High Performance P4 Software Switch

Student: YuriIozzelli (y.iozzelli@gmail.com)
Mentor: LuigiRizzo (luigi@freebsd.org)

Project description

Software switches are a key component of any cloud infrastructure, and even proposal focused on hardware targets, such as P4 (http://p4.org), cannot ignore software implementations and their performance.

Historically, software packet processors have been limited in performance by network I/O, but this is no more true with high speed frameworks such as netmap and DPDK.

In this project I would like to implement a modified version of the reference p4 switch (https://github.com/p4lang/behavioral-model) on FreeBSD, which uses netmap for faster packet I/O: the goal would be reaching the 1 Mpps order-of-magnitude speed (current reference implementation is limited to 150 Kpps for a simple l2 switch with 2 hosts).

This would enable P4 to be used in fast networking experimens as well as real environments.

Approach to solving the problem

The required steps would include:

Develop a simplified p4 target wich uses netmap as a backend for packet I/O (current implementation uses pcap)
Optimize the target performance leveraging both standard techniques (e.g. limit memory allocations by re-using buffers, use of a more parallel architecture) and netmap-enabled techniques (e.g. batching I/O)
Reach feature parity with the reference switch target, and have a functional openflow-enabled demo, which will be used to evaluate performance vs the reference implementation and others software switch solutions

Deliverables

D1

Port the existing reference implementation code to FreeBSD. Develop a simplified P4 switch target which replaces pcap with netmap for packet I/O.

D2

Now that the I/O is faster, find the new bottlenecks and address them. I will probably need to modify the target to use pre-allocated buffers for packets and process packets in multiple threads in a pipeline, and try to keep synchronization costs low.

D3

Extend the version developed to support more advanced features used by common p4 programs (packet replication, learning, multicast groups), and evaluate the performance vs the reference implementation.

D4

Develop an extended demo which implement the openflow protocol on top of the developed target, and evaluate the performance vs the reference implementation and other openflow software switches.

Milestones

This is the expected timeline with the milestones, with the actual work done added by linking to relevant soc-status mailing list posts.

23 – 29 May: port the existing code to FreeBSD --> 1
30 May - 5 June: add netmap-enabled I/O to a simple P4 target (D1)--> 1
6 – 12 June: optimize memory management --> 1
13 – 19 June: perform multithreading processing of packet in a pipeline (D2) --> 1
20 – 26 June: Prepare for mid-term evaluation: clean code and write results report --> 1
27 June - 17 July: Extend the target to support replication, learning and multicast --> 1, 2
18 – 31 July: Evaluate and possibly improve performance (D3) --> 1, 2
1 – 14 Aug: Develop the openflow demo and measure performance (D4) --> 1, 2
15 – 23 Aug: clean code, improve documentation and report final speed measurements

End of GSoC: State of the Project

I successfully implemented D1 and D2, improving the performance of the target by 2x-4x (depending on the p4 program).

Deliverable D3 was about improving the performance of advanced features like multicast and learning, but after many tries I didn't obtain significant speedup. The existing implementation is of course compatible with the improvements introduced by D1 and D2.

In order to allow the simple_switch target to use the newly developed lockless queue, I had to adapt it to more advanced behaviors like rate limiting and multiple producers (my original lockless queue was single producer, single consumer), so I used the time allocated for D3 to do that.

I couldn't set up the openflow demo for a performance comparison because i ran out of time (I lost almost two week because this summer I also had my thesis dissertation). I still have more basic performance comparisons that show the improvements.

The project as of now is structured this way:

A gsoc-master branch, with all the experimental modifications to the codebase, and a speed_tests directory with some scripts to measure performance. It also have a new target (called fast_switch), which uses the new queue classes. Netmap device manager is available to all the targets with the option --enable-netmap
a gsoc-dev-manager-netmap branch, which contains only the netmap-related code. The commits are squashed and I plan to use it for a pull request to the upstream repository. The netmap support is enabled by the configure script with the parameter --enable-netmap.
a gsoc-lockless branch, which only contains the lockless queue related code. The commits are squashed and I plan to use it for a pull request to the upstream repository
a new repository: freebsd_ports. This contains the Makefiles and the patches necessary to build the software on freebsd. I also needed to port a couple other projects that are dependencies. As of now, the port builds version 1.2.0 of the vanilla upstream code. If the above pull requests will be accepted, I plan to update the version in the Makefile. If they will not be accepted, I could use my own github repo, or leave to the user the choice of what features use.

How to test the improvements

To test the fast_switch target performance (which includes all the festures):

checkout the gsoc-master branch
configure and compile with -O3 level of optimization
go to the speed_tests directory
see the help for available options
(the virtual ethernet tests only work on linux, because they create veth pairs. It should be easy to use an epair instead)

The Code

https://github.com/zarghul/behavioral-model
- relevant branches are named gsoc-*
https://github.com/zarghul/freebsd_ports

Useful links

SummerOfCode2016/HighPerformanceP4SoftwareSwitch (last edited 2016-08-22T11:08:52+0000 by YuriIozzelli)