FreeBSD Developer Summit: Networking Working Group
We would like to cover the following topics. This is not an exhaustive list and if you feel there is something missing that you want to talk about, contact the session chair and your topic will be included here.
Please send talks, brainstorm topics, or other proposals to the session leaders, LawrenceStewart and GlebSmirnoff.
Agenda
- Protocols
- TCP
- Deferring reassembly to user thread's kernel context (Lawrence)
- MPTCP (Lawrence)
- Deferring mbuf drop on ACK (Scott)
- UI for controlling TCP/stack dynamics (Lawrence, Andre)
- Reworking SACK + integrating with reassembly (Andre)
- Pacing (Lawrence)
- TCP
- Infrastructure
- Hostcache (Lawrence)
- Interface dangling pointers
RX pointer in mbufs
- Problem: we don't refcount interface per mbuf which can lead to dangling pointers (particularly problematic in subsystems which hold mbufs for some time e.g. dummynet, netgraph)
- IOCTLs, rtentries, etc.
- Stack-driver interface (Andre)
- RX/TX IFQ/Multiqueue
- API/advertising hardware offload capabilities (kernel + userland)
- Making ifnet opaque to driver
- Juniper proposal?
- Performance optimisation
- Routing subsystem
FIB vs RIB
- Problem: Trie is too generic
- rtentry locking
Problem: Per-packet per-rtentry locking kills performance
- Netgraph (Alexander, Gleb)
Topology locking
- Problem: Topology locking hurts performance and doesn't currently protect nodes from all possible config changes which can lead to panic
Hand rolled RW locking
- Problem: For nodes that do nothing e.g. ng_tee, atomic inc/dec is too heavy weight, so ideally need a better solution for all nodes
- Fast path cache misses, amount of memory accessed per read/write (Adrian)
- MBUFS
Ref counting (Gleb, Adrian)
Problem: uint32_t's coming from single zone non contiguous in memory with the mbuf
- Batching
- Multi-(process|thread) accept (Gleb)
Problem: Wakeup causes all threads to race and only 1 wins
- Light weight ref counter based on count(9) (Gleb)
- Routing subsystem
- Testing/development
- NSC userspace stack (Lawrence)
- Testing @ Netflix (Lawrence)
- Testing @ Orange (Olivier)
- Ownership
- Identify chunks of code without owners
Results
(Add a list or attach slides detailing the achieved results here.)
Stack/driver interface - Andre Oppermann
To come:
- Formal documentation of stack driver boundary Split of ifnet structure into stack owned and driver owned (evaluate Juniper proposal) Adjust all drivers to new world order -- request for feedback to mailinglist messages
Version shims or will all vendors be expected to switch?
- -- Shouldn't be too hard to change -- Clear documentation
Will open projects branch for this.
Technical monitor is Ed Maaste from FF
Can change not be one large commit? #ifdef conditionalized, and convert most important drivers sequentially.
One aim is to separate L2 and L3 but that's outside scope here
Changing the way drivers announce capabilities to stack -- idea is to have an area of struct declared by driver like in setsockopt() -- list of items that the stack will walk.
It's like newbus. So why not newbus? "Newbus is not just buses, it's also not new (phk)" (Has been reinvented twice, not well since). To be investigated.
There can be additional failure possibilities if functions exported as pointers in list have to allocate memory otherwise will be the same as now.
Vendor capabilities, route to integrate advanced features of hardware.
TCP offload? Is ifnet the right level of abstraction for modern network hardware?
Just in time compilation to improve performance on cache-deep processes? Not a problem now, but it will be tomorrow. There will be cards that offload the entire network stack.
Automagic export of C code at compile time a la
IFQ interface: pain point
- different types of workload need different types of parallelism. No one true method like bind one queue to one core. Need flexibility to configure via ifconfig or some other method. ALTQ not working.
Get rid of IFQ and associated softqueue -- most HW has hardware support.
The stack will call down completely lockless, (no ifnet lock or driver lock). Drive can implement optimal approach
Pretty much everything uses iftransmit now, except altq. QoS? No explicit API for timing. Intermediary queue discipline? Stack iftransmit function pointer replaced by altq and will call down into driver iftransmit. Multiple layers of queue disipline? nterface only supports one layer.
Work is to prepare proposal, documentation and run performance tests so we can make informed decision
Can we pass in other structures than -- eg writing disk blocks directly into network packet? This is another construction site. Whois looking at busdma?
Once documentation etc. looked at, can update drivers.
sometime in the next 4 months.
Common API for common functions for drivers to use, but not stack. Driver can benefit automatically from inprovements
Vendor feedback: can we move bufferring up into the stack? Control queue lengths and multilayer queue disciplines. Adaptive behaviour depending on how many packets waiting
Copy-pasting in drivers is not a good thing. Copy code should be centralised into the stack available for driver to use. Infrastructure almost in place to remove that (expected 10.1)
Start thinking further into the future. Driver consists of about 20% boiler plate. Write condensed driver spec file and process into .[ch] files at compile time.
Interface dangling pointers -- Gleb
Implement some lightweight ref counting. Need an efficient way to fetch current value gathered from all cpus.
Routing performance
Mutex contention kills performance on routing lookup. Problems when egress interface disappears. Cached rtentry leads to crashes
Look at killing the need for rtentry locking. Ref counting as proposed by Gleb.
Testing / Development
Developers should communicate with companies able to test with different workloads.
Orange (Olivier) and Yandex (Alexander) have interest in routing/forwarding performance. Orange have Ixia test hardware.
Netflix (Lawrence, Adrian, Scott) have access to TCP-heavy production workload.
Netflix looking to host focused mini devsummit events in Los Gatos, CA on a semi-regular basis.
Mbuf refcounting
- cache contention from tightly packed slab
- Gleb to continue testing prototype patch and post to list
- cpu stall due to external memory read
- Evaluate further after Gleb's patch (which may address this issue)
multi-(process|thread) accept
"Thundering herd" problem.
- SO_REUSEPORT molested as in Linux/DFLYBSD
- Pros: Currently used in Linux/DFLY, good for small apps Cons: listen queue migration, annoying to support for larger apps, less flexibility
- Go our own way with e.g. SO_SHAREDQUEUE
- Seems to be a leaning towards this direction but a clear proposal should be articularted to a mailing list and a pro/con done later
Routing subsystem
- Multiple cache misses per radix lookup
- Radix is too generic. Could have a super compact v4 only trie along side
- Luigi/Marko's netmap + DXR work needs to be looked into
- Alexander needs to solve API fro RIB->FIB callbacks to deal with MPLS which can be used to support per-protocol specific lookup structures
Ownership
OFED - get Netapp, Isilon, iX, RDMA NIC vendors to talk
Miscellaneous
Andre is carving ipsec out into a pfil based kmod
Andre is working on the final bits of the TCP-AO
TSO doesn't work with NAT because the copy out buffer is too small and truncates