Network RSS

What is RSS?

In this context, it's "Receive Side Scaling". It started at Microsoft.

Start here: http://msdn.microsoft.com/en-us/library/windows/hardware/ff570736(v=vs.85).aspx

Then, read this: http://msdn.microsoft.com/en-us/library/windows/hardware/ff567236(v=vs.85).aspx

The basic idea is to keep the code and data for each given TCP and UDP flow on a CPU, which aims to:

Reduce lock contention (as multiple CPUs aren't competing for the same resource);
Keep things more hot in the local CPU cache;
.. eventually, keep it local to a CPU socket and memory.

Where's the code?

Robert's initial RSS work and the PCBGROUPS work is in -HEAD.

I'm committing things to -HEAD as they get tested.

I do development in this branch - but right now everything is in -HEAD:

https://github.com/erikarn/freebsd/commits/local/rss

$/!\$ wiki-admin note: this repository is not available as of 20240829.

RSS support overview

The RSS code is in sys/netinet/in_rss.[ch]. It provides the framework for mapping a given mbuf and RSS hash value to an RSS bucket, which then typically maps to a CPU, netisr context and PCBGROUP. For now the RSS code treats the RSS bucket as the netisr context and PCBGROUP value.

The PCBGROUP code in sys/netinet/in_pcbgroups.[ch] creates one PCB table per configured "thing". For RSS, it's one per RSS bucket - not per CPU. The RSS bucket -> CPU mapping occurs separately.

The RSS hash is a 32 bit number calculated from a topelitz hash and the RSS key. The microsoft links above describe how the RSS hash is calculated for each frame.

The RSS code calculates the number of RSS bits to create buckets for based on the number of CPUs - typically twice the number of CPUs so rebalancing is possible. The maximum number of RSS buckets is 128, or 7 bits.

The sysctl "net.inet.rss.bits" is currently the only tunable (set at boot time) and controls how many RSS bits above to use when creating the RSS buckets.

Some NICs (eg the Intel igb(4) NICs) only support 8 RSS queues (and some earlier NICs only support 4) - but the RSS code doesn't yet know about this. net.inet.rss.bits may need to be capped to the maximum number of RSS queues the NIC(s) in use supports.

The RSS code then will allocate CPUs to each RSS bucket in a round-robin fashion.

Userland can query the RSS bucket to CPU mapping in "net.inet.rss.bucket_mapping" - it's a string of bucket:CPUID queue values. Userland can use this to create one worker thread per RSS bucket and then bind it into the right CPU.

The RSS kernel calls are as follows:

rss_getbits() returns how many RSS bits in the bucket space are configured. The number of buckets is (1 << rss_getbits()).
rss_getbucket(uint32_t hash) returns the RSS bucket for the given RSS hash value.
rss_getcpu(u_int bucket) returns the currently configured RSS CPU for the given RSS bucket. RSS aware drivers will use this to pin worker threads and interrupts for the given RSS bucket (ie, "queue") to the correct CPU.
rss_hash2cpuid(), rss_hash2bucket(), rss_m2cpuid(), rss_m2bucket() return hash / mbuf to bucket/cpuid mappings.
rss_getkey() returns the currently configured RSS key. RSS aware drivers will use this to fetch the key to program into the hardware.
rss_getnumbuckets() returns the number of RSS buckets (ie, 1 << get_rssbits())
rss_getnumcpus() returns how many CPUs the RSS layer is allowed to use.
rss_get_indirection_to_bucket() takes an RSS indirection table entry and maps it to the correct RSS bucket. RSS aware drivers should use this to map each indirection entry to an RSS bucket.

TODO list

Here's the list of things to do for basic RSS support.

RSS Rebalancing - re-assigning the bucket -> queue -> CPU mapping to deal with slightly different CPU / queue loads (ie, "rebalancing" the load between each bucket so each CPU gets an even set of load)
.. some driver hooks are required to inform the drivers to reprogram their RSS configuration
.. and some notification mechanism for userland is required.
See if it's worth modifying the driver transmit behaviour so if the transmit CPU isn't the RSS correct one, it doesn't do transmit itself - it just queues and schedules the taskqueue to run on the correct CPU. It may mitigate the cost of having the occasional wrong affinity socket transmit some data whilst there's 40+ gigabit of correctly affinity sockets banging away...

Current work

General work

The current RSS work aims to finish up RSS awareness in the UDP path and tidy up the immediate loose ends around IPv4 and IPv6 fragment handling.

Convert UDP code to initialise 2-tuple or 4-tuple hashing based on whether 4-tuple UDP is enabled or not;
Add RSS calculation to the IPv6 fragment reassembly code - again depending upon whether 2-tuple or 4-tuple is configured for the packet type (eg if it's UDPv4 but only 2-tuple for UDP is available, then only do the software hash for the 2-tuple data.)
If it's easy, re-injected re-assembled IPv6 fragments back into the correct netisr queue - the new RSS hash value / type will direct it to the correct CPU.
Handle IPv4 tunnel decapsulated frames (gif, gre, ipip, ipsec)
Handle IPv6 tunnel decapsulated frames (gif, gre, ipip, ipsec)
Look at what was done for IPv4 netisr receive (ie, calling into the RSS code to see if frames needed their RSS hash recalculated based on what the hardware provided) and do the same for IPv6.
IPv6 UDP receive works - but IPv6 UDP send has very simplistic locking. In order for it to scale it should do something similar to what the IPv4 UDP send path does (ie, optimise for local bind() sockets, so it doesn't have to keep taking a write lock on the inp on each send.)

FreeBSD-12 work

TODO list for FreeBSD-12:

add support to configure per-device RSS keys on the fly;
configure per-device RSS bucket mappings on the fly;
configure per-device RSS hash configurations on the fly;
configure per-bucket CPU set mappings on the fly;
split the RSS driver side, the RSS packet input side and the RSS stack side of things up into separate options;
UDP IPv6 RSS support would be nice (it works, but i need to test/integrate bz's v6 udp locking changes for it to really matter);
work on scaling linearly on incoming /and/ outgoing connections. Right now incoming connections are easy, but outgoing connections aren't so easy;
allow the system TCP/UDP hash type to be set at boot time - right now it's 4-tuple for TCP, 2-tuple for UDP, but some NICs (eg the 40G Intel ixl hardware) can only hash UDPv4 as 4-tuple; it doesn't do 2-tuple hashing for UDP except if they're fragments;
remove the requirement that the RSS CPU/hash bucket mapping is a power of two; allow it to be an arbitrary configuration.

And:

it'd be nice if this were better documented;
it'd be nice if we had easy examples of this stuff working, complete with library bits in base.

Completed work

(done) For NICs that hash on RX we can certainly use their RSS hash; but for NICs that don't, we may want to set the flowid on packet reception to a software calculated RSS value.
.. which will be amusing, as currently the flowid tag isn't set on mbufs _also with_ the flowtype value; making it difficult to know whether the NIC is actually setting an RSS hash or some arbitrary flowid value.
(done - ixgbe, igb) .. So, we likely need to populate the flowtype value in the ethernet driver, complete with whether it's a 2-tuple or a 4-tuple value, and whether it's for IPv4 or IPv6.
(done - ipv4) Investigate / handle RSS for IP fragments - what CPU does the defrag/assembly? How's it end up on the right CPU?
- .. will likely have to do a software calculation of the frame once it finishes ip_reass()
(done - ixgbe, igb) Modify (some) drivers to use the RSS routines to pin the transmit/receive paths to the correct RSS CPU ID
.. and make sure it's all correctly pinned.
(done, ipv4) UDP RSS? What's required? (see below)
(done, ipv4) Reinject completed IPv4/IPv6 fragments into netisr but destined for the correct RSS bucket / queue / CPU.
(done) Figure out what's going on with ixgbe(4) and UDPv4 receive on fragmented (> 1 MTU) frames. Testing showed that the RX path was stamping these large UDP frames with a hashtype of straight IPv4 rather than UDP IPv4. It may be that hardware LRO is enabled and it's coalescing things for us, but returning it as an IPv4 frame. Perhaps a higher / varied traffic load would run the LRO buffers out and and would start to see some frames returned with UDP IPv4 (first fragment) and some with stright IPv4 (subsequent fragments.)
- What's happening: ixgbe, igb and cxgbe treat fragmented frames as 2-tuple IPv4 or IPv6 - including the first frame in the fragment. So the RSS hash needs to be re-calculated and re-injected where appropriate.
(Done - adrian) add support for multiple listener sockets on the same listen address/port
(Done - adrian) .. and then modify the pcbgroups / pcb lookup routines to terminate an incoming socket on a CPU local pcbgroup entry if one exists, before defaulting to a wildcard entry
(Done - adrian) Investigate a userland API to query the local CPU needed for a given socket, so applications can migrate fds to the right thread if required (eg for outbound connections)
(Done - rwatson) Integrate Robert's existing work into HEAD
(Done - adrian) Allow swi's to be CPU pinned, for TCP timers to correctly run on the right CPU
(Done - adrian) Store the TCP hashtype in the inpcb
(Done - adrian) Map TCP timers to the RSS selected CPU
(Done - adrian) Ensure that the RSS used is the receive hash from the hardware - and that for software calculation of the RSS hash, the source/destination tuples are used in the right order
.. ie, that the RSS hash for the transmit direction and receive direction of a flow come out to the same value
IPv6 RSS? What's required? (done - adrian)
(Done - adrian) Add RSS calculation to the IPv4 fragment reassembly code - again depending upon whether 2-tuple or 4-tuple is configured for the packet type (eg if it's UDPv4 but only 2-tuple for UDP is available, then only do the software hash for the 2-tuple data.)
(Done - adrian) If it's easy, re-injected re-assembled IPv4 fragments back into the correct netisr queue - the new RSS hash value / type will direct it to the correct CPU.
(Done - adrian) UDP IPv4 and IPv6 RSS aware receive distribution to RSS aware sockets;
(Done - adrian) UDP IPv4 and IPv6 RSS software hash for outbound UDP frames;
(Done - tiwei) RSS software hash for IPv6 packet receive path

Later work

What isn't required for basic RSS but would be nice:

Merge in the GSOC support for policies per FD/socket for RSS (GSoC project from Takuya Asada?) (URL?)
RSS-ised TCP syncache?
There's some lock contention between the if_transmit() thread(s), TX interrupt and deferred TX taskqueue. An interrupt can come in and schedule the swi during if_transmit() and deferred TX taskqueue. Keep an eye on that over time.
.. maybe do a trylock and schedule a deferred TX taskqueue to handle the TX completion, at least so the swi doesn't preempt an existing running task?
.. maybe temporarily disable scheduling the swi if the TX lock is held?
How should multi-CPU-socket and multi-CPU-NIC RSS work? Ideally each CPU local to a NIC would have a separate PCBGROUPs table (or part of one) and separate RSS, so the traffic can be kept local to that CPU socket.
Decouple the netisr and PCBGROUPS "cpu" pinning with some abstract "queue" concept, then have a way to bind each of those (and the RSS "CPU" for each RSS bucket) into a cpuset for that queue.

Drivers

Work is being done using igb(4) and ixgbe(4), as that's what AdrianChadd has on hand.

TODO - hopefully also cxgbe

Each RSS aware driver needs two things:

Support multiple transmit and receive queues. At some point software RSS will be included that will use multiple netisr queues and threads to handle traffic from a single hardware queue.
Support RSS, obviously. The same caveat above holds.

RSS support in NICs typically comes with:

Multiple transmit and receive hardware queues;
A way to configure the RSS key;
An RSS indirection table - typically 128 entries that maps to the value of the lower 7 bits of the RSS hash calculated for each receive frame
.. and each RSS indirection entry will map to one receive queue.

Then if RSS is enabled:

Program the RSS key into hardware at startup.
Create multiple transmit and receive queues, one per RSS bucket. RSS bucket is however many the FreeBSD RSS code is configured for, which is typically based on the number of available CPUs.
Bind the interrupt swi, transmit and receive queue threads and any deferred task threads to the correct CPU for the given RSS bucket.
Program the RSS indirection table - for each RSS indirection entry, query the FreeBSD RSS code for which RSS bucket that maps to.
If a received frame has RSS information attached, populate the receive mbuf with said information. Typically the hardware will have some RSS bits in the receive descriptor saying what the hash was calculated with (TCP, UDP, IPv4, IPv6, etc) as well as the 32-bit RSS hash value itself.
When transmitting, ensure that the correct destination queue is selected by querying the RSS layer with the mbuf RSS hash and RSS hash type. This ensures transmit and receive lines up on the same hardware queue(s).

UDP RSS

UDP is mostly the same as TCP, except where it isn't.

Some NICs support hashing on IPv4 / IPv6 UDP information. It's not part of the microsoft specification but the support is there.

The UDP transmit path doesn't assign a flowid / flowtype during udp_append(). The only time this occurs is when udp_append() calls ip_output() - if flowtable is enabled, the flowtable code will assign the flowtable hash to the mbuf flowid.

The ip_output() path has some code that inspects the inp for flowid/flowtype and assigns that to the mbuf. For UDP this won't really work in all cases as the inp may not actually reflect the actual source/destination of the frame. So the inp can't just be populated with the flow details.

.. which isn't entirely true - it can be, but then during udp_append() the inp flowid/flowtype can only be used if the source/destination matches an exact match, not a wildcard match. (Ie, it's a connected socket with a defined local and remote address/port - then send / recv will just use those IPs. If it's sourcing from INADDR_ANY or it's being overridden by sendto / sendmsg, the flowid will need re-calculating.)

So for transmit:

If the flowid in the inp (eventually) needs recalculating, then recalculate it before transmit
If userland already has the flow details (ie, they're replying back to the same sending IP/port from a symmetric path connection - ie, it goes out the same network interface with the same source address as it was sent to) then we should be able to recycle it

There's also the problem of IP fragments - see below.

The PCBGROUPS/RSS code doesn't know about UDP hashing. By default the UDP setup code configures the PCB info table as a two-tuple hash, rather than a four-tuple hash. The default igb/ixgbe NIC RSS hash however configures hashing on UDP 4-tuple. So unless a few pieces are updated to treat UDP hashing as a four-tuple, the NIC provided hashing won't line up with the expected hash types for RSS/PCBGROUPS, and things won't work.

So, the TODO list looks something like this:

Extend the RSS and PCBGROUP code to understand UDP as a hash type;
Add in IP options to populate the received frame with the RSS and flow information - for use by recvmsg();
Add in support in udp_append() to parse the provided information with sendmsg() to override the flowid / flowtype / RSS bucket;
Modify ip_output() to optionally not set a flowid and trust that the sender has set it correctly (or that zero is the correct value);
Handle IP options;
Eventually - add support for determining if the inp flowid for a connected UDP socket is usable and use it - otherwise either default to zero or do a software toeplitz hash;
Handle IP fragments.

It's possible that for now the correct thing to do is to change the NIC hash configuration to not hash on UDP 4-tuple and to treat UDP as a straight IPv4 or IPv6 packet. It means that communication between any given hosts will be mapped to the same CPU instead of distributed, but it avoids worrying about IP fragment hash handling.

So to finish it off:

Modify igb(4), ixgbe(4) to not hash on UDP 4-tuple, but only two tuple;
Add awareness in PCBGROUPS/RSS about the UDP hash type, just as a placeholder;
Document somewhere that UDP is being hashed as a 2-tuple hash rather than a 4-tuple hash for the above IP fragment reasons;
Ensure that the hashing is correctly occuring using 2-tuple logic, not 4-tuple logic.

Handling IP Fragments

This is especially a problem with UDP.

For the IPv4 path, it's handled via ip_reass() in sys/netinet/ip_input.c.

For the IPv6 path, IPv6 fragments are a different IPv6 protocol. They're effectively treated as a tunnel - re-encapsulation is performed and the complete frame is then re-parsed as a full IPv6 frame. That's handled in sys/netinet6/frag6.c.

So, once the fragments have been received and reassembled a 2-tuple or 4-tuple RSS hash is required. This depends upon the protocol type (TCP, UDP, other) and whether TCP/UDP hashing is enabled in RSS.

The easiest solution would be to just recalculate the hash on the completed frame. The more complicated solution is to check if the hash type has changed (eg IPv4 TCP frame, which the hardware stamped with a 2-tuple hash type) and if it still has the correct hash type, treat it as valid.

Another amusing thing is where to reinject the completed packet. The destination RSS bucket is likely different to the CPU which reassembled the frames. Somehow it needs to be re-injected into the correct netisr queue and handled appropriately.

CategoryProject

Networking/ReceiveSideScaling (last edited 2024-08-29T11:32:24+0000 by MarkLinimon)