TCP Callout Redesign in an RSS World
These are some notes about a discussion several of us (Adrian, Matt, Randall, Drew, Jonathan) had recently about TCP callouts, particularly in an RSS world.
These are my (Jonathan's) notes (and I've fleshed out the idea more than we discussed), so any errors are mine. (But, credit can go to the group.)
Observations
We manage state on a session almost statelessly and asynchronously. We only evaluate state when something (timer, packet, user request etc.) triggers some thread to look at the TCPCB again.
- We have numerous timers, which may or may not interact with each other at any given time.
- There is no guarantee about the locality (CPU core) of the threads that will touch a TCPCB.
- There is no guarantee that all timers for a TCPCB will run on the same core or callout wheel. (Randall disputes this observation. I haven't re-examined the code in a while.)
- There is no guarantee that two timers on different cores will not try to access the same TCPCB at the same time.
- There is no guarantee that timers will run on the same core as each other or the core that last touched the input/output queue or the core that receives input packets. Hopefully, RSS would eventually cause much of this to be localized; however, I’m not sure there is a guarantee that it really is localized.
Suggestion
Create N threads per RSS bucket to handle TCP "session state maintenance" (which I am basically defining as the same things our timers do for us today).
- Place the TCPCB in the callout wheel based on the time of the next event that needs to occur.
- When the TCPCB hits the front of the callout wheel, evaluate its state in a sane, holistic way and take whatever action needs to be taken for the TCPCB.
- If the TCPCB is locked, just reschedule the callout (up to X number of times, at which point you block).
- Fudge things so you will evaluate all actions that are "about" to happen (e.g. in the next half-millisecond or so) at the same time.
- Ideally, co-locate the "session state maintenance" (i.e. "timer") threads on the same core as the RSS bucket.
If you really want to take this a step further, you could even do things like:
- When incoming packets arrive on a different CPU, queue them to be handled by the correct TCP "session state maintenance" thread.
- When the user does something that would create a call to tcp_output(), schedule the call to occur within the TCP "session state maintenance" thread.
This is just a thumbnail sketch, but it should give the general idea. Hopefully, this will decrease cache misses, reduce lock contention, reduce the number of times we reschedule callouts, etc.