Limelight Networks ixgbe(4) tuning
Hardware
http://www.supermicro.com/products/accessories/addon/AOC-CTG-i2S.cfm
- 6 core, single socket sandy/ivybridge CPU
Tuning
- hw.ix.tx_process_limit: 256 (512?)
- hw.ix.rx_process_limit: 256 (512?)
- hw.ix.rxd: 2048 (4096?) - In the past we've found 2048 gives the best perf, but with different cards
- hw.ix.txd: 2048 (4096?)
- net.inet.tcp.tso=0
- kern.sched.slice=12
- hw.ix.enable_aim: 1
- dev.ix.0.fc=3
- hw.ix.num_queues=6 (4/5?)
- kern.random.sys.harvest.ethernet=1
- kern.random.sys.harvest.point_to_point=1
- kern.random.sys.harvest.interrupt=1
- kern.ipc.maxsockbuf=524288 (16777216?) - This is tuned for multiple 1Gs, I don't see us hitting a limit, but we may need to increase in the future.
netperf can be used here to easily see throughput, and netstat -I ix0 -hw 1 will show us a running pps.
The processing limit, harvests, and TSO/LRO settings are ones that could ease CPU usage, though we are very light on inbound so TSO is likely more of a possibility. The rest are performance related.
- kern.random.sys.harvest.ethernet=0
- kern.random.sys.harvest.point_to_point=0
- kern.random.sys.harvest.interrupt=0
BSD Router Project ixgbe(4) tuning (objective: routing)
Specific tuning for obtaining the best PPS performance (smallest packet size).
hardware tested
Benchs done on quad cores Intel Xeon L5630 2.13GHz with hyper-threading disabled (IBM System x3550 M3 ) with an Intel Ethernet Controller 10-Gigabit X540-AT2:
ix0@pci0:21:0:0: class=0x020000 card=0x00018086 chip=0x15288086 rev=0x01 hdr=0x00 vendor = 'Intel Corporation' device = 'Ethernet Controller 10-Gigabit X540-AT2' class = network subclass = ethernet bar [10] = type Prefetchable Memory, range 64, base 0xfbc00000, size 2097152, enabled bar [20] = type Prefetchable Memory, range 64, base 0xfbe04000, size 16384, enabled cap 01[40] = powerspec 3 supports D0 D3 current D0 cap 05[50] = MSI supports 1 message, 64 bit, vector masks cap 11[70] = MSI-X supports 64 messages, enabled Table in map 0x20[0x0], PBA in map 0x20[0x2000] cap 10[a0] = PCI-Express 2 endpoint max data 256(512) FLR link x8(x8) speed 5.0(5.0) ASPM disabled(L0s/L1) ecap 0001[100] = AER 2 0 fatal 0 non-fatal 1 corrected ecap 0003[140] = Serial 1 a0369fffff1e2814 ecap 000e[150] = ARI 1 ecap 0010[160] = SRIOV 1 ecap 000d[1d0] = ACS 1
Bench lab and packet generator/receiver
Lab diagram:
+-----------------------------------+ +----------------------------------+ | Packet generator and receiver | | Device under Test | | | | | | ix0: 8.8.8.1 (a0:36:9f:1e:1e:d8) |=====>| ix0: 8.8.8.2 (a0:36:9f:1e:28:14) | | 2001:db8:8::1 | | 2001:db8:8::2 | | | | | | ix1: 9.9.9.1 (a0:36:9f:1e:1e:da) |<=====| ix1: 9.9.9.2 (a0:36:9f:1e:28:16) | | 2001:db8:9::1 | | 2001:db8:9::2 | | | | | | | | static routes | | | | 8.0.0.0/8 => 8.8.8.1 | | | | 9.0.0.0/8 => 9.9.9.1 | | | | 2001:db8:8:://48 =>2001:db8:8::1 | | | | 2001:db8:9:://48 =>2001:db8:9::1 | +-----------------------------------+ +----------------------------------+
Packet generator use this command line (2000 flows: 100 different source IP * 20 different destination IP):
pkt-gen -i ix0 -f tx -n 1000000000 -l 60 -d 9.1.1.1:2000-9.1.1.100 -D a0:36:9f:1e:28:14 -s 8.1.1.1:2000-8.1.1.20 -w 4
Packet receiver use this command line:
pkt-gen -i ix1 -f rx -w 4
=> The reference PPS value for all these benchs is the "packet receiver" value.
Generic advice
cf Router/Gateway page about NIC independant advice for a router/gateway usage (disabling Ethernet flow-control, disabling LRO/TSO, etc…)
hw.ix.[r|t]x_process_limit
Disabling rx_process_limit and tx_process_limit give the best performance (value in pps):
x rx|tx_process_limit=256(default) + rx|tx_process_limit=512 * rx|tx_process_limit=-1(nolimit) +--------------------------------------------------------------------------------+ |* ++ * x+ x * * * * *| ||______M__A_________| | | |__A___| | | |______M__A________| | +--------------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 1656772 1786725 1690620 1704710.4 48677.407 + 5 1657620 1703418 1679787 1681153.8 17459.489 No difference proven at 95.0% confidence * 5 1918159 2036665 1950208 1963257 44988.621 Difference at 95.0% confidence 258547 +/- 68356.2 15.1666% +/- 4.00984% (Student's t, pooled s = 46869.3)
…but under heavy load, if NIC queues=ncpu, the system will be unresponsive.
hw.ix.rxd and hw.ix.txd
What are the impact of modifying these limit on PPS?
x hw.ix.[r|t]xd=1024 + hw.ix.[r|t]xd=2048 * hw.ix.[r|t]xd=4096 +--------------------------------------------------------------------------------+ |+ * + * +x ** * * x * x | | |____________________A________M__________|| | |______________________M__A__________________________| | | |_____________________A______M_______________| | +--------------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 1937832 2018222 2002526 1985231.6 35961.532 + 5 1881470 2011917 1937898 1943572.2 46782.722 No difference proven at 95.0% confidence * 5 1886620 1989850 1968491 1956497 40190.155 No difference proven at 95.0% confidence
=> No difference