Bug #1805
closedpfring: zero copy broken
Description
It appears that in certain setups, using PF_RING with multiple threads and zero copy mode is broken.
My test is simple: I blast ~9.6Gbps at the system affected. At some point it crashes sometimes.
I have made test to trigger the issue very quickly: In our 'Packet' structure we have a pointer to the position in the packet that is the ethernet header. I can see that the data in some cases gets corrupted.
So the test I added does this:
Next to the pointer, I added a static data structure for holding the contents of the ethernet header. On ethernet layer decoding I copy the data from the pointer into the static struct. Then just before the end of the life of the packet inside suricata (so before the next pfring_recv call on that thread) I compare if the data the pointer points to and my static copy are they same. If not, I abort.
This test can be found here https://github.com/inliniac/suricata/pull/2144/files
When using more than one thread, it blows up within a minute. When I use one thread, it appears to work correctly. Also when running for a long time.
On manual inspection I can see that the 'static' copy of the ethernet header header is correct. It contains the proper eth_type. The packet has also been decoded correctly at the higher levels which proves that in the pointer version it was correct at one point in time as well. However, in this test the pointer to the ethernet header shows junk values.
I'm suspecting there is some synchronization issue in the kernel/pfring module/driver.
On the same hardware and running the same test both AF_PACKET(v3) and NETMAP behave correctly.
Setup:
Intel X710:
# ethtool -i ens2f1 driver: i40e version: 1.4.25-k firmware-version: 4.53 0x8000206e 0.0.0 expansion-rom-version: bus-info: 0000:0f:00.1 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes
It's an older (Nehalem) 4core Xeon with Hyper threading:
Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 26 Model name: Intel(R) Xeon(R) CPU W3550 @ 3.07GHz
8 RSS queues:
[ 0.869890] i40e: Intel(R) Ethernet Connection XL710 Network Driver - version 1.4.25-k [ 0.869892] i40e: Copyright (c) 2013 - 2014 Intel Corporation. [ 0.885006] i40e 0000:0f:00.0: fw 4.40.35115 api 1.4 nvm 4.53 0x8000206e 0.0.0 [ 0.989150] i40e 0000:0f:00.0: MAC address: xxx [ 0.993134] i40e 0000:0f:00.0: SAN MAC: xxx [ 1.673081] i40e 0000:0f:00.0: PCI-Express: Speed 5.0GT/s Width x8 [ 1.673084] i40e 0000:0f:00.0: PCI-Express bandwidth available for this device may be insufficient for optimal performance. [ 1.673086] i40e 0000:0f:00.0: Please move the device to a different PCI-e link with more lanes and/or higher transfer rate. [ 1.679122] i40e 0000:0f:00.0: Features: PF-id[0] VFs: 64 VSIs: 66 QP: 8 RX: 1BUF RSS FD_ATR FD_SB NTUPLE DCB VxLAN Geneve PTP VEPA [ 1.693104] i40e 0000:0f:00.1: fw 4.40.35115 api 1.4 nvm 4.53 0x8000206e 0.0.0 [ 1.795281] i40e 0000:0f:00.1: MAC address: xxx [ 1.799253] i40e 0000:0f:00.1: SAN MAC: xxx [ 2.043232] i40e 0000:0f:00.1: PCI-Express: Speed 5.0GT/s Width x8 [ 2.043237] i40e 0000:0f:00.1: PCI-Express bandwidth available for this device may be insufficient for optimal performance. [ 2.043240] i40e 0000:0f:00.1: Please move the device to a different PCI-e link with more lanes and/or higher transfer rate. [ 2.074505] i40e 0000:0f:00.1: Features: PF-id[1] VFs: 64 VSIs: 66 QP: 8 RX: 1BUF RSS FD_ATR FD_SB NTUPLE DCB VxLAN Geneve PTP VEPA [ 2.075630] i40e 0000:0f:00.1 ens2f1: renamed from eth2 [ 2.093337] i40e 0000:0f:00.0 ens2f0: renamed from eth0 [ 3953.702730] i40e 0000:0f:00.1 ens2f1: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None [ 3957.127461] i40e 0000:0f:00.1 ens2f1: NIC Link is Down [ 3959.517008] i40e 0000:0f:00.1 ens2f1: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None
Using PF_RING 6.4.0
[18827] 9/6/2016 -- 11:01:34 - (runmode-pfring.c:343) <Info> (ParsePfringConfig) -- Using flow cluster mode for PF_RING (iface ens2f1) [18827] 9/6/2016 -- 11:01:34 - (util-runmodes.c:295) <Info> (RunModeSetLiveCaptureWorkersForDevice) -- Going to use 2 thread(s) [New Thread 0x7ffff3e18700 (LWP 18859)] [18859] 9/6/2016 -- 11:01:34 - (source-pfring.c:472) <Info> (ReceivePfringThreadInit) -- Enabling zero-copy for ens2f1 [18859] 9/6/2016 -- 11:01:34 - (source-pfring.c:537) <Info> (ReceivePfringThreadInit) -- (W#01-ens2f1) Using PF_RING v.6.4.0, interface ens2f1, cluster-id 99 [New Thread 0x7ffff2f54700 (LWP 18860)] [18860] 9/6/2016 -- 11:01:34 - (source-pfring.c:472) <Info> (ReceivePfringThreadInit) -- Enabling zero-copy for ens2f1 [18860] 9/6/2016 -- 11:01:34 - (source-pfring.c:537) <Info> (ReceivePfringThreadInit) -- (W#02-ens2f1) Using PF_RING v.6.4.0, interface ens2f1, cluster-id 99 [18827] 9/6/2016 -- 11:01:34 - (runmode-pfring.c:521) <Info> (RunModeIdsPfringWorkers) -- RunModeIdsPfringWorkers initialised $ cat /proc/net/pf_ring/info PF_RING Version : 6.4.0 (unknown) Total rings : 2 Standard (non ZC) Options Ring slots : 4096 Slot version : 16 Capture TX : Yes [RX+TX] IP Defragment : No Socket Mode : Standard Total plugins : 0 Cluster Fragment Queue : 0 Cluster Fragment Discard : 0 $ cat /proc/net/pf_ring/19136-ens2f1.37 Bound Device(s) : ens2f1 Active : 1 Breed : Standard Appl. Name : Suricata Socket Mode : RX+TX Capture Direction : RX+TX Sampling Rate : 1 IP Defragment : No BPF Filtering : Disabled Sw Filt Hash Rules : 0 Sw Filt WC Rules : 0 Hw Filt Rules : 0 Sw Filt Hash Match : 0 Sw Filt Hash Miss : 0 Poll Pkt Watermark : 128 Num Poll Calls : 2 Channel Id Mask : 0xFFFFFFFFFFFFFFFF Cluster Id : 99 Slot Version : 16 [6.4.0] Min Num Slots : 4098 Bucket Len : 1524 Slot Len : 1728 [bucket+header] Tot Memory : 7090176 Tot Packets : 9680214 Tot Pkt Lost : 9220222 Tot Insert : 458907 Tot Read : 448573 Insert Offset : 294608 Remove Offset : 297888 Num Free Slots : 0 TX: Send Ok : 0 TX: Send Errors : 0 Reflect: Fwd Ok : 0 Reflect: Fwd Errors: 0
Updated by Victor Julien almost 8 years ago
If anyone is willing to run https://github.com/inliniac/suricata/pull/2144 and report back, I'd appreciate it very much! It will abort Suricata in case of this issue, or it will run happily w/o issues otherwise.
Updated by Victor Julien almost 8 years ago
It's still broken with these options:
# ethtool -k ens2f1 Features for ens2f1: rx-checksumming: off tx-checksumming: off tx-checksum-ipv4: off tx-checksum-ip-generic: off [fixed] tx-checksum-ipv6: off tx-checksum-fcoe-crc: off [fixed] tx-checksum-sctp: off scatter-gather: off tx-scatter-gather: off tx-scatter-gather-fraglist: off [fixed] tcp-segmentation-offload: off tx-tcp-segmentation: off tx-tcp-ecn-segmentation: off tx-tcp6-segmentation: off udp-fragmentation-offload: off [fixed] generic-segmentation-offload: off generic-receive-offload: off large-receive-offload: off [fixed] rx-vlan-offload: off tx-vlan-offload: off ntuple-filters: off receive-hashing: off highdma: off rx-vlan-filter: on vlan-challenged: off [fixed] tx-lockless: off [fixed] netns-local: off [fixed] tx-gso-robust: off [fixed] tx-fcoe-segmentation: off [fixed] tx-gre-segmentation: on tx-ipip-segmentation: off [fixed] tx-sit-segmentation: off [fixed] tx-udp_tnl-segmentation: on fcoe-mtu: off [fixed] tx-nocache-copy: off loopback: off [fixed] rx-fcs: off [fixed] rx-all: off [fixed] tx-vlan-stag-hw-insert: off [fixed] rx-vlan-stag-hw-parse: off [fixed] rx-vlan-stag-filter: off [fixed] l2-fwd-offload: off [fixed] busy-poll: off [fixed] hw-tc-offload: off [fixed] # ethtool -a ens2f1 Pause parameters for ens2f1: Autonegotiate: off RX: off TX: off # ethtool -c ens2f1 Coalesce parameters for ens2f1: Adaptive RX: off TX: off stats-block-usecs: 0 sample-interval: 0 pkt-rate-low: 0 pkt-rate-high: 0 rx-usecs: 25 rx-frames: 0 rx-usecs-irq: 0 rx-frames-irq: 256 tx-usecs: 25 tx-frames: 0 tx-usecs-irq: 0 tx-frames-irq: 256 rx-usecs-low: 0 rx-frame-low: 0 tx-usecs-low: 0 tx-frame-low: 0 rx-usecs-high: 0 rx-frame-high: 0 tx-usecs-high: 0 tx-frame-high: 0
Updated by Victor Julien almost 8 years ago
The problem also appears if I modify the 'single' runmode to still use zero copy. Normally it is only activated with workers. So with a single pfring processing thread the problem also appears.
Updated by Victor Julien almost 8 years ago
I can also reproduce this in a modified pfcount, so it appears not specific to Suricata.
Updated by Peter Manev almost 8 years ago
I can confirm the same. Some additional info from my test environment
- using the latest pfring git master
- Ubuntu LTS Trusty with 3.19 kernel
PF_RING Version : 6.5.0 (dev:9f358aa8dd5b43bb74f67304c10ff41915e2f562) Total rings : 0 Standard (non ZC) Options Ring slots : 65534 Slot version : 16 Capture TX : Yes [RX+TX] IP Defragment : No Socket Mode : Standard Total plugins : 0 Cluster Fragment Queue : 0 Cluster Fragment Discard : 0 driver: ixgbe version: 4.2.1 firmware-version: 0x800000cb bus-info: 0000:04:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: no driver: ixgbe version: 4.2.1 firmware-version: 0x800000cb bus-info: 0000:04:00.1 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: no vendor_id : GenuineIntel cpu family : 6 model : 45 model name : Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz stepping : 7 microcode : 0x70b cpu MHz : 3186.105 cache size : 20480 KB physical id : 0 siblings : 16 core id : 0 cpu cores : 8 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_ tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_ 2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid xsaveopt bugs : bogomips : 5399.69 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management:
Updated by Victor Julien almost 8 years ago
Also reproducible on a different (Intel) card: 82576 with igb driver.
PF_RING upstream now convinced it's a PF_RING issue.
Updated by Victor Julien almost 8 years ago
- Status changed from New to Closed
Addressed upstream:
https://github.com/ntop/PF_RING/commit/939ac93b8d2920d10364bdfd78b2eb0f91800f05
https://github.com/ntop/PF_RING/commit/e738bb088eb28bcc11c6534e950c42ea9f92d64b
Only in the PF_RING dev branch at this moment.
Updated by Victor Julien almost 8 years ago
It appears 6.4.1 has been released to fix this and other issues.