Project

General

Profile

Support #2725

stream/packet on wrong thread

Added by Peter Manev 7 months ago. Updated about 11 hours ago.

Status:
Feedback
Priority:
Normal
Assignee:
Target version:
Affected Versions:
Effort:
Difficulty:
Label:

Description

Looking for feedback.

While investigating various research points with af-packet on live traffic and latest gitmaster (ex eedf08be/4.1) I noticed that I have never seen those to be 0 in any occasion

stream.wrong_thread                        | Total                     | 2982446
tcp.pkt_on_wrong_thread                    | Total                     | 156187846

Those statistics can be made available via - ( https://github.com/OISF/suricata/blob/master/suricata.yaml.in#L66 )

# global stats configuration
stats:
  enabled: yes
  # The interval field (in seconds) controls at what interval
  # the loggers are invoked.
  interval: 8
  # Add decode events as stats.
  decoder-events: true
  # Add stream events as stats.
  stream-events: true

I have tried different NICs/drivers(tested ixgbe/i40e), af-packet v3/v2, cluster_flow/cluster_cpu/cluster_qm, vlan tracking enabled or not, on different live traffic machines, different kernels (4.18/4.19) -
capture.kernel_drops and stream.wrong_thread are never 0 and always increasing.(it is more like 10-15% of the total in my test cases)

Looking for any feedback in terms of - if you are experiencing the same issue or not and what is your setup (if you would like to share).


Files

statslog.tar.gz (802 KB) statslog.tar.gz Peter Manev, 12/05/2018 07:37 AM
wrong_threads.png (43.4 KB) wrong_threads.png Andreas Herz, 05/03/2019 08:53 AM
WRONG_THREAD.ods (13.4 KB) WRONG_THREAD.ods Sean Cloherty, 05/23/2019 07:59 PM
stats.log (10.7 KB) stats.log Sean Cloherty, 05/31/2019 02:19 PM
issue-2725.tar.xz (99.6 KB) issue-2725.tar.xz Peter Manev, 06/11/2019 08:18 PM

Related issues

Related to Support #2900: alert 'SURICATA STREAM pkt seen on wrong thread' when run mode set to workersNewActions
Related to Feature #3011: Add new 'cluster_peer' runmode to allow for load balancing by IP header (src<->dst) onlyNewActions

History

#1

Updated by Victor Julien 6 months ago

  • Tracker changed from Bug to Support

It could be interesting to check if having a single NIC queue helps.

It could be interesting how it works with PF_RING and NETMAP.

#2

Updated by Victor Julien 6 months ago

Can you share a full stats.log record?

#3

Updated by Peter Manev 6 months ago

I could not test definitely with single RSS because the traffic was too much for one queue on all the live set ups I can test. Will try to reproduce that some other way on a slower machine.

It would also be interesting to see/confirm the issue is not present with dedicated or other HW - Netronome/Napatech/Mellanox/Accolade+pfring (that one I will test next week on and confirm)/ other ...

#4

Updated by Peter Manev 6 months ago

stats.log attached

#5

Updated by Victor Julien 6 months ago

There is Teredo, ipv4-in-ipv6 and GRE tunneling. I wonder if those are handled correctly wrt AF_PACKETs load balancing.

#6

Updated by Peter Manev 6 months ago

Valid point I think.
I am supposed to get access to a replay set up next week and will do a few runs with those types of live traffic specifically to cover those cases as well. Will report back as soon as I got some findings on that.

#7

Updated by Victor Julien 6 months ago

I wonder if as a test it would make sense to set a capture filter like just: tcp. I think this should rule out the tunnels.

Also, despite the perf concern I think it is still a good test to just use a single queue on the NIC, just to see if that affects the 'wrong_thread'.

#8

Updated by Peter Manev 6 months ago

An update:
Setup: - Af-packet config


suricata --dump-config |grep af-packet
af-packet = (null)
af-packet.0 = interface
af-packet.0.interface = enp59s0
af-packet.0.threads = 40
af-packet.0.cluster-id = 99
af-packet.0.cluster-type = cluster_flow
af-packet.0.defrag = yes
af-packet.0.xdp-mode = driver
af-packet.0.xdp-filter-file = /etc/suricata/xdp_filter.bpf
af-packet.0.bypass = yes
af-packet.0.use-mmap = yes
af-packet.0.mmap-locked = yes
af-packet.0.tpacket-v3 = yes
af-packet.0.ring-size = 300000
af-packet.0.block-size = 1048576
af-packet.1 = interface
af-packet.1.interface = default

NIC setup


ethtool -l enp59s0
Channel parameters for enp59s0:
Pre-set maximums:
RX:             0
TX:             0
Other:          1
Combined:       112
Current hardware settings:
RX:             0
TX:             0
Other:          1
Combined:       1

Case 1

Run command:

suricata -vvv --af-packet -S /opt/rules/*.rules --pidfile=/var/run/suricata.pid -F /var/log/suricata/bpf-vlan.filter

#bp filter
((ip and tcp port 80) or (vlan and tcp port 80))

So the "stream.wrong_thread" stay the same but "tcp.pkt_on_wrong_thread" are increasing


tail -F /var/log/suricata/stats.log  |grep -E "wrong_thread" 

stream.wrong_thread                        | Total                     | 5093
tcp.pkt_on_wrong_thread                    | Total                     | 5671362
stream.wrong_thread                        | Total                     | 5093
tcp.pkt_on_wrong_thread                    | Total                     | 5713760
stream.wrong_thread                        | Total                     | 5093
tcp.pkt_on_wrong_thread                    | Total                     | 5757011
stream.wrong_thread                        | Total                     | 5093
tcp.pkt_on_wrong_thread                    | Total                     | 5798877
stream.wrong_thread                        | Total                     | 5093
tcp.pkt_on_wrong_thread                    | Total                     | 5844993
stream.wrong_thread                        | Total                     | 5093
tcp.pkt_on_wrong_thread                    | Total                     | 5890561
stream.wrong_thread                        | Total                     | 5093
tcp.pkt_on_wrong_thread                    | Total                     | 5938549
stream.wrong_thread                        | Total                     | 5093
tcp.pkt_on_wrong_thread                    | Total                     | 5977920
stream.wrong_thread                        | Total                     | 5093
tcp.pkt_on_wrong_thread                    | Total                     | 6018627
stream.wrong_thread                        | Total                     | 5093
tcp.pkt_on_wrong_thread                    | Total                     | 6061152
stream.wrong_thread                        | Total                     | 5093
tcp.pkt_on_wrong_thread                    | Total                     | 6100331
stream.wrong_thread                        | Total                     | 5093
tcp.pkt_on_wrong_thread                    | Total                     | 6136587
stream.wrong_thread                        | Total                     | 5093
tcp.pkt_on_wrong_thread                    | Total                     | 6174691
stream.wrong_thread                        | Total                     | 5093
tcp.pkt_on_wrong_thread                    | Total                     | 6207062
stream.wrong_thread                        | Total                     | 5093
tcp.pkt_on_wrong_thread                    | Total                     | 6245910
stream.wrong_thread                        | Total                     | 5093
tcp.pkt_on_wrong_thread                    | Total                     | 6280238
stream.wrong_thread                        | Total                     | 5093
tcp.pkt_on_wrong_thread                    | Total                     | 6319797
stream.wrong_thread                        | Total                     | 5093
tcp.pkt_on_wrong_thread                    | Total                     | 6351413
stream.wrong_thread                        | Total                     | 5093
tcp.pkt_on_wrong_thread                    | Total                     | 6389877
stream.wrong_thread                        | Total                     | 5093
tcp.pkt_on_wrong_thread                    | Total                     | 6427672
stream.wrong_thread                        | Total                     | 5093
tcp.pkt_on_wrong_thread                    | Total                     | 6468473

Case 2

Both "stream.wrong_thread" and "tcp.pkt_on_wrong_thread" are increasing.

Run command:

suricata -vvv --af-packet -S /opt/rules/*.rules --pidfile=/var/run/suricata.pid tcp port 80 --set vlan.use-for-tracking=false


tail -F /var/log/suricata/stats.log  |grep -E "wrong_thread" 

stream.wrong_thread                        | Total                     | 5415
tcp.pkt_on_wrong_thread                    | Total                     | 975303
stream.wrong_thread                        | Total                     | 5788
tcp.pkt_on_wrong_thread                    | Total                     | 1599922
stream.wrong_thread                        | Total                     | 6206
tcp.pkt_on_wrong_thread                    | Total                     | 2002613
stream.wrong_thread                        | Total                     | 6570
tcp.pkt_on_wrong_thread                    | Total                     | 2324993
stream.wrong_thread                        | Total                     | 6979
tcp.pkt_on_wrong_thread                    | Total                     | 2595832
stream.wrong_thread                        | Total                     | 7437
tcp.pkt_on_wrong_thread                    | Total                     | 2816437
stream.wrong_thread                        | Total                     | 7804
tcp.pkt_on_wrong_thread                    | Total                     | 3046528
stream.wrong_thread                        | Total                     | 8130
tcp.pkt_on_wrong_thread                    | Total                     | 3231108
stream.wrong_thread                        | Total                     | 8505
tcp.pkt_on_wrong_thread                    | Total                     | 3393051
stream.wrong_thread                        | Total                     | 8825
tcp.pkt_on_wrong_thread                    | Total                     | 3500183
stream.wrong_thread                        | Total                     | 9196
tcp.pkt_on_wrong_thread                    | Total                     | 3617008
stream.wrong_thread                        | Total                     | 9519
tcp.pkt_on_wrong_thread                    | Total                     | 3715475
stream.wrong_thread                        | Total                     | 9911
tcp.pkt_on_wrong_thread                    | Total                     | 3816075
stream.wrong_thread                        | Total                     | 10289
tcp.pkt_on_wrong_thread                    | Total                     | 3927234
#9

Updated by Peter Manev 6 months ago

Same behavior as with AFP/Mellanox (MT27800 Family [ConnectX-5])
While PF_RING(cluster_flow)/Intel combo has "stream.wrong_thread" and "tcp.pkt_on_wrong_thread" counters increasing as well - it seems it is on a much smaller scale (0.00x% of the total pkts) during first test trials.
Continuing the investigation - will keep updating.

#10

Updated by Peter Manev 6 months ago

I have been able to pinpoint a reproducible case with having the "wrong_thread" counters present and increasing and then not present in a diff test case.
I have handed the test cases to Eric for feedback.

#11

Updated by Peter Manev 5 months ago

As an FYI
Tested/reproducible with the following kernels
Debian Stable with:

Linux suricata-ng01 4.18.0-0.bpo.1-amd64 #1 SMP Debian 4.18.6-1~bpo9+1 (2018-09-13) x86_64 GNU/Linux

Ubuntu LTS 16.04 with:

Linux suricata 4.18.6-amd64 #1 SMP Sat Sep 8 15:19:53 CEST 2018 x86_64 x86_64 x86_64 GNU/Linux

Also reported on Centos 7.6 on the user mailing list
https://lists.openinfosecfoundation.org/pipermail/oisf-users/2019-January/016527.html

I am currently trying some test runs to see if it is possible to pin point a possible direction of the issue.

#12

Updated by Peter Manev 5 months ago

Seems the problem is only present with af-packet using cluster_flow or cluster_cpu.

Using af-packet with cluster_qm with exact same number of threads and same numa node as on the NIC / NIC's queues - seems to be doing great on 30Gbps
Confirmed that with x710/i40 or Mellanox the counters are much lower almost insignificant (200 pkts wrong thread counter on 10 billion)-

 ethtool -i enp59s0
driver: i40e
version: 2.3.2-k
firmware-version: 5.05 0x8000291e 1.1313.0
expansion-rom-version: 
bus-info: 0000:3b:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes


ethtool -i enp94s0f0
driver: mlx5_core
version: 5.0-0
firmware-version: 16.23.1000 (MT_0000000012)
expansion-rom-version: 
bus-info: 0000:5e:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes
4.18.0-0.bpo.1-amd64 #1 SMP Debian 4.18.6-1~bpo9+1 (2018-09-13) x86_64 GNU/Linux
#13

Updated by Peter Manev 5 months ago

A simple table to start entering some stats -

Issue Appears OS/Distribution Kernel Version NIC/driver/version runmode Notes
NO Debian Stretch 4.18.0-0.bpo.1-amd64 x710,i40, 2.3.2-k AFP v3, cluster_qm With Rust enabled and low entropy key
YES Debian Stretch 4.18.0-0.bpo.1-amd64 x710,i40, 2.3.2-k AFP v3, cluster_flow With Rust enabled and low entropy key
YES Debian Stretch 4.18.0-0.bpo.1-amd64 x710,i40, 2.3.2-k AFP v3, cluster_cpu With Rust enabled and low entropy key
NO Debian Stretch 4.18.0-0.bpo.1-amd64 x710,i40, 2.3.2-k PF_RING With Rust enabled and low entropy key
NO Debian Stretch 4.18.0-0.bpo.1-amd64 MT27800, mlx5_core, 5.0-0 AFP v3, cluster_qm With Rust enabled and low entropy key
YES Debian Stretch 4.18.0-0.bpo.1-amd64 MT27800, mlx5_core, 5.0-0 AFP v3, cluster_flow With Rust enabled and low entropy key
YES Debian Stretch 4.18.0-0.bpo.1-amd64 MT27800, mlx5_core, 5.0-0 AFP v3, cluster_cpu With Rust enabled and low entropy key
YES Ubuntu LTS 18.04 4.18.6-amd64 x520, 82599ES,ixgbe,5.1.0-k AFP v3, cluster_qm With Rust enabled and low entropy key
YES Ubuntu LTS 18.04 4.18.6-amd64 x520, 82599ES,ixgbe,5.1.0-k AFP v3, cluster_flow With Rust enabled and low entropy key
NO Ubuntu LTS 18.04 4.18.6-amd64 x520, 82599ES,ixgbe,5.1.0-k PF_RING With Rust enabled and low entropy key
NO Debian Stretch 9.7 4.9.0-8-amd64 x710, i40e, 1.6.16-k, fw 5.40 0x80002e8c 18.0.14 AFP v3, cluster_qm With Rust enabled and low entropy key. Offloading off, ntuple-filters: on, receive-hashing: on
#14

Updated by Eric Urban 4 months ago

Peter, here is the requested info from the host with the logs stats-412_sur13_snf2.log-2019020616.gz that had a value of 100 for tcp.pkt_on_wrong_thread.

  • We are running:
    CentOS Linux release 7.6.1810 (Core)
  • kernel (just uname -a)
    Linux 3.10.0-957.5.1.el7.x86_64 #1 SMP Fri Feb 1 14:54:57 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
  • ethtool -i iface
    (Since we are running Myricom, I believe this is the equivalent to ethtool -i)
    $ myri_nic_info -B
    ProductCode Driver Version License
    10G-PCIE2-8C2-2S myri_snf 3.0.15.50857 Valid
  • runmode (example : af-packet cluster_flow / pfring )
    pcap (using --pcap command-line option) capture method with workers runmode
  • ethtool -x iface
    (Since we are running Myricom, this is hopefully what you wanted from ethtool -x)
    For quick reference, SNF_RSS_FLAGS = 49 (0x31) means SRC/DST addr and SRC/DST port.
    $ SNF_DEBUG_MASK=3 /opt/snf/bin/tests/snf_simple_recv
    91253 snf.0.-1 P (userset) SNF_PORTNUM = 0
    91253 snf.0.-1 P (default) SNF_RING_ID = -1 (0xffffffff)
    91253 snf.0.-1 P (default) SNF_NUM_RINGS = 1 (0x1)
    91253 snf.0.-1 P (default) SNF_RSS_FLAGS = 49 (0x31)
    91253 snf.0.-1 P (default) SNF_DATARING_SIZE = 268435456 (0x10000000) (256.0 MiB)
    91253 snf.0.-1 P (default) SNF_DESCRING_SIZE = 134217728 (0x8000000) (128.0 MiB)
    91253 snf.0.-1 P (default) SNF_FLAGS = 0
    91253 snf.0.-1 P (environ) SNF_DEBUG_MASK = 3 (0x3)
    91253 snf.0.-1 P (default) SNF_DEBUG_FILENAME = stderr
    91253 snf.0.-1 P (default) SNF_APP_ID = -1 (0xffffffff)
    91253 snf.0.-1 P SNF_DEBUG_MASK=0x3 for modes WARN=0x1, PARAM=0x2 QSTATS=0x4 TIMESYNC=0x8 IOCTL=0x10 QEVENTS=0x20 ARISTA=0x40
    91253 snf.0.-1 P lib version=3.0.15.50857 build=snf-3.0.15.50857_e2bb46352 11/13/18_15:34 e2bb46352
    91253 snf.0.-1 P kernel version=3.0.15.50857 build=snf-3.0.15.50857_e2bb46352 11/13/18_15:34 e2bb46352
    91253 snf.0.-1 P pqstate [ 0x7f04b97cc000.. 0x7f04b97ce000) size 8 KiB 8192 (0x2000)
    91253 snf.0.-1 P desc_ring [ 0x7f0438dbd000.. 0x7f04b8dbd000) size 2048 MiB 2147483648 (0x80000000)
    91253 snf.0.-1 P data_ring [ 0x7f0238dad000.. 0x7f0438dbd000) size 8192 MiB 8590000128 (0x200010000)
    91253 snf.0.-1 P pq_init: desc[seq=99,ev_idx=3412634,cnt=32115004058]
#15

Updated by Andreas Herz 4 months ago

  • Assignee set to OISF Dev
  • Target version set to Support
#16

Updated by Peter Manev 4 months ago

Thanks Eric for the info.

I have updated the table below:
Issue Appears OS/Distribution Kernel Version NIC/driver/version runmode Notes
NO Debian Stretch 4.18.0-0.bpo.1-amd64 x710,i40, 2.3.2-k AFP v3, cluster_qm With Rust, Toeplitz hash function, low entropy key. Offloading off, ntuple-filters: on, receive-hashing: on
YES Debian Stretch 4.18.0-0.bpo.1-amd64 x710,i40, 2.3.2-k AFP v3, cluster_flow With Rust, Toeplitz hash function, low entropy key. Offloading off, ntuple-filters: on, receive-hashing: on
YES Debian Stretch 4.18.0-0.bpo.1-amd64 x710,i40, 2.3.2-k AFP v3, cluster_cpu With Rust, Toeplitz hash function, low entropy key. Offloading off, ntuple-filters: on, receive-hashing: on
NO Debian Stretch 4.18.0-0.bpo.1-amd64 x710,i40, 2.3.2-k PF_RING With Rust, Toeplitz hash function, low entropy key. Offloading off, ntuple-filters: on, receive-hashing: on
NO Debian Stretch 4.18.0-0.bpo.1-amd64 MT27800, mlx5_core, 5.0-0 AFP v3, cluster_qm With Rust, Toeplitz hash function, low entropy key. Offloading off, ntuple-filters: on, receive-hashing: on
YES Debian Stretch 4.18.0-0.bpo.1-amd64 MT27800, mlx5_core, 5.0-0 AFP v3, cluster_flow With Rust, Toeplitz hash function, low entropy key. Offloading off, ntuple-filters: on, receive-hashing: on
YES Debian Stretch 4.18.0-0.bpo.1-amd64 MT27800, mlx5_core, 5.0-0 AFP v3, cluster_cpu With Rust, Toeplitz hash function, low entropy key. Offloading off, ntuple-filters: on, receive-hashing: on
YES Ubuntu LTS 18.04 4.18.6-amd64 x520, 82599ES,ixgbe,5.1.0-k AFP v3, cluster_qm With Rust, Toeplitz hash function, low entropy key. Offloading off, ntuple-filters: on, receive-hashing: on
YES Ubuntu LTS 18.04 4.18.6-amd64 x520, 82599ES,ixgbe,5.1.0-k AFP v3, cluster_flow With Rust, Toeplitz hash function, low entropy key. Offloading off, ntuple-filters: on, receive-hashing: on
NO Ubuntu LTS 18.04 4.18.6-amd64 x520, 82599ES,ixgbe,5.1.0-k PF_RING With Rust, Toeplitz hash function, low entropy key. Offloading off, ntuple-filters: on, receive-hashing: on
YES Ubuntu LTS 18.04 4.18.6-amd64 x520, 82599ES,ixgbe,5.1.0-k AFP v3, cluster_flow With Rust, 1 RSS queue only, Offloading off, ntuple-filters: of, receive-hashing: off. Only tcp.pkt_on_wrong_thread observed.
YES CentOS Linux release 7.6.1810 3.10.0-957.1.3.el7.x86_64 x520, 82599ES,ixgbe,5.5.2 AFP v3, cluster_flow With Rust, 1 RSS queue only, Offloading off, ntuple-filters: of, receive-hashing: off. Only tcp.pkt_on_wrong_thread observed.
NO CentOS Linux release 7.6.1810 3.10.0-957.5.1.el7.x86_64 10G-PCIE2-8C2-2S myri_snf 3.0.15.50857 pcap capture method with workers runmode With Rust
NO Debian Stretch 9.7 4.9.0-8-amd64 x710, i40e, 1.6.16-k, fw 5.40 0x80002e8c 18.0.14 AFP v3, cluster_qm With Rust enabled and low entropy key. Offloading off, ntuple-filters: on, receive-hashing: on

Looking at the above it appears:
- PF_RING does not have the problem
- newer models of NICs - x710/ Mellanox with AFPv3, cluster_qm amd toeplitz hash with low entropy key do not have the problem
- Myricom with pcap mode does not have the problem
- AFPv3 cluster flow even with forced 1 RSS on the NIC reports a problem

#17

Updated by Victor Julien 3 months ago

  • Related to Support #2900: alert 'SURICATA STREAM pkt seen on wrong thread' when run mode set to workers added
#18

Updated by Andreas Herz about 2 months ago

Peter can you share more config details about the first entry of the table? I have tested that configuration (same debian kernel, x710 and runmode) and the wrong_thread counter is starting to grow a bit faster as with cluster_flow :( but the packet drop percentage decreased a bit.
I just want to make sure we didn't miss another configuration setting.

#19

Updated by Peter Manev about 2 months ago

For example -
- 16 queues on the NIC, Low entropy key is a must and symmetric hashing is a must
- 16 worker afpacket threads with cluster_qm

#20

Updated by Andreas Herz about 2 months ago

Unless I missed something it didn't change much on our setup with same debian, kernel etc. (did ifdown/up after changes with ethtool), will post a bit more details:

CPU: 2*Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
Memory: 96GB
NIC for Monitoring: X710 for 10GbE SFP+

config

af-packet:
  - interface: enp94s0f1
    threads: 16
    cluster-id: 98
    cluster-type: cluster_qm
    defrag: yes
    rollover: yes
    use-mmap: yes
    tpacket-v3: yes
    use-emergency-flush: yes

eththool settings:

driver: i40e
version: 2.3.2-k
firmware-version: 5.05 0x80002927 1.1313.0

Current hardware settings:
RX:        0
TX:        0
Other:        1
Combined:    16

RX flow hash indirection table for enp94s0f1 with 40 RX ring(s):
    0:      0     1     2     3     4     5     6     7
    8:      8     9    10    11    12    13    14    15
    16:      0     1     2     3     4     5     6     7
    24:      8     9    10    11    12    13    14    15
...
RSS hash key:
6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a

Current hardware settings:
RX:        512
RX Mini:    0
RX Jumbo:    0
TX:        512

Cannot get device udp-fragmentation-offload settings: Operation not supported
rx-checksumming: on
tx-checksumming: on
    tx-checksum-ipv4: on
    tx-checksum-ip-generic: off [fixed]
    tx-checksum-ipv6: on
    tx-checksum-fcoe-crc: off [fixed]
    tx-checksum-sctp: on
scatter-gather: off
    tx-scatter-gather: off
    tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: off
    tx-tcp-segmentation: off
    tx-tcp-ecn-segmentation: off
    tx-tcp-mangleid-segmentation: off
    tx-tcp6-segmentation: off
udp-fragmentation-offload: off
generic-segmentation-offload: off
generic-receive-offload: off
large-receive-offload: off [fixed]
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: on
receive-hashing: on
highdma: on
rx-vlan-filter: on [fixed]
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: on
tx-gre-csum-segmentation: on
tx-ipxip4-segmentation: off [fixed]
tx-ipxip6-segmentation: off [fixed]
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-gso-partial: on
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: off [fixed]
tx-udp-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off [fixed]
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: on
tls-hw-tx-offload: off [fixed]
rx-gro-hw: off [fixed]
tls-hw-record: off [fixed]

We also did set the irq affinity and watching /proc/interrupts confirms that it's set correct, only one column is updated.

We only saw a much reduced drop rate on the NIC level but on the kernel level (capture.kernel_drops) it's still a lot and the wrong_thread counter also very high.

#21

Updated by Victor Julien about 2 months ago

Andreas you could try upgrading the cards firmware. In the entry that I added we had fw 5.40. The customer had issues with an older version. Not sure if that was related, but it might be worth a shot.

#22

Updated by Peter Manev about 2 months ago

@ Andreas - one more question (in additions to Victor's)- what is the RSS hashing function? aka in the example below it is Toeplitz

/usr/local/sbin/ethtool --show-rxfh  enp94s0f0
RX flow hash indirection table for enp94s0f0 with 40 RX ring(s):
    0:      0     1     2     3     4     5     6     7
    8:      8     9    10    11    12    13    14    15
   16:     16    17    18    19    20    21    22    23
   24:     24    25    26    27    28    29    30    31
   32:     32    33    34    35    36    37    38    39
   40:      0     1     2     3     4     5     6     7
   48:      8     9    10    11    12    13    14    15
   56:     16    17    18    19    20    21    22    23
   64:     24    25    26    27    28    29    30    31
   72:     32    33    34    35    36    37    38    39
   80:      0     1     2     3     4     5     6     7
   88:      8     9    10    11    12    13    14    15
   96:     16    17    18    19    20    21    22    23
  104:     24    25    26    27    28    29    30    31
  112:     32    33    34    35    36    37    38    39
  120:      0     1     2     3     4     5     6     7
RSS hash key:
    6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a
RSS hash function:
    toeplitz: on
    xor     : off
    crc32   : off

#23

Updated by Andreas Herz about 2 months ago

@Victor good point, we have another system with 6.01 firmware, will try to see if the changes are more promising there.

@Peter:

RX flow hash indirection table for enp94s0f1 with 16 RX ring(s):
    0:      0     1     2     3     4     5     6     7
    8:      8     9    10    11    12    13    14    15
   16:      0     1     2     3     4     5     6     7
   24:      8     9    10    11    12    13    14    15
RSS hash key:
6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a
RSS hash function:
    toeplitz: on
    xor: off
    crc32: off
#24

Updated by Andreas Herz about 2 months ago

Sadly no diff with the 6.01 firmware, counter increasing right from the start. Memory usage is fine (~50%), load is fine (15 load on 40 cores).

Maybe we can reproduce it in a lab environment so we could run suricata in debug mode to see the exact thread_ids. So far we only suspect the traffic on the live systems being another issue.

#25

Updated by Peter Manev about 2 months ago

How are they increasing - can you provide some stats over time compared to total number of packets?
I have seen similar - but very minor increase, almost negligent to the number/volume of total packets. The difference in terms of number of packets on the wrong thread before and after the config changes described was big though (for the better- much lower)

#26

Updated by Andreas Herz about 2 months ago

I could but it's also quite good to be seen in the picture attached. The changes I did throughout the day didn't change much on the overall issue.

#27

Updated by Peter Manev about 2 months ago

What Suricata version is that ? (apologies if I missed the info before)

#28

Updated by Andreas Herz about 2 months ago

4.1.2 in that case, as soon as it's in the repo we can bump it to 4.1.4

#29

Updated by Peter Manev about 2 months ago

Any chance you can try latest git ?

#30

Updated by Peter Manev about 2 months ago

Reading through the info again it seems something might be different in your setup.

In all the latest cases I’ve tested I always did it with latest git. However your traffic type might differ so I was wondering besides trying out git , can you try looking at different traffic patterns within the same traffic to try to see if any difference would be noticeable? For example run Suri with BPF “port 25” , then try “port 80” etc... (of course make sure to cover for the vlan part of the BPF if vlan is présent in the traffic ).

#31

Updated by Andreas Herz about 1 month ago

So some update on this issue:

- We finally got it reproduced in a lab environment \o/

- We also updated the X710 firmware to the most recent 6.80 -> no measurable change

- We also used isolcpus (see /proc/cmdline) and cpu affinity settings to force the cpu cores -> no measurable change

What I can see is that the isolcpus, irq affinity etc. work since only those cpu cores included in the configuration are used but within them there seems to be the issue. IRQ Affinity looks as expected and wanted, counter increase on a dedicated cpu core for each queue.

- We also tried 5.0 beta1, no change with as well

I enabled debug (--enable-debug) and so far I don't see a obvious scheme but this is what it looked like in a test run:

7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 11                                                                              
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 2
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 3
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 14
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 3
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 14
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 3
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 14
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 12
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 14
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 12
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 1
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 9
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 7
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 11
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 2
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 11
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 4
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 13
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 15
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 3
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 6
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 14
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 12
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 1
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 9
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 7
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 5
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 2
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 5
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 4
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 5
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 15
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 3
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 15
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 3
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 12, we are 15
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 14
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 12
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 1
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 9
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 7
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 11
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 2
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 11
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 4
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 11
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 6
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 11
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 15
7/5/2019 -- 09:49:58 - <Debug> - wrong thread: flow has 8, we are 11

What I can tell so far the id is always within the range to 16 (which is the amount of queues/threads we did setup)

[19443] 7/5/2019 -- 10:27:15 - (stream-tcp.c:4687) <Debug> (StreamTcpPacket) -- wrong thread: flow has 10, we are 14                                           
[19443] 7/5/2019 -- 10:27:15 - (stream-tcp.c:4687) <Debug> (StreamTcpPacket) -- wrong thread: flow has 10, we are 14                                           
[19443] 7/5/2019 -- 10:27:15 - (stream-tcp.c:4687) <Debug> (StreamTcpPacket) -- wrong thread: flow has 10, we are 14                                           
[19443] 7/5/2019 -- 10:27:15 - (stream-tcp.c:4687) <Debug> (StreamTcpPacket) -- wrong thread: flow has 10, we are 14                                           
[19443] 7/5/2019 -- 10:27:15 - (stream-tcp.c:4687) <Debug> (StreamTcpPacket) -- wrong thread: flow has 12, we are 14                                           
[19443] 7/5/2019 -- 10:27:15 - (stream-tcp.c:4687) <Debug> (StreamTcpPacket) -- wrong thread: flow has 12, we are 14                                           
[19443] 7/5/2019 -- 10:27:15 - (stream-tcp.c:4687) <Debug> (StreamTcpPacket) -- wrong thread: flow has 10, we are 14                                           
[19443] 7/5/2019 -- 10:27:15 - (stream-tcp.c:4687) <Debug> (StreamTcpPacket) -- wrong thread: flow has 12, we are 14                                           
[19443] 7/5/2019 -- 10:27:15 - (stream-tcp.c:4687) <Debug> (StreamTcpPacket) -- wrong thread: flow has 12, we are 14                                           
[19443] 7/5/2019 -- 10:27:15 - (stream-tcp.c:4687) <Debug> (StreamTcpPacket) -- wrong thread: flow has 12, we are 14                                           
[19443] 7/5/2019 -- 10:27:15 - (stream-tcp.c:4687) <Debug> (StreamTcpPacket) -- wrong thread: flow has 12, we are 14                                           
[19440] 7/5/2019 -- 10:27:15 - (stream-tcp.c:4687) <Debug> (StreamTcpPacket) -- wrong thread: flow has 2, we are 11                                            
[19440] 7/5/2019 -- 10:27:15 - (stream-tcp.c:4687) <Debug> (StreamTcpPacket) -- wrong thread: flow has 2, we are 11
#32

Updated by Peter Manev about 1 month ago

Is there anything related to decoder invalids in the stats?

#33

Updated by Andreas Herz about 1 month ago

nope:

------------------------------------------------------------------------------------
Date: 5/7/2019 -- 11:00:42 (uptime: 0d, 00h 13m 10s)
------------------------------------------------------------------------------------
Counter                                       | TM Name                   | Value
------------------------------------------------------------------------------------
capture.kernel_packets                        | Total                     | 240402716
capture.kernel_drops                          | Total                     | 123134832
decoder.pkts                                  | Total                     | 117058380
decoder.bytes                                 | Total                     | 87585127020
decoder.ipv4                                  | Total                     | 116624434
decoder.ethernet                              | Total                     | 117058380
decoder.tcp                                   | Total                     | 98124452
decoder.udp                                   | Total                     | 6929596
decoder.icmpv4                                | Total                     | 7068123
decoder.gre                                   | Total                     | 172
decoder.vlan                                  | Total                     | 117058380
decoder.avg_pkt_size                          | Total                     | 748
decoder.max_pkt_size                          | Total                     | 1514
flow.tcp                                      | Total                     | 2337356
flow.udp                                      | Total                     | 187546
flow.icmpv4                                   | Total                     | 13789
defrag.ipv4.fragments                         | Total                     | 5421
defrag.ipv4.reassembled                       | Total                     | 2611
decoder.ipv4.frag_overlap                     | Total                     | 67
tcp.sessions                                  | Total                     | 2210929
tcp.pseudo                                    | Total                     | 13812
tcp.syn                                       | Total                     | 2503758
tcp.synack                                    | Total                     | 1596557
tcp.rst                                       | Total                     | 1137302
tcp.pkt_on_wrong_thread                       | Total                     | 46241013
tcp.stream_depth_reached                      | Total                     | 5207
tcp.reassembly_gap                            | Total                     | 35494
tcp.overlap                                   | Total                     | 14671936
tcp.overlap_diff_data                         | Total                     | 4119
detect.alert                                  | Total                     | 1079050
app_layer.flow.http                           | Total                     | 76092
app_layer.tx.http                             | Total                     | 118715
app_layer.flow.smtp                           | Total                     | 1720
app_layer.tx.smtp                             | Total                     | 1773
app_layer.flow.tls                            | Total                     | 290061
app_layer.flow.ssh                            | Total                     | 48
app_layer.flow.dns_tcp                        | Total                     | 370
app_layer.tx.dns_tcp                          | Total                     | 741
app_layer.flow.ntp                            | Total                     | 2863
app_layer.flow.tftp                           | Total                     | 61
app_layer.flow.failed_tcp                     | Total                     | 3387
app_layer.flow.dns_udp                        | Total                     | 115961
app_layer.tx.dns_udp                          | Total                     | 2116115
app_layer.flow.enip_udp                       | Total                     | 144
app_layer.tx.enip_udp                         | Total                     | 144
app_layer.tx.ntp                              | Total                     | 24621
app_layer.flow.krb5_udp                       | Total                     | 129
app_layer.flow.failed_udp                     | Total                     | 68388
flow_mgr.closed_pruned                        | Total                     | 1141078
flow_mgr.new_pruned                           | Total                     | 1283603
flow_mgr.est_pruned                           | Total                     | 57623
flow_mgr.bypassed_pruned                      | Total                     | 797
flow.spare                                    | Total                     | 20369
flow.tcp_reuse                                | Total                     | 66804
flow_mgr.flows_checked                        | Total                     | 4744
flow_mgr.flows_notimeout                      | Total                     | 4354
flow_mgr.flows_timeout                        | Total                     | 390
flow_mgr.flows_timeout_inuse                  | Total                     | 73
flow_mgr.flows_removed                        | Total                     | 317
flow_mgr.rows_checked                         | Total                     | 65536
flow_mgr.rows_skipped                         | Total                     | 62229
flow_mgr.rows_empty                           | Total                     | 607
flow_mgr.rows_maxlen                          | Total                     | 11
tcp.memuse                                    | Total                     | 473600000
tcp.reassembly_memuse                         | Total                     | 76800000
flow.memuse                                   | Total                     | 25284544
#34

Updated by Andreas Herz 29 days ago

I promised you an update:

- disabling the rollover feature helped reducing the wrong_threads counter
- switching to cluster_qm helped on system with X710 AND X520
- changes in drop rates are not always an improvement, some got better some worse with the changes, we still try to narrow it down and see if other configuration changes help
- the sort of traffic plays a big role, but while gre traffic on one sensor increases the drop rate a lot it didn't on another sensor with same settings. The same applies for vlan, vlaninq, teredo and other types of traffic that are know to be more difficult

#35

Updated by Peter Manev 29 days ago

Thank you for the update! So if i understand it correctly there was part config adjustment that helped with getting the "wrong threads" counter down to 0. In some cases this does or does not directly help with kernel_drops and that is still under investigation. Did i understand correctly ?

#36

Updated by Andreas Herz 29 days ago

The "wrong threads" counter went down to 0 in most but not all cases, in the other cases it is still much better. So still some debugging left to do for us :)

And in some cases the changes helped to reduce kernel_drops but not everywhere.

So the conclusion is, setting cluster_qm and the necessary steps to use it properly (driver update, set_irq_affinity, rss queues) is an improvement in general with the exception of some corner cases.

#37

Updated by Peter Manev 29 days ago

Ok, Then it would mean that the load balancing still needs debugging in some cases even if the config is correct.
Would love to explore that more..

#38

Updated by Andreas Herz 27 days ago

The current test environment looks like this:

We have a T-rex instance running on one machine with the `delay_10_http_browsing_0.pcap` pcap which produces 10Gbit/s traffic (http). We forward this through a gigamon to the test sensor hardware with a DualNIX XL710. We have ~5Gbit/s on each interface enp175s0f0 and enp175s0f1.

As soon as we use more RSS queues we either run into the huge wrong_thread counter (with default rss hash key) or huge drop rates (with the "6D:5A..." key) where huge is somewhere between 30-40%. We could easily reproduce this by changing the RSS hash key from the default one to the "6D:5A.." one and back.
There is no big difference between using 4,6,8,10 queues on each interface and using the assigned cpu cores for those.

When we use just one queue on each interface there is no wrong_thread issue (as expected) and the drop rate goes below 5%, but we see that for each interface one cpu core is at 100%. I tried cluster_flow, cluster_qm, cluster_cpu and even enabled rollover again. The idea is that we still need to do some loadbalancing to other cpu cores which do nothing but without running into the issues with the wrong_thread and drop_rate.

#39

Updated by Peter Manev 27 days ago

Ok - thanks for the update.

What is our Trex start command? (server/client)
When you say increase the queues - how do you mean? If you hvae more than 1 queue you start seeing the problem or if you have 18 RSS queues but 16 threads in cluster_qm mode with AFP?

#40

Updated by Sean Cloherty 26 days ago

Additional info for the table -

I was seeing the same behavior on CentOS 7.6.1810 with Intel X710 and 82599ES NICS running 4.1.4 and using af-packet v3, cluster_flow, 1 queue. I just now noticed that it hasn't been happening on my test hosts since stopped on 3/22 and 5/17 so that is curious.

Also, see CPU cores pinned at 100% while others are idling. It seems to happen to one (sometimes two) core and ONLY to those cores. It is different per box and possibly per Suricata run. I just ran ps -eHO psr --sort=psr to see what was running on those cores. Here is the output for 2 occurrences of 100% utilization on CPU 26 and CPU 28:

156 26 S ? 00:00:00 [migration/26]
157 26 S ? 00:00:00 [ksoftirqd/26]
158 26 S ? 00:00:00 [kworker/26:0]
159 26 S ? 00:00:00 [kworker/26:0H]
160 0 S ? 00:00:00 [rcuob/26]

170 28 S ? 00:00:00 [migration/28]
171 28 S ? 00:00:00 [ksoftirqd/28]
172 28 S ? 00:00:00 [kworker/28:0]
173 28 S ? 00:00:00 [kworker/28:0H]

156 26 S ? 00:00:00 [migration/26]
157 26 S ? 00:00:00 [ksoftirqd/26]
158 26 S ? 00:00:00 [kworker/26:0]
159 26 S ? 00:00:00 [kworker/26:0H]

170 28 S ? 00:00:00 [migration/28]
171 28 S ? 00:00:00 [ksoftirqd/28]
172 28 S ? 00:00:00 [kworker/28:0]
173 28 S ? 00:00:00 [kworker/28:0H]

#41

Updated by Sean Cloherty 25 days ago

I have stats.log files (5 minute intervals) going back a ways for two test boxes, HOSTB and HOSTM. I determined when the WRONG_THREAD alerts 1st started and when they stopped. With that info I figured out which version of suricata.yaml was running just before the WRONG_THREADS stopped and swapped that back on HOSTM. Using the older suricata.yaml, it took less than 12 hours before WRONG_THREADS appeared (it hadn't done in the week since I last made changes).

Attached is a spreadsheet. The first worksheet has three consecutive stats.log entries. 1st is the last entry before the 1st occurrence of WRONG_THREAD. 2nd is the first entry with WRONG_THREADS. 3rd is the next one following in sequence. On the 2nd worksheet is of the diffs from the two suricata.yaml files. I only included items that were likely to have impact and omitted items like different values in logs that were disabled, etc.

Timeline for the two hosts looks like - - -

HOSTB : Logs going back to June 2017

FIRST OCCURRENCE : Date: 11/15/2018 -- 16:12:52 (uptime: 9d, 00h 25m 17s) from 2018-11-15_stats.zip
tcp.pkt_on_wrong_thread | Total | 378909830

LAST OCCURRENCE : Date: 3/22/2019 -- 16:40:25 (uptime: 1d, 05h 55m 06s) from 2019-03-22_stats.zip
tcp.pkt_on_wrong_thread | Total | 710594620

HOSTM - Logs go back to September 2018

FIRST OCCURRENCE: Date: 1/14/2019 -- 13:27:45 (uptime: 2d, 20h 18m 54s) from 2019-01-14_stats.zip:
tcp.pkt_on_wrong_thread | Total | 1924133244

LAST OCCURRENCE: Date: 5/17/2019 -- 11:23:43 (uptime: 15d, 20h 03m 42s) from 2019-05-17_stats.zip
tcp.pkt_on_wrong_thread | Total | 2407581403

#42

Updated by Cooper Nelson 24 days ago

Hi all, found this while doing a google search for 'tcp.pkt_on_wrong_thread'. I've been doing a deep-dive of our current deployment and this counter is around 25% of total packets, which is troubling.

Unless I'm missing something, this has to either be a problem with suricata's software load-balancer or perhaps just an issue of the counter being broken. I say this based on the observation that I'm using RSS on isolated NUMA nodes to send packets via cluster_flow to worker threads on other NUMA nodes. If cluster_flow was working properly they should all be directed to the same thread, even if RSS is broken. We can see this in the table above, 'cluster_flow' isn't working in any config. Even when using a single queue.

I tried using the ebpf load balancer but that didn't work either. All packets were sent to a single core, per NIC.

Would it be possible to add a flag to log packets that increment a specified counter? Being able to examine the packets triggering this state would be a big help I would think.

#43

Updated by Cooper Nelson 24 days ago

I might know what the problem is. I've been researching 40G NICs for a new build and encountered some interesting discussions of edge cases involving hashing functions, TCP/UDP and IP fragmentation.

When a TCP/UDP packet gets fragmented, only the first packet has the 'original' header in it. This breaks a 5-tuple hashing function, assuming the IP reassembly happens after the hashing, which is the case with suricata. The reassembly is happening within the context of the decode thread, so the subsequent reassembled packets are going to go to a different thread than the first one and incrementing this counter.

The fix I would recommend would be to simplifying the cluster_flow hashing function to the point that it work consistently on all IP traffic. This might have to be just a trivial src->dst hash, as I don't recall exactly how IP fragmentation works.

#44

Updated by Peter Manev 23 days ago

I think i have some suggestions that can significantly lower the "worng_threads" counter. At least in the different set ups i've tried it has provided very positive feedback. Mainly the AFPv3 plus RSS plus cluster_qm set up.

Cooper - what is your set up? I may be mistaken but I think you are running AMD arch right ?
Sean - you were on Intel correct?

If you guys like you could share yaml/NIC details (privately) and i can suggest some test conf adjustments to see if any improvement - after which we can update the table and the ticket here. Your help/feedback is much appreciated as always !

#45

Updated by Cooper Nelson 23 days ago

Yes I am on AMD Piledriver for my current deployment. Unfortunately, due to issues with the architecture and our network I have to use the cluster_flow load balancer in order to make use of all of the cores properly. Additionally, I did experiment with the cluster_qm runmode, that did not work either.

Looking over the documentation and source code it does appear that AFPv3 should be defragmenting TCP packets prior to handing them off to the suricata (this is enabled via the 'defrag' config var in the af-packet section).

However what I think is happening is that there is some issue either on the NIC, driver or kernel that is not allowing this to happen properly in all cases. I found a blog post that summarizes the issue:

http://adrianchadd.blogspot.com/2014/08/receive-side-scaling-figuring-out-how.html

The tl;dr is that fragmented TCP packets can end up on different cores in some cases even if you have RSS configured with a symmetric hashing algorithm. So the first TCP fragment would get hashed to core #3 and the rest would get hashed to core #7, for example. The fragmented packets should be getting reassembled properly by AFPv3, however if cluster_flow is using the hash from the NIC it's going to send the reassembled partial packet to a different core. Due to the buffering within AFPv3 (controlled by the block-size), it's very likely that the fragmented TCP packet is going to be processed out of order. I'm also not sure if there is any information within the TCP/IP header that suricata could use to reassemble the two fragments in the correct order, as their headers would be identical I think. They might even get rejected by the stream tracker due to having a bad checksum if you have that feature enabled.

Based on the blog post it looks like Intel handled this for UDP packets by simply hashing on the IPv4 header only. I'll suggest the simplest fix would be to simply provide an option to force 'src->dst' hashing for all IP traffic, specifically for IDS deployments. I would think this should be allowed via ethtool and setting the hashing as 'sd' vs. 'sdfn', however I get an error when I try to do that for TCP. Others have reported this error as well with no solution:

https://stackoverflow.com/questions/51791046/how-to-exclude-port-number-from-rss-hashing-for-tcp4-with-ixgbe

So, if anyone knows how to force hashing on the IP header only for all protocols on the ixgbe driver I would appreciate it if they let me know. Next week I'll look at the driver code and see if its possible to force it there as well.

I originally thought about modifying the cluster_flow hashing algorithm in suricata to ignore the RSS hash and compute its own 'src->dst' hash only, however this will still result in TCP fragments being delivered out-of-order to the worker threads, which might still cause issues. The 'right' way to fix this is to simply force 'sd' hashing for all IP traffic on RSS IDS deployments.

#46

Updated by Peter Manev 21 days ago

Thanks for the feedback Cooper !
Here is what I am staring at:
- I am consistently being able to minimize the (and in the cases of static testing basically to 0) the number of wrong_threads counters on live and/or Trex replay traffic in speeds 20-30Gbps
- On live set ups the wrong_threads counter is basically 0.00some% - still thee but rather negligible (i would still want to know why though)
- This above is only true if I use number of NIC_RSS queues = Number of AFPv3 threads with cluster_qm , on same NUMA running Intel HW.
- I used Trex to test diff pcap combo scenarios (http/https/oracle/dns/udp/tcp/smtp/etc...) in speeds of 30-40Gbps. While I have drops I never hit the wrong_threads counter , never even once. This however is all non frag traffic. As Trex seems is not keen on playing with fragments I then used tcpreplay with a very diverse fragments merged pcap. Is till could not come up with that wrong_threads/pkts counter.
- I need to try more diverse fragmenting pcaps I guess.
- I think it seems this is specific traffic related issue in that case maybe?
- If this is related to fragments - then I think the counters of "wrong_thread" should be somewhat related to the number of frags ? Can you somehow relate that in your set up ?

#47

Updated by Sean Cloherty 21 days ago

Yes, Peter that is correct. Intel CPUs. I'll forward the configs to you ASAP.

#48

Updated by Andreas Herz 20 days ago

I can confirm that with the recommendation by Peter the wrong_thread issue is gone for setups with X710 (i40e) and X5*0 (ixgbe). The counter is either 0 or very very small.

I'm now trying to achieve the same with cards using (igb) and (e1000e), did anyone came up with a solution for those?

While e1000e is really thin on features the igb at least offers the setup of up to 8 queues but no hash key, at least what I can see. So my idea was to reduce it to 1 queue and 1 cluster_qm thread but the issue still occurs. So any hints for those cards/drivers are also welcome.

This is especially present in setups where I have 2 interfaces used. If I go down to 1 interface, 1 queue and 1 thread with igb it's fine but when I use the 2nd interface I see the wrong_thread counter again.

This is what I tried:

ethtool -L enp216s0f0 combined 1
ethtool -L enp216s0f1 combined 1
systemctl stop irqbalance
set_irq_affinity.sh 10 enp216s0f0
set_irq_affinity.sh 11 enp216s0f0

I also enabled cpu affinity and did set the worker:

threading:
  set-cpu-affinity: yes
  cpu-affinity:
    - management-cpu-set:
        cpu: [ 0 ]  # include only these CPUs in affinity settings
    - worker-cpu-set:
        cpu: [ "10-11" ]
        mode: "exclusive" 
        prio:
          default: "high" 

Interface looks like:
af-packet:
  - interface: enp216s0f0
    threads: 1
    cluster-id: 97
    cluster-type: cluster_qm
    defrag: yes
    rollover: no
    use-mmap: yes
    tpacket-v3: yes
    use-emergency-flush: yes

  - interface: enp216s0f1
    threads: 1
    cluster-id: 96
    cluster-type: cluster_qm
    defrag: yes
    rollover: no
    use-mmap: yes
    tpacket-v3: yes
    use-emergency-flush: yes

Changing back to cluster_flow doesn't help either, only removing one interface helps. Maybe it's not possible to get rid of it on those cards :/

#49

Updated by Cooper Nelson 20 days ago

This is one of those issues that is only going to happen on 'live' traffic traversing bad/broken network hardware that is causing fragmented TCP packets. Test traffic isn't going to have fragmented IP packets (unless you generate them deliberately, I guess).

I'm seeing lots of large http downloads showing as "TRUNCATED", which I think is indicative of a TCP flow starting with small, non-fragmented TCP packets and then later becoming fragmented as the receive window increases. This is the specific case that will break RSS on the older Intel cards, as described in the article.

Andreas, from reading the documentation on the newer Intel 40G cards it appears they have a more robust RSS implementation that allows for fragmented TCP/UDP packets to be properly load-balanced. I think you are correct that older cards are simply 'broken' in this aspect.

However, as I mentioned, if we can force a trivial src->dst (sd) hash for all IP traffic on the NIC, that should fix it. The load-balancing won't be as 'elegant' on the RSS cores, however at least for my deployment that shouldn't matter. Suricata should properly load-balance the threads in software after they are delivered to the dedicated RSS cores and defragmented.

Again, if anyone knows how to force a trivial 'sd' RSS hash for all IP traffic I would appreciate it. As discussed ethtool does not seem to allow it.

#50

Updated by Sean Cloherty 20 days ago

One change to the NIC config which seems to have made a significant difference. I had a different takeaway from this thread and tried enabling csum offloading (ethtool --offload tx on rx on). After 15-16 hours and the result is 1.) No dropped packets 2.) No instances of tcp.pkt_on_wrong_thread or stream.wrong_thread 3.) No instance where one or two worker CPU’s are running at 100% for long periods. These as the only offloading options enabled:

rx-checksumming: on
tx-checksumming: on
tx-checksum-ip-generic: on
tx-checksum-fcoe-crc: on [fixed]
tx-checksum-sctp: on
tx-fcoe-segmentation: on [fixed]

Unlike Andreas’ results, I was seeing the problem but with only 1 active interface on the NUMA node. My only concern is whether or not enabling the offloading of those two functions is either masking the issue, or will have a negative impact on detection. I am using Intel ixgbe drivers v5.5.5 on 82599ES 10-Gigabit NICs.

#51

Updated by Cooper Nelson 20 days ago

Enabling csum offloading seems to have eliminated packet drops.

Still seeing lots of tcp.pkt_on_wrong_thread, though. I did experiment with enabling all offloading, in the hope that maybe something like GRO would reassemble the fragmented TCP packets. This didn't work.

Again, really wish there was some easy way just to force 'sd' hashing on RSS for all IP protocols.

#52

Updated by Peter Manev 19 days ago

Thank you for the updates guys!

One thing that I notice in Seans's set up - it is definitely using 1 RSS combined queue with cluster_flow - so it seems in this case the NIC is not to blame (just 1 queue)

In my set ups - both the tests I run on live traffic and with Trex(which it does not handle/replay vlans and frags) the successful formula so far has been:
Symmetric hashing, number of RSS queues == number of wrokers on same numa node, with cluster_qm and using XDP, AFPv3 (x710 NIC)

I run many dry runs with Trex trying to reproduce things with wrong-thread counters without success until I switch ed to good all tcprewrite/replay and now I have smaller pcap that on the same set up i am able to reproduce a counter increase of wrong_threads on tcpreplay. The pcap contains fragments and special corner vlan and QinQ frag cases where for example one frag has diff vlanid than it should etc,,,, I am trying to narrow it down to just a single small pcap and will update...
It does seem to be related to frags and/or vlans in that case indeed at least from the observations so far.
Will keep you posted of the findings.

#53

Updated by Sean Cloherty 19 days ago

I spoke too soon. After about 16 hours I started seeing drops again on both the test server and the production server getting the same traffic. However, no errors regarding packets / wrong thread. One notable difference - Before enabling the csum offloading, I would see a pattern where Suricata would run for hours or days, then there would be a spike of millions of lost packets in a short (hour or less) timeframe, and then it would stop for a period of time and then repeat. Now packet drops seem to be continuing along for the last 21 hours. I am going to see if I can take the stats and deltas in Splunk and see if there is any correlation between any other stats/deltas and the drops.

#54

Updated by Peter Manev 18 days ago

Sean (if read it correctly) - you can have wrong threads counters in your set up - only if you disable the csum offloading , right? (currently it is enabled , you have some drops but no wrong thread counters )

#55

Updated by Sean Cloherty 18 days ago

That is correct. I enabled the csum offload. In 2+ days there are no wrong threads. However drops are reaching over 5% on traffic that hardly ever exceeds 2G/s. I attached last stats.log output. Should I modify the checksum-checks to yes, auto, no ? And would that have any impact on the checksum-validation setting in streams?

Something else of note. The hosts monitoring general traffic seem to be suffering this issue. I have two hosts which only ingest proxy traffic and they have been 100% fine with no thread errors as far back as I can search.

#56

Updated by robson Nascimento 18 days ago

Hello guys ,

I have checked in my setup, some problems with Balancing and fanout, when I correct the problem of fanout I have problems in balancing. I have a problem also when I make the cpu afinity the one-node of the hardware I use has the colors divided into a non-sequential range.

i using Suricata ips inline mode, and I have verified using the "Cluster_qm" a high number of drops / tcp.reassembly_gap / tcp.pkt_on_wrong_thread

When I disable the "Cluster_qm" and I use cluster_flow with a unique network queue I do not have drops but it happens the same as in the matrix the packages get tied up somewhere that I can not see

Follow my setup

Model name:          Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
L3 cache:            14080K
NUMA node0 CPU(s):   0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38
NUMA node1 CPU(s):   1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39
Interface utilized:
export iface_in="ens1f1" 
export iface_out="ens3f1" 
Affinity script
~/i40e-2.7.29/scripts/set_irq_affinity -x 6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36 $iface_in
~/i40e-2.7.29/scripts/set_irq_affinity -x 7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37 $iface_out
Interface configuration mode:
%YAML 1.1
---
- interface: ens1f1
  copy-iface: ens3f1
  copy-mode: ips
  threads: 16
  use-mmap: yes
  tpacket-v3: yes
  cluster-id: 99
  cluster-type: cluster_qm
  checksum-checks: yes
  defrag: yes
  xdp-mode: driver
  xdp-filter-file: /etc/suricata/ebpf/xdp_filter.bpf
  #ebpf-lb-file:  /etc/suricata/ebpf/lb.bpf
  bypass: yes
  ring-size: 400000
  block-size: 1048576
  mmap-locked: yes
  buffer: 2147483647

- interface: ens3f1
  copy-iface: ens1f1
  copy-mode: ips
  threads: 16
  use-mmap: yes
  tpacket-v3: yes
  cluster-id: 98
  cluster-type: cluster_qm
  checksum-checks: yes
  defrag: yes
  xdp-mode: driver
  xdp-filter-file: /etc/suricata/ebpf/xdp_filter.bpf
  #ebpf-lb-file:  /etc/suricata/ebpf/lb.bpf
  bypass: yes
  ring-size: 400000
  block-size: 1048576
  mmap-locked: yes
  buffer: 2147483647
threading:
  set-cpu-affinity: yes
  # Tune cpu affinity of threads. Each family of threads can be bound
  # on specific CPUs.
  #
  # These 2 apply to the all runmodes:
  # management-cpu-set is used for flow timeout handling, counters
  # worker-cpu-set is used for 'worker' threads
  #
  # Additionally, for autofp these apply:
  # receive-cpu-set is used for capture threads
  # verdict-cpu-set is used for IPS verdict threads
  #
  cpu-affinity:
    - management-cpu-set:
        cpu: [ 2, 3, 4, 5 ]  # include only these CPUs in affinity settings
        #mode: "balanced" 
        #prio:
        #  default: "high" 
    - worker-cpu-set:
        cpu: [ "6","8","10","12","14","16","18","20","22","24","26","28","30","32","34","36","7","9","11","13","15","17","19","21","23","25","27","29","31","33","35","37" ]
        mode: "exclusive" 
        prio:
          low: [ 0 ]
          medium: [ "1" ]
          high: [ "2-37" ]
          default: "high" 
       
Fanout Problem logs, suricata don't respect the afinity configuration:

4204] 30/5/2019 -- 17:21:23 - (util-runmodes.c:297) <Info> (RunModeSetLiveCaptureWorkersForDevice) -- Going to use 16 thread(s)
[4205] 30/5/2019 -- 17:21:23 - (tm-threads.c:1101) <Perf> (TmThreadSetupOptions) -- Setting prio -2 for thread "W#01-ens1f1" to cpu/core 6, thread id 4205
[4206] 30/5/2019 -- 17:21:23 - (tm-threads.c:1101) <Perf> (TmThreadSetupOptions) -- Setting prio -2 for thread "W#02-ens1f1" to cpu/core 7, thread id 4206
[4207] 30/5/2019 -- 17:21:23 - (tm-threads.c:1101) <Perf> (TmThreadSetupOptions) -- Setting prio -2 for thread "W#03-ens1f1" to cpu/core 8, thread id 4207
[4208] 30/5/2019 -- 17:21:23 - (tm-threads.c:1101) <Perf> (TmThreadSetupOptions) -- Setting prio -2 for thread "W#04-ens1f1" to cpu/core 9, thread id 4208
[4209] 30/5/2019 -- 17:21:23 - (tm-threads.c:1101) <Perf> (TmThreadSetupOptions) -- Setting prio -2 for thread "W#05-ens1f1" to cpu/core 10, thread id 4209
[4210] 30/5/2019 -- 17:21:23 - (tm-threads.c:1101) <Perf> (TmThreadSetupOptions) -- Setting prio -2 for thread "W#06-ens1f1" to cpu/core 11, thread id 4210
[4211] 30/5/2019 -- 17:21:23 - (tm-threads.c:1101) <Perf> (TmThreadSetupOptions) -- Setting prio -2 for thread "W#07-ens1f1" to cpu/core 12, thread id 4211
[4212] 30/5/2019 -- 17:21:23 - (tm-threads.c:1101) <Perf> (TmThreadSetupOptions) -- Setting prio -2 for thread "W#08-ens1f1" to cpu/core 13, thread id 4212
[4213] 30/5/2019 -- 17:21:23 - (tm-threads.c:1101) <Perf> (TmThreadSetupOptions) -- Setting prio -2 for thread "W#09-ens1f1" to cpu/core 14, thread id 4213
[4214] 30/5/2019 -- 17:21:23 - (tm-threads.c:1101) <Perf> (TmThreadSetupOptions) -- Setting prio -2 for thread "W#10-ens1f1" to cpu/core 15, thread id 4214
[4215] 30/5/2019 -- 17:21:23 - (tm-threads.c:1101) <Perf> (TmThreadSetupOptions) -- Setting prio -2 for thread "W#11-ens1f1" to cpu/core 16, thread id 4215
[4216] 30/5/2019 -- 17:21:23 - (tm-threads.c:1101) <Perf> (TmThreadSetupOptions) -- Setting prio -2 for thread "W#12-ens1f1" to cpu/core 17, thread id 4216
[4217] 30/5/2019 -- 17:21:23 - (tm-threads.c:1101) <Perf> (TmThreadSetupOptions) -- Setting prio -2 for thread "W#13-ens1f1" to cpu/core 18, thread id 4217
[4218] 30/5/2019 -- 17:21:23 - (tm-threads.c:1101) <Perf> (TmThreadSetupOptions) -- Setting prio -2 for thread "W#14-ens1f1" to cpu/core 19, thread id 4218
[4219] 30/5/2019 -- 17:21:23 - (tm-threads.c:1101) <Perf> (TmThreadSetupOptions) -- Setting prio -2 for thread "W#15-ens1f1" to cpu/core 20, thread id 4219
[4220] 30/5/2019 -- 17:21:23 - (tm-threads.c:1101) <Perf> (TmThreadSetupOptions) -- Setting prio -2 for thread "W#16-ens1f1" to cpu/core 21, thread id 4220

[4204] 30/5/2019 -- 17:21:23 - (util-runmodes.c:297) <Info> (RunModeSetLiveCaptureWorkersForDevice) -- Going to use 16 thread(s)
[4221] 30/5/2019 -- 17:21:23 - (tm-threads.c:1101) <Perf> (TmThreadSetupOptions) -- Setting prio -2 for thread "W#01-ens3f1" to cpu/core 22, thread id 4221
[4222] 30/5/2019 -- 17:21:23 - (tm-threads.c:1101) <Perf> (TmThreadSetupOptions) -- Setting prio -2 for thread "W#02-ens3f1" to cpu/core 23, thread id 4222
[4223] 30/5/2019 -- 17:21:23 - (tm-threads.c:1101) <Perf> (TmThreadSetupOptions) -- Setting prio -2 for thread "W#03-ens3f1" to cpu/core 24, thread id 4223
[4224] 30/5/2019 -- 17:21:23 - (tm-threads.c:1101) <Perf> (TmThreadSetupOptions) -- Setting prio -2 for thread "W#04-ens3f1" to cpu/core 25, thread id 4224
[4225] 30/5/2019 -- 17:21:23 - (tm-threads.c:1101) <Perf> (TmThreadSetupOptions) -- Setting prio -2 for thread "W#05-ens3f1" to cpu/core 26, thread id 4225
[4226] 30/5/2019 -- 17:21:23 - (tm-threads.c:1101) <Perf> (TmThreadSetupOptions) -- Setting prio -2 for thread "W#06-ens3f1" to cpu/core 27, thread id 4226
[4227] 30/5/2019 -- 17:21:23 - (tm-threads.c:1101) <Perf> (TmThreadSetupOptions) -- Setting prio -2 for thread "W#07-ens3f1" to cpu/core 28, thread id 4227
[4228] 30/5/2019 -- 17:21:23 - (tm-threads.c:1101) <Perf> (TmThreadSetupOptions) -- Setting prio -2 for thread "W#08-ens3f1" to cpu/core 29, thread id 4228
[4229] 30/5/2019 -- 17:21:23 - (tm-threads.c:1101) <Perf> (TmThreadSetupOptions) -- Setting prio -2 for thread "W#09-ens3f1" to cpu/core 30, thread id 4229
[4230] 30/5/2019 -- 17:21:23 - (tm-threads.c:1101) <Perf> (TmThreadSetupOptions) -- Setting prio -2 for thread "W#10-ens3f1" to cpu/core 31, thread id 4230
[4231] 30/5/2019 -- 17:21:23 - (tm-threads.c:1101) <Perf> (TmThreadSetupOptions) -- Setting prio -2 for thread "W#11-ens3f1" to cpu/core 32, thread id 4231
[4232] 30/5/2019 -- 17:21:23 - (tm-threads.c:1101) <Perf> (TmThreadSetupOptions) -- Setting prio -2 for thread "W#12-ens3f1" to cpu/core 33, thread id 4232
[4233] 30/5/2019 -- 17:21:23 - (tm-threads.c:1101) <Perf> (TmThreadSetupOptions) -- Setting prio -2 for thread "W#13-ens3f1" to cpu/core 34, thread id 4233
[4234] 30/5/2019 -- 17:21:23 - (tm-threads.c:1101) <Perf> (TmThreadSetupOptions) -- Setting prio -2 for thread "W#14-ens3f1" to cpu/core 35, thread id 4234
[4235] 30/5/2019 -- 17:21:23 - (tm-threads.c:1101) <Perf> (TmThreadSetupOptions) -- Setting prio -2 for thread "W#15-ens3f1" to cpu/core 36, thread id 4235
[4236] 30/5/2019 -- 17:21:23 - (tm-threads.c:1101) <Perf> (TmThreadSetupOptions) -- Setting prio -2 for thread "W#16-ens3f1" to cpu/core 37, thread id 4236
Cluster_QM Load balance Problem, packets don't equal distribuited:
[4205] 30/5/2019 -- 17:23:55 - (source-af-packet.c:2670) <Perf> (ReceiveAFPThreadExitStats) -- (W#01-ens1f1) Kernel: Packets 84948332, dropped 42359227
[4206] 30/5/2019 -- 17:23:56 - (source-af-packet.c:2670) <Perf> (ReceiveAFPThreadExitStats) -- (W#02-ens1f1) Kernel: Packets 0, dropped 0
[4207] 30/5/2019 -- 17:23:56 - (source-af-packet.c:2670) <Perf> (ReceiveAFPThreadExitStats) -- (W#03-ens1f1) Kernel: Packets 0, dropped 0
[4208] 30/5/2019 -- 17:23:56 - (source-af-packet.c:2670) <Perf> (ReceiveAFPThreadExitStats) -- (W#04-ens1f1) Kernel: Packets 0, dropped 0
[4209] 30/5/2019 -- 17:23:56 - (source-af-packet.c:2670) <Perf> (ReceiveAFPThreadExitStats) -- (W#05-ens1f1) Kernel: Packets 0, dropped 0
[4210] 30/5/2019 -- 17:23:56 - (source-af-packet.c:2670) <Perf> (ReceiveAFPThreadExitStats) -- (W#06-ens1f1) Kernel: Packets 18, dropped 0
[4211] 30/5/2019 -- 17:23:56 - (source-af-packet.c:2670) <Perf> (ReceiveAFPThreadExitStats) -- (W#07-ens1f1) Kernel: Packets 14, dropped 0
[4212] 30/5/2019 -- 17:23:56 - (source-af-packet.c:2670) <Perf> (ReceiveAFPThreadExitStats) -- (W#08-ens1f1) Kernel: Packets 32, dropped 0
[4213] 30/5/2019 -- 17:23:56 - (source-af-packet.c:2670) <Perf> (ReceiveAFPThreadExitStats) -- (W#09-ens1f1) Kernel: Packets 13, dropped 0
[4214] 30/5/2019 -- 17:23:56 - (source-af-packet.c:2670) <Perf> (ReceiveAFPThreadExitStats) -- (W#10-ens1f1) Kernel: Packets 47806, dropped 0
[4215] 30/5/2019 -- 17:23:57 - (source-af-packet.c:2670) <Perf> (ReceiveAFPThreadExitStats) -- (W#11-ens1f1) Kernel: Packets 60698, dropped 0
[4216] 30/5/2019 -- 17:23:57 - (source-af-packet.c:2670) <Perf> (ReceiveAFPThreadExitStats) -- (W#12-ens1f1) Kernel: Packets 47625, dropped 0
[4217] 30/5/2019 -- 17:23:57 - (source-af-packet.c:2670) <Perf> (ReceiveAFPThreadExitStats) -- (W#13-ens1f1) Kernel: Packets 110868, dropped 0
[4218] 30/5/2019 -- 17:23:57 - (source-af-packet.c:2670) <Perf> (ReceiveAFPThreadExitStats) -- (W#14-ens1f1) Kernel: Packets 54795, dropped 0
[4219] 30/5/2019 -- 17:23:57 - (source-af-packet.c:2670) <Perf> (ReceiveAFPThreadExitStats) -- (W#15-ens1f1) Kernel: Packets 111116, dropped 0
[4220] 30/5/2019 -- 17:23:57 - (source-af-packet.c:2670) <Perf> (ReceiveAFPThreadExitStats) -- (W#16-ens1f1) Kernel: Packets 139395, dropped 0
[4221] 30/5/2019 -- 17:23:57 - (Source-Af-Packet.C:2670) <Perf> (Receiveafpthreadexitstats) -- (W#01-Ens3f1) Kernel: Packets 69213171, Dropped 25853962
[4222] 30/5/2019 -- 17:23:58 - (source-af-
I have checked in my setup, some problems with Balancing and fanout, when I correct the problem of fanout I have problems in balancing. I have a problem also when I make the cpu afinity the one-node of the hardware I use has the colors divided into a non-sequential range. Follow my setuppacket.c:2670) <Perf> (ReceiveAFPThreadExitStats) -- (W#02-ens3f1) Kernel: Packets 27, dropped 0
[4223] 30/5/2019 -- 17:23:58 - (source-af-packet.c:2670) <Perf> (ReceiveAFPThreadExitStats) -- (W#03-ens3f1) Kernel: Packets 18, dropped 0
[4224] 30/5/2019 -- 17:23:58 - (source-af-packet.c:2670) <Perf> (ReceiveAFPThreadExitStats) -- (W#04-ens3f1) Kernel: Packets 30, dropped 0
[4225] 30/5/2019 -- 17:23:58 - (source-af-packet.c:2670) <Perf> (ReceiveAFPThreadExitStats) -- (W#05-ens3f1) Kernel: Packets 27, dropped 0
[4226] 30/5/2019 -- 17:23:58 - (source-af-packet.c:2670) <Perf> (ReceiveAFPThreadExitStats) -- (W#06-ens3f1) Kernel: Packets 31, dropped 0
[4227] 30/5/2019 -- 17:23:58 - (source-af-packet.c:2670) <Perf> (ReceiveAFPThreadExitStats) -- (W#07-ens3f1) Kernel: Packets 26, dropped 0
[4228] 30/5/2019 -- 17:23:58 - (source-af-packet.c:2670) <Perf> (ReceiveAFPThreadExitStats) -- (W#08-ens3f1) Kernel: Packets 33, dropped 0
[4229] 30/5/2019 -- 17:23:58 - (source-af-packet.c:2670) <Perf> (ReceiveAFPThreadExitStats) -- (W#09-ens3f1) Kernel: Packets 28, dropped 0
[4230] 30/5/2019 -- 17:23:58 - (source-af-packet.c:2670) <Perf> (ReceiveAFPThreadExitStats) -- (W#10-ens3f1) Kernel: Packets 53986, dropped 0
[4231] 30/5/2019 -- 17:23:58 - (source-af-packet.c:2670) <Perf> (ReceiveAFPThreadExitStats) -- (W#11-ens3f1) Kernel: Packets 56785, dropped 0
[4232] 30/5/2019 -- 17:23:58 - (source-af-packet.c:2670) <Perf> (ReceiveAFPThreadExitStats) -- (W#12-ens3f1) Kernel: Packets 89036, dropped 0
[4233] 30/5/2019 -- 17:23:58 - (source-af-packet.c:2670) <Perf> (ReceiveAFPThreadExitStats) -- (W#13-ens3f1) Kernel: Packets 95541, dropped 0
[4234] 30/5/2019 -- 17:23:58 - (source-af-packet.c:2670) <Perf> (ReceiveAFPThreadExitStats) -- (W#14-ens3f1) Kernel: Packets 97854, dropped 0
[4235] 30/5/2019 -- 17:23:58 - (source-af-packet.c:2670) <Perf> (ReceiveAFPThreadExitStats) -- (W#15-ens3f1) Kernel: Packets 116324, dropped 0
[4236] 30/5/2019 -- 17:23:58 - (source-af-packet.c:2670) <Perf> (ReceiveAFPThreadExitStats) -- (W#16-ens3f1) Kernel: Packets 119019, dropped 0
GAP/DROPS running traffig arround "8Gb Out and 4Gb IN" 
capture.kernel_drops                          | Total                     | 30851380
tcp.reassembly_gap                            | Total                     | 415
capture.kernel_drops                          | Total                     | 33108671
tcp.reassembly_gap                            | Total                     | 418
capture.kernel_drops                          | Total                     | 36862751
tcp.reassembly_gap                            | Total                     | 437
capture.kernel_drops                          | Total                     | 40214600
tcp.reassembly_gap                            | Total                     | 471
capture.kernel_drops                          | Total                     | 41667557
tcp.reassembly_gap                            | Total                     | 475
capture.kernel_drops                          | Total                     | 41667557
tcp.reassembly_gap                            | Total                     | 475
capture.kernel_drops                          | Total                     | 41667557
tcp.reassembly_gap                            | Total                     | 475
perf stat -C 6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37 -e LLC-loads,LLC-load-misses,LLC-stores,LLC-store-misses,LLC-prefetch-misses -I500

   472.873703526    <not supported>      LLC-prefetch-misses                                         
   473.374174031          8,791,280      LLC-loads                                                     (50.00%)
   473.374174031          5,271,962      LLC-load-misses           #  323.35% of all LL-cache hits     (50.00%)
   473.374174031          3,567,877      LLC-stores                                                    (50.00%)
   473.374174031          1,728,600      LLC-store-misses                                              (50.00%)
#57

Updated by Cooper Nelson 17 days ago

Sean Cloherty wrote:

Something else of note. The hosts monitoring general traffic seem to be suffering this issue. I have two hosts which only ingest proxy traffic and they have been 100% fine with no thread errors as far back as I can search.

This is another datapoint in favor of my theory is that what we are seeing is an artifact of TCP packets fragmented mid-session on general "dirty" internet traffic.

In the interest of full disclosure, its been pointed out to me that TCP window scaling shouldn't cause this directly, as the IP packets are still going to be 1514 bytes. However, I'm off the opinion that the higher data rates associated with IT might cause the effect on marginal connections/hardware.

#58

Updated by Andreas Herz 15 days ago

@robson do you also see wrong_threads or just drops? what version of suricata? and could you test out ids mode instead of ips for comparison?

@all should we start a shared gdoc sheet to gather our findings and setups (hardware, nic, config)?

#59

Updated by robson Nascimento 15 days ago

@ Andreas Herz

I see wrong_threads and drops, but I'm finding it wrong because it's wrong where the affinity of threads is defined, if looking at the logs the suricata is starting the wrong sequence, same thing the part of balancing when using workers with cluster_qm, it does not do the balancing always gets stuck only to a queue (Look traffic statistics by threads). I'm using version 4.0 dev, but as a test however I've already used versions 4.1.2, 4.1.4 and 5.0 (dev). I'm using "5.1.3" kernel, but I've already used the "4.14 / 4.15 / 4.19" versions, but in both the behavior is similar.

I already tested in IDS mode but I did not identify differences.

Suricata afinity

threading:
  set-cpu-affinity: yes
  # Tune cpu affinity of threads. Each family of threads can be bound
  # on specific CPUs.
  #
  # These 2 apply to the all runmodes:
  # management-cpu-set is used for flow timeout handling, counters
  # worker-cpu-set is used for 'worker' threads
  #
  # Additionally, for autofp these apply:
  # receive-cpu-set is used for capture threads
  # verdict-cpu-set is used for IPS verdict threads
  #
  cpu-affinity:
    - management-cpu-set:
        cpu: [ 2, 3, 4, 5 ]  # include only these CPUs in affinity settings
        #mode: "balanced" 
        #prio:
        #  default: "high" 
    - worker-cpu-set:
        cpu: [ "6","8","10","12","14","16","18","20","22","24","26","28","30","32","34","36","7","9","11","13","15","17","19","21","23","25","27","29","31","33","35","37" ]
        mode: "exclusive" 
        prio:
          low: [ 0 ]
          medium: [ "1" ]
          high: [ "2-37" ]
          default: "high" 

Interface configuration:

Interface configuration mode:
%YAML 1.1
---
- interface: ens1f1
  copy-iface: ens3f1
  copy-mode: ips
  threads: 16
  use-mmap: yes
  tpacket-v3: yes
  cluster-id: 99
  cluster-type: cluster_qm
  checksum-checks: yes
  defrag: yes
  xdp-mode: driver
  xdp-filter-file: /etc/suricata/ebpf/xdp_filter.bpf
  #ebpf-lb-file:  /etc/suricata/ebpf/lb.bpf
  bypass: yes
  ring-size: 400000
  block-size: 1048576
  mmap-locked: yes
  buffer: 2147483647

- interface: ens3f1
  copy-iface: ens1f1
  copy-mode: ips
  threads: 16
  use-mmap: yes
  tpacket-v3: yes
  cluster-id: 98
  cluster-type: cluster_qm
  checksum-checks: yes
  defrag: yes
  xdp-mode: driver
  xdp-filter-file: /etc/suricata/ebpf/xdp_filter.bpf
  #ebpf-lb-file:  /etc/suricata/ebpf/lb.bpf
  bypass: yes
  ring-size: 400000
  block-size: 1048576
  mmap-locked: yes
  buffer: 2147483647

Intel X710 Driver/firmeware version

driver: i40e
version: 2.1.14-k
firmware-version: 6.80 0x80003d72 18.8.9
expansion-rom-version: 
bus-info: 0000:05:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

#60

Updated by Cooper Nelson 15 days ago

@ Robson Nascimento

I see you are using one of the newer i40e NICs.

Could you try using ethtool to set a 'sd' hash for TCP and see if that addresses the tcp packets on the wrong thread?

/usr/sbin/ethtool -N ens1f1 rx-flow-hash tcp4 sd
/usr/sbin/ethtool -N ens1f1 rx-flow-hash tcp6 sd

This doesn't work on the ixgbe driver, not sure if its a hardware or software issue.

#61

Updated by Peter Manev 14 days ago

@ robson Nascimento

This is not a Suricata issue - this is bound to kernel version in my opinion - at least from what i have seen.

On 4.18+ and 4.14 when using xdp and affinity in corner cases like yours you need to rerun

~/i40e-2.7.29/scripts/set_irq_affinity -x 6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36 $iface_in
~/i40e-2.7.29/scripts/set_irq_affinity -x 7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37 $iface_out

right after Suricata starts. Its only those two versions i've seen doing that with Intel and XDP. I am tracking that and will updated if really it is related to the kernel.
(this is at least my observation with this NIC driver version combo)

Also I see you re running IPS mode with AFPv3 which is not recommended.

#62

Updated by robson Nascimento 14 days ago

@Peter

If you want, I can provide a LAB environment with mirrored traffic with the gigamon bordering around 20GB.

I tested the latest firmware / driver available for these intel x710. I have already tested native versions of the kernel but without significant change.

#63

Updated by Andreas Herz 14 days ago

So some updates from my side. I have two systems that are receiving around 8gbit/s split over two interfaces. We use t-rex (/t-rex-64 -f cap2/http_simple.yaml -c 8 -m 10000 -d 10000) to generate the traffic.

One system is X710 with i40e and one system is X520 with ixgbe drivers. As soon as I enable sym. hashing and 10/8 queues per interface I can confirm that the wrong_threads counter is 0. This even works with cluster_flow besides cluster_qm. I can also confirm, based on the per thread stats.log that the distribution between the 20/16 (2*10/2*8 queues) threads is very well done, same for the dedicated load on the cpu cores. I also disabled rollover in the interface section.

So IMHO that is what we want to have as a baseline, all cores busy with similiar amounts of traffic.

BUT the drops are still a big but also strange issue, because when I restart suricata that jump up to 65% and decline step by step. But even when I look into the delta between two stats.log outputs (20seconds) I still see 5-10% drop rate.
The huge spike at the beginning lasts for around 3 minutes before it starts declining.

When the threads and cores aren't aligned to the queues the wrong_thread counter comes back again, which is no real surprise, but the drop rate is low.
The increase of the wrong_thread counter correlates with the smaller increase in the droprates in comparison to the modes above (threads aligned to queues). So this might be an explanation that packets "lost" due to the wrong_threads won't be counted for the drops.

So while I recommend to achieve the no wrong_thread counter state, try to compare droprates between those scenarios if you also see an increase.

EDIT: I forgot to mention that there is also a huge improvement with regards to the drop counter on the NIC itself. Before the queue changes the drop counter on the NIC was also quite high, thus it is quite obvious that the drop counter on suricata might be lower since a lot of traffic doesn't event reach it.

On the other hand runs with t-rex-64 -f cap2/sfr3.yaml -c 8 -m 100 -d 100000 can achieve 10GBit/s without drops (only some minor peaks which result in 0.05% droprate)

#64

Updated by Peter Manev 14 days ago

Since Trex does not handle/replay vlan and fragments (at last in my observations) - i think we can conclude it is vlan/QinQ/fragments related most likely - with respect to the "wrong_thread" counter. I have case that i can consistently reproduce by using tcpreplay - reproduce "wrong_thread" counter. I have been trying to narrow it down to specific flow.

#65

Updated by Cooper Nelson 13 days ago

Indeed, it is very important to test with 'organic' IP traffic from the "Dirty Internet". LAN and test boxes won't show this effect unless you explicitly engineer it.

#66

Updated by Cooper Nelson 13 days ago

Found this in the ixgbe sources, looks like it is a hardware limitation for TCP on these cards

static int ixgbe_set_rss_hash_opt(struct ixgbe_adapter *adapter,
struct ethtool_rxnfc *nfc) {
u32 flags2 = adapter->flags2;

/*
         * RSS does not support anything other than hashing
         * to queues on src and dst IPs and ports
*/
#67

Updated by Cooper Nelson 13 days ago

I have possibly made some progress here.

I was experimenting with the older run modes, the tcp.pkt_on_wrong_thread counter isn't present when using either pcap or af_packet and the 'autofp' runmode. However packet drops are through the roof.

What I think is happening here is that by putting the stream tracker on the RSS cores, fragmented TCP packets are properly reassembled by suricata before being sent to the detect threads. In the 'workers' runmode, fragmented TCP packets end up on different worker threads which breaks the stream tracker.

Not sure how to fix this as you really need to use the 'workers' runmode on a busy sensor. For small deployments autofp will be a practical solution.

#68

Updated by Victor Julien 12 days ago

  • Related to Feature #3011: Add new 'cluster_peer' runmode to allow for load balancing by IP header (src<->dst) only added
#69

Updated by robson Nascimento 11 days ago

@Cooper Nelson

his doesn't work on the i40e driver, not sure if its a hardware or software issue

#70

Updated by Andreas Herz 11 days ago

I did more testing with Intel I350 Gigabit cards (igb driver). I can get rid of the wrong_thread counter when I use just one combined queue and disable rollover in the interface section. BUT this only works as long as I just use one interface, as soon as I use a second interface with the same settings the counter increases again. I even forced 1 thread per interface (as it's not much traffic ~200mbit/s average) so I don't see any performance issues. But even though I did set irq affinity and cpu affinity there is still a wrong thread_counter at around 10% of received packets.

So far I'm out of ideas how to fix it with igb based cards, they don't support the rss hash key setting. So at least for those cases I think it might be worth to discuss options that AF_PACKET could provide. Eric Leblond do you have any ideas what we could do with cards like those?

With e1000e cards it's not the same, as soon as I disable rollover the wrong_thread counter is gone or really low (with counter_flow). So it's something to work on the igb side I guess.

#71

Updated by Andreas Herz 11 days ago

As pointed out in the other issue #2900 it can be fixed by using autofp instead of workers which is not perfect but should be enough for 1gbit/s.
I can confirm that for e1000e and igb setups, even those with several interfaces, can anyone else confirm this as well?

#72

Updated by Cooper Nelson 11 days ago

Andreas Herz wrote:

So far I'm out of ideas how to fix it with igb based cards, they don't support the rss hash key setting. So at least for those cases I think it might be worth to discuss options that AF_PACKET could provide. Eric Leblond do you have any ideas what we could do with cards like those?

AF_PACKET is supposed to handle this case per the man pages:

IP fragmentation causes packets from the same flow to have different flow
hashes. The flag PACKET_FANOUT_FLAG_DEFRAG, if set, causes
packets to be defragmented before fanout is applied, to pre‐
serve order even in this case.

http://man7.org/linux/man-pages/man7/packet.7.html

I presume this is what the 'defrag' option in the af-packet section of the .yaml controls. This could be broken or dependent on some hardware or driver features.

#73

Updated by Peter Manev 11 days ago

I have a reproducible test with a pcap using tcpreplay that I have requested for Andreas to confirm on his set up - basically i40/symmetric hashing/cluster_qm resulting in

stream.wrong_thread                           | Total                     | 4
tcp.pkt_on_wrong_thread                       | Total                     | 8

and none of those counters above when using autofp. If he confirms - then we actually may have something to 0 in on - i can share the pcap too (just need confirmation)

#74

Updated by Karel Van Hecke 11 days ago

FreeBSD 11.2
netmap
Intel pro/1000 PT (em driver)

Can confirm that using autofp instead of workers run mode fixes the issue.

#75

Updated by Peter Manev 10 days ago

good info - at this point it seems it is not af_packet/netmap related as well.

#76

Updated by Peter Manev 7 days ago

I can narrow it down with a specific pcap in my set up to the following

- afpv3, 1 RSS , and cluster_flow with 36 threads- no wrong_thread counters while doing one time tcpreplay of the pcap
- afpv3, 36 RSS , and cluster_qm with 36 threads, symmetric hashing - i have 4 wrong_thread counters while doing one time tcpreplay of the pcap
- no wrong_thread counter while using autofp as well.

The pcap has lots of frga/vlan/qinq cases.

#77

Updated by Cooper Nelson 7 days ago

@Peter Manev,

Maybe a tall order, but could you try identifying all TCP flows with at least one fragmented packet and put them in seperate file for testing?

If my hypothesis is correct you should get the same number (4) of wrong thread counters. Then you can check to see if can find TCP flows which contain a mix of fragmented and unfragmented packets and focus on those?

Based on your observation, the problem is an interaction between RSS on the NIC and afpv3, which is putting TCP packets from the same flow in different queues. If it doesn't happen with 1 queue and cluster_flow, then the problem isn't with suri.

I think if I remember correctly regit mentioned that cluster_flow will use the RSS hash from the NIC to compute its own hash if present, which means using multi-queue RSS and cluster_flow will result in the same problem.

#78

Updated by Peter Manev 6 days ago

Here is counter corner case if you will. The difference is that i can actually reproduce this at will.
Attached as well (in the archive) we have full details for each run plus inventory set up plus (nasty) pcap to reproduce the results below. Also in the details attached are alerts/packets that triggered the "SURICATA STREAM pkt seen on wrong thread" alerts/counters

So here is the breakdown and sum up of multitude of runs of the tests:

# Issue Appears OS/Distribution Kernel Version NIC/driver/version runmode Queues/threads/workers Notes
1 NO Ubuntu Bionic 4.18.0-20-generic XL710 for 40GbE QSFP+,i40, 2.7.29 AFP v3, cluster_flow 4RSS, 4 Workers With Rust, Toeplitz hash function, low entropy key. Offloading off, ntuple-filters: on, receive-hashing: on, "af-packet.0.defrag = yes" and "af-packet.1.defrag = yes"
2 YES Ubuntu Bionic 4.18.0-20-generic XL710 for 40GbE QSFP+,i40, 2.7.29 AFP v3, cluster_qm 4RSS, 4 Workers With Rust, Toeplitz hash function, low entropy key. Offloading off, ntuple-filters: on, receive-hashing: on, "af-packet.0.defrag = yes" and "af-packet.1.defrag = yes"
3 YES Ubuntu Bionic 4.18.0-20-generic XL710 for 40GbE QSFP+,i40, 2.7.29 AFP v3, cluster_qm 36 RSS, 36 Workers With Rust, Toeplitz hash function, low entropy key. Offloading off, ntuple-filters: on, receive-hashing: on, "af-packet.0.defrag = yes" and "af-packet.1.defrag = yes"
4 YES Ubuntu Bionic 4.18.0-20-generic XL710 for 40GbE QSFP+,i40, 2.7.29 AFP v3, cluster_qm 36 RSS, 36 Workers With Rust, Toeplitz hash function, low entropy key. Offloading off, ntuple-filters: on, receive-hashing: on, "af-packet.0.defrag = no" and "af-packet.1.defrag = no"
# Issue Appears stream.wrong_thread / tcp.pkt_on_wrong_thread Notes
1 NO 0 / 0
2 YES 2 / 4
3 YES 4 / 8 x2 on the counters from test 2
4 YES 4 / 8 x2 on the counters from test 2

If I use the same setup as described in the attached details and above in the tables - but with Trex and do a run like so:

./t-rex-64 -f cap2/sfr_agg_tcp14_udp11_http200msec_new_high_new_nir_profile_ipg_mix.yaml -c 8 -m 22 -d 10000

I have NO "wrong" counters in any cluster_qm set up. The only "wrong" counters appear with cluster_flow (the reverse of above). The difference between the two test processes (trex and tcpreplay pcap) is that the pcap contains purposefully out of fragments with vlans and the trex run has none of those.

I am open to feedback/reproducible runs info/thoughts etc...

NOTE: Just for info - I have NO stream.wrong_thread / tcp.pkt_on_wrong_thread counters with autofp runmode or reading the pcap with autofp.

#79

Updated by Cooper Nelson 6 days ago

Peter Manev wrote:

I am open to feedback/reproducible runs info/thoughts etc...

I think the Issue Appears "NO" for the first case is a false-negative caused by the low number of worker threads. Remember cluster_flow isn't using a proper toeplitz hash, rather its using a adding function (per regit) for the sdfn five-tuple. This is much less 'random' than the toeplitz function, so packets with similar headers (fragments) are much more likely to end up in the same queue when there are a low number of worker threads. Since there isn't a TCP header on the second IP fragment, cluster_flow is probably going to just use '0' for 'fn' of its 'sdfn' hash; while 'sd' will be the same. So the packet will be sent to 'close' worker, rather than a random one. This will hide the issue with a low number of worker threads; if you did a test with 36 workers you would probably see the issue.

From my understanding of the RSS toeplitz implementation, zeroing out the 'fn' fields will send the packet to a 'random' thread (as its a 'proper' hash); though there is still a 25% chance it will end up in the right queue given the low number of worker threads.

I'm using 48 worker threads so this is still a problem in my deployment.

Basically what I'm asking for is either a new cluster mode (cluster_peer) that is even less random than the current implementation, or maybe having a new section in the yaml to config the cluster_flow loadbalancer. I.e., sd vs. sdfn.

I think its important to understand that this is simultaneously a software, hardware, TCP/IP and (physical) networking problem.

#80

Updated by Cooper Nelson 6 days ago

Addenda: You really should have 'Queues/threads/workers' set to a high number and the same for test cases.

#81

Updated by Cooper Nelson about 11 hours ago

@Peter Manev

At this point I'm very close to 100% sure I know what 'generic' problem is, meaning what is happening in all the cases described here. It's possible there are other related edged cases, but I'm certain what we are seeing on our sensors is the same thing.

The most important thing to realize is that the source of the problem is in the RSS implementation on your NICs, which is 'baked' into the ROM on the firmware and can't be fixed (as far as I know). I'll break the problem down in pieces and show why you are seeing what appears to be odd results in your experiment.

It's critical to understand first what happens when a TCP/IP packet gets fragmented. Only the first packet will have the TCP header, subsequent fragments have the IP header only (along with an offset). Of course, this will break RSS implementations that are using 'sdfn' hashing, sending the fragments to different RSS queues. This is undesired behavior.

They way RSS handles this case is to check to see if the 'fragflag' is set on each TCP packet; if it is RSS reverts to 'sd' hashing only. So the the first fragment and all subsequent fragments are hashed on the IP header only, sending them all to the same RSS queue. The problem we are seeing is simply due to TCP flows that have a mix of fragmented and unfragmented packets. As I've mentioned, its very common for flows to begin without fragmentation and it only happens later as the receive window increases. So the fragmented TCP packets are sent to a different RSS queue, where AF_PACKET reassembles them. They are however now in the 'wrong' thread, which is why the counter increments.

In your specific case, lets use an example. Say there are two large TCP flows (1,000+ packets), each containing four fragmented TCP packets. This would result in test cases 3 and 4, given the large number of RSS/worker threads. There is only a 1 in 36 chance that the fragmented packets would be sent to the same thread as the unfragmented packets, resulting in a total of 8 tcp.pkt_on_wrong_thread.

In cases 1 and 2, given there are only 4 RSS queues/threads, there is a 1 in 4 chance the packets will end up on the same queue, assuming a purely 'random' hashing implementation, like Toeplitz. So its not really surprising that we would see lower numbers with lower thread counts in general. I think the first case is 0 simply because cluster_flow doesn't use a true 'random' hash and will often direct packets to same queue if the port numbers are missing and the thread count is low.

Given that this is "feature", not a bug, of the RSS implementation, it will need to be fixed either in the linux kernel or suricata itself. In the kernel, AF_PACKET could be modified to compute a proper Toeplitz hash after TCP packets are defragmented and then send the packet to the 'correct' RSS queue. Or, suricata could be modified to hash only on IPv4 headers, or possibly check the fragflag if its still set after reassembly, computing a new Toeplitz hash and then copying the packet to the proper worker thread.

This issue doesn't happen in auto_fp mode simply because there is only one stream tracker which processes all TCP packets, regardless of whether or not they were fragmented. In the 'worker' runmode each thread has its own stream tracker.

Also available in: Atom PDF