Project

General

Profile

Support #2725

stream/packet on wrong thread

Added by Peter Manev about 2 months ago. Updated 31 minutes ago.

Status:
Feedback
Priority:
Normal
Assignee:
-
Target version:
-
Affected Versions:
Effort:
Difficulty:

Description

Looking for feedback.

While investigating various research points with af-packet on live traffic and latest gitmaster (ex eedf08be/4.1) I noticed that I have never seen those to be 0 in any occasion

stream.wrong_thread                        | Total                     | 2982446
tcp.pkt_on_wrong_thread                    | Total                     | 156187846

Those statistics can be made available via - ( https://github.com/OISF/suricata/blob/master/suricata.yaml.in#L66 )

# global stats configuration
stats:
  enabled: yes
  # The interval field (in seconds) controls at what interval
  # the loggers are invoked.
  interval: 8
  # Add decode events as stats.
  decoder-events: true
  # Add stream events as stats.
  stream-events: true

I have tried different NICs/drivers(tested ixgbe/i40e), af-packet v3/v2, cluster_flow/cluster_cpu/cluster_qm, vlan tracking enabled or not, on different live traffic machines, different kernels (4.18/4.19) -
capture.kernel_drops and stream.wrong_thread are never 0 and always increasing.(it is more like 10-15% of the total in my test cases)

Looking for any feedback in terms of - if you are experiencing the same issue or not and what is your setup (if you would like to share).

statslog.tar.gz (802 KB) statslog.tar.gz Peter Manev, 12/05/2018 07:37 AM

History

#1 Updated by Victor Julien about 2 months ago

  • Tracker changed from Bug to Support

It could be interesting to check if having a single NIC queue helps.

It could be interesting how it works with PF_RING and NETMAP.

#2 Updated by Victor Julien about 2 months ago

Can you share a full stats.log record?

#3 Updated by Peter Manev about 2 months ago

I could not test definitely with single RSS because the traffic was too much for one queue on all the live set ups I can test. Will try to reproduce that some other way on a slower machine.

It would also be interesting to see/confirm the issue is not present with dedicated or other HW - Netronome/Napatech/Mellanox/Accolade+pfring (that one I will test next week on and confirm)/ other ...

#4 Updated by Peter Manev about 2 months ago

stats.log attached

#5 Updated by Victor Julien about 2 months ago

There is Teredo, ipv4-in-ipv6 and GRE tunneling. I wonder if those are handled correctly wrt AF_PACKETs load balancing.

#6 Updated by Peter Manev about 1 month ago

Valid point I think.
I am supposed to get access to a replay set up next week and will do a few runs with those types of live traffic specifically to cover those cases as well. Will report back as soon as I got some findings on that.

#7 Updated by Victor Julien about 1 month ago

I wonder if as a test it would make sense to set a capture filter like just: tcp. I think this should rule out the tunnels.

Also, despite the perf concern I think it is still a good test to just use a single queue on the NIC, just to see if that affects the 'wrong_thread'.

#8 Updated by Peter Manev about 1 month ago

An update:
Setup: - Af-packet config


suricata --dump-config |grep af-packet
af-packet = (null)
af-packet.0 = interface
af-packet.0.interface = enp59s0
af-packet.0.threads = 40
af-packet.0.cluster-id = 99
af-packet.0.cluster-type = cluster_flow
af-packet.0.defrag = yes
af-packet.0.xdp-mode = driver
af-packet.0.xdp-filter-file = /etc/suricata/xdp_filter.bpf
af-packet.0.bypass = yes
af-packet.0.use-mmap = yes
af-packet.0.mmap-locked = yes
af-packet.0.tpacket-v3 = yes
af-packet.0.ring-size = 300000
af-packet.0.block-size = 1048576
af-packet.1 = interface
af-packet.1.interface = default

NIC setup


ethtool -l enp59s0
Channel parameters for enp59s0:
Pre-set maximums:
RX:             0
TX:             0
Other:          1
Combined:       112
Current hardware settings:
RX:             0
TX:             0
Other:          1
Combined:       1

Case 1

Run command:

suricata -vvv --af-packet -S /opt/rules/*.rules --pidfile=/var/run/suricata.pid -F /var/log/suricata/bpf-vlan.filter

#bp filter
((ip and tcp port 80) or (vlan and tcp port 80))

So the "stream.wrong_thread" stay the same but "tcp.pkt_on_wrong_thread" are increasing


tail -F /var/log/suricata/stats.log  |grep -E "wrong_thread" 

stream.wrong_thread                        | Total                     | 5093
tcp.pkt_on_wrong_thread                    | Total                     | 5671362
stream.wrong_thread                        | Total                     | 5093
tcp.pkt_on_wrong_thread                    | Total                     | 5713760
stream.wrong_thread                        | Total                     | 5093
tcp.pkt_on_wrong_thread                    | Total                     | 5757011
stream.wrong_thread                        | Total                     | 5093
tcp.pkt_on_wrong_thread                    | Total                     | 5798877
stream.wrong_thread                        | Total                     | 5093
tcp.pkt_on_wrong_thread                    | Total                     | 5844993
stream.wrong_thread                        | Total                     | 5093
tcp.pkt_on_wrong_thread                    | Total                     | 5890561
stream.wrong_thread                        | Total                     | 5093
tcp.pkt_on_wrong_thread                    | Total                     | 5938549
stream.wrong_thread                        | Total                     | 5093
tcp.pkt_on_wrong_thread                    | Total                     | 5977920
stream.wrong_thread                        | Total                     | 5093
tcp.pkt_on_wrong_thread                    | Total                     | 6018627
stream.wrong_thread                        | Total                     | 5093
tcp.pkt_on_wrong_thread                    | Total                     | 6061152
stream.wrong_thread                        | Total                     | 5093
tcp.pkt_on_wrong_thread                    | Total                     | 6100331
stream.wrong_thread                        | Total                     | 5093
tcp.pkt_on_wrong_thread                    | Total                     | 6136587
stream.wrong_thread                        | Total                     | 5093
tcp.pkt_on_wrong_thread                    | Total                     | 6174691
stream.wrong_thread                        | Total                     | 5093
tcp.pkt_on_wrong_thread                    | Total                     | 6207062
stream.wrong_thread                        | Total                     | 5093
tcp.pkt_on_wrong_thread                    | Total                     | 6245910
stream.wrong_thread                        | Total                     | 5093
tcp.pkt_on_wrong_thread                    | Total                     | 6280238
stream.wrong_thread                        | Total                     | 5093
tcp.pkt_on_wrong_thread                    | Total                     | 6319797
stream.wrong_thread                        | Total                     | 5093
tcp.pkt_on_wrong_thread                    | Total                     | 6351413
stream.wrong_thread                        | Total                     | 5093
tcp.pkt_on_wrong_thread                    | Total                     | 6389877
stream.wrong_thread                        | Total                     | 5093
tcp.pkt_on_wrong_thread                    | Total                     | 6427672
stream.wrong_thread                        | Total                     | 5093
tcp.pkt_on_wrong_thread                    | Total                     | 6468473

Case 2

Both "stream.wrong_thread" and "tcp.pkt_on_wrong_thread" are increasing.

Run command:

suricata -vvv --af-packet -S /opt/rules/*.rules --pidfile=/var/run/suricata.pid tcp port 80 --set vlan.use-for-tracking=false


tail -F /var/log/suricata/stats.log  |grep -E "wrong_thread" 

stream.wrong_thread                        | Total                     | 5415
tcp.pkt_on_wrong_thread                    | Total                     | 975303
stream.wrong_thread                        | Total                     | 5788
tcp.pkt_on_wrong_thread                    | Total                     | 1599922
stream.wrong_thread                        | Total                     | 6206
tcp.pkt_on_wrong_thread                    | Total                     | 2002613
stream.wrong_thread                        | Total                     | 6570
tcp.pkt_on_wrong_thread                    | Total                     | 2324993
stream.wrong_thread                        | Total                     | 6979
tcp.pkt_on_wrong_thread                    | Total                     | 2595832
stream.wrong_thread                        | Total                     | 7437
tcp.pkt_on_wrong_thread                    | Total                     | 2816437
stream.wrong_thread                        | Total                     | 7804
tcp.pkt_on_wrong_thread                    | Total                     | 3046528
stream.wrong_thread                        | Total                     | 8130
tcp.pkt_on_wrong_thread                    | Total                     | 3231108
stream.wrong_thread                        | Total                     | 8505
tcp.pkt_on_wrong_thread                    | Total                     | 3393051
stream.wrong_thread                        | Total                     | 8825
tcp.pkt_on_wrong_thread                    | Total                     | 3500183
stream.wrong_thread                        | Total                     | 9196
tcp.pkt_on_wrong_thread                    | Total                     | 3617008
stream.wrong_thread                        | Total                     | 9519
tcp.pkt_on_wrong_thread                    | Total                     | 3715475
stream.wrong_thread                        | Total                     | 9911
tcp.pkt_on_wrong_thread                    | Total                     | 3816075
stream.wrong_thread                        | Total                     | 10289
tcp.pkt_on_wrong_thread                    | Total                     | 3927234

#9 Updated by Peter Manev about 1 month ago

Same behavior as with AFP/Mellanox (MT27800 Family [ConnectX-5])
While PF_RING(cluster_flow)/Intel combo has "stream.wrong_thread" and "tcp.pkt_on_wrong_thread" counters increasing as well - it seems it is on a much smaller scale (0.00x% of the total pkts) during first test trials.
Continuing the investigation - will keep updating.

#10 Updated by Peter Manev about 1 month ago

I have been able to pinpoint a reproducible case with having the "wrong_thread" counters present and increasing and then not present in a diff test case.
I have handed the test cases to Eric for feedback.

#11 Updated by Peter Manev 31 minutes ago

As an FYI
Tested/reproducible with the following kernels
Debian Stable with:

Linux suricata-ng01 4.18.0-0.bpo.1-amd64 #1 SMP Debian 4.18.6-1~bpo9+1 (2018-09-13) x86_64 GNU/Linux

Ubuntu LTS 16.04 with:

Linux suricata 4.18.6-amd64 #1 SMP Sat Sep 8 15:19:53 CEST 2018 x86_64 x86_64 x86_64 GNU/Linux

Also reported on Centos 7.6 on the user mailing list
https://lists.openinfosecfoundation.org/pipermail/oisf-users/2019-January/016527.html

I am currently trying some test runs to see if it is possible to pin point a possible direction of the issue.

Also available in: Atom PDF