Project

General

Profile

Actions

Support #3778

closed

AF_Packet Config Tweaks

Added by Taylor Walton almost 4 years ago. Updated about 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Affected Versions:
Label:

Description

Hey Team,

I have currently deployed Suricata version 4.1.8 in three of my environments (Dev, Pre Prod, and Prod). The boxes are inline and using iptables to act as a gateway and forward all traffic it receives:

iptables -I FORWARD -j NFQUEUE

The boxes have 96 core cpus, 2 10GbE NICs (1 serving upstream traffic and 1 serving downstream traffic), and 92 GB RAM. Running on CentOS 7 kernel version 3.10.0-1062.9.1.el7.x86_64. I am using the runmode: autofp and mpm-algo: ac and spm-algo: auto.

The initial deployment of Suricata in my Dev, and Pre Prod environments went well. Suricata ran well with no added network latency and an average load on the box. I never saw cpu or memory spike and Suricata seemed to balance across the 96 cores well. However, when I deployed in the Prod environment using the same configuration, heavy network latency was observed. This resulted in me killing the Suricata app and strictly using the box as a gateway (because it is inline) until I can increase Suricata's performance. What is odd is that I never saw the server's cpu or ram spike in any way. Just looking at the box, one could not tell that Suricata was struggling to keep up with received packets, but looking at some bandwidth graphs, we could see bandwidth latency rise to around 800 ms shortly after starting Suricata. Killing Suricata resulted in bandwidth latency hovering around 1-2 ms.

To improve Suricata's performance I am looking into enabling Hyperscan and AF_Packet. I believe I have figured out installing and compiling Hyperscan with Suricata, but I have some questions around AF_Packet and how to best implement my NICs to load balance as much as possible. I have been reading the documentation here: https://suricata.readthedocs.io/en/suricata-5.0.3/performance/high-performance-config.html and have an interest in some of those commands and how they pertain to my NICs.

In total, the NICs can support 32 RSS Queues:

  1. ethtool -l eth15
    Channel parameters for eth15:
    Pre-set maximums:
    RX: 32
    TX: 32
    Other: 0
    Combined: 32
    Current hardware settings:
    RX: 0
    TX: 0
    Other: 0
    Combined: 8

Should I look into increasing this value from combined 8 to 32? Would this be appropriate, or would raising to 16 be sufficient?

The below settings are currently set for the interfaces:

  1. ethtool -k eth15
    Features for eth15:
    rx-checksumming: on [fixed]
    tx-checksumming: on
    tx-checksum-ipv4: on
    tx-checksum-ip-generic: off [fixed]
    tx-checksum-ipv6: on
    tx-checksum-fcoe-crc: off [fixed]
    tx-checksum-sctp: off [fixed]
    scatter-gather: off
    tx-scatter-gather: off
    tx-scatter-gather-fraglist: off [fixed]
    tcp-segmentation-offload: off
    tx-tcp-segmentation: off
    tx-tcp-ecn-segmentation: off
    tx-tcp-mangleid-segmentation: off
    tx-tcp6-segmentation: off
    udp-fragmentation-offload: off
    generic-segmentation-offload: off
    generic-receive-offload: off
    large-receive-offload: off [fixed]
    rx-vlan-offload: on [fixed]
    tx-vlan-offload: on [fixed]
    ntuple-filters: on
    receive-hashing: on [fixed]
    highdma: on [fixed]
    rx-vlan-filter: on [fixed]
    vlan-challenged: off [fixed]
    tx-lockless: off [fixed]
    netns-local: off [fixed]
    tx-gso-robust: off [fixed]
    tx-fcoe-segmentation: off [fixed]
    tx-gre-segmentation: on
    tx-gre-csum-segmentation: on
    tx-ipxip4-segmentation: off [fixed]
    tx-ipxip6-segmentation: off [fixed]
    tx-udp_tnl-segmentation: on
    tx-udp_tnl-csum-segmentation: on
    tx-gso-partial: off [fixed]
    tx-sctp-segmentation: off [fixed]
    tx-esp-segmentation: off [fixed]
    tx-udp-segmentation: off [fixed]
    fcoe-mtu: off [fixed]
    tx-nocache-copy: off
    loopback: off [fixed]
    rx-fcs: off [fixed]
    rx-all: off [fixed]
    tx-vlan-stag-hw-insert: off [fixed]
    rx-vlan-stag-hw-parse: off [fixed]
    rx-vlan-stag-filter: off [fixed]
    l2-fwd-offload: off [fixed]
    hw-tc-offload: on
    esp-hw-offload: off [fixed]
    esp-tx-csum-hw-offload: off [fixed]
    rx-udp_tunnel-port-offload: on
    tls-hw-tx-offload: off [fixed]
    tls-hw-rx-offload: off [fixed]
    rx-gro-hw: off [requested on]
    tls-hw-record: off [fixed]
  1. ethtool -x eth15
    RX flow hash indirection table for eth15 with 8 RX ring(s):
    0: 0 1 2 3 4 5 6 7
    8: 0 1 2 3 4 5 6 7
    16: 0 1 2 3 4 5 6 7
    24: 0 1 2 3 4 5 6 7
    32: 0 1 2 3 4 5 6 7
    40: 0 1 2 3 4 5 6 7
    48: 0 1 2 3 4 5 6 7
    56: 0 1 2 3 4 5 6 7
    64: 0 1 2 3 4 5 6 7
    72: 0 1 2 3 4 5 6 7
    80: 0 1 2 3 4 5 6 7
    88: 0 1 2 3 4 5 6 7
    96: 0 1 2 3 4 5 6 7
    104: 0 1 2 3 4 5 6 7
    112: 0 1 2 3 4 5 6 7
    120: 0 1 2 3 4 5 6 7
    RSS hash key:
    e6:90:0d:17:35:92:56:71:d6:e8:d7:96:6f:2e:03:e2:ce:f0:09:bc:c3:ca:14:0e:23:b2:42:4d:53:3e:d0:bf:87:db:ca:43:82:3c:0b:47
    RSS hash function:
    toeplitz: on
    xor: off
    crc32: off

I believe here I would need to change to a symmetric hash key and leave toeplitz on?

  1. ethtool -g eth15
    Ring parameters for eth15:
    Pre-set maximums:
    RX: 8191
    RX Mini: 0
    RX Jumbo: 0
    TX: 8191
    Current hardware settings:
    RX: 1023
    RX Mini: 0
    RX Jumbo: 0
    TX: 8191

For the ring descriptor size, would you recommend increasing to 8191? In the documentation 1024 is suggested but this NIC very well good have a bigger ring size.

Then in the suricata.yaml within the AF_Packet section would I need to set cluster_qm (I will need to upgrade my kernel to support it) in order to see performance increases, or would making the adjustments on the NICs as shown above and enabling cluster_flow still give Suricata a performance boost?

Will I need to set the ring-size to match what the NIC can support (8191)? With a NIC of these settings what would you recommend as to being a good buffer-size? I see the default is set to 32768 but I think I could increase that. What is a max you would possibly suggest?

Do you recommend a checksum-validition mode of kernel, yes, not, or auto? What I do not want to happen is have the kernel drop packets if it cannot resolve a checksum.

Thank you so much for your time and any insight you can provide me! I am eager on making these tweaks to better improve Suricata's performance so that it is no longer the bottleneck within my Production environment.

As always, let me know if you would like me to clear anything up or if more information is needed.

Thanks so much for the help!

Best Regards,

Taylor

Actions #1

Updated by Taylor Walton almost 4 years ago

Below are my CPU specs with NUMA node0 and node 1:

  1. lscpu
    Architecture: x86_64
    CPU op-mode(s): 32-bit, 64-bit
    Byte Order: Little Endian
    CPU: 96
    On-line CPU list: 0-95
    Thread(s) per core: 2
    Core(s) per socket: 24
    Socket(s): 2
    NUMA node(s): 2
    Vendor ID: GenuineIntel
    CPU family: 6
    Model: 85
    Model name: Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz
    Stepping: 7
    CPU MHz: 1000.910
    CPU max MHz: 3700.0000
    CPU min MHz: 1000.0000
    BogoMIPS: 4200.00
    Virtualization: VT-x
    L1d cache: 32K
    L1i cache: 32K
    L2 cache: 1024K
    L3 cache: 36608K
    NUMA node0 CPU: 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94
    NUMA node1 CPU: 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81,83,85,87,89,91,93,95
    Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke avx512_vnni md_clear flush_l1d arch_capabilities
Actions #2

Updated by Taylor Walton almost 4 years ago

Also, I see in the doc:

  • MTU on both interfaces have to be equal: the copy from one interface to the other is direct and packet bigger then the MTU will be dropped by kernel.

Is the MTU a setting I specify within the af_packet section within the suricata.yaml or is this referring to the MTU that is set on the interfaces themselves?

Thanks,

Taylor

Actions #3

Updated by Andreas Herz almost 4 years ago

First of all, if you talk about AF_PACKET inline did you get rid of the NFQUEUE setup?
Those are two different approaches, the first does direct copy from interface to interface, the other uses the netfilter path but might be less performant. NFQUUE supports several queues which can help performance as well. And yes go with Hyperscan and a newer Kernel in general as well.
What NIC do you use?

Actions #4

Updated by Taylor Walton almost 4 years ago

Hey Andreas,

Thanks for reaching out.

First of all, if you talk about AF_PACKET inline did you get rid of the NFQUEUE setup?

  • Yes, I am no longer running NFQUEUE in my dev environment. I cleared my IPTABLES using iptables -F and started suricata with --af-packet and these suricata.yaml settings:

af-packet = (null)
af-packet.0 = interface
af-packet.0.interface = eth17
af-packet.0.threads = auto
af-packet.0.defrag = yes
af-packet.0.cluster-type = cluster_flow
af-packet.0.cluster-id = 98
af-packet.0.ring-size = 2000
af-packet.0.copy-mode = ips
af-packet.0.copy-iface = eth13
af-packet.0.use-mmap = yes
af-packet.0.tpacket-v3 = no
af-packet.0.rollover = yes
af-packet.1 = interface
af-packet.1.interface = eth13
af-packet.1.threads = auto
af-packet.1.cluster-id = 99
af-packet.1.cluster-type = cluster_flow
af-packet.1.defrag = yes
af-packet.1.rollover = yes
af-packet.1.use-mmap = yes
af-packet.1.tpacket-v3 = no
af-packet.1.ring-size = 2000
af-packet.1.block-size = 4096
af-packet.1.copy-mode = ips
af-packet.1.copy-iface = eth17
af-packet.2 = interface
af-packet.2.interface = default

I am wondering if my understanding of AF_PACKET is wrong. My current architecture resides (in simpler terms) as Internet->Firewall->Suricata->Core Switch->Internal Hosts, and then the opposite for network traffic leaving the internal network back out to the Internet. Suricata is running on a Centos7 box with two 10GB NICs (upstream and downtream). I am using OSPF on the Suricata box to receive routes from the Firewall (above suricata) and from the Core Switch (below Suricata). So you could say my box is serving as an advanced router and if those routes on the box were to be lost, traffic would not flow through the network. I have another neighbor Suricata box that serves as a failover. Suricata 02 is also receiving routes via OSPF but the firewall and the core switch send to their default gateways of Suricata 01 as long as the box is up.

Would this type of architecture work for an AF_PACKET deployment? Reading some other message boards online, it seems that AF_PACKET converts my box into a bridge instead of a router? If that is the case, would my current deployment (descried above) work for this AF_PACKET mode? If AF_PACKET mode serves as a bridge, is it possible to configure the firewall and core switch to send their traffic to the bridged interfaces of the suricata box, or is that clashing layer 2 vs layer 3?

I also noticed an interesting issue when deploying this in my dev environment the other night. When trying to start both boxes in the AF_PACKET mode, traffic would split brain. The core would send traffic to both the suricata 01 and the suricata 02 box. Could this be due to the fact that the core switch saw the MAC address of both suricata boxes, since this runs at a layer 2 layer, and sent traffic to all MACs it can see?

If I need to fall back to an NFQU mode, how much of a performance increase can be seen (just in your experiences) by adding several queues?

Thank you so much for your time and insight.

Best Regards,

Taylor

Actions #5

Updated by Andreas Herz almost 4 years ago

Taylor Walton wrote in #note-4:

Would this type of architecture work for an AF_PACKET deployment? Reading some other message boards online, it seems that AF_PACKET converts my box into a bridge instead of a router?

AF_PACKET like you configured it is just copying each packet it receives on eth17 to eth13 and the other way round, there is just forwarding no routing or netfilter related part as with NFQUEUE

If I need to fall back to an NFQU mode, how much of a performance increase can be seen (just in your experiences) by adding several queues?

Hard to tell but it should scale well

Actions #6

Updated by Andreas Herz about 2 years ago

  • Status changed from New to Closed

Hi, we're closing this issue since there have been no further responses.
If you think this issue is still relevant, try to test it again with the
most recent version of suricata and reopen the issue. If you want to
improve the bug report please take a look at
https://redmine.openinfosecfoundation.org/projects/suricata/wiki/Reporting_Bugs

Actions

Also available in: Atom PDF