Project

General

Profile

Actions

Support #3778

closed

AF_Packet Config Tweaks

Added by Taylor Walton almost 4 years ago. Updated about 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Affected Versions:
Label:

Description

Hey Team,

I have currently deployed Suricata version 4.1.8 in three of my environments (Dev, Pre Prod, and Prod). The boxes are inline and using iptables to act as a gateway and forward all traffic it receives:

iptables -I FORWARD -j NFQUEUE

The boxes have 96 core cpus, 2 10GbE NICs (1 serving upstream traffic and 1 serving downstream traffic), and 92 GB RAM. Running on CentOS 7 kernel version 3.10.0-1062.9.1.el7.x86_64. I am using the runmode: autofp and mpm-algo: ac and spm-algo: auto.

The initial deployment of Suricata in my Dev, and Pre Prod environments went well. Suricata ran well with no added network latency and an average load on the box. I never saw cpu or memory spike and Suricata seemed to balance across the 96 cores well. However, when I deployed in the Prod environment using the same configuration, heavy network latency was observed. This resulted in me killing the Suricata app and strictly using the box as a gateway (because it is inline) until I can increase Suricata's performance. What is odd is that I never saw the server's cpu or ram spike in any way. Just looking at the box, one could not tell that Suricata was struggling to keep up with received packets, but looking at some bandwidth graphs, we could see bandwidth latency rise to around 800 ms shortly after starting Suricata. Killing Suricata resulted in bandwidth latency hovering around 1-2 ms.

To improve Suricata's performance I am looking into enabling Hyperscan and AF_Packet. I believe I have figured out installing and compiling Hyperscan with Suricata, but I have some questions around AF_Packet and how to best implement my NICs to load balance as much as possible. I have been reading the documentation here: https://suricata.readthedocs.io/en/suricata-5.0.3/performance/high-performance-config.html and have an interest in some of those commands and how they pertain to my NICs.

In total, the NICs can support 32 RSS Queues:

  1. ethtool -l eth15
    Channel parameters for eth15:
    Pre-set maximums:
    RX: 32
    TX: 32
    Other: 0
    Combined: 32
    Current hardware settings:
    RX: 0
    TX: 0
    Other: 0
    Combined: 8

Should I look into increasing this value from combined 8 to 32? Would this be appropriate, or would raising to 16 be sufficient?

The below settings are currently set for the interfaces:

  1. ethtool -k eth15
    Features for eth15:
    rx-checksumming: on [fixed]
    tx-checksumming: on
    tx-checksum-ipv4: on
    tx-checksum-ip-generic: off [fixed]
    tx-checksum-ipv6: on
    tx-checksum-fcoe-crc: off [fixed]
    tx-checksum-sctp: off [fixed]
    scatter-gather: off
    tx-scatter-gather: off
    tx-scatter-gather-fraglist: off [fixed]
    tcp-segmentation-offload: off
    tx-tcp-segmentation: off
    tx-tcp-ecn-segmentation: off
    tx-tcp-mangleid-segmentation: off
    tx-tcp6-segmentation: off
    udp-fragmentation-offload: off
    generic-segmentation-offload: off
    generic-receive-offload: off
    large-receive-offload: off [fixed]
    rx-vlan-offload: on [fixed]
    tx-vlan-offload: on [fixed]
    ntuple-filters: on
    receive-hashing: on [fixed]
    highdma: on [fixed]
    rx-vlan-filter: on [fixed]
    vlan-challenged: off [fixed]
    tx-lockless: off [fixed]
    netns-local: off [fixed]
    tx-gso-robust: off [fixed]
    tx-fcoe-segmentation: off [fixed]
    tx-gre-segmentation: on
    tx-gre-csum-segmentation: on
    tx-ipxip4-segmentation: off [fixed]
    tx-ipxip6-segmentation: off [fixed]
    tx-udp_tnl-segmentation: on
    tx-udp_tnl-csum-segmentation: on
    tx-gso-partial: off [fixed]
    tx-sctp-segmentation: off [fixed]
    tx-esp-segmentation: off [fixed]
    tx-udp-segmentation: off [fixed]
    fcoe-mtu: off [fixed]
    tx-nocache-copy: off
    loopback: off [fixed]
    rx-fcs: off [fixed]
    rx-all: off [fixed]
    tx-vlan-stag-hw-insert: off [fixed]
    rx-vlan-stag-hw-parse: off [fixed]
    rx-vlan-stag-filter: off [fixed]
    l2-fwd-offload: off [fixed]
    hw-tc-offload: on
    esp-hw-offload: off [fixed]
    esp-tx-csum-hw-offload: off [fixed]
    rx-udp_tunnel-port-offload: on
    tls-hw-tx-offload: off [fixed]
    tls-hw-rx-offload: off [fixed]
    rx-gro-hw: off [requested on]
    tls-hw-record: off [fixed]
  1. ethtool -x eth15
    RX flow hash indirection table for eth15 with 8 RX ring(s):
    0: 0 1 2 3 4 5 6 7
    8: 0 1 2 3 4 5 6 7
    16: 0 1 2 3 4 5 6 7
    24: 0 1 2 3 4 5 6 7
    32: 0 1 2 3 4 5 6 7
    40: 0 1 2 3 4 5 6 7
    48: 0 1 2 3 4 5 6 7
    56: 0 1 2 3 4 5 6 7
    64: 0 1 2 3 4 5 6 7
    72: 0 1 2 3 4 5 6 7
    80: 0 1 2 3 4 5 6 7
    88: 0 1 2 3 4 5 6 7
    96: 0 1 2 3 4 5 6 7
    104: 0 1 2 3 4 5 6 7
    112: 0 1 2 3 4 5 6 7
    120: 0 1 2 3 4 5 6 7
    RSS hash key:
    e6:90:0d:17:35:92:56:71:d6:e8:d7:96:6f:2e:03:e2:ce:f0:09:bc:c3:ca:14:0e:23:b2:42:4d:53:3e:d0:bf:87:db:ca:43:82:3c:0b:47
    RSS hash function:
    toeplitz: on
    xor: off
    crc32: off

I believe here I would need to change to a symmetric hash key and leave toeplitz on?

  1. ethtool -g eth15
    Ring parameters for eth15:
    Pre-set maximums:
    RX: 8191
    RX Mini: 0
    RX Jumbo: 0
    TX: 8191
    Current hardware settings:
    RX: 1023
    RX Mini: 0
    RX Jumbo: 0
    TX: 8191

For the ring descriptor size, would you recommend increasing to 8191? In the documentation 1024 is suggested but this NIC very well good have a bigger ring size.

Then in the suricata.yaml within the AF_Packet section would I need to set cluster_qm (I will need to upgrade my kernel to support it) in order to see performance increases, or would making the adjustments on the NICs as shown above and enabling cluster_flow still give Suricata a performance boost?

Will I need to set the ring-size to match what the NIC can support (8191)? With a NIC of these settings what would you recommend as to being a good buffer-size? I see the default is set to 32768 but I think I could increase that. What is a max you would possibly suggest?

Do you recommend a checksum-validition mode of kernel, yes, not, or auto? What I do not want to happen is have the kernel drop packets if it cannot resolve a checksum.

Thank you so much for your time and any insight you can provide me! I am eager on making these tweaks to better improve Suricata's performance so that it is no longer the bottleneck within my Production environment.

As always, let me know if you would like me to clear anything up or if more information is needed.

Thanks so much for the help!

Best Regards,

Taylor

Actions

Also available in: Atom PDF