Project

General

Profile

Actions

Task #3318

open

Research: NUMA awareness

Added by Victor Julien over 4 years ago. Updated 10 months ago.

Status:
New
Priority:
Normal
Assignee:
Target version:
Effort:
Difficulty:
Label:

Description

In several talks at suricon we've seen that the best performance happens when the NIC and suricata are on the same NUMA node, and that Suricata should be limited to this node.

Even in a multi-NIC scenario, Suricata will likely not perform well when running on multiple nodes at once, as global data structures like the flow table are then accessed/updated over the interconnects a lot.

Evaluate what strategies exist.

Reading material:
https://www.akkadia.org/drepper/cpumemory.pdf
https://stackoverflow.com/a/47714514/2756873


Related issues 2 (2 open0 closed)

Related to Suricata - Task #3288: Suricon 2019 brainstormAssignedVictor JulienActions
Related to Suricata - Task #3695: research: libhwloc for better autoconfigurationAssignedShivani BhardwajActions
Actions #1

Updated by Victor Julien over 4 years ago

  • Related to Task #3288: Suricon 2019 brainstorm added
Actions #2

Updated by Victor Julien over 4 years ago

  • Tracker changed from Feature to Task
  • Subject changed from numa awareness to Research: NUMA awareness
Actions #3

Updated by Victor Julien over 4 years ago

Several possible subtasks come to mind:
  1. making configuration easier: take NUMA into account when configuring CPU affinity. Currently a list of CPUs has to be provided, which can be tedious and error prone. libnuma could help with identifying the CPUs belong to a node.
  2. assign memory to specific nodes: the default allocation behaviour (at least on Linux) seems to already be that the allocating thread allocates memory in its own node. For packets we already do this correctly, with packet pools initialized per thread, in the thread. But for example the flow spare queue is global and the flows in it are initially alloc'd from the main thread, and later updated from the flow manager. This means these flows will likely be unbalanced and lean towards one node more than others. Creating per thread flow spare queues could be one way to address this. Similarly for other 'pools' like stream segments, sessions, etc.
  3. duplicate data structures per node. Not sure yet if this a good strategy, but the idea is that something like the flow table or detect engine would have a copy per node to guarantee locality. In a properly functioning flow table this should be clean, as the flows should stay on the same thread (=CPU). For the detection engine this will pretty much duplicate memory use for the detection engine. Unless loading is done in parallel, start up time would also increase.
Actions #4

Updated by Andreas Herz over 4 years ago

  • Assignee set to OISF Dev
  • Target version set to TBD
Actions #5

Updated by Victor Julien over 4 years ago

  • Description updated (diff)
  • Status changed from New to Assigned
  • Assignee changed from OISF Dev to Victor Julien
Actions #6

Updated by Andreas Herz over 4 years ago

Do we also have some more insights how this does affect the management threads for example? If we can at least move those to a different node to keep the other cpu cores free for the heavy tasks?

Actions #7

Updated by Victor Julien over 4 years ago

They would probably have to run on the same node as where the traffic is and where the memory for that traffic is owned to avoid accessing locks over the interconnects.

Actions #8

Updated by Victor Julien over 3 years ago

  • Related to Task #3695: research: libhwloc for better autoconfiguration added
Actions #9

Updated by Victor Julien 10 months ago

  • Status changed from Assigned to New
  • Assignee changed from Victor Julien to OISF Dev

@Lukas Sismis since you've been doing a bit of NUMA stuff for DPDK, I wonder if you have some thoughts on the topic

Actions

Also available in: Atom PDF