Optimization #1039: Packetpool should be a stack - Suricata - Open Information Security Foundation

Actions

Copy link

Optimization #1039

closed

Packetpool should be a stack

Added by Ken Steele over 11 years ago. Updated almost 11 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Ken Steele

Target version:

2.1beta1

Effort:

Difficulty:

Label:

Description

It is better to store free Packets on a LIFO stack, rather than a FIFO queue. This way recently used Packets, which might still be in cache, are reused more quickly.

Actions

Copy link

Updated by Ken Steele over 11 years ago

The code is in tmqh-packetpool.c

Actions

Copy link

Updated by Victor Julien over 11 years ago

Target version set to 3.0RC2

Actions

Copy link

Updated by Song Liu over 11 years ago

Assignee set to Song Liu

Actions

Copy link

Updated by Ken Steele over 11 years ago

I would recommend having a per thread free stack of packets that is only accessed by the thread, thus not needing a mutex. To allow threads to free a Packet back to the free stack on another thread, use a second "return-stack", protected by a mutex. When the thread's local free stack is empty, it can lock the return-stack and move all the packets to its local free stack.

This requires that each Packet record the thread on which is was allocated, but that can be stored in one byte for up to 256 threads.

Actions

Copy link

Updated by Victor Julien over 11 years ago

I agree Ken. There is one common use case to consider, the autofp runmodes. In this case the packet will almost certainly be freed by another thread than the one that alloc'd it.

Actions

Copy link

Updated by Victor Julien over 11 years ago

Btw, a while ago I played with this code: https://github.com/inliniac/suricata/pull/845, at the time I was investigating slowdowns. It seemed like we could experience 'pseudo packet storms', where the packet processing virtually stopped. The goal of this queue experiment was not to reduce locking, but reduce lock contention. IIRC it worked well. Might consider an approach like this for the 'return stack', so that we'll never get serious contention there.

Actions

Copy link

Updated by Anoop Saldanha over 11 years ago

If we are planning ton use the LIFO approach, for cuda we might need another "still-in-use" kind of return stack. In cuda once I send the packet over to the gpu, on the cpu side I might not need the results from the gpu and pass the packet back to the packetpool, despite the gpu holding a reference to this packet. If we reuse this packet from the packetpool, inside the decoder we would wait till the gpu frees this packet up.

Cuda now only works with autofp, so be default we would end up using the return-stack, but the thread might need to check for the "in-use-by-gpu" flag on the packet before transfering it back to its free stack pool or maybe the thread is ready to take a gpu wait hit in decoder, and move all of them back to its "free" packet pool. Either ways assigning a sufficiently huge no in the free stack would give the gpu enough time to free the packet up.

An additional advantage I see with LIFO is we won't be constrained by 65k packets we are constrained now, again keeping cuda in mind. We can provide additional queue types to support > 65k packets, but LIFO seems easier.

Actions

Copy link

Updated by Victor Julien over 11 years ago

That sounds like an architecture problem in the CUDA code then. We shouldn't be putting packets back into the pool if they are still referenced elsewhere. Think we can exclude this from the general packet stack discussion and need to address it separately.

Actions

Copy link

Updated by Anoop Saldanha over 11 years ago

Right, the cuda-packet-return issue lies outside the packetpool.

From cuda perspective though, the advantage with LIFO packetpool is that it's much easier to have more than 65k packets per packetpool, than use other methods like multiple queues.

Actions

Copy link

#10

Updated by Song Liu over 11 years ago

In worker mode(or single mode), even the return-stack does not need a mutex. Actually return-stack is not necessary in worker mode, as only one thread to handle from the beginning to end. Therefore the question comes down to whether we should handle this based on each mode, or use two per-thread-stacks for all modes?

But one byte for up to 256 threads might not be enough. Tilera already supported up to 288 cores, and I bet it will support more later.

Actions

Copy link

#11

Updated by Peter Manev over 11 years ago

I also think 256 threads might not be enough.
Is it a lot of effort to redesign (increase) that number?

Actions

Copy link

#12

Updated by Ken Steele over 11 years ago

Assignee changed from Song Liu to Ken Steele
% Done changed from 0 to 90
Estimated time set to 8.00 h

Fixed in Pull 913 (https://github.com/inliniac/suricata/pull/913).

Actions

Copy link

#13

Updated by Ken Steele over 11 years ago

Instead of using an index byte or short, which would have limited the number of stacks. The Packet has a pointer to the stack, which then allows any thread, even one without its own PacketPool, to free packets.

Actions

Copy link

#14