Optimization #1039

Packetpool should be a stack

Added by Ken Steele about 7 years ago. Updated over 6 years ago.

Target version:


It is better to store free Packets on a LIFO stack, rather than a FIFO queue. This way recently used Packets, which might still be in cache, are reused more quickly.


Updated by Ken Steele about 7 years ago

The code is in tmqh-packetpool.c


Updated by Victor Julien about 7 years ago

  • Target version set to 3.0RC2

Updated by Song Liu about 7 years ago

  • Assignee set to Song Liu

Updated by Ken Steele almost 7 years ago

I would recommend having a per thread free stack of packets that is only accessed by the thread, thus not needing a mutex. To allow threads to free a Packet back to the free stack on another thread, use a second "return-stack", protected by a mutex. When the thread's local free stack is empty, it can lock the return-stack and move all the packets to its local free stack.

This requires that each Packet record the thread on which is was allocated, but that can be stored in one byte for up to 256 threads.


Updated by Victor Julien almost 7 years ago

I agree Ken. There is one common use case to consider, the autofp runmodes. In this case the packet will almost certainly be freed by another thread than the one that alloc'd it.


Updated by Victor Julien almost 7 years ago

Btw, a while ago I played with this code:, at the time I was investigating slowdowns. It seemed like we could experience 'pseudo packet storms', where the packet processing virtually stopped. The goal of this queue experiment was not to reduce locking, but reduce lock contention. IIRC it worked well. Might consider an approach like this for the 'return stack', so that we'll never get serious contention there.


Updated by Anoop Saldanha almost 7 years ago

If we are planning ton use the LIFO approach, for cuda we might need another "still-in-use" kind of return stack. In cuda once I send the packet over to the gpu, on the cpu side I might not need the results from the gpu and pass the packet back to the packetpool, despite the gpu holding a reference to this packet. If we reuse this packet from the packetpool, inside the decoder we would wait till the gpu frees this packet up.

Cuda now only works with autofp, so be default we would end up using the return-stack, but the thread might need to check for the "in-use-by-gpu" flag on the packet before transfering it back to its free stack pool or maybe the thread is ready to take a gpu wait hit in decoder, and move all of them back to its "free" packet pool. Either ways assigning a sufficiently huge no in the free stack would give the gpu enough time to free the packet up.

An additional advantage I see with LIFO is we won't be constrained by 65k packets we are constrained now, again keeping cuda in mind. We can provide additional queue types to support > 65k packets, but LIFO seems easier.


Updated by Victor Julien almost 7 years ago

That sounds like an architecture problem in the CUDA code then. We shouldn't be putting packets back into the pool if they are still referenced elsewhere. Think we can exclude this from the general packet stack discussion and need to address it separately.


Updated by Anoop Saldanha almost 7 years ago

Right, the cuda-packet-return issue lies outside the packetpool.

From cuda perspective though, the advantage with LIFO packetpool is that it's much easier to have more than 65k packets per packetpool, than use other methods like multiple queues.


Updated by Song Liu almost 7 years ago

In worker mode(or single mode), even the return-stack does not need a mutex. Actually return-stack is not necessary in worker mode, as only one thread to handle from the beginning to end. Therefore the question comes down to whether we should handle this based on each mode, or use two per-thread-stacks for all modes?

But one byte for up to 256 threads might not be enough. Tilera already supported up to 288 cores, and I bet it will support more later.


Updated by Peter Manev almost 7 years ago

I also think 256 threads might not be enough.
Is it a lot of effort to redesign (increase) that number?


Updated by Ken Steele almost 7 years ago

  • Assignee changed from Song Liu to Ken Steele
  • % Done changed from 0 to 90
  • Estimated time set to 8.00 h

Updated by Ken Steele almost 7 years ago

Instead of using an index byte or short, which would have limited the number of stacks. The Packet has a pointer to the stack, which then allows any thread, even one without its own PacketPool, to free packets.


Updated by Victor Julien over 6 years ago

  • Status changed from New to Closed
  • % Done changed from 90 to 100

Updated by Victor Julien over 6 years ago

  • Target version changed from 3.0RC2 to 2.1beta1

Also available in: Atom PDF