Project

General

Profile

Actions

Bug #4785

closed

af-packet: threads sometimes get stuck in capture

Added by Victor Julien 11 months ago. Updated 11 months ago.

Status:
Closed
Priority:
Normal
Assignee:
Target version:
Affected Versions:
Effort:
Difficulty:
Label:
Needs backport to 5.0, Needs backport to 6.0

Description

I can reproduce a case where af-packet threads get no more traffic despite traffic being pushed to them. The poll() call in the Suricata af-packet code times out and returns 0 and never recovers. Drop stats for the thread are increasing rapidly as no packets are read. It's as if the socket is somehow getting confused. There is no errno being set or a kernel message in dmesg.

The issue will appear consistently within half an hour of a sustained t-rex test. The test I'm using is simple:

victor@z420:/opt/trex/v2.92$ sudo ./t-rex-64 -f cap2/http_very_long.yaml -c 1 -m 11111 -d 7200 --active-flows 512 -p --cfg ../cfg_ids_10gb_x520_tr1.yaml

Suricata runs on another matchine:
./src/suricata -c ids.yaml --af-packet -l /var/log/suricata/ -S bypass.rules -vv --set flow.hash-size=1000000

af-packet config:
af-packet:
  - interface: enp65s0f1
    cluster-id: 20

  - interface: default
    threads: 16
    cluster-type: cluster_flow
    defrag: yes
    use-mmap: yes
    mmap-locked: no
    #tpacket-v3: yes
    ring-size: 8192
    block-size: 262144

NIC stats in ethtool show that the traffic continues to be well balanced.

Kernel: Linux tr1 5.11.0-38-generic #42~20.04.1-Ubuntu SMP Tue Sep 28 20:41:07 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
NIC: Intel X710, i40e driver


Related issues 2 (0 open2 closed)

Copied to Bug #4789: af-packet: threads sometimes get stuck in captureClosedVictor JulienActions
Copied to Bug #4790: af-packet: threads sometimes get stuck in captureClosedVictor JulienActions
Actions #1

Updated by Victor Julien 11 months ago

  • Status changed from New to In Progress
  • Assignee set to Victor Julien
  • Target version set to 7.0rc1
  • Label Needs backport to 5.0, Needs backport to 6.0 added

This looks like a bug in our tpacket-v2 implementation. Testing a fix.

Actions #2

Updated by Victor Julien 11 months ago

It looks like this issue appears in more than one way. During startup, the AFPReadAndDiscardFromRing function, which is called in a loop from AFPSynchronizeStart, checks each position in the ring for a timestamp value, regardless of whether we're supposed to look at the packet (tp_status != TP_STATUS_KERNEL).

To reproduce this easily: replay at max speed, small ring size, many threads. This leads to the ring overflowing during the sync start logic.

I'm still not sure I fully understand what is happening, but it seems I have fixes that reliably fix the issue.

Actions #3

Updated by Jeff Lucovsky 11 months ago

  • Copied to Bug #4789: af-packet: threads sometimes get stuck in capture added
Actions #4

Updated by Jeff Lucovsky 11 months ago

  • Copied to Bug #4790: af-packet: threads sometimes get stuck in capture added
Actions #5

Updated by Victor Julien 11 months ago

  • Status changed from In Progress to Closed
Actions

Also available in: Atom PDF