Project

General

Profile

Bug #8667 » README.md

Shane Dugan, 06/17/2026 04:15 PM

 

AF-PACKET inline-IPS startup race: sendto(socket 0) ENOTSOCK wedges forwarding on cold restart

Summary

A concurrency bug introduced in Suricata 8.0.0 by commit "af-packet: speed up thread sync during startup" causes a worker thread in inline-IPS copy-mode to call sendto() on a peer socket file descriptor that has not yet been published, producing an ENOTSOCK error and permanently wedging the inline forwarding path for that interface pair.

Affected versions

  • Suricata 8.0.0+ (any version containing the "speed up thread sync" optimization)
  • Not present in Suricata 7.0.x

Symptom

Runtime warning in Suricata logs:

<iface>_RX: sending packet failed on socket 0: Socket operation on non-socket

Followed by a permanent TX-ok / RX-dead asymmetry at shutdown stats:

SFE_0_TX: packets: 1298782314
SFE_0_RX: packets: 0

A healthy inline-IPS engine accumulates packets on both sides of a peer pair. packets: 0 on an RX interface under live traffic is the definitive indicator that the forwarding path never came up. The bug only fires on a cold restart (SIGTERM + relaunch); steady-state AFPTryReopen is unaffected.

Root cause

The optimization (introduced in 8.0.0)

Commit "af-packet: speed up thread sync during startup" moved AFPPeersListReachedInc() — the turn-barrier bump that releases the next thread — from after AFPCreateSocket() returns to inside AFPCreateSocket() immediately after bind(), before AFPSetupRing() and before AFPSwitchState(AFP_STATE_UP). The intent was to allow all threads to run the expensive packet-ring mmap in parallel rather than sequentially:

/* bind() done, allow next thread to continue */
if (peer_update) {
    AFPPeersListReachedInc();   /* ← barrier bumped here in v8 */
}
ret = AFPSetupRing(ptv, devname);   /* ring mmap, can be slow */
...
AFPSwitchState(ptv, AFP_STATE_UP);  /* ← peer socket fd published HERE */

What broke

AFPSynchronizeStart() releases forwarding threads when AFPPeersListStarted() returns true, i.e. when peerslist.turn == 0. In Suricata 7, turn == 0 implied every peer had completed AFPCreateSocket() — including AFPSwitchState(AFP_STATE_UP) which publishes the socket fd. In Suricata 8, turn == 0 only implies every peer has completed bind(). The fd is published later, after AFPSetupRing().

AFPWritePacket() — the forward path — reads peer->socket and calls sendto() with no check on peer->state:

socket = SC_ATOMIC_GET(p->afp_v.peer->socket);  /* still 0 if peer not yet UP */
if (sendto(socket, ...) < 0) {
    if (SC_ATOMIC_ADD(p->afp_v.peer->send_errors, 1) == 0) {
        SCLogWarning("%s: sending packet failed on socket %d: %s",
                     p->afp_v.peer->iface, socket, strerror(errno));
    }
}

When peer->socket is still 0 (zero-initialized; peer hasn't reached AFPSwitchState(AFP_STATE_UP) yet), sendto(0, ...) returns ENOTSOCK. After the first ENOTSOCK the warning is silenced by the send_errors rate-limit, so every subsequent packet is silently dropped with no further log output. The engine never recovers — AFPTryReopen is only triggered by read-side errors, not write-side errors.

Race timeline

Thread A (forwards to B's socket)       Thread B (publishes B's fd)
─────────────────────────────────       ────────────────────────────
bind()                                  bind()
AFPPeersListReachedInc()  ──────┐       AFPPeersListReachedInc()
                         (last bump → turn=0)
AFPSetupRing()  (fast)          │       AFPSetupRing()  (slow: fragmented memory)
AFPSynchronizeStart(): turn==0  │                │
RELEASED ←──────────────────────┘                │  (still in ring setup,
read peer->socket == 0                           │   socket NOT yet published)
sendto(0,...) → ENOTSOCK                         │
"socket 0: Socket operation                      │
 on non-socket"                                  │
                                        AFPSwitchState(AFP_STATE_UP)
                                        → publishes B's fd  ← too late

The race window is the interval between the barrier release and the peer's AFPSwitchState(AFP_STATE_UP). Its dominant cost is AFPSetupRing() (packet-ring mmap). On hosts with long uptime and memory fragmentation, this window widens, increasing the probability of the race.

Key publication ordering (why the fix is race-free)

AFPPeerUpdate() — the only place peer->socket is set — writes the two atomics in a fixed order:

SC_ATOMIC_SET(ptv->mpeer->socket, ptv->socket);   /* written first */
SC_ATOMIC_SET(ptv->mpeer->state, ptv->afp_state); /* written second */

Therefore: observing peer->state == AFP_STATE_UP guarantees that peer->socket has already been published. This ordering makes the proposed fix race-free.

Proposed fix (Fix A — recommended)

Add a single peer-state guard at the top of AFPWritePacket(), before the socket is read:

static void AFPWritePacket(Packet *p, int version)
{
    /* ... latency stats ... */

    /* Guard: peer fd not yet published during startup window.
     * state is published *after* socket in AFPPeerUpdate(), so
     * observing AFP_STATE_UP guarantees the fd is valid. */
    if (SC_ATOMIC_GET(p->afp_v.peer->state) != AFP_STATE_UP) {
        return;
    }

    if (p->afp_v.copy_mode == AFP_COPY_MODE_IPS) {
        if (PacketCheckAction(p, ACTION_DROP)) {
            return;
        }
    }
    /* ... rest of function unchanged ... */

This preserves the parallel ring-setup optimization while closing the race window. Packets arriving during the window are dropped cleanly (the peer socket does not exist yet, so they would have been lost anyway via ENOTSOCK). The guard terminates as soon as the peer reaches AFP_STATE_UP.

Alternative fix (Fix B — restores v7 ordering)

Move AFPPeersListReachedInc() back to after AFPSwitchState(AFP_STATE_UP) inside AFPCreateSocket():

    /* Init is ok */
    AFPSwitchState(ptv, AFP_STATE_UP);
    /* fd published — now allow next thread to proceed */
    if (peer_update) {
        AFPPeersListReachedInc();
    }
    return 0;

This restores the Suricata 7 invariant (turn == 0 → all peer fds published) at the cost of re-serializing AFPSetupRing() across threads, giving up the startup-time optimization.

Reproduction results

Reproduced on a Linux dev host running Suricata 8.0.3 with 6 AF-PACKET interface pairs
(dummy/veth substrate, copy-mode: ips, runmode workers, 2 threads per interface)
and SURICATA_RING_SETUP_DELAY_US=500000 (deterministic window widener applied via
widener.patch):

Metric Value
Restart cycles 10
Cycles reaching "Engine started" 8
Cycles with ENOTSOCK 7 (87.5%)
Total ENOTSOCK lines 28

Sample log lines:

[W#01-SFE_3_TX] Warning: af-packet: SFE_3_RX: sending packet failed on socket 0: Socket operation on non-socket
[W#01-SFE_4_TX] Warning: af-packet: SFE_4_RX: sending packet failed on socket 0: Socket operation on non-socket
[W#01-SFE_5_TX] Warning: af-packet: SFE_5_RX: sending packet failed on socket 0: Socket operation on non-socket
[W#02-SFE_5_TX] Warning: af-packet: SFE_5_RX: sending packet failed on socket 0: Socket operation on non-socket

TX/RX asymmetry on a wedged engine (last cycle):

SFE_0_TX:  packets: 3,727,852   SFE_0_RX:  packets: 279
SFE_1_TX:  packets: 19,696,196  SFE_1_RX:  packets: 279
SFE_4_TX:  packets: 19,874,739  SFE_4_RX:  packets: 279
SFE_5_TX:  packets: 36,548,604  SFE_5_RX:  packets: 558

RX interfaces stuck at 279–558 packets (only the frames received before the race fired).
All subsequent forwarded traffic silently dropped forever.

Timing: ENOTSOCK fires at the same second as or 1 second before "Engine started" in every
case, confirming the race fires during the startup window between bind() +
AFPPeersListReachedInc() and AFPSwitchState(AFP_STATE_UP).

Fix verification: After adding the AFP_STATE_UP guard to AFPWritePacket() (Fix A),
10/10 restart cycles produced zero ENOTSOCK lines with the 500ms widener still active.

Reproduction

  • runmode: workers
  • copy-mode: ips
  • interfaces: paired veth or physical (SFE_i_TXSFE_i_RX in copy-mode IPS)
  • threads: ≥ 2 per interface pair
  • pairs: ≥ 5-6 (increases hit rate)
  • CPU affinity: do NOT pin all workers to one core (serializes setup, hides race)

Steps

  1. Start continuous traffic into the read side of the interface pairs.
  2. With traffic flowing, loop cold restarts: SIGTERM Suricata and relaunch it immediately.
  3. Watch startup logs for the ENOTSOCK signature.
  4. After Engine started., check per-interface packet counters for SFE_N_RX: packets: 0.

Expected behavior

SFE_N_TX: packets: and SFE_N_RX: packets: both grow under traffic after startup.

Actual behavior

One or more SFE_N_RX interfaces show packets: 0 permanently. The warning:

SFE_N_RX: sending packet failed on socket 0: Socket operation on non-socket

appears in the log between AF_PACKET IPS mode activated and Engine started., then is silenced by the rate-limit while dropping continues indefinitely.

Deterministic reproduction with window widener (test-only)

Add a usleep at the top of AFPSetupRing() gated behind an environment variable. This holds the existing race window open deterministically without creating the bug. With ~6 interface pairs and the widener enabled, the ENOTSOCK signature reproduced on 20/20 cold-start iterations.

Upstream commit introducing the regression

Commit: "923ad6af" — "af-packet: speed up thread sync during startup"

Message excerpt:

The ring setup doesn't need to be done sequentially. This patch releases the thread early, after bind but before the ring setups.

The optimization is correct in intent. The oversight is that "release after bind" also means "release before AFPSwitchState(AFP_STATE_UP)," which is the only place the peer socket fd is published. No compensating guard was added to AFPWritePacket().

Files

  • README.md — this document
  • bug-report.textile — condensed one-paragraph summary for Redmine ticket body
(1-1/4)