# AF-PACKET inline-IPS startup race: `sendto(socket 0)` ENOTSOCK wedges forwarding on cold restart

## Summary

A concurrency bug introduced in Suricata 8.0.0 by commit "af-packet: speed up thread sync during startup" causes a worker thread in inline-IPS copy-mode to call `sendto()` on a peer socket file descriptor that has not yet been published, producing an `ENOTSOCK` error and permanently wedging the inline forwarding path for that interface pair.

## Affected versions

- Suricata 8.0.0+ (any version containing the "speed up thread sync" optimization)
- **Not present in Suricata 7.0.x**

## Symptom

Runtime warning in Suricata logs:

```
<iface>_RX: sending packet failed on socket 0: Socket operation on non-socket
```

Followed by a permanent TX-ok / RX-dead asymmetry at shutdown stats:

```
SFE_0_TX: packets: 1298782314
SFE_0_RX: packets: 0
```

A healthy inline-IPS engine accumulates packets on both sides of a peer pair. `packets: 0` on an RX interface under live traffic is the definitive indicator that the forwarding path never came up. The bug only fires on a cold restart (SIGTERM + relaunch); steady-state `AFPTryReopen` is unaffected.

## Root cause

### The optimization (introduced in 8.0.0)

Commit "af-packet: speed up thread sync during startup" moved `AFPPeersListReachedInc()` — the turn-barrier bump that releases the next thread — from **after** `AFPCreateSocket()` returns to **inside** `AFPCreateSocket()` immediately after `bind()`, before `AFPSetupRing()` and before `AFPSwitchState(AFP_STATE_UP)`. The intent was to allow all threads to run the expensive packet-ring `mmap` in parallel rather than sequentially:

```c
/* bind() done, allow next thread to continue */
if (peer_update) {
    AFPPeersListReachedInc();   /* ← barrier bumped here in v8 */
}
ret = AFPSetupRing(ptv, devname);   /* ring mmap, can be slow */
...
AFPSwitchState(ptv, AFP_STATE_UP);  /* ← peer socket fd published HERE */
```

### What broke

`AFPSynchronizeStart()` releases forwarding threads when `AFPPeersListStarted()` returns true, i.e. when `peerslist.turn == 0`. In Suricata 7, `turn == 0` implied every peer had completed `AFPCreateSocket()` — including `AFPSwitchState(AFP_STATE_UP)` which publishes the socket fd. In Suricata 8, `turn == 0` only implies every peer has completed `bind()`. The fd is published later, after `AFPSetupRing()`.

`AFPWritePacket()` — the forward path — reads `peer->socket` and calls `sendto()` with no check on `peer->state`:

```c
socket = SC_ATOMIC_GET(p->afp_v.peer->socket);  /* still 0 if peer not yet UP */
if (sendto(socket, ...) < 0) {
    if (SC_ATOMIC_ADD(p->afp_v.peer->send_errors, 1) == 0) {
        SCLogWarning("%s: sending packet failed on socket %d: %s",
                     p->afp_v.peer->iface, socket, strerror(errno));
    }
}
```

When `peer->socket` is still 0 (zero-initialized; peer hasn't reached `AFPSwitchState(AFP_STATE_UP)` yet), `sendto(0, ...)` returns `ENOTSOCK`. After the first ENOTSOCK the warning is silenced by the `send_errors` rate-limit, so every subsequent packet is silently dropped with no further log output. The engine never recovers — `AFPTryReopen` is only triggered by read-side errors, not write-side errors.

### Race timeline

```
Thread A (forwards to B's socket)       Thread B (publishes B's fd)
─────────────────────────────────       ────────────────────────────
bind()                                  bind()
AFPPeersListReachedInc()  ──────┐       AFPPeersListReachedInc()
                         (last bump → turn=0)
AFPSetupRing()  (fast)          │       AFPSetupRing()  (slow: fragmented memory)
AFPSynchronizeStart(): turn==0  │                │
RELEASED ←──────────────────────┘                │  (still in ring setup,
read peer->socket == 0                           │   socket NOT yet published)
sendto(0,...) → ENOTSOCK                         │
"socket 0: Socket operation                      │
 on non-socket"                                  │
                                        AFPSwitchState(AFP_STATE_UP)
                                        → publishes B's fd  ← too late
```

The race window is the interval between the barrier release and the peer's `AFPSwitchState(AFP_STATE_UP)`. Its dominant cost is `AFPSetupRing()` (packet-ring `mmap`). On hosts with long uptime and memory fragmentation, this window widens, increasing the probability of the race.

### Key publication ordering (why the fix is race-free)

`AFPPeerUpdate()` — the only place `peer->socket` is set — writes the two atomics in a fixed order:

```c
SC_ATOMIC_SET(ptv->mpeer->socket, ptv->socket);   /* written first */
SC_ATOMIC_SET(ptv->mpeer->state, ptv->afp_state); /* written second */
```

Therefore: observing `peer->state == AFP_STATE_UP` **guarantees** that `peer->socket` has already been published. This ordering makes the proposed fix race-free.

## Proposed fix (Fix A — recommended)

Add a single peer-state guard at the top of `AFPWritePacket()`, before the socket is read:

```c
static void AFPWritePacket(Packet *p, int version)
{
    /* ... latency stats ... */

    /* Guard: peer fd not yet published during startup window.
     * state is published *after* socket in AFPPeerUpdate(), so
     * observing AFP_STATE_UP guarantees the fd is valid. */
    if (SC_ATOMIC_GET(p->afp_v.peer->state) != AFP_STATE_UP) {
        return;
    }

    if (p->afp_v.copy_mode == AFP_COPY_MODE_IPS) {
        if (PacketCheckAction(p, ACTION_DROP)) {
            return;
        }
    }
    /* ... rest of function unchanged ... */
```

This preserves the parallel ring-setup optimization while closing the race window. Packets arriving during the window are dropped cleanly (the peer socket does not exist yet, so they would have been lost anyway via ENOTSOCK). The guard terminates as soon as the peer reaches `AFP_STATE_UP`.

## Alternative fix (Fix B — restores v7 ordering)

Move `AFPPeersListReachedInc()` back to after `AFPSwitchState(AFP_STATE_UP)` inside `AFPCreateSocket()`:

```c
    /* Init is ok */
    AFPSwitchState(ptv, AFP_STATE_UP);
    /* fd published — now allow next thread to proceed */
    if (peer_update) {
        AFPPeersListReachedInc();
    }
    return 0;
```

This restores the Suricata 7 invariant (`turn == 0` → all peer fds published) at the cost of re-serializing `AFPSetupRing()` across threads, giving up the startup-time optimization.

## Reproduction results

Reproduced on a Linux dev host running Suricata 8.0.3 with 6 AF-PACKET interface pairs
(dummy/veth substrate, `copy-mode: ips`, `runmode workers`, 2 threads per interface)
and `SURICATA_RING_SETUP_DELAY_US=500000` (deterministic window widener applied via
`widener.patch`):

| Metric | Value |
|---|---|
| Restart cycles | 10 |
| Cycles reaching "Engine started" | 8 |
| Cycles with ENOTSOCK | 7 (87.5%) |
| Total ENOTSOCK lines | 28 |

Sample log lines:
```
[W#01-SFE_3_TX] Warning: af-packet: SFE_3_RX: sending packet failed on socket 0: Socket operation on non-socket
[W#01-SFE_4_TX] Warning: af-packet: SFE_4_RX: sending packet failed on socket 0: Socket operation on non-socket
[W#01-SFE_5_TX] Warning: af-packet: SFE_5_RX: sending packet failed on socket 0: Socket operation on non-socket
[W#02-SFE_5_TX] Warning: af-packet: SFE_5_RX: sending packet failed on socket 0: Socket operation on non-socket
```

TX/RX asymmetry on a wedged engine (last cycle):
```
SFE_0_TX:  packets: 3,727,852   SFE_0_RX:  packets: 279
SFE_1_TX:  packets: 19,696,196  SFE_1_RX:  packets: 279
SFE_4_TX:  packets: 19,874,739  SFE_4_RX:  packets: 279
SFE_5_TX:  packets: 36,548,604  SFE_5_RX:  packets: 558
```

RX interfaces stuck at 279–558 packets (only the frames received before the race fired).
All subsequent forwarded traffic silently dropped forever.

Timing: ENOTSOCK fires at the same second as or 1 second before "Engine started" in every
case, confirming the race fires during the startup window between `bind()` +
`AFPPeersListReachedInc()` and `AFPSwitchState(AFP_STATE_UP)`.

**Fix verification:** After adding the `AFP_STATE_UP` guard to `AFPWritePacket()` (Fix A),
10/10 restart cycles produced zero ENOTSOCK lines with the 500ms widener still active.

## Reproduction

- **runmode:** `workers`
- **copy-mode:** `ips`
- **interfaces:** paired veth or physical (`SFE_i_TX` ↔ `SFE_i_RX` in copy-mode IPS)
- **threads:** ≥ 2 per interface pair
- **pairs:** ≥ 5-6 (increases hit rate)
- **CPU affinity:** do NOT pin all workers to one core (serializes setup, hides race)

### Steps

1. Start continuous traffic into the read side of the interface pairs.
2. With traffic flowing, loop cold restarts: `SIGTERM` Suricata and relaunch it immediately.
3. Watch startup logs for the ENOTSOCK signature.
4. After `Engine started.`, check per-interface packet counters for `SFE_N_RX: packets: 0`.

### Expected behavior

`SFE_N_TX: packets:` and `SFE_N_RX: packets:` both grow under traffic after startup.

### Actual behavior

One or more `SFE_N_RX` interfaces show `packets: 0` permanently. The warning:

```
SFE_N_RX: sending packet failed on socket 0: Socket operation on non-socket
```

appears in the log between `AF_PACKET IPS mode activated` and `Engine started.`, then is silenced by the rate-limit while dropping continues indefinitely.

### Deterministic reproduction with window widener (test-only)

Add a `usleep` at the top of `AFPSetupRing()` gated behind an environment variable. This holds the existing race window open deterministically without creating the bug. With ~6 interface pairs and the widener enabled, the ENOTSOCK signature reproduced on 20/20 cold-start iterations.

## Upstream commit introducing the regression

Commit: ["923ad6af" — "af-packet: speed up thread sync during startup"](https://github.com/OISF/suricata/commit/923ad6af7709c9eca6f0f5856ee267845b425ae5)

Message excerpt:
> The ring setup doesn't need to be done sequentially. This patch releases the thread early, after bind but before the ring setups.

The optimization is correct in intent. The oversight is that "release after bind" also means "release before `AFPSwitchState(AFP_STATE_UP)`," which is the only place the peer socket fd is published. No compensating guard was added to `AFPWritePacket()`.

## Files

- `README.md` — this document
- `bug-report.textile` — condensed one-paragraph summary for Redmine ticket body
