Bug #8442
opencapture-bypass: worker timeout of flows causes statistics inconsistencies
Description
Problem:¶
When a worker times out a capture-bypassed flow, it does not call the necessary functions to update the flow statistics.
Gathering statistics can be a costly operation, as it depends on the BypassUpdate callback implementation (e.g., querying hardware).
Proposed solution:¶
Forbid workers from timing out capture-bypassed flows and allow only FlowManager to handle their timeouts and updates.
Files
OT Updated by OISF Ticketbot 3 months ago
- Subtask #8443 added
OT Updated by OISF Ticketbot 3 months ago
- Label deleted (
Needs backport to 8.0)
AK Updated by Adam Kiripolsky 5 days ago
- File port_any.pcap port_any.pcap added
- File port_443.pcap port_443.pcap added
- File suricata-worket-bypass-stats.yml suricata-worket-bypass-stats.yml added
Reproducibility test¶
I used Suricata in AF_PACKET runmode with EBPF bypass.
To see the results more clearly, I have created a test branch: https://github.com/adaki4/suricata/tree/reproduce-wrong-worker-bypass-stats-v1
This branch adds counters for capture-bypassed flows that would be timed out by the function FlowIsTimedOut() and for the number of deletions from the bypass eBPF map.
I have used two pcaps, port_443.pcap and port_any.pcap , both generated by scapy.
- port_443.pcap contains 1000 TCP flows (each of 10 packets) with different IP addresses, all with port 443.
- port_any.pcap contains 1000 TCP flows (each of 10 packets) with different IP addresses, all with different ports other than 443.
Suricata rules are in a file drop-443.rules and can look like:
drop tcp any any -> any 443 (msg:"Dropping all HTTPS traffic (port 443)"; bypass; sid:1000004; rev:1;)
drop tcp any 443 -> any any (msg:"Dropping all HTTPS traffic (port 443)"; bypass; sid:1000005; rev:1;)
I have also reduced Suricata's flow table and the hash-size, as configured in the attached suricata.yaml.
I launched Suricata with:
sudo src/suricata -S ./rules/drop-443.rules -c suricata-worket-bypass-stats.yml -l /tmp/ -vvvv --af-packet
Note: In my setup, the interface I use to replay traffic is mirrored to Suricata's interface.
First, I send to Suricata's running interface the port_443.pcap via tcpreplay.
sudo tcpreplay -i <if> port443.pcap
When the replay ends, I immediately send port_any.pcap. It is necessary to send the pcap right after the first one, as the timeout for bypassed flows is set to only 20s.
sudo tcpreplay -i <if> port_any.pcap
After the replay ends, we can shut down Suricata. There are 2 important lines in the cmd log:
Info: af-packet: rules for bypass deleted: x [ReceiveAFPThreadDeinit:source-af-packet.c:2750]
Info: af-packet: capture bypassed flows timeouted by worker: y [ReceiveAFPThreadDeinit:source-af-packet.c:2751]
The x gives us the number of capture-bypassed flows that were deinitialized correctly, e.g. their entry from eBPF map was deleted. The y gives us the number of capture-bypassed flows that were timed out in the function FlowGetFlowFromHash(). These flows won't have their entries deleted from the eBPF map and the statistics from them are not collected, yet the flows are removed from the flow table. x + y gives us the total number of flows that were supposed to be bypassed, that being 1000 in this case.
In the fixed version (e.g. by applying the fix in PR https://github.com/OISF/suricata/pull/15331), y is always 0 and x is 1000, meaning that all flows and their bypass data are being properly deinitialized.