Support #3320
closedSignficant packet loss when using Suricata with Rust enabled
Description
Summary
We experience significant packet loss at times with Rust enabled in Suricata. In our environment we have two instances with the same version, same configuration file, same rules loaded, and same traffic where the one without Rust enabled has little to no packet loss and the one with Rust experienced packet loss. Disabling Rust on the host with packet loss has been shown to correct the issue.
Details
Currently we are running two instances on 4.1.5 side by side with the same configuration, rules loaded, and traffic. In both cases Suricata was complied with the options "HAVE_PYTHON=/usr/bin/python3 ./configure --with-libpcap=/opt/snf --localstatedir=/var/ --with-libhs-includes=/usr/local/include/hs/ --with-libhs-libraries=/usr/local/lib64/" but one had Rust/Cargo present during compliation and the other didn't. We also have a 5.0.0 instance, where Rust is required and enabled by default, with the same config/rules/traffic that experiences drops as well. This same behavior was also seen on 4.1.2 where we did a side by side compare of using Rust vs. not using it.
Our current comparison setup unfortunately is being done on hosts with different hardware. However, we did run this comparison on identical hardware back when using 4.1.2 and had the same results where Rust being enabled produced many more drops. I also believe in our current test setup that both hosts are more than adequately sized. The Rust enabled host has 40 cores with 128GB memory and 1 instance of Suricata. The non-Rust host has 88 cores with 256GB memory and 4 instances of Suricata, though only one of four instances is getting the traffic mirroring that of our Rust enabled instance.
The Suricata stats show drops and so do our Myricom stats. It appears there could be a counter issue of some kind because the number of packets during these periods of large drops also increases significantly. When I compared packets received minus packets dropped across these two hosts, the Rust enabled instance still had noticeably fewer total packets in most cases, so it would seem something else is going on. One example difference of the sum of stats.capture.kernel_packets_delta and stats.capture.kernel_drops_delta on Nov 4 over the minute of 11:06 is the Rust instance had 1,440,535 packets vs. 8,237,600 without Rust.
During the periods of drops, the Rust enabled instance has fewer alerts. The difference varies quite a bit depending on time period analyzed and which period of drops is analyzed. One example is between 09:00 and 10:00 on November 4 when drops were happening that the Rust instance had 13601 alerts and the one without Rust had 15820. When looking at times outside of drop periods, for the times I sampled, the Rust host generally has slightly more alerts but this is around 1% or less. I am guessing this small difference during normal operating periods isn't too unusual since enabling Rust does change the traffic analyzers for some protocols.
I did seek the help through the mailing list earlier this year at a thread started with https://lists.openinfosecfoundation.org/pipermail/oisf-users/2019-February/016618.html. That had some activity over at least a few months, but there was no resolution and the thread became quite long so it may be best to avoid looking at that and to start from scratch.
Some additional info that applies to both 4.1.5 instances:
- CentOS Linux release 7.7.1908 (Core) / 3.10.0-1062.1.2.el7.x86_64 #1 SMP Mon Sep 30 14:19:46 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
- Pcap capture method (using --pcap command-line option) with workers runmode
- Myricom cards.
ProductCode Driver Version
10G-PCIE2-8C2-2S myri_snf 3.0.18.50878
- Rust/cargo versions:
Rust compiler: rustc 1.38.0
Rust cargo: cargo 1.38.0
I will attach stats from the eve logs for both hosts and also Myricom stats logs. Note that the counters ending in __per_second in the Myricom log should not be used as these are not standard. Build-info output is also included. I can provide configuration directly (not through Redmine) if requested.
Steps to reproduce
Unknown for sure how to reproduce other than building Suricata with Rust.
Files
Updated by Victor Julien about 5 years ago
Hi Eric, can you share the app-layer section of your yaml here?
I'm mostly thinking about how SMB will behave very differently between the C and Rust code, with the same config. SMB enabled 'stream depth' 0 (unlimited) by default, which means it does not respect your global stream.reassemble.depth setting. This is set to 1MiB by default. So in the old C code scenario, it tracks SMB only until that 1MiB (or whatever your setting is). And even then only for SMB1, which I hope you're not seeing much anymore. In the Rust case it will track all SMB w/o limit, plus do much more work on the traffic: file tracking, logging, etc.
If you monitor a lot of SMB traffic you could force the SMB parser to use the old stream depth behavior:
smb:
enabled: yes
detection-ports:
dp: 139, 445
# Stream reassembly size for SMB streams. By default track it completely.
stream-depth: 1mb
Updated by Eric Urban about 5 years ago
Hello Victor, thank you for the prompt response!
I put the config change in place on our test sensor with Rust yesterday. I will monitor and get back to you with our app-layer config if there is no change. So far there have not been drops but we have had other full days where we don't see this issue so will need to let it run for a while longer to build confidence.
Updated by Eric Urban about 5 years ago
After making this change we have had 0 drops and the packets between our Rust enabled and non-Rust enabled sensors have aligned very closely. We had not gone this many days without drops, so it looks like this fixes the issue. Thank you again for your response and the config suggestion that appears to have corrected this issue!
Victor, I am wondering if you could clarify one thing in your last comment? You were explaining that in the old C code scenario that it tracks SMB only until 1MiB and then in the next sentence you wrote "And even then only for SMB1". Does that mean that in the C code SMB analyzer that SMB1 is only tracked for 1MiB but that SMB2 is tracked without limit?
Updated by Eric Urban about 5 years ago
I believe I found answer to my question in the previous comment, which is that the SMB2 app-layer parser is disabled internally in the C code parser. I was thrown off since we regularly see alerts for SMB2 traffic, but these alerts are triggered by rules using TCP as the protocol that do a content match on SMB2 so are not actually doing app-layer parsing of SMB2.
Updated by Andreas Herz about 4 years ago
- Status changed from Feedback to Closed