Bug #3004
closedSC_ERR_PCAP_DISPATCH with message "error code -2" upon rule reload completion
Description
Summary
When doing rule reloads (using the USR2 signal), about half of the time at the completion of the reload we see the error SC_ERR_PCAP_DISPATCH with error_code 20 and the message "error code -2". I am not sure at this time if we should be concerned about this error or if it can safely be ignored.
Details
Upon the completion of rule reloads, about half of the time this produces a sequence of events like the following:
{"timestamp":"2019-04-26T10:07:54.238852-0500","log_level":"Error","event_type":"engine","engine":{"error_code":20,"error":"SC_ERR_PCAP_DISPATCH","message":"error code -2 "}} {"timestamp":"2019-04-26T10:07:54.664296-0500","log_level":"Info","event_type":"engine","engine":{"message":"cleaning up signature grouping structure... complete"}} {"timestamp":"2019-04-26T10:07:54.665821-0500","log_level":"Notice","event_type":"engine","engine":{"message":"rule reload complete"}}
We upgraded from 4.0.6 to 4.1.3 a while back and that is believed to be when this started happening. I checked the 4.0.6 logs and did not see these messages. Upgrading from 4.1.3 to 4.1.4 did not resolve the issue.
We are using the pcap capture method (Myricom) with workers runmode.
I looked into this issue back when I submitted this to the OISF-USERS mailing list (https://lists.openinfosecfoundation.org/pipermail/oisf-users/2019-April/016850.html) and have the following observations:
It seems this error is coming from source-pcap.c on line 269 (https://github.com/OISF/suricata/blob/7f38ffc8bcfa3bca793eb3be41f112634b48de2a/src/source-pcap.c#L269), since we aren't loading a pcap file in this case and that is mostly where else this error is thrown.
There is a pcap_dispatch call above this one (line 265) and the conditional on line 267 to enter the trigger for this error checks that the return from pcap_dispatch is < 0. From https://linux.die.net/man/3/pcap_dispatch, "-2 (is returned) if the loop terminated due to a call to pcap_breakloop() before any packets were processed". The PCAP_ERROR_BREAK (-2) code would be handled on line 272 once inside of here. There is a pcap_breakloop() call (line 226) inside PcapCallbackLoop which is called on line 266, but I believe instead this could be the result of the change for 4.1.3 in https://github.com/OISF/suricata/commit/bb26e6216e5190d841529c0ecb1292b9a358ed54#diff-2079412a59d37868318fc953aeddef52 where ReceivePcapBreakLoop was created for PktAcqBreakLoop. So possibly in tm-threads.c at https://github.com/OISF/suricata/blob/d6903e70c1b653984ca95f8808755efbc6a9ece4/src/tm-threads.c#L1610?
If that is how the error occurs, then I am curious if we may be losing a half second (at least) of traffic visibility due to the reconnect on line 277 of source-pcap.c?
Steps to reproduce
Unknown at this time other than the possibility of needing to use pcap capture method.
Updated by Andreas Herz over 5 years ago
- Assignee set to OISF Dev
- Target version set to Support
Updated by Victor Julien over 5 years ago
The logic that triggers this lives in detect-engine.c, look at the BreakCapture function and how it is called here https://github.com/OISF/suricata/blob/14c2b6e445b3e8b5f802c4538397c12bfd3f831d/src/detect-engine.c#L1505
The goal is to make sure that all packet threads wake up within a reasonable time, even if the capture threads see no packets for a while. So it appears that the capture threads are not responding quickly enough and thus the BreakCapture is called which in turn calls the pcap_breakloop() function (indirectly).
The issue is that the breakloop result is considered an error. Could you try this patch?
diff --git a/src/source-pcap.c b/src/source-pcap.c
index 9c7388b96..a16e42d05 100644
--- a/src/source-pcap.c
+++ b/src/source-pcap.c
@@ -237,6 +237,10 @@ static void PcapCallbackLoop(char *user, struct pcap_pkthdr *h, u_char *pkt)
SCReturn;
}
+#ifndef PCAP_ERROR_BREAK
+#define PCAP_ERROR_BREAK -2
+#endif
+
/**
* \brief Main PCAP reading Loop function
*/
@@ -264,15 +268,15 @@ TmEcode ReceivePcapLoop(ThreadVars *tv, void *data, void *slot)
/* Right now we just support reading packets one at a time. */
r = pcap_dispatch(ptv->pcap_handle, packet_q_len,
(pcap_handler)PcapCallbackLoop, (u_char *)ptv);
- if (unlikely(r < 0)) {
+ if (unlikely(r == 0 || r == PCAP_ERROR_BREAK)) {
+ if (r == PCAP_ERROR_BREAK && ptv->cb_result == TM_ECODE_FAILED) {
+ SCReturnInt(TM_ECODE_FAILED);
+ }
+ TmThreadsCaptureHandleTimeout(tv, ptv->slot, NULL);
+ } else if (unlikely(r < 0)) {
int dbreak = 0;
SCLogError(SC_ERR_PCAP_DISPATCH, "error code %" PRId32 " %s",
r, pcap_geterr(ptv->pcap_handle));
-#ifdef PCAP_ERROR_BREAK
- if (r == PCAP_ERROR_BREAK) {
- SCReturnInt(ptv->cb_result);
- }
-#endif
do {
usleep(PCAP_RECONNECT_TIMEOUT);
if (suricata_ctl_flags != 0) {
@@ -287,8 +291,6 @@ TmEcode ReceivePcapLoop(ThreadVars *tv, void *data, void *slot)
} else if (ptv->cb_result == TM_ECODE_FAILED) {
SCLogError(SC_ERR_PCAP_DISPATCH, "Pcap callback PcapCallbackLoop failed");
SCReturnInt(TM_ECODE_FAILED);
- } else if (unlikely(r == 0)) {
- TmThreadsCaptureHandleTimeout(tv, ptv->slot, NULL);
}
StatsSyncCountersIfSignalled(tv);
Updated by Eric Urban over 5 years ago
Thank you for the patch. I was planning on applying this to the code from the 4.1.4 branch we are running and putting it on one of our redundant sensors to test. However, I then noticed that the 4.1.4 branch calls TmThreadsCaptureInjectPacket instead of TmThreadsCaptureHandleTimeout in the else-if from the patch hunk 3 for lines 287-291, which the patch moves to the lines 264-268 in hunk 2. It seems we would need to test this on the master branch instead of the current version we are running.
You wrote that "The issue is that the breakloop result is considered an error". What I am wondering is if this means this error can be ignored or are we losing visibility momentarily due to the capture threads not responding quickly?
Updated by Andreas Herz over 5 years ago
- Status changed from New to Feedback
Did you have a chance to test the patch (or a modified version)?
Updated by Eric Urban over 5 years ago
Sorry for the delay. I just applied a modified version of the patch on an instance where we have seen this error on 12 of the last 30 days. I am hoping the modifications I made still have the same effect. I took the patch provided by Victor above and replaced TmThreadsCaptureHandleTimeout with TmThreadsCaptureInjectPacket, since that was the function previously used in the 4.1.4 version in the final else-if condition of patch and TmThreadsCaptureHandleTimeout didn't exist there.
If that invalidates the patch, please let me know. Otherwise I will do my best to get back to you once we have a good idea of the results from the change.
Updated by Eric Urban over 5 years ago
A modified version of the patch (described in last comment) has been in place on one of our 4.1.4 version sensors since 2019-07-09. Since applying the patch, we have not encountered the error.
Updated by Victor Julien over 5 years ago
- Status changed from Feedback to Assigned
- Assignee changed from OISF Dev to Victor Julien
- Target version changed from Support to 5.0rc1
- Label Needs backport added
Updated by Victor Julien over 5 years ago
- Status changed from Assigned to Closed
Updated by Victor Julien over 5 years ago
- Copied to Bug #3159: SC_ERR_PCAP_DISPATCH with message "error code -2" upon rule reload completion (4.1.x) added