Bug #3075
openRX thread hang in pcap-file mode
Description
based on suricata4.1.4, RX thread sometimes(always two days) hang, so RX can't read pcap file.
gstack RX_thread_id
#0 0x00007f0a98fe8945 in pthread_cond_wait@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000005a19c2 in PacketPoolWait () at tmqh-packetpool.c:155 #2 0x000000000058afdd in PcapFileDispatch (ptv=ptv@entry=0x7f0a8f38f2b0) at source-pcap-file-helper.c:135 #3 0x0000000000588a1fin PcapDirectoryDispatchForTimeRange (older_than=0x7f0a09150a0, pv=0x7f0a8c030e70) at source-pcap-file-directory-helper.c:462 #4 PcapDirectoryDispatch (ptv=0x7f0a8c030e70) at source-pcap-file-directory-helper.c:530 #5 0x00000000005860c6 ReceivePcapFileLoop (tv=<optimized out>, data=0x7f0a8c030db0, slot=<optimized out>) at source-pcap-file.c:177 #6 0x00000000005a5b26 in TmThreadsSlotPktAcqLoop (td=0x9deedc0) at tm-threads.c:356 #4 0x00007f0a98fe4e25 in start_thread () from /lib64/libpthread.so.0 #5 0x00007f0a9869834d in clone () from /lib64/libc.so.6
Files
Updated by Andreas Herz over 5 years ago
- Status changed from New to Feedback
- Assignee set to OISF Dev
- Target version changed from 4.1.5 to TBD
Can you give us more details about your setup?
(Linux, NIC, configuration, runmode, parameter)
Updated by WenTan Liu over 5 years ago
- File suricata.yaml suricata.yaml added
1. Centos 7.2
2. NIC Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection
3. runmode <autofp>
4. suricata -c suricata.yaml -r pcap_file_directory --pcap-file-continuous -l log_dir
Updated by Victor Julien over 5 years ago
Are you able to test the current git master? I made some fixes some time ago that might be related.
Updated by Feng Dai over 5 years ago
I got similar issue in suriata 4.0.6 with loading test of 400Mbps for 20 minutes. The RX thread doesn't receive any more packages.
#0 0x00007fb60082f945 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x0000561bdf192b7a in PacketPoolWait () #2 0x0000561bdf178bd5 in ReceivePcapLoop () #3 0x0000561bdf1975e7 in TmThreadsSlotPktAcqLoop () #4 0x00007fb60082be25 in start_thread () from /lib64/libpthread.so.0 #5 0x00007fb60013e34d in clone () from /lib64/libc.so.6
I got a fix to pass my load test. The pattern to use condition variable was not correct from my opinion. Please review if my fix makes sense. Thanks.
diff -Naurp --exclude tags suricata-4.0.6/src/tmqh-packetpool.c suricata-4.0.6-twutm1605/src/tmqh-packetpool.c
--- suricata-4.0.6/src/tmqh-packetpool.c 2018-11-06 03:01:46.000000000 -0600
+++ suricata-4.0.6-twutm1605/src/tmqh-packetpool.c 2019-09-13 15:41:20.673513665 -0500
@@ -149,10 +149,13 @@ void PacketPoolWait(void)
{
PktPool *my_pool = GetThreadPacketPool();
- if (PacketPoolIsEmpty(my_pool)) {
+ if (!my_pool->head) {
+ /* local stack is empty */
SCMutexLock(&my_pool->return_stack.mutex);
- SC_ATOMIC_ADD(my_pool->return_stack.sync_now, 1);
- SCCondWait(&my_pool->return_stack.cond, &my_pool->return_stack.mutex);
+ while (PacketPoolIsEmpty(my_pool)) {
+ SC_ATOMIC_ADD(my_pool->return_stack.sync_now, 1);
+ SCCondWait(&my_pool->return_stack.cond, &my_pool->return_stack.mutex);
+ }
SCMutexUnlock(&my_pool->return_stack.mutex);
}
@@ -323,8 +326,8 @@ void PacketPoolReturnPacket(Packet *p)
my_pool->pending_tail->next = pool->return_stack.head;
pool->return_stack.head = my_pool->pending_head;
SC_ATOMIC_RESET(pool->return_stack.sync_now);
- SCMutexUnlock(&pool->return_stack.mutex);
SCCondSignal(&pool->return_stack.cond);
+ SCMutexUnlock(&pool->return_stack.mutex);
/* Clear the list of pending packets to return. */
my_pool->pending_pool = NULL;
my_pool->pending_head = NULL;
@@ -337,8 +340,8 @@ void PacketPoolReturnPacket(Packet *p)
p->next = pool->return_stack.head;
pool->return_stack.head = p;
SC_ATOMIC_RESET(pool->return_stack.sync_now);
- SCMutexUnlock(&pool->return_stack.mutex);
SCCondSignal(&pool->return_stack.cond);
+ SCMutexUnlock(&pool->return_stack.mutex);
}
}
}
@@ -395,8 +398,8 @@ void PacketPoolInit(void)
PacketPoolStorePacket(p);
}
- //SCLogInfo("preallocated %"PRIiMAX" packets. Total memory %"PRIuMAX"",
- // max_pending_packets, (uintmax_t)(max_pending_packets*SIZE_OF_PACKET));
+ SCLogInfo("preallocated %"PRIiMAX" packets. Total memory %"PRIuMAX"",
+ max_pending_packets, (uintmax_t)(max_pending_packets*SIZE_OF_PACKET));
}
void PacketPoolDestroy(void)
Updated by Andreas Herz over 5 years ago
First of all please test it again with current versions, 4.0.6 is rather old.
If you want to commit your patch please follow those https://redmine.openinfosecfoundation.org/projects/suricata/wiki/Contributing steps, thanks!
Updated by Andreas Herz almost 3 years ago
- Status changed from Feedback to Closed
Hi, we're closing this issue since there have been no further responses.
If you think this issue is still relevant, try to test it again with the
most recent version of suricata and reopen the issue. If you want to
improve the bug report please take a look at
https://redmine.openinfosecfoundation.org/projects/suricata/wiki/Reporting_Bugs
Updated by Simeon Miteff about 2 years ago
I can confirm that this issue is still present in git master in autofp mode. I can reproduce this about 50% of the time with repeated runs on archives of ~660 PCAP files totaling ~600GB input data in each run. I have suricata configured with 4 worker threads (because I found that it slows down rapidly with >4 threads, but that is a separate issue) as follows:
threading:
set-cpu-affinity: yes
cpu-affinity:
- management-cpu-set:
cpu: [ 1-21,88-109 ] # Pin to NUMA node 0, avoid CPU 0
- receive-cpu-set:
cpu: [ 1-21,88-109 ] # Pin to NUMA node 0, avoid CPU 0
- worker-cpu-set:
cpu: [ 1-21,88-109 ] # Pin to NUMA node 0, avoid CPU 0
mode: "balanced"
threads: 4
The suricata process with hung RX threads have only one FD open (for one of the ~660 PCAP input files). The backtrace on the hung thread looks the same every time:
(gdb) info threads Id Target Id Frame * 1 Thread 0x7f880f338d80 (LWP 8) "Suricata-Main" 0x00007f880f75a845 in clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6 2 Thread 0x7f880d73c640 (LWP 9) "RX#01" 0x00007f880f6e339a in __futex_abstimed_wait_common () from /lib64/libc.so.6 3 Thread 0x7f880cf3b640 (LWP 10) "W#01" 0x00007f880f6e339a in __futex_abstimed_wait_common () from /lib64/libc.so.6 4 Thread 0x7f8807fff640 (LWP 11) "W#02" 0x00007f880f6e339a in __futex_abstimed_wait_common () from /lib64/libc.so.6 5 Thread 0x7f88077fe640 (LWP 12) "W#03" 0x00007f880f6e339a in __futex_abstimed_wait_common () from /lib64/libc.so.6 6 Thread 0x7f8806ffd640 (LWP 13) "W#04" 0x00007f880f6e339a in __futex_abstimed_wait_common () from /lib64/libc.so.6 7 Thread 0x7f88067fc640 (LWP 14) "FM#01" 0x00007f880f75a845 in clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6 8 Thread 0x7f8805ffb640 (LWP 15) "FM#02" 0x00007f880f75a845 in clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6 9 Thread 0x7f88057fa640 (LWP 16) "FR#01" 0x00007f880f75a845 in clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6 10 Thread 0x7f8804ff9640 (LWP 17) "FR#02" 0x00007f880f75a845 in clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6 (gdb) thread 2 [Switching to thread 2 (Thread 0x7f880d73c640 (LWP 9))] #0 0x00007f880f6e339a in __futex_abstimed_wait_common () from /lib64/libc.so.6 (gdb) bt #0 0x00007f880f6e339a in __futex_abstimed_wait_common () from /lib64/libc.so.6 #1 0x00007f880f6e5ba0 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libc.so.6 #2 0x000000000055f734 in PacketPoolWait () at tmqh-packetpool.c:76 #3 0x000000000064b771 in PcapFileDispatch (ptv=ptv@entry=0x7f88088e4340) at source-pcap-file-helper.c:152 #4 0x000000000064aa8c in PcapDirectoryDispatchForTimeRange (older_than=0x7f880d73a710, pv=0x7f8808880c20) at source-pcap-file-directory-helper.c:460 #5 PcapDirectoryDispatch (ptv=0x7f8808880c20) at source-pcap-file-directory-helper.c:520 #6 0x0000000000649cf9 in ReceivePcapFileLoop (tv=<optimized out>, data=0x7f8808878b60, slot=<optimized out>) at source-pcap-file.c:184 #7 0x0000000000562d37 in TmThreadsSlotPktAcqLoop (td=0x1e612e70) at tm-threads.c:310 #8 0x00007f880f6e6802 in start_thread () from /lib64/libc.so.6 #9 0x00007f880f686314 in clone () from /lib64/libc.so.6
I applied Feng Dai's patch above and so far it appears to resolve the hang (I've only run it 8 times, but I will continue and report if this changes) but the runs are also ~3 times longer than successful (non-hanging) runs on unpatched git master, which suggests to me that the patch forces sequential processing.
Updated by Andreas Herz about 2 years ago
- Status changed from Closed to In Progress
Updated by Andreas Herz about 2 years ago
- Status changed from In Progress to New
Updated by Simeon Miteff about 2 years ago
- None of them got stuck.
- The reduction in performance is closer to 1/2 rather than the 1/3 I reported in the previous note.
Updated by Victor Julien about 2 years ago
@Simeon Miteff can you share the exact yaml, build-info output, commandline and rules used? Also, can you share the output of stats.log (or a eve.stats record from the time of the hang)?
Updated by Simeon Miteff almost 2 years ago
- File suricata.yaml suricata.yaml added
- File build-info.txt build-info.txt added
- File statline.json statline.json added
Sorry for the delayed response. I attach the config yaml (suricata.yaml), build info in build-info.txt (note that I think the hang is reproducable in just about any version).
The contents of statline.json is the last EVE log stat output before the hang, and also references the PCAP file which is open at the time of the hang.
Command line is of the form: suricata -vvv -k none --pcap-file-recursive -r /path/to/month/containing/daily/subdirs/filled/with/pcaps -S /suricata.rules -c /suricata.yaml -F /ignore-internal-scanners.bpf
I can't share the rules file publically but it is OK to share with OISF, so will send to you out-of-band.
Updated by Victor Julien over 1 year ago
- Related to Bug #5445: RX thread hang in pcap-file mode added
Updated by Simeon Miteff over 1 year ago
Hi Victor and Andreas.
I'm about to resume the kind of PCAP processing that is affected by this bug, so I'm checking in if there is any progress on this or interest to work on it?
Updated by Ofer Dagan 4 months ago
Hi,
We are still experiencing the issue.
Is the patch expected to enter some version soon? Is there another solution that doesn't decrease performances that much?
We have a use case where performance is important so we prefer a solution that won't hurt it that much.
Also not needed to manually patch the solution every time will be really helpful.