Bug #3075: RX thread hang in pcap-file mode - Suricata - Open Information Security Foundation

Actions

Copy link

Bug #3075

open

RX thread hang in pcap-file mode

Added by WenTan Liu over 6 years ago. Updated 5 months ago.

Status:

New

Priority:

Normal

Assignee:

OISF Dev

Target version:

TBD

Affected Versions:

8.0.0

Effort:

Difficulty:

Label:

Description

based on suricata4.1.4, RX thread sometimes(always two days) hang, so RX can't read pcap file.

gstack RX_thread_id

#0 0x00007f0a98fe8945 in pthread_cond_wait@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00000000005a19c2 in PacketPoolWait () at tmqh-packetpool.c:155
#2 0x000000000058afdd in PcapFileDispatch (ptv=ptv@entry=0x7f0a8f38f2b0) at source-pcap-file-helper.c:135
#3 0x0000000000588a1fin PcapDirectoryDispatchForTimeRange (older_than=0x7f0a09150a0, pv=0x7f0a8c030e70) at source-pcap-file-directory-helper.c:462
#4 PcapDirectoryDispatch (ptv=0x7f0a8c030e70) at source-pcap-file-directory-helper.c:530
#5 0x00000000005860c6 ReceivePcapFileLoop (tv=<optimized out>, data=0x7f0a8c030db0, slot=<optimized out>) at source-pcap-file.c:177
#6 0x00000000005a5b26 in TmThreadsSlotPktAcqLoop (td=0x9deedc0) at tm-threads.c:356
#4 0x00007f0a98fe4e25 in start_thread () from /lib64/libpthread.so.0
#5 0x00007f0a9869834d in clone () from /lib64/libc.so.6

Files

Download all files

suricata.yaml (73 KB) suricata.yaml		WenTan Liu, 07/08/2019 10:22 AM
suricata.yaml (10.6 KB) suricata.yaml		Simeon Miteff, 02/07/2023 02:48 AM
build-info.txt (3.31 KB) build-info.txt		Simeon Miteff, 02/07/2023 02:48 AM
statline.json (196 KB) statline.json		Simeon Miteff, 02/07/2023 02:48 AM

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Andreas Herz over 6 years ago

Status changed from New to Feedback
Assignee set to OISF Dev
Target version changed from 4.1.5 to TBD

Can you give us more details about your setup?
(Linux, NIC, configuration, runmode, parameter)

Actions

Copy link

Updated by WenTan Liu over 6 years ago

File suricata.yaml suricata.yaml added

1. Centos 7.2
2. NIC Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection
3. runmode <autofp>
4. suricata -c suricata.yaml -r pcap_file_directory --pcap-file-continuous -l log_dir

Actions

Copy link

Updated by Victor Julien over 6 years ago

Are you able to test the current git master? I made some fixes some time ago that might be related.

Actions

Copy link

Updated by Victor Julien over 6 years ago

Priority changed from High to Normal

Actions

Copy link

Updated by Victor Julien over 6 years ago

Description updated (diff)

Actions

Copy link

Updated by Feng Dai over 6 years ago

I got similar issue in suriata 4.0.6 with loading test of 400Mbps for 20 minutes. The RX thread doesn't receive any more packages.

#0  0x00007fb60082f945 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000561bdf192b7a in PacketPoolWait ()
#2  0x0000561bdf178bd5 in ReceivePcapLoop ()
#3  0x0000561bdf1975e7 in TmThreadsSlotPktAcqLoop ()
#4  0x00007fb60082be25 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fb60013e34d in clone () from /lib64/libc.so.6

I got a fix to pass my load test. The pattern to use condition variable was not correct from my opinion. Please review if my fix makes sense. Thanks.

diff -Naurp --exclude tags suricata-4.0.6/src/tmqh-packetpool.c suricata-4.0.6-twutm1605/src/tmqh-packetpool.c
--- suricata-4.0.6/src/tmqh-packetpool.c    2018-11-06 03:01:46.000000000 -0600
+++ suricata-4.0.6-twutm1605/src/tmqh-packetpool.c  2019-09-13 15:41:20.673513665 -0500
@@ -149,10 +149,13 @@ void PacketPoolWait(void)
 {
     PktPool *my_pool = GetThreadPacketPool();

-    if (PacketPoolIsEmpty(my_pool)) {
+    if (!my_pool->head) {
+        /* local stack is empty */
         SCMutexLock(&my_pool->return_stack.mutex);
-        SC_ATOMIC_ADD(my_pool->return_stack.sync_now, 1);
-        SCCondWait(&my_pool->return_stack.cond, &my_pool->return_stack.mutex);
+        while (PacketPoolIsEmpty(my_pool)) {
+            SC_ATOMIC_ADD(my_pool->return_stack.sync_now, 1);
+            SCCondWait(&my_pool->return_stack.cond, &my_pool->return_stack.mutex);
+        }
         SCMutexUnlock(&my_pool->return_stack.mutex);
     }

@@ -323,8 +326,8 @@ void PacketPoolReturnPacket(Packet *p)
                 my_pool->pending_tail->next = pool->return_stack.head;
                 pool->return_stack.head = my_pool->pending_head;
                 SC_ATOMIC_RESET(pool->return_stack.sync_now);
-                SCMutexUnlock(&pool->return_stack.mutex);
                 SCCondSignal(&pool->return_stack.cond);
+                SCMutexUnlock(&pool->return_stack.mutex);
                 /* Clear the list of pending packets to return. */
                 my_pool->pending_pool = NULL;
                 my_pool->pending_head = NULL;
@@ -337,8 +340,8 @@ void PacketPoolReturnPacket(Packet *p)
             p->next = pool->return_stack.head;
             pool->return_stack.head = p;
             SC_ATOMIC_RESET(pool->return_stack.sync_now);
-            SCMutexUnlock(&pool->return_stack.mutex);
             SCCondSignal(&pool->return_stack.cond);
+            SCMutexUnlock(&pool->return_stack.mutex);
         }
     }
 }
@@ -395,8 +398,8 @@ void PacketPoolInit(void)
         PacketPoolStorePacket(p);
     }

-    //SCLogInfo("preallocated %"PRIiMAX" packets. Total memory %"PRIuMAX"",
-    //        max_pending_packets, (uintmax_t)(max_pending_packets*SIZE_OF_PACKET));
+    SCLogInfo("preallocated %"PRIiMAX" packets. Total memory %"PRIuMAX"",
+            max_pending_packets, (uintmax_t)(max_pending_packets*SIZE_OF_PACKET));
 }

 void PacketPoolDestroy(void)

Actions

Copy link

Updated by Andreas Herz over 6 years ago

First of all please test it again with current versions, 4.0.6 is rather old.
If you want to commit your patch please follow those https://redmine.openinfosecfoundation.org/projects/suricata/wiki/Contributing steps, thanks!

Actions

Copy link

Updated by Andreas Herz almost 4 years ago

Status changed from Feedback to Closed

Hi, we're closing this issue since there have been no further responses.
If you think this issue is still relevant, try to test it again with the
most recent version of suricata and reopen the issue. If you want to
improve the bug report please take a look at
https://redmine.openinfosecfoundation.org/projects/suricata/wiki/Reporting_Bugs

Actions

Copy link

Updated by Simeon Miteff about 3 years ago

I can confirm that this issue is still present in git master in autofp mode. I can reproduce this about 50% of the time with repeated runs on archives of ~660 PCAP files totaling ~600GB input data in each run. I have suricata configured with 4 worker threads (because I found that it slows down rapidly with >4 threads, but that is a separate issue) as follows:

threading:
  set-cpu-affinity: yes
  cpu-affinity:
    - management-cpu-set:
        cpu: [ 1-21,88-109 ]  # Pin to NUMA node 0, avoid CPU 0
    - receive-cpu-set:
        cpu: [ 1-21,88-109 ]  # Pin to NUMA node 0, avoid CPU 0
    - worker-cpu-set:
        cpu: [ 1-21,88-109 ]  # Pin to NUMA node 0, avoid CPU 0
        mode: "balanced" 
        threads: 4

The suricata process with hung RX threads have only one FD open (for one of the ~660 PCAP input files). The backtrace on the hung thread looks the same every time:

(gdb) info threads
  Id   Target Id                                     Frame
* 1    Thread 0x7f880f338d80 (LWP 8) "Suricata-Main" 0x00007f880f75a845 in clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6
  2    Thread 0x7f880d73c640 (LWP 9) "RX#01"         0x00007f880f6e339a in __futex_abstimed_wait_common () from /lib64/libc.so.6
  3    Thread 0x7f880cf3b640 (LWP 10) "W#01"         0x00007f880f6e339a in __futex_abstimed_wait_common () from /lib64/libc.so.6
  4    Thread 0x7f8807fff640 (LWP 11) "W#02"         0x00007f880f6e339a in __futex_abstimed_wait_common () from /lib64/libc.so.6
  5    Thread 0x7f88077fe640 (LWP 12) "W#03"         0x00007f880f6e339a in __futex_abstimed_wait_common () from /lib64/libc.so.6
  6    Thread 0x7f8806ffd640 (LWP 13) "W#04"         0x00007f880f6e339a in __futex_abstimed_wait_common () from /lib64/libc.so.6
  7    Thread 0x7f88067fc640 (LWP 14) "FM#01"        0x00007f880f75a845 in clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6
  8    Thread 0x7f8805ffb640 (LWP 15) "FM#02"        0x00007f880f75a845 in clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6
  9    Thread 0x7f88057fa640 (LWP 16) "FR#01"        0x00007f880f75a845 in clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6
  10   Thread 0x7f8804ff9640 (LWP 17) "FR#02"        0x00007f880f75a845 in clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6
(gdb) thread 2
[Switching to thread 2 (Thread 0x7f880d73c640 (LWP 9))]
#0  0x00007f880f6e339a in __futex_abstimed_wait_common () from /lib64/libc.so.6
(gdb) bt
#0  0x00007f880f6e339a in __futex_abstimed_wait_common () from /lib64/libc.so.6
#1  0x00007f880f6e5ba0 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libc.so.6
#2  0x000000000055f734 in PacketPoolWait () at tmqh-packetpool.c:76
#3  0x000000000064b771 in PcapFileDispatch (ptv=ptv@entry=0x7f88088e4340) at source-pcap-file-helper.c:152
#4  0x000000000064aa8c in PcapDirectoryDispatchForTimeRange (older_than=0x7f880d73a710, pv=0x7f8808880c20) at source-pcap-file-directory-helper.c:460
#5  PcapDirectoryDispatch (ptv=0x7f8808880c20) at source-pcap-file-directory-helper.c:520
#6  0x0000000000649cf9 in ReceivePcapFileLoop (tv=<optimized out>, data=0x7f8808878b60, slot=<optimized out>) at source-pcap-file.c:184
#7  0x0000000000562d37 in TmThreadsSlotPktAcqLoop (td=0x1e612e70) at tm-threads.c:310
#8  0x00007f880f6e6802 in start_thread () from /lib64/libc.so.6
#9  0x00007f880f686314 in clone () from /lib64/libc.so.6

I applied Feng Dai's patch above and so far it appears to resolve the hang (I've only run it 8 times, but I will continue and report if this changes) but the runs are also ~3 times longer than successful (non-hanging) runs on unpatched git master, which suggests to me that the patch forces sequential processing.

Actions

Copy link

#10

Updated by Andreas Herz about 3 years ago

Status changed from Closed to In Progress

Actions

Copy link

#11

Updated by Andreas Herz about 3 years ago

Status changed from In Progress to New

Actions

Copy link

#12

Updated by Simeon Miteff about 3 years ago

FYI, I have run another 10 of these 1 day PCAP jobs and:

None of them got stuck.
The reduction in performance is closer to 1/2 rather than the 1/3 I reported in the previous note.

Actions

Copy link

#13

Updated by Victor Julien about 3 years ago

@simeonmiteff can you share the exact yaml, build-info output, commandline and rules used? Also, can you share the output of stats.log (or a eve.stats record from the time of the hang)?

Actions

Copy link Download all files

#14

Updated by Simeon Miteff almost 3 years ago

File suricata.yaml suricata.yaml added
File build-info.txt build-info.txt added
File statline.json statline.json added

Sorry for the delayed response. I attach the config yaml (suricata.yaml), build info in build-info.txt (note that I think the hang is reproducable in just about any version).

The contents of statline.json is the last EVE log stat output before the hang, and also references the PCAP file which is open at the time of the hang.

Command line is of the form: suricata -vvv -k none --pcap-file-recursive -r /path/to/month/containing/daily/subdirs/filled/with/pcaps -S /suricata.rules -c /suricata.yaml -F /ignore-internal-scanners.bpf

I can't share the rules file publically but it is OK to share with OISF, so will send to you out-of-band.

Actions

Copy link

#15

Updated by Victor Julien over 2 years ago

Related to Bug #5445: RX thread hang in pcap-file mode added

Actions

Copy link

#16

Updated by Simeon Miteff over 2 years ago

Hi Victor and Andreas.

I'm about to resume the kind of PCAP processing that is affected by this bug, so I'm checking in if there is any progress on this or interest to work on it?

Actions

Copy link

#17

Updated by Victor Julien over 1 year ago

Is this still an issue?

Actions

Copy link

#18

Updated by Ofer Dagan about 1 year ago

Hi,
We are still experiencing the issue.
Is the patch expected to enter some version soon? Is there another solution that doesn't decrease performances that much?
We have a use case where performance is important so we prefer a solution that won't hurt it that much.
Also not needed to manually patch the solution every time will be really helpful.

Actions

Copy link

#19

Updated by Philippe Antoine 5 months ago

Affected Versions 8.0.0 added
Affected Versions deleted (~~4.1.4~~)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Suricata

Custom queries

Bug #3075

RX thread hang in pcap-file mode

Updated by Andreas Herz over 6 years ago

Updated by WenTan Liu over 6 years ago

Updated by Victor Julien over 6 years ago

Updated by Victor Julien over 6 years ago

Updated by Victor Julien over 6 years ago

Updated by Feng Dai over 6 years ago

Updated by Andreas Herz over 6 years ago

Updated by Andreas Herz almost 4 years ago

Updated by Simeon Miteff about 3 years ago

Updated by Andreas Herz about 3 years ago

Updated by Andreas Herz about 3 years ago

Updated by Simeon Miteff about 3 years ago

Updated by Victor Julien about 3 years ago

Updated by Simeon Miteff almost 3 years ago

Updated by Victor Julien over 2 years ago

Updated by Simeon Miteff over 2 years ago

Updated by Victor Julien over 1 year ago

Updated by Ofer Dagan about 1 year ago

Updated by Philippe Antoine 5 months ago