Project

General

Profile

Actions

Bug #3075

open

RX thread hang in pcap-file mode

Added by WenTan Liu over 5 years ago. Updated about 2 months ago.

Status:
New
Priority:
Normal
Assignee:
Target version:
Affected Versions:
Effort:
Difficulty:
Label:

Description

based on suricata4.1.4, RX thread sometimes(always two days) hang, so RX can't read pcap file.

gstack RX_thread_id

#0 0x00007f0a98fe8945 in pthread_cond_wait@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00000000005a19c2 in PacketPoolWait () at tmqh-packetpool.c:155
#2 0x000000000058afdd in PcapFileDispatch (ptv=ptv@entry=0x7f0a8f38f2b0) at source-pcap-file-helper.c:135
#3 0x0000000000588a1fin PcapDirectoryDispatchForTimeRange (older_than=0x7f0a09150a0, pv=0x7f0a8c030e70) at source-pcap-file-directory-helper.c:462
#4 PcapDirectoryDispatch (ptv=0x7f0a8c030e70) at source-pcap-file-directory-helper.c:530
#5 0x00000000005860c6 ReceivePcapFileLoop (tv=<optimized out>, data=0x7f0a8c030db0, slot=<optimized out>) at source-pcap-file.c:177
#6 0x00000000005a5b26 in TmThreadsSlotPktAcqLoop (td=0x9deedc0) at tm-threads.c:356
#4 0x00007f0a98fe4e25 in start_thread () from /lib64/libpthread.so.0
#5 0x00007f0a9869834d in clone () from /lib64/libc.so.6


Files

suricata.yaml (73 KB) suricata.yaml WenTan Liu, 07/08/2019 10:22 AM
suricata.yaml (10.6 KB) suricata.yaml Simeon Miteff, 02/07/2023 02:48 AM
build-info.txt (3.31 KB) build-info.txt Simeon Miteff, 02/07/2023 02:48 AM
statline.json (196 KB) statline.json Simeon Miteff, 02/07/2023 02:48 AM

Related issues 1 (0 open1 closed)

Related to Suricata - Bug #5445: RX thread hang in pcap-file modeRejectedActions
Actions #1

Updated by Andreas Herz over 5 years ago

  • Status changed from New to Feedback
  • Assignee set to OISF Dev
  • Target version changed from 4.1.5 to TBD

Can you give us more details about your setup?
(Linux, NIC, configuration, runmode, parameter)

Actions #2

Updated by WenTan Liu over 5 years ago

1. Centos 7.2
2. NIC Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection
3. runmode <autofp>
4. suricata -c suricata.yaml -r pcap_file_directory --pcap-file-continuous -l log_dir

Actions #3

Updated by Victor Julien over 5 years ago

Are you able to test the current git master? I made some fixes some time ago that might be related.

Actions #4

Updated by Victor Julien over 5 years ago

  • Priority changed from High to Normal
Actions #5

Updated by Victor Julien over 5 years ago

  • Description updated (diff)
Actions #6

Updated by Feng Dai about 5 years ago

I got similar issue in suriata 4.0.6 with loading test of 400Mbps for 20 minutes. The RX thread doesn't receive any more packages.

#0  0x00007fb60082f945 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000561bdf192b7a in PacketPoolWait ()
#2  0x0000561bdf178bd5 in ReceivePcapLoop ()
#3  0x0000561bdf1975e7 in TmThreadsSlotPktAcqLoop ()
#4  0x00007fb60082be25 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fb60013e34d in clone () from /lib64/libc.so.6

I got a fix to pass my load test. The pattern to use condition variable was not correct from my opinion. Please review if my fix makes sense. Thanks.

diff -Naurp --exclude tags suricata-4.0.6/src/tmqh-packetpool.c suricata-4.0.6-twutm1605/src/tmqh-packetpool.c
--- suricata-4.0.6/src/tmqh-packetpool.c    2018-11-06 03:01:46.000000000 -0600
+++ suricata-4.0.6-twutm1605/src/tmqh-packetpool.c  2019-09-13 15:41:20.673513665 -0500
@@ -149,10 +149,13 @@ void PacketPoolWait(void)
 {
     PktPool *my_pool = GetThreadPacketPool();

-    if (PacketPoolIsEmpty(my_pool)) {
+    if (!my_pool->head) {
+        /* local stack is empty */
         SCMutexLock(&my_pool->return_stack.mutex);
-        SC_ATOMIC_ADD(my_pool->return_stack.sync_now, 1);
-        SCCondWait(&my_pool->return_stack.cond, &my_pool->return_stack.mutex);
+        while (PacketPoolIsEmpty(my_pool)) {
+            SC_ATOMIC_ADD(my_pool->return_stack.sync_now, 1);
+            SCCondWait(&my_pool->return_stack.cond, &my_pool->return_stack.mutex);
+        }
         SCMutexUnlock(&my_pool->return_stack.mutex);
     }

@@ -323,8 +326,8 @@ void PacketPoolReturnPacket(Packet *p)
                 my_pool->pending_tail->next = pool->return_stack.head;
                 pool->return_stack.head = my_pool->pending_head;
                 SC_ATOMIC_RESET(pool->return_stack.sync_now);
-                SCMutexUnlock(&pool->return_stack.mutex);
                 SCCondSignal(&pool->return_stack.cond);
+                SCMutexUnlock(&pool->return_stack.mutex);
                 /* Clear the list of pending packets to return. */
                 my_pool->pending_pool = NULL;
                 my_pool->pending_head = NULL;
@@ -337,8 +340,8 @@ void PacketPoolReturnPacket(Packet *p)
             p->next = pool->return_stack.head;
             pool->return_stack.head = p;
             SC_ATOMIC_RESET(pool->return_stack.sync_now);
-            SCMutexUnlock(&pool->return_stack.mutex);
             SCCondSignal(&pool->return_stack.cond);
+            SCMutexUnlock(&pool->return_stack.mutex);
         }
     }
 }
@@ -395,8 +398,8 @@ void PacketPoolInit(void)
         PacketPoolStorePacket(p);
     }

-    //SCLogInfo("preallocated %"PRIiMAX" packets. Total memory %"PRIuMAX"",
-    //        max_pending_packets, (uintmax_t)(max_pending_packets*SIZE_OF_PACKET));
+    SCLogInfo("preallocated %"PRIiMAX" packets. Total memory %"PRIuMAX"",
+            max_pending_packets, (uintmax_t)(max_pending_packets*SIZE_OF_PACKET));
 }

 void PacketPoolDestroy(void)

Actions #7

Updated by Andreas Herz about 5 years ago

First of all please test it again with current versions, 4.0.6 is rather old.
If you want to commit your patch please follow those https://redmine.openinfosecfoundation.org/projects/suricata/wiki/Contributing steps, thanks!

Actions #8

Updated by Andreas Herz almost 3 years ago

  • Status changed from Feedback to Closed

Hi, we're closing this issue since there have been no further responses.
If you think this issue is still relevant, try to test it again with the
most recent version of suricata and reopen the issue. If you want to
improve the bug report please take a look at
https://redmine.openinfosecfoundation.org/projects/suricata/wiki/Reporting_Bugs

Actions #9

Updated by Simeon Miteff almost 2 years ago

I can confirm that this issue is still present in git master in autofp mode. I can reproduce this about 50% of the time with repeated runs on archives of ~660 PCAP files totaling ~600GB input data in each run. I have suricata configured with 4 worker threads (because I found that it slows down rapidly with >4 threads, but that is a separate issue) as follows:

threading:
  set-cpu-affinity: yes
  cpu-affinity:
    - management-cpu-set:
        cpu: [ 1-21,88-109 ]  # Pin to NUMA node 0, avoid CPU 0
    - receive-cpu-set:
        cpu: [ 1-21,88-109 ]  # Pin to NUMA node 0, avoid CPU 0
    - worker-cpu-set:
        cpu: [ 1-21,88-109 ]  # Pin to NUMA node 0, avoid CPU 0
        mode: "balanced" 
        threads: 4

The suricata process with hung RX threads have only one FD open (for one of the ~660 PCAP input files). The backtrace on the hung thread looks the same every time:

(gdb) info threads
  Id   Target Id                                     Frame
* 1    Thread 0x7f880f338d80 (LWP 8) "Suricata-Main" 0x00007f880f75a845 in clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6
  2    Thread 0x7f880d73c640 (LWP 9) "RX#01"         0x00007f880f6e339a in __futex_abstimed_wait_common () from /lib64/libc.so.6
  3    Thread 0x7f880cf3b640 (LWP 10) "W#01"         0x00007f880f6e339a in __futex_abstimed_wait_common () from /lib64/libc.so.6
  4    Thread 0x7f8807fff640 (LWP 11) "W#02"         0x00007f880f6e339a in __futex_abstimed_wait_common () from /lib64/libc.so.6
  5    Thread 0x7f88077fe640 (LWP 12) "W#03"         0x00007f880f6e339a in __futex_abstimed_wait_common () from /lib64/libc.so.6
  6    Thread 0x7f8806ffd640 (LWP 13) "W#04"         0x00007f880f6e339a in __futex_abstimed_wait_common () from /lib64/libc.so.6
  7    Thread 0x7f88067fc640 (LWP 14) "FM#01"        0x00007f880f75a845 in clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6
  8    Thread 0x7f8805ffb640 (LWP 15) "FM#02"        0x00007f880f75a845 in clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6
  9    Thread 0x7f88057fa640 (LWP 16) "FR#01"        0x00007f880f75a845 in clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6
  10   Thread 0x7f8804ff9640 (LWP 17) "FR#02"        0x00007f880f75a845 in clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6
(gdb) thread 2
[Switching to thread 2 (Thread 0x7f880d73c640 (LWP 9))]
#0  0x00007f880f6e339a in __futex_abstimed_wait_common () from /lib64/libc.so.6
(gdb) bt
#0  0x00007f880f6e339a in __futex_abstimed_wait_common () from /lib64/libc.so.6
#1  0x00007f880f6e5ba0 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libc.so.6
#2  0x000000000055f734 in PacketPoolWait () at tmqh-packetpool.c:76
#3  0x000000000064b771 in PcapFileDispatch (ptv=ptv@entry=0x7f88088e4340) at source-pcap-file-helper.c:152
#4  0x000000000064aa8c in PcapDirectoryDispatchForTimeRange (older_than=0x7f880d73a710, pv=0x7f8808880c20) at source-pcap-file-directory-helper.c:460
#5  PcapDirectoryDispatch (ptv=0x7f8808880c20) at source-pcap-file-directory-helper.c:520
#6  0x0000000000649cf9 in ReceivePcapFileLoop (tv=<optimized out>, data=0x7f8808878b60, slot=<optimized out>) at source-pcap-file.c:184
#7  0x0000000000562d37 in TmThreadsSlotPktAcqLoop (td=0x1e612e70) at tm-threads.c:310
#8  0x00007f880f6e6802 in start_thread () from /lib64/libc.so.6
#9  0x00007f880f686314 in clone () from /lib64/libc.so.6

I applied Feng Dai's patch above and so far it appears to resolve the hang (I've only run it 8 times, but I will continue and report if this changes) but the runs are also ~3 times longer than successful (non-hanging) runs on unpatched git master, which suggests to me that the patch forces sequential processing.

Actions #10

Updated by Andreas Herz almost 2 years ago

  • Status changed from Closed to In Progress
Actions #11

Updated by Andreas Herz almost 2 years ago

  • Status changed from In Progress to New
Actions #12

Updated by Simeon Miteff almost 2 years ago

FYI, I have run another 10 of these 1 day PCAP jobs and:
  • None of them got stuck.
  • The reduction in performance is closer to 1/2 rather than the 1/3 I reported in the previous note.
Actions #13

Updated by Victor Julien almost 2 years ago

@Simeon Miteff can you share the exact yaml, build-info output, commandline and rules used? Also, can you share the output of stats.log (or a eve.stats record from the time of the hang)?

Updated by Simeon Miteff almost 2 years ago

Sorry for the delayed response. I attach the config yaml (suricata.yaml), build info in build-info.txt (note that I think the hang is reproducable in just about any version).

The contents of statline.json is the last EVE log stat output before the hang, and also references the PCAP file which is open at the time of the hang.

Command line is of the form: suricata -vvv -k none --pcap-file-recursive -r /path/to/month/containing/daily/subdirs/filled/with/pcaps -S /suricata.rules -c /suricata.yaml -F /ignore-internal-scanners.bpf

I can't share the rules file publically but it is OK to share with OISF, so will send to you out-of-band.

Actions #15

Updated by Victor Julien over 1 year ago

  • Related to Bug #5445: RX thread hang in pcap-file mode added
Actions #16

Updated by Simeon Miteff about 1 year ago

Hi Victor and Andreas.

I'm about to resume the kind of PCAP processing that is affected by this bug, so I'm checking in if there is any progress on this or interest to work on it?

Actions #17

Updated by Victor Julien 5 months ago

Is this still an issue?

Actions #18

Updated by Ofer Dagan about 2 months ago

Hi,
We are still experiencing the issue.
Is the patch expected to enter some version soon? Is there another solution that doesn't decrease performances that much?
We have a use case where performance is important so we prefer a solution that won't hurt it that much.
Also not needed to manually patch the solution every time will be really helpful.

Actions

Also available in: Atom PDF