Project

General

Profile

Actions

Support #2618

closed

suricata 4.0.5 + pfring 7.2.0 crashes

Added by Rado Ce over 5 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Affected Versions:
Label:

Description

Hi!

I'm running suricata 4.0.5 (4.0.4 was also affected) with pf_ring 7.2.0 on CentOS 1804, kernel 3.10.0-862.el7.x86_64 like this:

suricata  -c /etc/suricata/suricata.yaml --pfring-int=zc:99@0 --pfring-int=zc:99@1 \
--pfring-int=zc:99@2 --pfring-int=zc:99@3 --pfring-int=zc:99@4 --pfring-int=zc:99@5 \
--pidfile /run/suricata.pid --runmode=workers

from time to time it just exits, so I've rebuilt it with debugging enabled, before crash there's this log entry:

12/9/2018 -- 14:07:59 - <Debug> - creating thread "US"...
12/9/2018 -- 14:07:59 - <Debug> - US starting
12/9/2018 -- 14:07:59 - <Debug> - error with the counter ids
12/9/2018 -- 14:07:59 - <Debug> - stats_ctx->sts 0x7fa1580008e0
12/9/2018 -- 14:07:59 - <Debug> - US started...
12/9/2018 -- 14:07:59 - <Debug> - error with the counter ids
12/9/2018 -- 14:07:59 - <Debug> - stats_ctx->sts 0x7fa158000910
12/9/2018 -- 14:07:59 - <Error> - [ERRCODE: SC_ERR_THREAD_INIT(49)] - thread "W#01-zc:99@0" failed to initialize: flags 0145
12/9/2018 -- 14:07:59 - <Error> - [ERRCODE: SC_ERR_INITIALIZATION(45)] - Engine initialization failed, aborting...

After this pfring's egress queues are locked, so I have to restart pfring and then suricata. There are other applications using other pfring queues in that time and they are working after this occurs.

This happens sometimes every day, sometimes every two or three days, no rule there. Also regardless of traffic.

Build info log attached.
Thanks in advance
Rado


Files

suricata_build_info.txt (3.2 KB) suricata_build_info.txt Rado Ce, 09/12/2018 03:16 PM
suricata04.log (7.57 KB) suricata04.log Rado Ce, 09/13/2018 03:43 PM
Actions #1

Updated by Victor Julien over 5 years ago

What is the output if you disable the debug messages? I suspect there should be an error message before the errors posted above.

Actions #2

Updated by Rado Ce over 5 years ago

Victor Julien wrote:

What is the output if you disable the debug messages? I suspect there should be an error message before the errors posted above.

I've attached log from other machine with the same problem, that I've cleaned a bit from Debug messages.
To answer your question: I've enabled 'Info' log level and I have to wait until it crashes again. But as far as I remember there was not much info there when it started happening.

Actions #3

Updated by Victor Julien over 5 years ago

11/9/2018 -- 07:13:45 - <Error> - [ERRCODE: SC_ERR_PF_RING_OPEN(34)] - Failed to open zc:27@0: pfring_open error. Check if zc:27@0 exists and pf_ring module is loaded.

This is what I was after. Do you have the module loaded? Anything from pfring in the dmesg?
Actions #4

Updated by Rado Ce over 5 years ago

dmesg is clean, pfring interface is divided to 32 egress queues on which some other applications are running (for example bro) and after suricata crashes (this happens after day, two or three of working normally) rest of applications and pfring are still working.

Actions #5

Updated by Victor Julien over 5 years ago

So this happens with running Suricata, not a (re)start of Suricata?

Actions #6

Updated by Rado Ce over 5 years ago

Victor Julien wrote:

So this happens with running Suricata, not a (re)start of Suricata?

Yes, this happens while it's running.

Actions #7

Updated by Victor Julien over 5 years ago

The outputs you've shown both show signs of this happening at startup. Do you perhaps have some watchdog that automatically restarts suricata after a crash or after a rule update or something?

Of course this shouldn't happen at startup either. I wonder if this could happen if another suricata would still be running at the same time?

Actions #8

Updated by Rado Ce over 5 years ago

Victor Julien wrote:

The outputs you've shown both show signs of this happening at startup. Do you perhaps have some watchdog that automatically restarts suricata after a crash or after a rule update or something?

Of course this shouldn't happen at startup either. I wonder if this could happen if another suricata would still be running at the same time?

Brilliant! It's logrotate:

/var/log/suricata/*.log {
        missingok
        rotate 120
        compress
        delaycompress
        create 0644 suricata suricata
        size 500M
        postrotate
                kill -HUP $(cat /var/run/suricata.pid)
                #kill -HUP $(cat /var/run/rsyslogd.pid)
        endscript
}

but question remains why this happens so randomly.

Actions #9

Updated by Rado Ce over 5 years ago

Unfortunately it wasn't logrotate's fault. I've ran suricata for few days in gdb, and got the fault captured:

Reading symbols from /usr/sbin/suricata...Reading symbols from /usr/lib/debug/usr/sbin/suricata.debug...done.
done.
Starting program: /sbin/suricata -c /etc/suricata/suricata.yaml --pidfile /var/run/suricata.pid --pfring-int=zc:27@0 --pfring-int=zc:27@1 --pfring-int=zc:27@2 --pfring-int=zc:27@3 --pfring-int=zc:27@4 --pfring-int=zc:27@5 --pidfile /run/suricata.pid --runmode=workers
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
5/10/2018 -- 12:23:39 - <Info> - Including configuration file /etc/suricata/suricata_output.yaml.
5/10/2018 -- 12:23:39 - <Info> - Including configuration file /etc/suricata/rules/ATAsuricata.yaml.
5/10/2018 -- 12:23:39 - <Info> - Configuration node 'vlan' redefined.
5/10/2018 -- 12:23:39 - <Info> - Configuration node 'default-rule-path' redefined.
5/10/2018 -- 12:23:39 - <Info> - Configuration node 'rule-files' redefined.
5/10/2018 -- 12:23:39 - <Info> - Configuration node 'classification-file' redefined.
5/10/2018 -- 12:23:39 - <Info> - Configuration node 'reference-config-file' redefined.
5/10/2018 -- 12:23:39 - <Info> - Configuration node 'vars' redefined.
[New Thread 0x7ffff1346700 (LWP 45966)]
[New Thread 0x7fffebfff700 (LWP 45971)]
[New Thread 0x7fffeb7fe700 (LWP 45984)]
[New Thread 0x7fffeaffd700 (LWP 45989)]
[New Thread 0x7fffea7fc700 (LWP 45994)]
[New Thread 0x7fffe9ffb700 (LWP 45999)]
[New Thread 0x7fffe97fa700 (LWP 46004)]
[New Thread 0x7fffe8ff9700 (LWP 46005)]
[New Thread 0x7fffb7fff700 (LWP 46006)]
[New Thread 0x7fffb77fe700 (LWP 46007)]
[New Thread 0x7fffb6ffd700 (LWP 46008)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffeb7fe700 (LWP 45984)]
0x00007ffff3a2bdd6 in __memcpy_ssse3_back () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install GeoIP-1.5.0-11.el7.x86_64 file-libs-5.11-33.el7.x86_64 glibc-2.17-222.el7.x86_64 hyperscan-4.7.0-1.ATA3.x86_64 jansson-2.10-1.el7.x86_64 libcap-ng-0.7.5-4.el7.x86_64 libgcc-4.8.5-28.el7.x86_64 libmnl-1.0.3-7.el7.x86_64 libnet-1.1.6-7.el7.x86_64 libnetfilter_queue-1.0.2-2.el7_2.x86_64 libnfnetlink-1.0.1-4.el7.x86_64 libstdc++-4.8.5-28.el7.x86_64 libyaml-0.1.4-11.el7_0.x86_64 nspr-4.17.0-1.el7.x86_64 nss-3.34.0-4.el7.x86_64 nss-softokn-3.34.0-2.el7.x86_64 nss-softokn-freebl-3.34.0-2.el7.x86_64 nss-util-3.34.0-2.el7.x86_64 pcre-8.32-17.el7.x86_64 sqlite-3.7.17-8.el7.x86_64 zlib-1.2.7-17.el7.x86_64

(gdb) bt
#0  0x00007ffff3a2bdd6 in __memcpy_ssse3_back () from /lib64/libc.so.6
#1  0x00005555555dd30b in memcpy (__len=4294966022, __src=<optimized out>, __dest=<optimized out>)
    at /usr/include/bits/string3.h:51
#2  DefragInsertFrag (tv=tv@entry=0x55555a1ed4c0, dtv=dtv@entry=0x7fffdd1610c0,
    tracker=tracker@entry=0x55555690d450, p=p@entry=0x7fffdd15d4a0, pq=pq@entry=0x55555bf656a0) at defrag.c:790
#3  0x00005555555ddfe3 in Defrag (tv=tv@entry=0x55555a1ed4c0, dtv=dtv@entry=0x7fffdd1610c0,
    p=p@entry=0x7fffdd15d4a0, pq=pq@entry=0x55555bf656a0) at defrag.c:1006
#4  0x00005555555d95a3 in DecodeIPV4 (tv=tv@entry=0x55555a1ed4c0, dtv=dtv@entry=0x7fffdd1610c0,
    p=p@entry=0x7fffdd15d4a0, pkt=pkt@entry=0x2aaad2b1ad52 "E", len=<optimized out>, pq=0x55555bf656a0)
    at decode-ipv4.c:549
#5  0x00005555555dbcdd in DecodeVLAN (tv=tv@entry=0x55555a1ed4c0, dtv=dtv@entry=0x7fffdd1610c0,
    p=p@entry=0x7fffdd15d4a0, pkt=pkt@entry=0x2aaad2b1ad4e "\001\271\b", len=len@entry=88,
    pq=pq@entry=0x55555bf656a0) at decode-vlan.c:98
#6  0x00005555555d73c3 in DecodeEthernet (tv=tv@entry=0x55555a1ed4c0, dtv=dtv@entry=0x7fffdd1610c0,
    p=p@entry=0x7fffdd15d4a0, pkt=0x2aaad2b1ad40 "", len=<optimized out>, pq=pq@entry=0x55555bf656a0)
    at decode-ethernet.c:80
#7  0x00005555556ea698 in DecodePfring (tv=0x55555a1ed4c0, p=0x7fffdd15d4a0, data=0x7fffdd1610c0,
    pq=0x55555bf656a0, postpq=<optimized out>) at source-pfring.c:694
#8  0x0000555555702061 in TmThreadsSlotVarRun (tv=tv@entry=0x55555a1ed4c0, p=p@entry=0x7fffdd15d4a0,
    slot=slot@entry=0x55555bf65660) at tm-threads.c:130
#9  0x00005555556eac3a in TmThreadsSlotProcessPkt (p=0x7fffdd15d4a0, s=0x55555bf65660, tv=0x55555a1ed4c0)
    at tm-threads.h:147
#10 ReceivePfringLoop (tv=0x55555a1ed4c0, data=0x7fffdd1608c0, slot=<optimized out>) at source-pfring.c:386
#11 0x0000555555704ed7 in TmThreadsSlotPktAcqLoop (td=0x55555a1ed4c0) at tm-threads.c:334
#12 0x00007ffff40c4e25 in start_thread () from /lib64/libpthread.so.0
#13 0x00007ffff39d4bad in clone () from /lib64/libc.so.6
Actions #10

Updated by Victor Julien over 5 years ago

Do you still have this open or did you save the core?

If so, could you please do:

f 2
print ltrim
print *p
print *new

and post the output of each print command?

Actions #11

Updated by Rado Ce over 5 years ago

Here it is:

(gdb) f 2
#2  DefragInsertFrag (tv=tv@entry=0x55555a1ed4c0, dtv=dtv@entry=0x7fffdd1610c0,
    tracker=tracker@entry=0x55555690d450, p=p@entry=0x7fffdd15d4a0, pq=pq@entry=0x55555bf656a0) at defrag.c:790
790         memcpy(new->pkt, GET_PKT_DATA(p) + ltrim, GET_PKT_LEN(p) - ltrim);

(gdb) print ltrim
$1 = 1376

(gdb) print *p
$2 = {src = {family = 2 '\002', address = {address_un_data32 = {3743224236, 0, 0, 0}, address_un_data16 = {4524,
        57117, 0, 0, 0, 0, 0, 0}, address_un_data8 = "\254\021\035\337", '\000' <repeats 11 times>,
      address_un_in6 = {__in6_u = {__u6_addr8 = "\254\021\035\337", '\000' <repeats 11 times>, __u6_addr16 = {
            4524, 57117, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {3743224236, 0, 0, 0}}}}}, dst = {family = 2 '\002',
    address = {address_un_data32 = {2911800644, 0, 0, 0}, address_un_data16 = {36164, 44430, 0, 0, 0, 0, 0, 0},
      address_un_data8 = "D\215\216\255", '\000' <repeats 11 times>, address_un_in6 = {__in6_u = {
          __u6_addr8 = "D\215\216\255", '\000' <repeats 11 times>, __u6_addr16 = {36164, 44430, 0, 0, 0, 0, 0,
            0}, __u6_addr32 = {2911800644, 0, 0, 0}}}}}, {sp = 0, type = 0 '\000'}, {dp = 0, code = 0 '\000'},
  proto = 17 '\021', recursion_level = 0 '\000', vlan_id = {0, 0}, vlan_idx = 1 '\001', flowflags = 0 '\000',
  flags = 98304, flow = 0x0, flow_hash = 66984416, ts = {tv_sec = 1538834646, tv_usec = 663113}, {nfq_v = {
      id = 0, nfq_index = 0, verdicted = 0 '\000', mark = 0, ifi = 0, ifo = 0, hw_protocol = 0}, afp_v = {
      relptr = 0x0, peer = 0x0, mpeer = 0x0, copy_mode = 0 '\000'}, pcap_v = {tenant_id = 0}},
  ReleasePacket = 0x5555556fff00 <PacketPoolReturnPacket>, BypassPacketsFlow = 0x0, pktvar = 0x0,
  ethh = 0x2aaad2b1ad40, level3_comp_csum = -1, level4_comp_csum = -1, ip4h = 0x2aaad2b1ad52, ip6h = 0x0, {
    ip4vars = {comp_csum = 0, opt_cnt = 0, opts_set = 0}, {ip6vars = {ip_opts_len = 0 '\000',
        l4proto = 0 '\000'}, ip6eh = {rh_set = false, rh_type = 0 '\000', fh_set = false,
        fh_more_frags_set = false, fh_nh = 0 '\000', fh_prev_nh = 0 '\000', fh_prev_hdr_offset = 0,
        fh_header_offset = 0, fh_data_offset = 0, fh_data_len = 0, fh_offset = 0, fh_id = 0}}}, l4vars = {
    tcpvars = {ts_set = false, ts_val = 0, ts_ecr = 0, sack = {type = 0 '\000', len = 0 '\000', data = 0x0},
      sackok = {type = 0 '\000', len = 0 '\000', data = 0x0}, ws = {type = 0 '\000', len = 0 '\000', data = 0x0},
      mss = {type = 0 '\000', len = 0 '\000', data = 0x0}}, icmpv4vars = {id = 0, seq = 0, emb_ipv4h = 0x0,
      emb_tcph = 0x0, emb_udph = 0x0, emb_icmpv4h = 0x0, emb_ip4_src = {s_addr = 0}, emb_ip4_dst = {s_addr = 0},
      emb_ip4_hlen = 0 '\000', emb_ip4_proto = 0 '\000', emb_sport = 0, emb_dport = 0}, icmpv6vars = {id = 0,
      seq = 0, mtu = 0, error_ptr = 0, emb_ipv6h = 0x0, emb_tcph = 0x0, emb_udph = 0x0, emb_icmpv6h = 0x0,
      emb_ip6_src = {0, 0, 0, 0}, emb_ip6_dst = {0, 0, 0, 0}, emb_ip6_proto_next = 0 '\000', emb_sport = 0,
      emb_dport = 0}}, tcph = 0x0, udph = 0x0, sctph = 0x0, icmpv4h = 0x0, icmpv6h = 0x0, ppph = 0x0,
  pppoesh = 0x0, pppoedh = 0x0, greh = 0x0, vlanh = {0x2aaad2b1ad4e, 0x0}, payload = 0x0, payload_len = 0,
  action = 0 '\000', pkt_src = 1 '\001', pktlen = 102, ext_pkt = 0x2aaad2b1ad40 "", livedev = 0x555555ff9470,
  alerts = {cnt = 0, alerts = {{num = 7242, action = 1 '\001', flags = 10 '\n', s = 0x55555c81fd50, tx_id = 0}, {
        num = 7242, action = 1 '\001', flags = 10 '\n', s = 0x55555c81fd50, tx_id = 0}, {num = 0,
        action = 0 '\000', flags = 0 '\000', s = 0x0, tx_id = 0} <repeats 13 times>}, drop = {num = 0,
      action = 0 '\000', flags = 0 '\000', s = 0x0, tx_id = 0}}, host_src = 0x0, host_dst = 0x0, pcap_cnt = 0,
  events = {cnt = 0 '\000', events = "\003\217\223\233\227\000\000\000\000\000\000\000\000\000"},
  app_layer_events = 0x7fffa8108d00, next = 0x0, prev = 0x0, datalink = 1, root = 0x0, tunnel_mutex = {__data = {
      __lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {
        __prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}, tunnel_rtv_cnt = 0,
  tunnel_tpr_cnt = 0, tenant_id = 0, pool = 0x7fffeb7fe540, profile = 0x7fffd32391e0}

(gdb) print *new
$3 = {offset = 0, len = 0, hlen = 0 '\000', more_frags = 0 '\000', skip = 0 '\000', ip_hdr_offset = 0,
  frag_hdr_offset = 0, data_offset = 0, data_len = 0, ltrim = 0, pkt = 0x7fffd398d810 "\330", next = {
    tqe_next = 0x0, tqe_prev = 0x0}}

Actions #12

Updated by Victor Julien over 5 years ago

Thanks, could you also try:

print *prev
print *next
print *tracker
print hlen
print data_offset
print data_len
print frag_end
print ip_hdr_offset
print frag_hdr_offset
print af
print ip6_nh_set_offset
print ip6_nh_set_value

Actions #13

Updated by Rado Ce over 5 years ago

Unfortunately most of them are optimized out:

(gdb) print *prev
value has been optimized out
(gdb) print *next
value has been optimized out
(gdb) print *tracker
$4 = {lock = {__data = {__lock = 1, __count = 0, __owner = 45984, __nusers = 1, __kind = 0, __spins = 0,
      __elision = 0, __list = {__prev = 0x0, __next = 0x0}},
    __size = "\001\000\000\000\000\000\000\000\240\263\000\000\001", '\000' <repeats 26 times>, __align = 1},
  vlan_id = {0, 0}, id = 11656, proto = 17 '\021', policy = 3 '\003', af = 2 '\002', seen_last = 0 '\000',
  remove = 0 '\000', src_addr = {family = 2 '\002', address = {address_un_data32 = {3743224236, 0, 0, 0},
      address_un_data16 = {4524, 57117, 0, 0, 0, 0, 0, 0},
      address_un_data8 = "\254\021\035\337", '\000' <repeats 11 times>, address_un_in6 = {__in6_u = {
          __u6_addr8 = "\254\021\035\337", '\000' <repeats 11 times>, __u6_addr16 = {4524, 57117, 0, 0, 0, 0, 0,
            0}, __u6_addr32 = {3743224236, 0, 0, 0}}}}}, dst_addr = {family = 2 '\002', address = {
      address_un_data32 = {2911800644, 0, 0, 0}, address_un_data16 = {36164, 44430, 0, 0, 0, 0, 0, 0},
      address_un_data8 = "D\215\216\255", '\000' <repeats 11 times>, address_un_in6 = {__in6_u = {
          __u6_addr8 = "D\215\216\255", '\000' <repeats 11 times>, __u6_addr16 = {36164, 44430, 0, 0, 0, 0, 0,
            0}, __u6_addr32 = {2911800644, 0, 0, 0}}}}}, timeout = {tv_sec = 1538834946, tv_usec = 663113},
  host_timeout = 300, use_cnt_sc_atomic__ = 1, frags = {tqh_first = 0x7ffff276d370, tqh_last = 0x7ffff276d388},
  hnext = 0x0, hprev = 0x0, lnext = 0x0, lprev = 0x0}
(gdb) print hlen
$5 = <optimized out>
(gdb) print data_offset
$6 = <optimized out>
(gdb) print data_len
$7 = 1400
(gdb) print frag_end
$8 = <optimized out>
(gdb) print ip_hdr_offset
$9 = <optimized out>
(gdb) print frag_hdr_offset
$10 = <optimized out>
(gdb) print af
$11 = <optimized out>
(gdb) print ip6_nh_set_offset
$12 = <optimized out>
(gdb) print ip6_nh_set_value
$13 = <optimized out>
(gdb)
Actions #14

Updated by Victor Julien over 5 years ago

Thanks again. One more request for now:

print *p->ip4h

Thanks!

Actions #15

Updated by Victor Julien over 5 years ago

Some notes for analysing this:

PKT_ZERO_COPY is set on the packet. PF_RING code sets p->pktlen twice for zero copy, both to 'struct pfring_pkthdr::caplen'. It's 102 bytes in this case.

Defrag reports a 'data_len' of 1400, which is set using 'data_len = IPV4_GET_IPLEN(p) - hlen;'

This mismatch between caplen and ip header fields should have been caught by the IPv4 decoder. However the p->events show no indication of this.

Actions #16

Updated by Rado Ce over 5 years ago

(gdb) print *p->ip4h
$15 = {ip_verhl = 69 'E', ip_tos = 0 '\000', ip_len = 35845, ip_id = 10819, ip_off = 64, ip_ttl = 255 '\377',
  ip_proto = 6 '\006', ip_csum = 1774, ip4_hdrun1 = {ip4_un1 = {ip_src = {s_addr = 809446410}, ip_dst = {
        s_addr = 3572566538}}, ip_addrs = {11274, 12351, 2570, 54513}}}
Actions #17

Updated by Victor Julien over 5 years ago

One more request:

f 10
print *hdr
print *ptv->pd

Actions #18

Updated by Rado Ce over 5 years ago

(gdb) f 10
#10 ReceivePfringLoop (tv=0x55555a1ed4c0, data=0x7fffdd1608c0, slot=<optimized out>) at source-pfring.c:386
386                 if (TmThreadsSlotProcessPkt(ptv->tv, ptv->slot, p) != TM_ECODE_OK) {

(gdb) print *hdr
Structure has no component named operator*.

$17 = {initialized = 1 '\001', enabled = 1 '\001', long_header = 1 '\001', force_timestamp = 0 '\000',
  strip_hw_timestamp = 0 '\000', disable_parsing = 0 '\000', disable_timestamp = 0 '\000',
  ixia_timestamp_enabled = 0 '\000', vss_apcon_timestamp_enabled = 0 '\000', chunk_mode_enabled = 0 '\000',
  userspace_bpf = 0 '\000', force_userspace_bpf = 0 '\000', rss_mode = 0, direction = rx_only_direction,
  mode = send_and_recv_mode, userspace_bpf_filter = {bf_len = 0, bf_insns = 0x0}, hw_ts = {
    force_timestamp = 0 '\000', is_silicom_hw_timestamp_card = 0 '\000', enable_hw_timestamp = 0 '\000',
    last_hw_timestamp_head_offset = 0 '\000', last_hw_timestamp = {tv_sec = 0, tv_nsec = 0}}, tx = {
    enabled_rx_packet_send = 0 '\000', last_received_hdr = 0x0}, zc_device = 0 '\000',
  priv_data = 0x7fffdd160a70, close = 0x7ffff6be2940 <pfring_mod_zc_spsc_close>,
  stats = 0x7ffff6be2580 <pfring_mod_zc_spsc_stats>, recv = 0x7ffff6be26b0 <pfring_mod_zc_spsc_recv>,
  set_poll_watermark = 0x0, set_poll_watermark_timeout = 0x0, set_poll_duration = 0x0, set_tx_watermark = 0x0,
  set_channel_id = 0x0, set_channel_mask = 0x0,
  set_application_name = 0x7ffff6bd5760 <pfring_mod_set_application_name>,
  set_application_stats = 0x7ffff6bd57a0 <pfring_mod_set_application_stats>,
  get_appl_stats_file_name = 0x7ffff6bd5820 <pfring_mod_get_appl_stats_file_name>, set_vlan_id = 0x0, bind = 0x0,
  send = 0x7ffff6be28f0 <pfring_mod_zc_spsc_send>, send_get_time = 0x0, get_num_rx_channels = 0x0,
  get_card_settings = 0x0, set_sampling_rate = 0x0, set_packet_slicing = 0x0,
  get_selectable_fd = 0x7ffff6be2520 <pfring_mod_zc_spsc_get_selectable_fd>,
  set_direction = 0x7ffff6be2410 <pfring_mod_zc_spsc_set_direction>,
  set_socket_mode = 0x7ffff6be2420 <pfring_mod_zc_spsc_set_socket_mode>, set_cluster = 0x0,
  remove_from_cluster = 0x0, set_master_id = 0x0, set_master = 0x0, get_ring_id = 0x0, get_num_queued_pkts = 0x0,
  get_hash_filtering_rule_stats = 0x0, handle_hash_filtering_rule = 0x0, purge_idle_hash_rules = 0x0,
  add_filtering_rule = 0x0, remove_filtering_rule = 0x0, purge_idle_rules = 0x0, get_filtering_rule_stats = 0x0,
  toggle_filtering_policy = 0x0, enable_rss_rehash = 0x0, poll = 0x7ffff6be25f0 <pfring_mod_zc_spsc_poll>,
  is_pkt_available = 0x7ffff6be2480 <pfring_mod_zc_spsc_is_pkt_available>, next_pkt_time = 0x0,
  next_pkt_raw_timestamp = 0x0, version = 0x0, get_bound_device_address = 0x0, get_bound_device_ifindex = 0x0,
  get_device_ifindex = 0x0, get_slot_header_len = 0x0, set_virtual_device = 0x0, add_hw_rule = 0x0,
  remove_hw_rule = 0x0, loopback_test = 0x0, enable_ring = 0x7ffff6be2450 <pfring_mod_zc_spsc_enable_ring>,
  disable_ring = 0x7ffff6be2460 <pfring_mod_zc_spsc_disable_ring>, shutdown = 0x0,
  set_bpf_filter = 0x7ffff6be29a0 <pfring_mod_zc_spsc_set_bpf_filter>, remove_bpf_filter = 0x0,
  get_device_clock = 0x0, set_device_clock = 0x0, adjust_device_clock = 0x0,
  sync_indexes_with_kernel = 0x7ffff6be25c0 <pfring_mod_zc_spsc_sync_rx>, send_last_rx_packet = 0x0,
  flush_tx_packets = 0x7ffff6be25d0 <pfring_mod_zc_spsc_sync_tx>, register_zerocopy_tx_ring = 0x0,
  recv_chunk = 0x0, set_bound_dev_name = 0x0, get_metadata = 0x7ffff6be23f0 <pfring_mod_zc_spsc_get_metadata>,
  get_interface_speed = 0x7ffff6be2470 <pfring_mod_zc_spsc_get_interface_speed>, rdi = {device_id = -1 '\377',
    port_id = -1 '\377'}, ft_mode = software_only, ft_device_type = standard_nic_family, buffer = 0x0,
  slots = 0x0, device_name = 0x7fffdd160950 "27", caplen = 1522, slot_header_len = 0, mtu = 9000,
  sampling_rate = 0, sampling_counter = 0, slicing_level = FULL_PACKET_SLICING, slicing_additional_bytes = 0,
  is_shutting_down = 0 '\000', socket_default_accept_policy = 1 '\001', fd = 17, device_id = 0, slots_info = 0x0,
  poll_sleep = 0, poll_duration = 0, promisc = 1 '\001', ft_enabled = 0 '\000', reentrant = 0 '\000',
  break_recv_loop = 0 '\000', num_poll_calls = 0, rx_lock = {__data = {__lock = 0, __nr_readers = 0,
      __readers_wakeup = 0, __writer_wakeup = 0, __nr_readers_queued = 0, __nr_writers_queued = 0, __writer = 0,
      __shared = 0, __pad1 = 0, __pad2 = 0, __flags = 0}, __size = '\000' <repeats 55 times>, __align = 0},
  tx_lock = {__data = {__lock = 0, __nr_readers = 0, __readers_wakeup = 0, __writer_wakeup = 0,
---Type <return> to continue, or q <return> to quit---
      __nr_readers_queued = 0, __nr_writers_queued = 0, __writer = 0, __shared = 0, __pad1 = 0, __pad2 = 0,
      __flags = 0}, __size = '\000' <repeats 55 times>, __align = 0}, flags = 12, ft = 0x0, sock_tx = {
    sll_family = 0, sll_protocol = 0, sll_ifindex = 0, sll_hatype = 0, sll_pkttype = 0 '\000',
    sll_halen = 0 '\000', sll_addr = "\000\000\000\000\000\000\000"}, reflector_socket = 0x0,
  one_copy_rx_pfring = 0x0}
Actions #19

Updated by Victor Julien over 5 years ago

Ah sorry, that should have been:

print hdr

(w/o the *) Could you still add that too?

Actions #20

Updated by Rado Ce over 5 years ago

sure, here you go:

(gdb) print hdr
$18 = {ts = {tv_sec = 1538834646, tv_usec = 663113}, caplen = 102, len = 102, extended_hdr = {timestamp_ns = 0,
    flags = 0, rx_direction = 1 '\001', if_index = 16777301, pkt_hash = 1651439071, tx = {bounce_interface = 0,
      reserved = 0xffdc000020000100}, parsed_pkt = {dmac = "\177\000\000(\000", smac = "\000\000\000\000\000 ",
      eth_type = 32716, vlan_id = 65515, qinq_vlan_id = 127, ip_version = 0 '\000', l3_proto = 88 'X',
      ip_tos = 213 '\325', ip_src = {v6 = {__in6_u = {__u6_addr8 = "\036ZUU\000\000@\ry`UU\000\000\000\347",
            __u6_addr16 = {23070, 21845, 0, 3392, 24697, 21845, 0, 59136}, __u6_addr32 = {1431656990, 222298112,
              1431658617, 3875536896}}}, v4 = 1431656990}, ip_dst = {v6 = {__in6_u = {
            __u6_addr8 = "\177\353\377\177\000\000L\270\225\363\377\177\000\000\000", __u6_addr16 = {60287,
              32767, 0, 47180, 62357, 32767, 0, 0}, __u6_addr32 = {2147478399, 3091988480, 2147480469, 0}}},
        v4 = 2147478399}, l4_src_port = 0, l4_dst_port = 0, icmp_type = 0 '\000', icmp_code = 0 '\000', tcp = {
        flags = 0 '\000', seq_num = 0, ack_num = 3842180960}, tunnel = {tunnel_id = 32767,
        tunneled_ip_version = 33 '!', tunneled_proto = 49 '1', tunneled_ip_src = {v6 = {__in6_u = {
              __u6_addr8 = "]UUU\000\000\310\000\000\000\000\000\000\000\tJ", __u6_addr16 = {21853, 21845, 0,
                200, 0, 0, 0, 18953}, __u6_addr32 = {1431655773, 13107200, 0, 1242103808}}}, v4 = 1431655773},
        tunneled_ip_dst = {v6 = {__in6_u = {__u6_addr8 = "]UUU\000\000\000\000\177\353\377\177\000\000\300\331",
              __u6_addr16 = {21853, 21845, 0, 0, 60287, 32767, 0, 55744}, __u6_addr32 = {1431655773, 0,
                2147478399, 3653238784}}}, v4 = 1431655773}, tunneled_l4_src_port = 21984,
        tunneled_l4_dst_port = 21845}, last_matched_rule_id = 0, offset = {eth_offset = 0, vlan_offset = 0,
        l3_offset = 0, l4_offset = 512, payload_offset = 7}}}}
Actions #21

Updated by Victor Julien over 5 years ago

IPv4 HLEN 20, IPLEN 1420.

Pointer offsets vs ext_pkt:

ethh: 0
vlanh0: 14
ipv4h: 18

These look reasonable. ext_data is set to the raw packet by PacketSetData.

Actions #22

Updated by Victor Julien over 5 years ago

I asked Alfredo, one of the PF_RING developers, to have a look. He said "I see that packet len in the PF_RING packet header is 102 (caplen and len), while the IPLEN computed from packet headers is 1420, this should definitely not happen, especially with ZC as there is no snaplen, it always captures full packets.

Since reporter is using zero-copy distribution (with zbalance_ipc), I would not exclude that some other application is corrupting the packet. I would ask the reporter to provide the zbalance_ipc configuration, and stop all other applications if any and try to reproduce the crash again."

Can you provide that config and run his suggested test?

Actions #23

Updated by Rado Ce over 5 years ago

Thank you for looking into this! Second suspect after logrotate's HUP was bro running on same zbalance_ipc cluster, so there might be something to it.

zbalance_ipc config file:

-c=27
-i=zc:eth3
-n=6,21,4,1
-m=1
-P=/var/run/cluster-27.pid
-p
-g=0,1
-q=16384
-b=32
-l=/var/log/zbalance.log

apps on egress queues are
6 - suricata
21 - bro 2.5.4 with pf_ring input plugin
4 - tcpdump from pf_ring
1 - yaf

This crash occurs randomly and sometimes take few days to occur, it might not be possible to turn off all other applications as this is production site. Is there any other way?

Actions #24

Updated by Victor Julien over 5 years ago

The feedback I got from the PF_RING folks is that any of the tools could be to blame, so it's impossible to isolate whether this is a Suricata issue w/o stopping the other tools.

Actions #25

Updated by Rado Ce over 5 years ago

Yeah, thought so. Anyways, I'll arrange taking it out of production and, get back to you with results. Thanks

Actions #26

Updated by Victor Julien about 5 years ago

Have you been able to look at this?

Actions #27

Updated by Andreas Herz almost 5 years ago

  • Assignee set to Rado Ce
  • Target version set to TBD
Actions #28

Updated by Victor Julien almost 5 years ago

  • Status changed from New to Feedback
Actions #29

Updated by Andreas Herz over 4 years ago

@Rado did you have time to look into this?

Actions #30

Updated by Victor Julien over 4 years ago

  • Tracker changed from Bug to Support
  • Status changed from Feedback to Closed
  • Assignee deleted (Rado Ce)
  • Target version deleted (TBD)
Actions

Also available in: Atom PDF