Project

General

Profile

Actions

Support #1781

closed

fast.log stops updating

Added by Jon Zeolla almost 8 years ago. Updated almost 8 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Affected Versions:
Label:

Description

Numerous hours after starting my suricata service, my fast.log stops getting appended to without a crash of the service, and it will stay this way for hours until I do a manual service restart. I've attached some relevant information, but because it is timed ambiguously (anywhere between 7 and 24 hours after service start) it's hard for me to get a gdb debug, but I am working on that now. It's not core dumping and the service stays up, it's just that the fast.log stops getting appended to. If I look at stats.log, it continues to write updates, but all of the values are static until the instance is restarted.

The service is not entirely failing, which is somewhat more difficult to monitor, maintain, and debug, than if it would just crash and core dump. Some additional details are below, but I'd be more happy to gather more details as requested. I also idle in #suricata as jzeolla if it's easier to chat there and just put the results of any conversation in this thread.

Dell PowerEdge R730
- 12 cores * 2 processors w/HT (48 total)
- 192GB RAM
- 300GB usable disk space, with usage < 30%

Ubuntu 14.04.4 LTS, x86_64
- 3.13.0-85-generic

Suricata 3.0.1 RELEASE
- See attached "build-info.txt" for the output of `suricata --build-info`
- See attached "sanitized suricata.yaml" for the related suricata.yaml
- See attached "sanitized suricata.log" for a not-yet-crashed log file (as far as I can tell, when this issue hits, nothing is written to suricata.log until the service gets restarted).


Files

sanitized suricata.yaml (4 KB) sanitized suricata.yaml Jon Zeolla, 05/09/2016 10:47 AM
build-info.txt (2.82 KB) build-info.txt Jon Zeolla, 05/09/2016 11:53 AM
sanitized suricata.log (20.1 KB) sanitized suricata.log Jon Zeolla, 05/09/2016 11:55 AM
stats.log (3.72 MB) stats.log Jon Zeolla, 05/09/2016 12:51 PM
sanitized suricata 2.yaml (4 KB) sanitized suricata 2.yaml Jon Zeolla, 05/09/2016 01:06 PM
stats.log (1.63 MB) stats.log Jon Zeolla, 05/14/2016 09:58 AM
sanitized_suricata_2.log (21.1 KB) sanitized_suricata_2.log Jon Zeolla, 05/18/2016 07:12 PM
stats.log (1.06 MB) stats.log Jon Zeolla, 05/19/2016 06:06 AM
Actions #1

Updated by Victor Julien almost 8 years ago

Can you add stats.log records?

Actions #2

Updated by Jon Zeolla almost 8 years ago

Here's yesterday's log, and you can see the number freeze starting at 5/8/2016 -- 13:47:29

Actions #3

Updated by Victor Julien almost 8 years ago

the tcp.reassembly_gap is really high, 23M vs 42M sessions. This means a great number of your TCP sessions are not fully tracked. The cause appears to be the memcap:

tcp.reassembly_memuse | Total | 12884901882

This means all 12gb is in use, leading to many segments not considered for reassembly:

tcp.segment_memcap_drop | Total | 233438924

Actions #4

Updated by Jon Zeolla almost 8 years ago

Good catch, that was a leftover from before we added more RAM to the box (before we were using 100% of our memory). I just increased the memcap from 12gb to 30gb and it's back up and running now. New suricata.yaml:

Actions #5

Updated by Victor Julien almost 8 years ago

  • Tracker changed from Bug to Support
Actions #6

Updated by Jon Zeolla almost 8 years ago

I'm still encountering the same issue. Here's a new stats.log with the issue becoming apparent at 01:54:54. I've modified suricata.yaml to more aggressively prune tcp connections, and hopefully that will help - any other suggestions?

I also feel as though it would be better for Suricata to crash rather than to fail in the way that it has been. Thoughts?

Thanks,

Jon

Actions #7

Updated by Andreas Herz almost 8 years ago

Did you try other output formats besides fast.log?
Just to see if it's just related to fast.log or eve/json etc. as well.

Actions #8

Updated by Jon Zeolla almost 8 years ago

I have added parallel JSON output. I will report back when I have more information. Thanks

Actions #9

Updated by Jon Zeolla almost 8 years ago

We encountered the issue again in the middle of the night last night, and the fast.log issues match with eve. That is to say, they both stopped writing at the same exact time and only resumed after service reboot.

The last write to fast.log was:
05/17/2016-02:32:22.611623

The last write to eve.json was:
"timestamp":"2016-05-17T02:32:22.611623-0400"

Actions #10

Updated by Peter Manev almost 8 years ago

Can you try the following changes/adjustments (see if it makes a diff) :

- max-pending-packets: 65534
- mpm-algo: ac-ks

in the af-packet section -
- rollover: no
- #use-emergency-flush: yes
- #checksum-checks: no
- ring-size: 65534

in the detect-engine section:
- sgh-mpm-context: auto

in the flow section:
- memcap: 1gb
- #prune-flows: 1

in the stream section:
- prealloc-sessions: 100000
- midstream: false

then start Suricata in verbose mode (-v) and please share the suricata.log file.

Actions #11

Updated by Jon Zeolla almost 8 years ago

I gave that a shot and got:
18/5/2016 -- 19:56:25 - <Error> - [ERRCODE: SC_ERR_INVALID_YAML_CONF_ENTRY(139)] - Invalid mpm algo supplied in the yaml conf file: "ac-ks"

and after review of the current suricata.yaml.in in master I modified the mpm-algo to be ac-bs, as it appears to be the closest to what you suggested. Attached is my suricata.log from startup. My initial thoughts are that restarts are way faster now, but I will monitor and get back to this soon with an update.

Actions #12

Updated by Peter Manev almost 8 years ago

Yes they should be indeed.
Sorry ac-ks is available in git (I rarely use stable :) )

I silly question - you are not running out of disk space right ? (it has happened to me hence the question)

Actions #13

Updated by Victor Julien almost 8 years ago

Jon could you add another stats.log from when it stops alerting again?

Actions #14

Updated by Jon Zeolla almost 8 years ago

Peter,

I don't believe that has ever been an issue. We're are roughly 30% usage of the disk that the logs write to, and the stats.log will continually write the same metrics to disk when fast.log/eve.json is frozen.

Peter & Victor,

We got through the night without issue (and at very low memory usage), but here is the stats.log anyway. I've been evaluating to see if there are any ill effects from the new config. I will report back with a stats.log if it stops reporting again.

Actions #15

Updated by Jon Zeolla almost 8 years ago

That last modification has made major improvements to our stability. I've rolled back our interim fixes (scheduled restarts) and reverted to the USR2 for signature reloading. Will let this bake for a couple of days and then review stats.log again to ensure drops are reasonable.

Actions #16

Updated by Jon Zeolla almost 8 years ago

Things appear to be much better now, you can close the ticket. Would you mind pointing me to some documentation regarding the mpm-algo change?

Actions #17

Updated by Victor Julien almost 8 years ago

  • Status changed from New to Closed
  • Priority changed from High to Normal
Actions

Also available in: Atom PDF