Bug #1358
closedGradual memory leak using reload (kill -USR2 $pid)
Description
Greetings,
Per discussion on mailing list, entering ticket for reproducible memory leak while using 'kill -USR2 pid' command to initiate a reload of the suricata process. Each time the reload function is called, allocated ram will typically double for suricata processes, eventually eating into swap. This appears regardless of capture method (tested and same using pcap live and af-packet).
Better memory management is needed to prevent this from happening.
- My experience:
Pulling new rules and reloading (via systemd unit as 'suri' user) every two hours caused all memory and swap to be consumed in 5 days (16/8 GB respectively). Testing with ad-hoc reloads reproduces memory allocation at about 1.5x - 2x memory allocation each time. I am running suricata 2.1beta2 64-bit (have not tested 2.0 series) on ArchLinux kernel 3.17.6-1-ARCH in VMware 11 environment (8 cores, 24 GB).
- Mailing list, per Peter Manev:
I was able to reproduce your behavior (on Ubuntu Trusty LTS, 3.13
kernel) -> simple kill -USR pid on the latest dev while inspecting
some traffic.
Updated by Victor Julien almost 10 years ago
If you do memory debugging with valgrind or similar, you won't find memory leaks. The issue here seems to be that the way we reload the detection engine while keeping the engine running causes really bad memory fragmentation. This is a long standing issue that isn't easily resolved.
Using different allocators such as tcmalloc or jemalloc may help with this, perhaps you can give them a try:
http://blog.inliniac.net/2010/10/21/speeding-up-suricata-with-tcmalloc/
http://blog.inliniac.net/2014/12/23/profiling-suricata-with-jemalloc/
In the longer run several possible solutions exist:
1. incremental updates so that we don't replace all of the detection engine
2. preallocating a large chunk of memory for the reload to reduce/prevent fragmentation
3. using a allocator like jemalloc to allocate the de_ctx into a different 'arena' then the packet/flow based allocations
4. extensive caching of allocations in the detection engine so we can reuse much of the memory after a reload
I think 2 and 3 would be the easiest to implement, where my preference would go to 3.
1 is the holy grail but it will be extremely tricky to update the very complicated state machine, that our detection engine is, in real time.
Updated by Peter Manev almost 10 years ago
What are the pros and cons with regards to option 2 and 3?
What do you mean by - large chunk of memory - for option 2?
Thanks
Updated by Jay MJ almost 10 years ago
Victor Julien wrote:
Using different allocators such as tcmalloc or jemalloc may help with this, perhaps you can give them a try:
http://blog.inliniac.net/2010/10/21/speeding-up-suricata-with-tcmalloc/
http://blog.inliniac.net/2014/12/23/profiling-suricata-with-jemalloc/
Thank you Victor, I will give those a good test run through reloading and report back in a few days.
Updated by Andreas Herz over 9 years ago
Jay MJ wrote:
I am running suricata 2.1beta2 64-bit (have not tested 2.0 series) on ArchLinux kernel 3.17.6-1-ARCH in VMware 11 environment (8 cores, 24 GB).
I can confirm this also with 2.0.5 as already discussed with Peter and Victor on IRC.
Updated by Jay MJ over 9 years ago
Initial look at preloading tcmalloc and jemalloc do not appear to make much of a difference out the gate. jemalloc seems slightly more efficient over tcmalloc (haven't noticed a change with this one). I am still looking at allocation 1.5 ~ 2x greater upon each reload.
Updated by john kely over 9 years ago
In my opinion, when reload rules, suri alloc a new memory chunk. Although causes memory fragmentation, total memory of suri will not change. But in this case, total memory of suri encrease unnormally. So, i think suri freed uncompletely detect context.
Updated by Victor Julien over 9 years ago
@john: luckily we don't have to depend on 'think', we can measure using tools like valgrind, jemalloc and others. So if you're convinced this is a memleak, feel free to report the actual leak.
Updated by Victor Julien over 9 years ago
I'm wondering about another possible explanation: the new detect engine is created by a short lived thread that is created just for the purpose of one single reload. If I understand how allocators work correctly, each thread gets its own memory 'arena' for (most of) the allocations it does. So by creating a new thread each time, we may actually force the new detection engine to go into a new arena (= large memory block) each time. I'm assuming that the main process/thread inherits the reload threads arena's when that closes.
If true, I still wonder why the new arena's are never freed. Perhaps small leaks prevent that. Or perhaps the arena's are being reused for traffic based allocations as well.
Will try to do a test where the reload thread never quits soon. Easy enough to test.
Updated by Victor Julien over 9 years ago
Anyone able to try https://github.com/inliniac/suricata/pull/1319 ?
I do still see some memory growth, but much less of it.
Updated by Peter Manev over 9 years ago
It does not seem to work for me.
I can not get past one live rule swap.
Updated by Victor Julien over 9 years ago
Peter Manev wrote:
It does not seem to work for me.
I can not get past one live rule swap.
What do you mean here? What did you try? What are you seeing?
Updated by Peter Manev over 9 years ago
I was testing with and without profiling enabled.
Started a better more consistent test run.
It looks good so far.
I would like a couple of more days to confirm the consistency of the results.
Updated by Jay MJ over 9 years ago
Victor Julien wrote:
Anyone able to try https://github.com/inliniac/suricata/pull/1319 ?
Can't say I'm the best person to test this, so please take with a grain of salt. I cloned and grabbed the branch (latter part new to to me), and compared the files to changed branch, they appeared correct with your changes. Anyway, this was a quick test this morning, basically start the daemon via systemd unit as suri user, wait for RAM allocation to stabilize, reload systemd unit (kill -USR2 $pid), repeat, all within 30 minutes.
Next I will try my rule update and reload timer, and record the stats to a file each time and have that sit for awhile. This should give a better representation of the use I was using suricata for when I noticed the issue to begin with. If there are other ways I can help contribute without getting to deep in to dev world, please do not hesitate to let me know.
One thing I noticed with the first test, the reloads were very quick (kinda makes sense, I didn't change any rules), and the suricata.log file was not noting them all, only the first. Standard output with systemd however did catch them all (although the time stamps look off between the two). Not sure if it's worth noting, logs are below.
Reload Res(MB) % with 12 GB RAM
0 2338 19.5 1 4389 36.6 2 4398 36.6 3 4398 36.6
Systemd:
Feb 06 08:40:36 hostname systemd[1]: Started Open Source Next Generation Intrusion Detection and Prevention Engine. Feb 06 08:48:11 hostname systemd[1]: Reloading Open Source Next Generation Intrusion Detection and Prevention Engine. Feb 06 08:48:11 hostname systemd[1]: Reloaded Open Source Next Generation Intrusion Detection and Prevention Engine. Feb 06 08:56:12 hostname systemd[1]: Reloading Open Source Next Generation Intrusion Detection and Prevention Engine. Feb 06 08:56:12 hostname systemd[1]: Reloaded Open Source Next Generation Intrusion Detection and Prevention Engine. Feb 06 09:00:16 hostname systemd[1]: Reloading Open Source Next Generation Intrusion Detection and Prevention Engine. Feb 06 09:00:16 hostname systemd[1]: Reloaded Open Source Next Generation Intrusion Detection and Prevention Engine.
suricata.log:
[30898] 6/2/2015 -- 08:51:05 - (util-threshold-config.c:1203) <Info> (SCThresholdConfParseFile) -- Threshold config parsed: 8 rule(s) found [30898] 6/2/2015 -- 08:51:05 - (detect-engine.c:449) <Notice> (DetectEngineReloadThreads) -- rule reload starting [30898] 6/2/2015 -- 08:51:05 - (detect-engine.c:559) <Info> (DetectEngineReloadThreads) -- Live rule swap has swapped 8 old det_ctx's with new ones, along with the new de_ctx
Updated by Andreas Herz over 9 years ago
I can confirm this observation, after the first reload there is a rather huge increase and then it's rarely noticeable.
I see a jump from ~180mb to ~290mb and after 100 runs i end up with 299mb, so <1mb increase per reload.
But loosing the yaml reload should be added back, since i have usecases in which the HOME_NET gets changed in the config and i prefer to use reload for this also.
thanks so far, looking promising :)
Updated by Jay MJ over 9 years ago
Apologies for the initial e-mail response, wrong bug tracker.
Ran this all weekend with hourly updates to pro rule set and ip
reputation updates (temporary frequency for testing). It does appear
to stabilize after copious reloading once the initial reload is
complete.
But loosing the yaml reload should be added back, since i have usecases in which the HOME_NET gets changed in the config and i prefer to use reload for this also.
Just curious, why would yaml reloading be necessary? Is HOME_NET a
complex and changing dhcp network? Playing devils advocate here; if I
change the config, a restart seems sufficient for this purpose.
Updated by Andreas Herz over 9 years ago
Jay MJ wrote:
Just curious, why would yaml reloading be necessary? Is HOME_NET a
complex and changing dhcp network? Playing devils advocate here; if I
change the config, a restart seems sufficient for this purpose.
A restart is also an option (which i have running right now).
But my use case is that HOME_NET includes also the dynamic IP for WAN Interfaces (with DSL connection for example), so i change HOME_NET as soon as the ppp login is done.
So i just update one IP/VAR and the reload is much faster and you get much less downtime.
In IRC someone also pointed out another usecase, you added another .rules file in your config and don't want to restart just for this.
Another idea might be to have two reloads, a simple rule-reload and a whole reload without restart.
So it's just rather nice to have this part also in the reload.
Updated by Peter Manev over 9 years ago
I tested both pf-ring and af-packet on a live 2-4Gbps traffic - looks much more stable to me - no mem consumption increase.
Updated by Victor Julien over 9 years ago
- Status changed from New to Closed
- Assignee set to Victor Julien
- Target version set to 2.1beta4
Updated by Andreas Herz almost 9 years ago
- Status changed from Closed to New
- Target version changed from 2.1beta4 to 3.0
This is still an issue, related issues from older times are #492 and maybe even #573 might be involved.
I can reproduce it with sending USR2 for several times and get increased memory usage.
That's with 3.0RC3 and on Debian, Gentoo, Arch, CentOS (no diff between 32/64bit x86).
It's an issue with systems that run for days and have dynamic IPs for example. Those get triggered with USR2 to reload the new HOME_NET info once per day at least.
So in 1 month you have ~40% more memory usage.
Updated by Victor Julien almost 9 years ago
- Status changed from New to Closed
- Target version changed from 3.0 to 2.1beta4
I'd like to track this in a new ticket. Based on the description it seems to be a new and different issue.