Project

General

Profile

Actions

Bug #3705

open

VarNameStoreLookupById: Assertion `!(current == ((void *)0))'

Added by Justin Ossevoort over 4 years ago. Updated 6 months ago.

Status:
New
Priority:
Normal
Assignee:
Target version:
Affected Versions:
Effort:
Difficulty:
Label:

Description

We're trying to use the multi-tenant support in Suricata. In our testsetup we were not having any problems, but trying to load a multi-tenant configuration (both device and vlan based) on a system that is monitoring a relatively busy link we get this error (often within a few minutes):

[May 18 14:48:38] 18/5/2020 -- 14:48:38 - <Notice> - rule reload complete
[May 18 14:48:38] 18/5/2020 -- 14:48:38 - <Notice> - Signature(s) loaded, Detect thread(s) activated.
[May 18 14:52:47] suricata: util-var-name.c:300: VarNameStoreLookupById: Assertion `!(current == ((void *)0))' failed.

It seems related to https://redmine.openinfosecfoundation.org/issues/2518 only we're not trying to perform a reload. We've tried both vlan and device based multi-tenant configurations. We tried setting "loaders" to 1 and 3. We tried "default" on and off. But this always happens.

As soon as we disable multi-select everything runs fine.

Actions #1

Updated by Justin Ossevoort over 4 years ago

I've run Suricata in gdb with a breakpoint on the `g_varnamestore_current_sc` global variable

(gdb) watch g_varnamestore_current_sc_atomic__
Hardware watchpoint 1: g_varnamestore_current_sc_atomic__
(gdb) run
Starting program: /usr/bin/suricata --af-packet=mon3
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
19/5/2020 -- 09:00:37 - <Notice> - This is Suricata version 5.0.3 RELEASE running in SYSTEM mode
[New Thread 0x7fffee069700 (LWP 683)]

Thread 1 "Suricata-Main" hit Hardware watchpoint 1: g_varnamestore_current_sc_atomic__

Old value = (VarNameStore *) 0x0
New value = (VarNameStore *) 0x7fffe804f6c0
0x00005555557d05bf in VarNameStoreActivateStaging () at util-var-name.c:348
348    in util-var-name.c
(gdb) c
Continuing.
[New Thread 0x7fffdcd1f700 (LWP 684)]
[New Thread 0x7fffc7fff700 (LWP 685)]
[New Thread 0x7fffc77fe700 (LWP 686)]
[New Thread 0x7fffc6ffd700 (LWP 687)]
[New Thread 0x7fffc67fc700 (LWP 688)]
[New Thread 0x7fffc5ffb700 (LWP 689)]
[New Thread 0x7fffc57fa700 (LWP 690)]
[New Thread 0x7fffc4ff9700 (LWP 691)]
[New Thread 0x7fff8ffff700 (LWP 692)]
19/5/2020 -- 09:02:44 - <Notice> - all 4 packet processing threads, 4 management threads initialized, engine started.
19/5/2020 -- 09:02:44 - <Notice> - rule reload starting

Thread 1 "Suricata-Main" hit Hardware watchpoint 1: g_varnamestore_current_sc_atomic__

Old value = (VarNameStore *) 0x7fffe804f6c0
New value = (VarNameStore *) 0x0
0x00005555557d06c7 in VarNameStoreFree (de_ctx_version=99) at util-var-name.c:385
385    in util-var-name.c
(gdb) bt
#0  0x00005555557d06c7 in VarNameStoreFree (de_ctx_version=99) at util-var-name.c:385
#1  0x0000555555618649 in DetectEngineCtxFree (de_ctx=0x55555ecf7a40) at detect-engine.c:2148
#2  0x000055555562351b in DetectEngineCtxFree (de_ctx=<optimized out>) at detect-engine.c:4025
#3  DetectEnginePruneFreeList () at detect-engine.c:4025
#4  0x0000555555623951 in DetectEngineReload (suri=<optimized out>) at detect-engine.c:4117
#5  0x000055555557dd8a in PostRunStartedDetectSetup (suri=0x555555ad0000 <suricata>)
    at suricata.c:2600
#6  main (argc=<optimized out>, argv=<optimized out>) at suricata.c:3092
(gdb) c
Continuing.
19/5/2020 -- 09:09:36 - <Notice> - rule reload complete
19/5/2020 -- 09:09:36 - <Notice> - Signature(s) loaded, Detect thread(s) activated.
suricata: util-var-name.c:300: VarNameStoreLookupById: Assertion `!(current == ((void *)0))' failed.

Thread 6 "W#04-mon3" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffc6ffd700 (LWP 687)]
0x00007ffff6376781 in raise () from /lib/x86_64-linux-gnu/libc.so.6
Actions #2

Updated by Justin Ossevoort over 4 years ago

I'm not claiming any fundamental understanding of the codebase, but it seems the following is happening:

  • Suricata starts up and instantiates a `DetectEngineCtx`, which in turn initializes the "VarNameStore"
  • After initial startup it triggers a load reload (probably so that loading and reloading use the same codepaths)
  • In `DetectEngineReload` a new `DetectEngineCtx` is created (which doesn't initialize a new "VarNameStore" as one is already present)
  • It swaps the old and new context in `DetectEngineReloadThreads`
  • Afterwards it triggers the cleanup of the old `DetectEngineCtx` using `DetectEngineCtxFree` in `DetectEnginePruneFreeList`, which in turn calls `VarNameStoreFree` which cleans up the current "VarNameStore"

Now we have new instances of the `DetectEngineCtx` running without a current "VarNameStore", which leads to us triggering the assertion.

My first guess would be that we need either a refcount for the "VarNameStore" to keep it alive across reloads or we need to perform a swap/handover during reload and a flag in the `DetectEngineCtx` that indicates if it owns the "VarNameStore".

Actions #3

Updated by Justin Ossevoort over 4 years ago

Justin Ossevoort wrote in #note-2:

My first guess would be that we need either a refcount for the "VarNameStore" to keep it alive across reloads or we need to perform a swap/handover during reload and a flag in the `DetectEngineCtx` that indicates if it owns the "VarNameStore".

Ah, it appears the part I was missing is the `de_ctx_version` logic, which acts effectively what I meant with the handover logic. It should only free the current "VarNameStore" if the version being freed matches the version being active.

This is possibly where the multi-tenant shenanigans come in.

Actions #4

Updated by Justin Ossevoort over 4 years ago

I've been running Suricata in gdb some more:

  • Around reload the new `DetectEngineCtx` allocate a new "VarNameStore", which is put in the `g_varnamestore_staging` global variable
  • When `VarNameStoreFree` (after the reload when cleaning up the old `DetectEngineCtx`) the new "VarNameStore" is still in the `g_varnamestore_staging` variable en the one in `g_varnamestore_current` matches the old `de_ctx_version` that is being cleaned up, so a NULL pointer is put in it's place.

Looking at the code I suspect in between these two lines someone should have called `VarNameStoreActivateStaging` (which puts `g_varnamestore_current` in `g_varnamestore_old` and `g_varnamestore_staging` in `g_varnamestore_current`).

This function is currently called in only two places:

  • `detect-engine-build.c`: At the end of `SigGroupBuild` there is:
    if (!DetectEngineMultiTenantEnabled()) {
        VarNameStoreActivateStaging();
    }
  • `detect-engine.c`: At the end of `DetectEngineMultiTenantSetup`

I suspect the first codepath is why the not-multi-tenant reload is working. The second callsite is only triggered at startup, but not around reloads.

I'm starting to think this is exactly https://redmine.openinfosecfoundation.org/issues/2518 as this problem is around reloads and I can't seem to find any logic that would handle multi-tenant reloading in the reload logic and consequently call the `VarNameStoreActivateStaging` for this case.

Actions #5

Updated by Justin Ossevoort over 4 years ago

Some further puzzle pieces:

  • The `DetectEngineReload function appears to want only support non DETECT_ENGINE_MT_STUB type detect engines
  • In `detect-engine.c` there is a function `DetectEngineMTApply` that contains parts of the logic from `DetectEngineReload` that also deal with `DETECT_ENGINE_MT_STUB` detect engine types, but this method appears to never be called in the normal reload logic, only around unix socket commands
Actions #6

Updated by Justin Ossevoort over 4 years ago

It appears I've zeroed in on the problem:

As before it appears that `DetectEngineReload` function doesn't support DETECT_ENGINE_MT_STUB, so why is it actually being called in the first place? And it should have short-circuited in the case of DETECT_ENGINE_MT_STUB, so why is it being called and runs to completion?

Well according to the breakpoint I set right before the check:

(gdb) print old_de_ctx->type
$9 = DETECT_ENGINE_TYPE_DD_STUB

After a little more digging I found this is used to implement `detect.delayed-detect` (which is an option we've always had enabled). After disabling this option the entire `DetectEngineReload` is never called and Suricata starts succesfully.

But triggering a reload using the command channel will indeed trigger the same assertion so the `reload-rules` is indeed still broken (aka: https://redmine.openinfosecfoundation.org/issues/2518)

So the current state is:

  • Bug: `multi-detect` can not be combined with `detect.delayed-detect` (leading to the earlier assertion in "VarNameStore"); It might be enough to declare this unsupported and reorder the check in `PostConfLoadedDetectSetup` so that `multi-detect` takes precedence over `detect.delayed-detect`
  • Bug: `reload-rules` can not be used with `multi-detect` (again leading to the earlier assertion in "VarNameStore"); This probably needs a special case in `DetectEngineReload` and could possible use loop over the tenants and `DetectEngineMTApply` for activation.
  • I feel there is still some strange corner case around `multi-detect` and `default: true` that should be handled, but I don't understand how that is supposed to work in the first place
Actions #7

Updated by Philippe Antoine over 1 year ago

  • Assignee set to OISF Dev
Actions #8

Updated by Philippe Antoine 6 months ago

  • Target version set to TBD
Actions

Also available in: Atom PDF