Bug #6173: http: loss of backward compatibility in HTTP logs from v6 to v7 - Suricata - Open Information Security Foundation

Actions

Copy link

Bug #6173

open

http: loss of backward compatibility in HTTP logs from v6 to v7

Added by Eric Leblond about 2 years ago. Updated 24 days ago.

Status:

New

Priority:

Normal

Assignee:

OISF Dev

Target version:

TBD

Affected Versions:

8.0.0

Effort:

Difficulty:

Label:

Description

Resolution of #5320 done in https://github.com/OISF/suricata/pull/8714/commits is causing a set of problems.

First, it breaks backward compatibility of events and users will have to upgrade all data handling for something that was there since ages. I think this could have been avoid.

Second, switching from a set of JSON key: value to a dictionary {"key": $key, "value": $value} is going to cause severe issues in lot of data lake. For example, let's take Elasticsearch case. If default template is kept, we are going to have the following issue. On an event with 2 headers at least:

"key: "alpha", "value": "jane" 
"key": "beta", "value": "robert"

if user want to see if "alpha": "robert" is present then he will naturally do "key:alpha AND value:robert" and this will match on our event because of the way Elasticsearch handle array by default. If Elasticsearch is set up to use nested elements, the match should be correct. But I'm still completely missing how we can do a list of all values for a given header. Even more if we add the third point below.

Third, the key corresponding to header were "normalized" as it was a fixed string. This is not the case anymore as the key are not converted to lower case before being added to 'key'. The lack of conversion to lower case is making sense as this output is trying to match the reality of the transaction as close as possible.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Andreas Herz about 2 years ago

Affected Versions 7.0.0-rc2 added

Actions

Copy link

Updated by Andreas Herz about 2 years ago

Subject changed from Loss of backward compatibility in HTTP log to http: loss of backward compatibility in HTTP logs from v6 to v7

Actions

Copy link

Updated by Victor Julien about 2 years ago

Related to Bug #5320: Key collisions in HTTP JSON eve-logs added

Actions

Copy link

Updated by Victor Julien about 2 years ago

Description updated (diff)

For reference, the upgrade guide mentions that the format changed https://docs.suricata.io/en/suricata-7.0.0-rc2/upgrade.html#logging-changes

Actions

Copy link

Updated by Philippe Antoine about 2 years ago

First, it breaks backward compatibility of events and users will have to upgrade all data handling for something that was there since ages.

So, how do you propose to fix #5320 ?
Having content_range changed to content_range_parsed in EveHttpLogJSONBasic ?
This is still a backward compatibility breakage (but maybe a lesser one)

Second, switching from a set of JSON key: value to a dictionary {"key": $key, "value": $value} is going to cause severe issues in lot of data lake.

Then, regit is not that a problem with @dump-all-headers suricata.yaml option ?
(disclaimer: I do not understand everything about Elastic)

The lack of conversion to lower case is making sense as this output is trying to match the reality of the transaction as close as possible.

Not sure I understand if you want normalization or raw data here...

Actions

Copy link

Updated by Eric Leblond about 2 years ago

Philippe Antoine wrote in #note-5:

First, it breaks backward compatibility of events and users will have to upgrade all data handling for something that was there since ages.

So, how do you propose to fix #5320 ?
Having content_range changed to content_range_parsed in EveHttpLogJSONBasic ?
This is still a backward compatibility breakage (but maybe a lesser one)

Yes, changing one single key would be better. Also there is something we should avoid which is to have a JSON key pointing to different values type. Pointed in the issue was the fact one single key was single value or multiple values. In the case of Elasticsearch it handles it fine but for most "DB" engines or even for coding this is a nightmare.

Second, switching from a set of JSON key: value to a dictionary {"key": $key, "value": $value} is going to cause severe issues in lot of data lake.

Then, regit is not that a problem with @dump-all-headers suricata.yaml option ?
(disclaimer: I do not understand everything about Elastic)

For me, dump-all-headers is really nice when you want to investigate one single event or doing forensic as we are not going to miss a single field. The key, value is needed as adding dynamic keys to tools like Elasticsearch is causing an explosion of their engine. But it is not usable for doing queries like give me the rare $name_your_header_key values. I think we need to keep the set of static key we already have as it is really easy to query them and do stats or dashboards.

The lack of conversion to lower case is making sense as this output is trying to match the reality of the transaction as close as possible.

Not sure I understand if you want normalization or raw data here...

In dump all headers, I think it is making sense to have raw data. This is for me a one event at the time or something used in processing.

Actions

Copy link

Updated by Jason Ish about 2 years ago

Can we get some examples of the actual breakage? I'm playing with 6.0.13 and 7.0.0-rc2 with some basic requests, but not too much is jumping out, at least with a default configuration.

I believe the main issue was that listing headers in custom could overwrite some http fields as we, arguably in error logged these objects directly into the http object instead of some sub-object named headers. And now we log these response_headers. regit am I correct that you see this breakage when using a custom value for @http.custom?

As the header names allowed in custom are a defined list, an alternative fix to the issue could be to make sure there is no name overlap by renaming some fields, but still logging them under the http object.

Then response_headers and request_headers could be saved for the dump-all-headers option.

Within response_headers and request_headers what is the best format? What we have now:

"response_headers": [
  { "name": "Server", value: "Caddy" }
]

or..

"response_headers": {
  "Server": ["Caddy"],
  "Header With Space": ["Testing"]
}

where the first may be harder to query, but second causing index expansion and possibly other issues.

Actions

Copy link

Updated by Eric Leblond about 2 years ago

Jason Ish wrote in #note-7:

Can we get some examples of the actual breakage? I'm playing with 6.0.13 and 7.0.0-rc2 with some basic requests, but not too much is jumping out, at least with a default configuration.

So I was trying to get the http server information in the log and I used the configuration to do so:

        - http:
            extended: yes     # enable this for extended logging information
            custom: [server, accept-encoding]

the thing I have with 7 is:

{
  "hostname": "198.71.247.91",
  "url": "/",
  "http_user_agent": "curl/7.58.0",
  "http_content_type": "text/html",
  "http_method": "GET",
  "protocol": "HTTP/1.1",
  "status": 200,
  "length": 51,
  "response_headers": [
    {
      "name": "Server",
      "value": "Apache/2.4.41 (Ubuntu)" 
    }
  ]
}

With 6.x I got

{
  "hostname": "198.71.247.91",
  "url": "/",
  "http_user_agent": "curl/7.58.0",
  "http_content_type": "text/html",
  "server": "Apache/2.4.41 (Ubuntu)",
  "http_method": "GET",
  "protocol": "HTTP/1.1",
  "status": 200,
  "length": 51
}

I believe the main issue was that listing headers in custom could overwrite some http fields as we, arguably in error logged these objects directly into the http object instead of some sub-object named headers. And now we log these response_headers. regit am I correct that you see this breakage when using a custom value for @http.custom?

As the header names allowed in custom are a defined list, an alternative fix to the issue could be to make sure there is no name overlap by renaming some fields, but still logging them under the http object.

Then response_headers and request_headers could be saved for the dump-all-headers option.

Within response_headers and request_headers what is the best format? What we have now:
"response_headers": [

{ "name": "Server", value: "Caddy" }
]

IMO, this is the way to go for listing of arbitrary keys.

or..

[...]

where the first may be harder to query, but second causing index expansion and possibly other issues.

Yes, this one is going to cause software like Elasticsearch to explode.

Eric Leblond wrote in #note-6:

Philippe Antoine wrote in #note-5:

First, it breaks backward compatibility of events and users will have to upgrade all data handling for something that was there since ages.

So, how do you propose to fix #5320 ?
Having content_range changed to content_range_parsed in EveHttpLogJSONBasic ?
This is still a backward compatibility breakage (but maybe a lesser one)

Yes, changing one single key would be better. Also there is something we should avoid which is to have a JSON key pointing to different values type. Pointed in the issue was the fact one single key was single value or multiple values. In the case of Elasticsearch it handles it fine but for most "DB" engines or even for coding this is a nightmare.

Second, switching from a set of JSON key: value to a dictionary {"key": $key, "value": $value} is going to cause severe issues in lot of data lake.

Then, regit is not that a problem with @dump-all-headers suricata.yaml option ?
(disclaimer: I do not understand everything about Elastic)

For me, dump-all-headers is really nice when you want to investigate one single event or doing forensic as we are not going to miss a single field. The key, value is needed as adding dynamic keys to tools like Elasticsearch is causing an explosion of their engine. But it is not usable for doing queries like give me the rare $name_your_header_key values. I think we need to keep the set of static key we already have as it is really easy to query them and do stats or dashboards.

The lack of conversion to lower case is making sense as this output is trying to match the reality of the transaction as close as possible.

Not sure I understand if you want normalization or raw data here...

In dump all headers, I think it is making sense to have raw data. This is for me a one event at the time or something used in processing.

Actions

Copy link

Updated by Jason Ish about 2 years ago

Any recommendations on fixing this breaking change or comments on my suggested fix?

Actions

Copy link

#10

Updated by Eric Leblond about 2 years ago

For content_range, I would rename the array version in content_ranges like it is done for ethernet address but proposed content_range_parsed works too.

Actions

Copy link

#11