Support #5366
closedDisplaying Chinese Characters in eve.json
Description
Hi OISF Team,
Is there a way to display Chinese characters in my eve.json?
This question came up as I was creating sigs today. I was looking at content similiar to this:
return d.includes("hbWallet") ? "火币钱包"
I generated a pcap for it. To confirm that I generated the pcap correctly, I confirmed that the To Hex content above was correctly reflected in my Wireshark Hexdump. Here is the To Hex of the content:
return|20|d|2e|includes|28 22|hbWallet|22 29 20 3f 20 22 e7 81 ab e5 b8 81 e9 92 b1 e5 8c 85 22|
The generated .pcap should be attached for your testing as well.
As I was testing my sigs, I noticed that the eve.json would display content with ... instead of Chinese characters.
"http_response_body_printable":"return d.includes(\"hbWallet\") ? \"............\"\n"
and
"payload_printable":"HTTP/1.0 200 OK\r\nServer: SimpleHTTP/0.6 Python/3.8.10\r\nDate: Wed, 18 May 2022 00:10:49 GMT\r\nContent-type: application/javascript\r\nContent-Length: 47\r\nLast-Modified: Tue, 17 May 2022 23:59:19 GMT\r\n\r\nreturn d.includes(\"hbWallet\") ? \"............\"\n"
I have reviewed this past, similar ticket: https://redmine.openinfosecfoundation.org/issues/2647. I did confirm that the following variables are set to "yes" and are not commented out in my suricata.yaml while testing.
payload-printable: yes # enable dumping payload in printable (lossy) format
http-body: yes # Requires metadata; enable dumping of HTTP body in Base64
http-body-printable: yes # Requires metadata; enable dumping of HTTP body in printable format
decode-base64: yes
decode-quoted-printable: yes
Is there anything else you can suggest to help display the Chinese characters?
Files
Updated by Jason Ish over 2 years ago
We don't make any assumptions about the encoding of the data other than there might be some ascii chars in there. These buffers are just raw bytes as far as Suricata is concerned. I think to display non-ascii character sets we'd have to attempt to decode them as UTF-8. If you're lucky, the whole thing will decode as UTF-8 and we could log it as such, however if it didn't decode as UTF-8, we'd have to attempt to decode chunks of it as UTF-8 which could get expensive. So its just more consistent to log the ASCII set and the rest as unprintable.
When using the base64 logging (default), the data is logged in a loss-less format, so presentation tools could attempt to perform the conversion. I think it could just cause issues if we started to do this in Suricata on untrusted data in terms of performance, inconsistencies, and perhaps even an attack vector?
Updated by Jason Ish about 2 years ago
- Status changed from New to Closed
- Assignee changed from OISF Dev to Jason Ish
Closing. Was told answer was sufficient via Discord.