Bug #7950
closedPotentially incorrect decoding of quoted-printable mime text attachments
Description
If I am correct, there might be a decoding problem with decoding quoted-printable encoded email text attachments in Suricata.
In addition, if there is an empty line at the end of the last attachment (before the ".") it is also appended to the decoded file. AFAIK, that should not be the case.
I believe I found a test case where suricata 7.0.12 and 8.0.1 and the GMime library (as a reference) all produce different text file output and checksums which might induce an IoC matching problem.
- "testcase.smtp" is the SMTP stream from the PCAP extracted from wireshark follow TCP stream.
- "Attachment2-gmime" is the output that is generated by the GMime library.
- "Attachment2-suri7" and "Attachment-suri8" are the respective outputs of the above versions when activating filestore.
$ diff Attachment2-gmime Attachment2-suri7 119c119 < =================================================================== --- > ==3D================================================================ 246,247c246 < +static gboolean related_url_string_cb(field_info *finfo, gboolean doit, < const gchar** ret_url) --- > +static gboolean related_url_string_cb(field_info *finfo, gboolean doit, const gchar** ret_url) 451c450 < =================================================================== --- > ==========================================3D======================== 1037a1037 >
$ diff Attachment2-gmime Attachment2-suri8 119c119 < =================================================================== --- > ==3D================================================================ 246,247c246 < +static gboolean related_url_string_cb(field_info *finfo, gboolean doit, < const gchar** ret_url) --- > +static gboolean related_url_string_cb(field_info *finfo, gboolean doit, const gchar** ret_url) 451c450 < =================================================================== --- > ==========================================D======================== 1037a1037 >
Hopefully I did not make a mistake. But if I am correct, there might be unwanted IoC differences.
Best regards,
MaJa
Files
Updated by Marko Jahnke 26 days ago
Ma Ja wrote:
If I am correct, there might be a decoding problem with decoding quoted-printable encoded email text attachments in Suricata.
Of course, with "attachments" I meant MIME multipart bodyparts.
Updated by Victor Julien 21 days ago
- Status changed from New to Assigned
- Assignee changed from OISF Dev to Philippe Antoine
- Target version changed from TBD to 9.0.0-beta1
@Philippe Antoine can you check this and mark for backport(s) if needed?
Updated by Philippe Antoine 21 days ago
Suricata 8 and 7 seem incorrect.
So does Gmime in another way, while comparing to Wireshark IMF exported object
Updated by Philippe Antoine 21 days ago
- Label Needs backport to 7.0, Needs backport to 8.0 added
Not really a backport for 7, but a fix for the C parser...
Updated by Philippe Antoine 21 days ago
- Status changed from Assigned to In Review
Updated by Marko Jahnke 19 days ago
@Philippe Antoine wrote:
So does Gmime in another way, while comparing to Wireshark IMF exported object
Is it possible to tell what GMime does wrong? If we use that as a reference, we might also have a problem.
Updated by Philippe Antoine 19 days ago
< +static gboolean related_url_string_cb(field_info *finfo, gboolean doit, < const gchar** ret_url) --- > +static gboolean related_url_string_cb(field_info *finfo, gboolean doit, const gchar** ret_url)
I think Gmime does insert wrongly a newline (compared to Wireshark and suricata)
Updated by Albrecht Dreß 18 days ago
(Sorry for jumping into this thread, a colleague pointed me to it as I use GMime in several projects, inter alia the MUA Balsa – thus I'm interested in any possible bugs in that library…)
The dump from the PCAP actually looks a little odd at this point:
000030d0 73 74 61 74 69 63 20 67 62 6f 6f 6c 65 61 6e 20 |static gboolean | 000030e0 72 65 6c 61 74 65 64 5f 75 72 6c 5f 73 74 72 69 |related_url_stri| 000030f0 6e 67 5f 63 62 28 66 69 65 6c 64 5f 69 6e 66 6f |ng_cb(field_info| 00003100 20 2a 66 69 6e 66 6f 2c 20 67 62 6f 6f 6c 65 61 | *finfo, gboolea| 00003110 6e 20 64 6f 69 74 2c 20 3d 0a 0a 63 6f 6e 73 74 |n doit, =..const| 00003120 20 67 63 68 61 72 2a 2a 20 20 72 65 74 5f 75 72 | gchar** ret_ur| 00003130 6c 29 3d 30 41 3d 0a 2b 7b 3d 30 41 3d 0a 2b 3d |l)=0A=.+{=0A=.+=|
Apparently, the sending MUA did not convert the line breaks to into RFC 5322 (i.e. CRLF) sequences. However, RFC 2045, Section 6.7, Clause 4 (Line Breaks) states
A line break in a text body, represented as a CRLF sequence in the text canonical form, must be represented by a (RFC 822) line break, which is also a CRLF sequence, in the Quoted-Printable encoding.
[…]
Note that many implementations may elect to encode the local representation of various content types directly rather than converting to canonical form first, encoding, and then converting back to local representation. In particular, this may apply to plain text material on systems that use newline conventions other than a CRLF terminator sequence. Such an implementation optimization is permissible, but only when the combined canonicalization-encoding step is equivalent to performing the three steps separately.
IMHO the GMime decoder does decode the input at offset 0x3118 correctly according to this optimisation: the two octets 0x3d 0x0a represent the soft line break which is removed according to Clause 5 of the the aforementioned standard, whilst the hard line break at offset 0x311a is preserved. It may be confusing that the attachment is an application/octet-stream
, but the standard does not explicitly rule out using the “simplified” line breaks for this content type. This looks really like a somewhat special corner case…
Or did I miss something here?
Updated by Albrecht Dreß 17 days ago
Albrecht Dreß wrote in #note-12:
IMHO the GMime decoder does decode the input at offset 0x3118 correctly according to this optimisation: the two octets 0x3d 0x0a represent the soft line break which is removed according to Clause 5 of the the aforementioned standard, whilst the hard line break at offset 0x311a is preserved. […]
As additional test, I loaded testcase.smtp into Thunderbird on Trixie, opened the 2nd attachment (TB calls an external application for that), and apparently TB does also keep the newline (which breaks the patch file, but that seems to be an issue of the MUA producing the message):
Updated by Philippe Antoine 17 days ago
The patch file looks indeed correct without the newline...
Updated by Philippe Antoine 14 days ago
- Status changed from In Review to Resolved
https://github.com/OISF/suricata/pull/13937
@albrecht thanks for the feedback.
your pcap dump seems wrong, I see
00003194 2c 20 3d 0d , =. 00003198 0a 63 6f 6e 73 74 20 67 63 68 61 72 2a 2a 20 20 .const g char**
So, 3d0d0a
and not 3d0a0a
as you posted in #12
3d0d0a
seems a legit soft line break
Updated by Albrecht Dreß 14 days ago
- File Bildschirmfoto_2025-10-06_10-24-33.png Bildschirmfoto_2025-10-06_10-24-33.png added
- File Bildschirmfoto_2025-10-06_10-27-12.png Bildschirmfoto_2025-10-06_10-27-12.png added
Philippe Antoine wrote in #note-15:
@albrecht thanks for the feedback.
your pcap dump seems wrong, I see
[...]So,
3d0d0a
and not3d0a0a
as you posted in #12
3d0d0a
seems a legit soft line break
Yes, your right, of course!
Looking into the Wireshark hex display of the re-assembled TCP stream, I actually see
as you said.
Switch Wireshark to ASCII display, save the re-assembled TCP stream to a file, and run hd
on it, I get
I.e. I probably didn't understand how Wireshark's ASCII export actually works (or there is a glitch in Wireshark's export?) which led to the confusion… Fixing that byte, GMime, too, produces the proper output.
Thanks again for the clarification!