Project

General

Profile

Actions

Bug #705

closed

http.log missing lots of requests under high traffic load

Added by Peter Manev almost 12 years ago. Updated over 5 years ago.

Status:
Closed
Priority:
High
Assignee:
Target version:
Affected Versions:
Effort:
Difficulty:
Label:

Description

using -

 suricata --build-info
[23628] 10/1/2013 -- 00:49:59 - (suricata.c:560) <Info> (SCPrintBuildInfo) -- This is Suricata version 1.4dev (rev 5f4c528)
[23628] 10/1/2013 -- 00:49:59 - (suricata.c:633) <Info> (SCPrintBuildInfo) -- Features: PCAP_SET_BUFF LIBPCAP_VERSION_MAJOR=1 PF_RING AF_PACKET HAVE_PACKET_FANOUT LIBCAP_NG LIBNET1.1 HAVE_HTP_URI_NORMALIZE_HOOK HAVE_HTP_TX_GET_RESPONSE_HEADERS_RAW HAVE_NSS PROFILING
[23628] 10/1/2013 -- 00:49:59 - (suricata.c:647) <Info> (SCPrintBuildInfo) -- 64-bits, Little-endian architecture
[23628] 10/1/2013 -- 00:49:59 - (suricata.c:649) <Info> (SCPrintBuildInfo) -- GCC version 4.6.3, C version 199901
[23628] 10/1/2013 -- 00:49:59 - (suricata.c:655) <Info> (SCPrintBuildInfo) -- __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1
[23628] 10/1/2013 -- 00:49:59 - (suricata.c:658) <Info> (SCPrintBuildInfo) -- __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2
[23628] 10/1/2013 -- 00:49:59 - (suricata.c:661) <Info> (SCPrintBuildInfo) -- __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4
[23628] 10/1/2013 -- 00:49:59 - (suricata.c:664) <Info> (SCPrintBuildInfo) -- __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8
[23628] 10/1/2013 -- 00:49:59 - (suricata.c:667) <Info> (SCPrintBuildInfo) -- __GCC_HAVE_SYNC_COMPARE_AND_SWAP_16
[23628] 10/1/2013 -- 00:49:59 - (suricata.c:671) <Info> (SCPrintBuildInfo) -- compiled with -fstack-protector
[23628] 10/1/2013 -- 00:49:59 - (suricata.c:677) <Info> (SCPrintBuildInfo) -- compiled with _FORTIFY_SOURCE=2
[23628] 10/1/2013 -- 00:49:59 - (suricata.c:680) <Info> (SCPrintBuildInfo) -- compiled with libhtp 0.2.11, linked against 0.2.11

root@suricata:/var/data/regit/log/suricata# time tcpdump -i eth3 -n -s 0 -w - | grep -a -o -E "Host\: .*|GET \/.*"  >gets.txt
tcpdump: WARNING: eth3: no IPv4 address assigned
tcpdump: listening on eth3, link-type EN10MB (Ethernet), capture size 65535 bytes
^C637676 packets captured
4955063 packets received by filter
4317303 packets dropped by kernel
1 packet dropped by interface

real    0m3.942s
user    0m2.824s
sys     0m1.420s
root@suricata:/var/data/regit/log/suricata# grep GET gets.txt |wc -l
1529

So for 3 seconds we have 1529 HTTP GET requests with tcpdump (and that is just GET)

root@suricata:/var/data/regit/log/suricata# date && wc -l http.log
Thu Jan 10 00:14:16 EET 2013
518476 http.log
root@suricata:/var/data/regit/log/suricata# date && wc -l http.log
Thu Jan 10 00:14:18 EET 2013
518480 http.log
root@suricata:/var/data/regit/log/suricata# date && wc -l http.log
Thu Jan 10 00:14:19 EET 2013
518489 http.log


and for 3 sec with Suricata we get 13 http logs in http.log on the same 10G/s traffic
root@suricata:/var/data/regit/log/suricata# wc -l gets.txt
3209 gets.txt

root@suricata:/var/data/regit/log/suricata# tail -8 gets.txt
Host: redirector.c.youtube.com
GET /v302708503/6baa/fIDnzBb3MPE.jpg HTTP/1.1
Host: cs302708.userapi.com
GET /csi?v=2&s=youtube&action=results&e=901312,916623,922911,919108,920704,912806,922403,922405,929901,913605,929104,929110,920201,913302,919009,911116,926403,901451&li=1&rt=ct.281,js_head.591,js_page.863,js_foot.865,ol.1208,aft.1208 HTTP/1.1
Host: csi.gstatic.com
GET /socket.io/1/?t=1357769748997 HTTP/1.1
Host: ws.9gag.com:8084
GET /vi/9d6IsU01-HE/hqdefault.jpg HTTP/1.1
root@suricata:/var/data/regit/log/suricata#

my http config part in suriacta.yaml

  # a line based log of HTTP requests (no alerts)
  - http-log:
      enabled: yes
      filename: http.log
      append: yes
      extended: yes     # enable this for extended logging information
      #filetype: regular # 'regular', 'unix_stream' or 'unix_dgram'

just for general info - extended: yes/no - it has no effect on the number of logs written.

I think 1.3.1 didn't have that issue.

Actions #1

Updated by Victor Julien almost 12 years ago

Sounds like you're just experiencing packet loss.

Actions #2

Updated by Peter Manev almost 12 years ago

How can we be sure if this is the case?

I mean all these

tcp.stream_depth_reached
tcp.reassembly_gap
tcp.segment_memcap_drop   
capture.kernel_drops      
tcp.ssn_memcap_drop  
defrag.max_frag_hits
tcp.invalid_checksum
defrag.ipv6.timeouts      
defrag.ipv4.timeouts

are zeros (0) in stats.log for every thread (running af_packet).

Actions #3

Updated by Anoop Saldanha almost 12 years ago

The regex you have there is not a definite way to check the http requests from tcpdump.

Not saying that we may not have an issue here. 15 requests for that network sounds suspicious indeed. Possible to show the same behaviour with a pcap?

Actions #4

Updated by Peter Manev almost 12 years ago

no the regex is not :) - but here is one more definitive way (i think) with tshark:


root@suricata:/var/data/peter/http-problem# tshark -r http-problem.pcap -R "http.request" -T fields -e http.request.method |wc -l
tshark: Lua: Error during loading:
Running as user "root" and group "root". This could be dangerous.

14799
root@suricata:/var/data/peter/http-problem#
root@suricata:/var/data/peter/http-problem#

tshark shows 14K requests from the pcap.


root@suricata:/var/data/peter/http-problem# capinfos http-problem.pcap
File name:           http-problem.pcap
File type:           Wireshark/tcpdump/... - libpcap
File encapsulation:  Ethernet
Packet size limit:   file hdr: 65535 bytes
Number of packets:   4798877
File size:           4632192790 bytes
Data size:           4555410734 bytes
Capture duration:    9 seconds
Start time:          Thu Jan 10 19:20:38 2013
End time:            Thu Jan 10 19:20:47 2013
Data byte rate:      533608406.22 bytes/sec
Data bit rate:       4268867249.74 bits/sec
Average packet size: 949.27 bytes
Average packet rate: 562127.38 packets/sec
SHA1:                eb62c3788428230bfdb50c6577b1bd61e7d0ebfb
RIPEMD160:           9dcb0acabd482aa460ae192af4cc3b45c15f0ae3
MD5:                 9bd082aab4a4373f3625ba125ba2e482
Strict time order:   False
root@suricata:/var/data/peter/http-problem#

running the same pcap with --runmode=single and zero rules


[27313] 10/1/2013 -- 21:23:44 - (source-pcap-file.c:190) <Info> (ReceivePcapFileLoop) -- pcap file end of file reached (pcap err code 0)
[27312] 10/1/2013 -- 21:23:44 - (suricata.c:2013) <Info> (main) -- Signal Received.  Stopping engine.
[27314] 10/1/2013 -- 21:23:44 - (flow-manager.c:554) <Info> (FlowManagerThread) -- 0 new flows, 0 established flows were timed out, 0 flows in closed state
[27312] 10/1/2013 -- 21:23:44 - (suricata.c:2049) <Info> (main) -- time elapsed 32.480s
[27313] 10/1/2013 -- 21:23:44 - (source-pcap-file.c:290) <Info> (ReceivePcapFileThreadExitStats) -- Pcap-file module read 4798877 packets, 4555410734 bytes
[27313] 10/1/2013 -- 21:23:44 - (stream-tcp.c:4180) <Info> (StreamTcpExitPrintStats) -- Stream TCP processed 2191278 TCP packets
[27313] 10/1/2013 -- 21:23:44 - (alert-fastlog.c:321) <Info> (AlertFastLogExitPrintStats) -- Fast log output wrote 0 alerts
[27313] 10/1/2013 -- 21:23:44 - (log-httplog.c:617) <Info> (LogHttpLogExitPrintStats) -- HTTP logger logged 349 requests
[27313] 10/1/2013 -- 21:23:44 - (log-tlslog.c:528) <Info> (LogTlsLogExitPrintStats) -- TLS logger logged 15 requests
[27313] 10/1/2013 -- 21:23:44 - (log-file.c:432) <Info> (LogFileLogExitPrintStats) -- (PcapFile) Files logged: 87
[27312] 10/1/2013 -- 21:23:45 - (host.c:244) <Info> (HostPrintStats) -- host memory usage: 349376 bytes, maximum: 16777216
[27312] 10/1/2013 -- 21:23:45 - (util-profiling-rules.c:299) <Info> (SCProfilingRuleDump) -- Dumping profiling data for 0 rules.
[27312] 10/1/2013 -- 21:23:45 - (util-profiling-rules.c:416) <Info> (SCProfilingRuleDump) -- Done dumping profiling data.
[27312] 10/1/2013 -- 21:23:45 - (detect.c:3942) <Info> (SigAddressCleanupStage1) -- cleaning up signature grouping structure... complete
[27312] 10/1/2013 -- 21:23:45 - (util-profiling.c:275) <Info> (SCProfilingDump) -- Done dumping profiling data.

HTTP logger logged 349 requests
Actions #5

Updated by Anoop Saldanha almost 12 years ago

Can you attach/share the pcap?

Actions #6

Updated by Peter Manev almost 12 years ago

shared privately.

Actions #7

Updated by Anoop Saldanha almost 12 years ago

  • Assignee set to Anoop Saldanha
Actions #8

Updated by Victor Julien about 11 years ago

  • Target version set to TBD
Actions #9

Updated by Erik C almost 11 years ago

I see there is a "target version" set TBD. Is there any update on that? I am struggling with this identical issue, and really want to use suri for dedicated http logging (to cover ipv6 http logging), as I already use it for ids. Being able to deploy the same software universally with only conf mod changes between purposes would go a long way towards simplifying my life. Currently I am using a urllog program that was built from libnids, which unfortunately is ipv4 only.

Thank you for any update on this issue.

Actions #10

Updated by Victor Julien almost 11 years ago

I suspect this is either caused by packet loss, or traffic weirdness like async routing or bad checksums.

Actions #11

Updated by Erik C almost 11 years ago

I am running 1.4.6. Below is the only ixgbe config I could use which would get me a full 10gig link packet dump using tcpdump. Without these ethtool mods, it is impossible for us to see all the packets, as evinced by tcpdump and ifconfig packet status. Does this have something to do with why/how we are not getting our urllogs? If so, how can this be managed without us modifying our ixgbe configuration.

----
ethtool -G ${i} rx 4096
ethtool -C ${i} rx-usecs 10000
ethtool -C $i adaptive-rx off
ethtool -K $i tso off
ethtool -K $i gro off
ethtool -K $i lro off
ethtool -K $i gso off
ethtool -K $i rx off
ethtool -K $i tx off
ethtool -K $i sg off
  1. ethtool -i tgig0
    driver: ixgbe
    version: 3.6.7-k
    firmware-version: 0x29b70001
    bus-info: 0000:0a:00.0
    supports-statistics: yes
    supports-test: yes
    supports-eeprom-access: yes
    supports-register-dump: yes
    supports-priv-flags: no
    root@snots:/home/dpoole# ethtool tgig0
    Settings for tgig0:
    Supported ports: [ FIBRE ]
    Supported link modes: 10000baseT/Full
    Supported pause frame use: No
    Supports auto-negotiation: No
    Advertised link modes: 10000baseT/Full
    Advertised pause frame use: No
    Advertised auto-negotiation: No
    Speed: 10000Mb/s
    Duplex: Full
    Port: Direct Attach Copper
    PHYAD: 0
    Transceiver: external
    Auto-negotiation: off
    Supports Wake-on: d
    Wake-on: d
    Current message level: 0x00000007 (7)
    drv probe link
    Link detected: yes
ifenslave bond0 $i
done
Actions #12

Updated by Erik C almost 11 years ago

update: Running autonegotiate causes all forms of ids to "not work" based on our traffic. We have to dump autonegotiate in order for us to get a complete picture of what is going across our bond.

Actions #13

Updated by Peter Manev almost 11 years ago

The issue is not caused by packet loss or traffic weirdness. I can reproduce the issue on a repeatable basis(which I just re-did ).

On our high speed test box if i do a tcpdump and run the pcap through Suricata - I get all the logs.
Example:
On the same live traffic box, a 4 sec tcpdump pcap resulting in 4GB pcap , run through Suricata, results in a over 6K http request written in the http.log. While at any given 4-5 sec interval on live traffic the http logs are no more than 200-400.

Actions #14

Updated by Victor Julien almost 11 years ago

Possibly related to #1082.

Actions #15

Updated by Erik C over 10 years ago

Can confirm not related to 1082, which assumably was patched in the 2.0 release which was released on the 28th? (very very very recently at least).

Actions #16

Updated by Eric Leblond over 10 years ago

  • Priority changed from Normal to High
I've done some tests related to this problem on a 10G test box:
  • tcpdump + pcap replay in suricata shows 1000 HTTP request/s
  • Only 30 HTTP log/s when parsing live traffic
  • Only 30 HTTP log/s when parsing "any" subset of live traffic (port 1000:3000 and port 3001:65535)
  • 300 HTTP log/s if we write in /run/shm instead of plain hard disk
  • 30 HTTP/s Suppressing fflush call after write to fd
  • 30 HTTP/s suppressing lock around write (if \n is written then the number of line is correct)

This issue is really strange and Suri is 20 times below reality.

Actions #17

Updated by Victor Julien over 10 years ago

Eric Leblond wrote:

I've done some tests related to this problem on a 10G test box:
  • tcpdump + pcap replay in suricata shows 1000 HTTP request/s

If you replay this in suri, does it log all?

Actions #18

Updated by Eric Leblond over 10 years ago

Victor Julien wrote:

Eric Leblond wrote:

I've done some tests related to this problem on a 10G test box:
  • tcpdump + pcap replay in suricata shows 1000 HTTP request/s

If you replay this in suri, does it log all?

yes, exactly: tcpdump for 3 sec saving in a pcap, then -r pcap in suri.

Actions #19

Updated by Victor Julien over 10 years ago

Do you happen to have very aggressive flow-timeout settings for tcp? I wonder if we may discard the flow before the http session is complete and ready to be logged.

Actions #20

Updated by Eric Leblond over 10 years ago

Victor Julien wrote:

Do you happen to have very aggressive flow-timeout settings for tcp? I wonder if we may discard the flow before the http session is complete and ready to be logged.

Here's the part of the YAML:

flow-timeouts:

  default:
    new: 5 #30
    established: 10#  300
    closed: 0
    emergency-new: 1 #10
    emergency-established: 2 #100
    emergency-closed: 0
  tcp:
    new: 5 #60
    established: 100 # 3600
    closed: 10 #30
    emergency-new: 1 # 10
    emergency-established: 5 # 300
    emergency-closed: 20 #20
  udp:
    new: 5 #30
    established: 5 # 300
    emergency-new: 5 #10
    emergency-established: 5 # 100
  icmp:
    new: 5 #30
    established: 5 # 300
    emergency-new: 5 #10
    emergency-established: 5 # 100

Actions #21

Updated by Victor Julien over 10 years ago

Was emergency mode active during the live processing tests?

Actions #22

Updated by Eric Leblond over 10 years ago

Victor Julien wrote:

Was emergency mode active during the live processing tests?

No trace of it.

Actions #23

Updated by Victor Julien over 10 years ago

Did you have stream.midstream enabled or disabled (both live and replay case).

I'm noticing something odd, on the DK box, I get about 10k http logs per minute with midstream disabled, but about 2.5k/min with midstream enabled.

Actions #24

Updated by Peter Manev over 10 years ago

I have tried both (midstream on/off on live traffic) on the 10G test box - no effect on the numbers.

Actions #25

Updated by Victor Julien over 10 years ago

So looking further at this midstream thing, I noticed that in the non-midstream case, after 10mins we have:

tcp.reassembly_memuse     | AFPacketeth21             | 955346671

Or about ~900mb

In the midstream case we have:

tcp.reassembly_memuse     | AFPacketeth21             | 4294967261

Or 4gb, which is also the limit on this box.

I think there is an issue with 'ageing out' segments in the session in the midstream case. This then causes segments to not be freed.

This case could easily lead to missing http requests, as the tcp reassembly fails more often (memcap reached) and thus reassembly doesn't feed data to the app layer parsers.

Actions #26

Updated by Erik C over 10 years ago

Any word on this? Am anxious to try and get this up and functioning.

Actions #27

Updated by john kely almost 10 years ago

I tested on traffic 500 Mbps. Result:
midstream = false -> HTTP Request is full
midstream = true -> HTTP Request is miss 5%
This is complicated.

Actions #28

Updated by Andreas Herz over 8 years ago

  • Assignee changed from Anoop Saldanha to OISF Dev
Actions #29

Updated by Andreas Herz over 5 years ago

It's rather hard to try to reproduce it again with 5.0 beta :)?
Or does any of you still have issues like that?

Actions #30

Updated by Peter Manev over 5 years ago

I would not be interested in http.log anymore :) but rather the "event_type: http" in eve.json.
Having a look at it now - after all the changes and improvements have been made to the latest Suricata code since the ticket was opened :) this very well may be related to either drops or wrong_threads counters and similar issues related to config and set up.
I am in favor of closing this till we actually have a reproducible case.

Actions #31

Updated by Andreas Herz over 5 years ago

  • Status changed from New to Closed
Actions

Also available in: Atom PDF