Project

General

Profile

Actions

Documentation #6552

open

doc: add tcp timeout fix to upgrade guide

Added by Thomas Winter 6 months ago. Updated 4 months ago.

Status:
In Review
Priority:
Normal
Assignee:
Target version:
Affected Versions:
Effort:
Difficulty:
Label:

Description

I have bisected two git commits that have caused large increases in memory use.

This one https://github.com/OISF/suricata/commit/f2e2576ff4 causes an increase in http.memuse.
4.0.6:

http.memuse                                | Total                     | 61928

7.0.0
http.memuse                                   | Total                     | 2698429

This one https://github.com/OISF/suricata/commit/1dac2467c5 causes a massive increase in tcp.reassembly_memuse such that memory use is almost 6x and the reassembly.memcap is hit causing traffic drops (IPS mode) and throughput reduction.
4.0.6

tcp.reassembly_memuse                      | Total                     | 11128832

7.0.0
tcp.reassembly_memuse                         | Total                     | 63608832

I have not been able to figure out why this increased so much hence I am making this ticket for help.

For these stats I used trex with 1 IP.

sudo ./t-rex-64 --astf -f astf/http_https.py -m 100 -d 230400 -l 100


Files

http_https_suri5.pcap (11.9 MB) http_https_suri5.pcap Thomas Winter, 11/23/2023 12:09 AM
massif.out.3238 (50.3 KB) massif.out.3238 Thomas Winter, 11/23/2023 01:21 AM
massif.out.2298 (91 KB) massif.out.2298 Thomas Winter, 11/23/2023 01:21 AM

Subtasks 2 (1 open1 closed)

Documentation #6640: doc: add tcp timeout fix to upgrade guide (6.0.x backport)RejectedActions
Documentation #6641: doc: add tcp timeout fix to upgrade guide (7.0.x backport)AssignedVictor JulienActions
Actions #1

Updated by Thomas Winter 6 months ago

  • Priority changed from Normal to High
Actions #2

Updated by Victor Julien 6 months ago

  • Status changed from New to Feedback
  • Assignee changed from OISF Dev to Thomas Winter

It would be good to get a better test case, ideally pcap based. Also, you could use tools like valgrinds massif to try to analyze where the memory use is exactly.

Updated by Thomas Winter 6 months ago

I've attached massif.out.3238 which is 4.0.6 with ReassembleRealloc at 17.1MB peak.
I've attached massif.out.2298 which is 7.0.0 with StreamTcpReassembleRealloc at 61.7MB peak.
Both were captured with `astf/http_https.py -m 100`.

After flows are closed/timedout, memuse returns to baseline so it is not a leak.

Here are some notes on running trex.

#Download and extract
wget --no-cache --no-check-certificate https://trex-tgn.cisco.com/trex/release/latest
tar -xzvf latest
cd v3.04/

#Create a trex config file
cat /etc/trex_cfg.yaml
- port_limit      : 2
  version         : 2
#List of interfaces. Change to suit your setup. Use ./dpdk_setup_ports.py -s to see available options
#interfaces    : ["enp3s0","enp4s0"] 
  interfaces    : ["br10","br20"] 
  port_info       :  # Port IPs. Change to suit your needs. In case of loopback, you can leave as is.
          - ip         : 192.168.10.100
            #default_gw : 192.168.10.15
            default_gw : 192.168.10.16
          - ip         : 192.168.20.100
            #default_gw : 192.168.20.15
            default_gw : 192.168.20.16

#Update traffic to use single client/server IP
diff astf/http_https.py.orig astf/http_https.py
15,16c15,16
<         ip_gen_c = ASTFIPGenDist(ip_range=["16.0.0.0", "16.0.0.255"], distribution="seq")
<         ip_gen_s = ASTFIPGenDist(ip_range=["48.0.0.0", "48.0.255.255"], distribution="seq")
---
>         ip_gen_c = ASTFIPGenDist(ip_range=["192.168.10.100", "192.168.10.100"], distribution="seq")
>         ip_gen_s = ASTFIPGenDist(ip_range=["192.168.20.100", "192.168.20.100"], distribution="seq")

#IPs are on the local interfaces, NICs are connected to the device running suricata acting as router/gateway
ip addr
8: br10: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    inet 192.168.10.100/24 brd 192.168.10.255 scope global br10
9: br20: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    inet 192.168.20.100/24 brd 192.168.20.255 scope global br20

#Run the test
sudo ./t-rex-64 --astf -f astf/http_https.py -m 250

I've attached a packet capture of a trex run with -m 10 for 10 seconds. Packet capture would be too large otherwise, you would need to set a very low reassembly memcap to see it run out with this one. I suspect any tcp traffic should see the increased memory usage eg http(s) transactions.

Actions #4

Updated by Thomas Winter 6 months ago

So far I have only seen this issue when using trex and another traffic generator.
Running multiple continuous curl to download large files, I do not see such large memory use.
A difference between the curl transactions is that they end with FIN/ACKs. The trex traffic ends in a RST/ACK.
I can use wget and kill the process to do a RST/ACK and see an increase in tcp.reassembly_memuse

while [ 1 ]
do
wget -o /dev/null -O /dev/null --http-user="user" --http-password="pass" --no-check-certificate http://192.168.10.15/upload_test &
wget_pid=$!
sleep 1
kill -s SIGTERM "$wget_pid" 
done

If wget is not interrupted, I do not see the memory use increase as drastically.

Actions #5

Updated by Thomas Winter 6 months ago

The streamingBuffer rework seems to be a red herring. I think it just increased memuse just enough to hit the memcap. There seems to be a large change in usage between 5.0.0 and 6.0.0 - I am performing another git bisect.

Actions #6

Updated by Thomas Winter 6 months ago

The cause was this patch c50ef8cc21 ("flow: fix TCP closed default initialization") which increased the timeout for TCP sessions in the closed state which caused flows to hang around longer causing elevated memuse which previously would have decreased much sooner. I'm not sure why I could only see this for sessions closed by RST and not FIN as both should become closed state.

Applying the following config restores the previous behaviour

flow-timeouts:
 tcp:
  closed: 0
  emergency_closed: 0

Actions #7

Updated by Thomas Winter 6 months ago

Feel free to close this now.

Actions #8

Updated by Victor Julien 6 months ago

  • Tracker changed from Bug to Documentation
  • Subject changed from Large increase in memory after upgrade to 7.0.0 causing traffic loss due to memcap to doc: add tcp timeout fix to upgrade guide
  • Assignee changed from Thomas Winter to Victor Julien
  • Priority changed from High to Normal
  • Target version changed from TBD to 7.0.3

Thanks Thomas. Switching this to a task to update the upgrading from 6 to 7 guide about this fix and how it can affect memory use.

Actions #9

Updated by Victor Julien 6 months ago

  • Status changed from Feedback to Assigned
Actions #10

Updated by Victor Julien 6 months ago

  • Status changed from Assigned to In Review
  • Label Needs backport to 6.0, Needs backport to 7.0 added
Actions #11

Updated by Shivani Bhardwaj 5 months ago

  • Target version changed from 7.0.3 to 8.0.0-beta1
Actions #12

Updated by OISF Ticketbot 5 months ago

  • Subtask #6640 added
Actions #13

Updated by OISF Ticketbot 5 months ago

  • Label deleted (Needs backport to 6.0)
Actions #14

Updated by OISF Ticketbot 5 months ago

  • Subtask #6641 added
Actions #15

Updated by OISF Ticketbot 5 months ago

  • Label deleted (Needs backport to 7.0)
Actions #16

Updated by Thomas Winter 5 months ago

Should use "emergency-closed" not "emergency_closed" otherwise you get a deprecated warning log.

Actions

Also available in: Atom PDF