Bug #5118: smb: excessive CPU utilization and higher packet processing latency due to excessive calls to Vec::extend_from_slice() - Suricata - Open Information Security Foundation

Actions

Bug #5118

closed

JL JI

smb: excessive CPU utilization and higher packet processing latency due to excessive calls to Vec::extend_from_slice()

Bug #5118: smb: excessive CPU utilization and higher packet processing latency due to excessive calls to Vec::extend_from_slice()

Added by Jeff Lucovsky over 4 years ago. Updated over 4 years ago.

Status:

Closed

Priority:

High

Assignee:

Jason Ish

Target version:

6.0.5

Affected Versions:

4.1.10, 5.0.8

Effort:

Difficulty:

Label:

Rust

Description

This issue is related to https://redmine.openinfosecfoundation.org/issues/4842 and possibly https://redmine.openinfosecfoundation.org/issues/4580

While testing the workaround for 4842 by setting a stream-depth:20mb for SMB, it was seen that there was higher worker thread CPU utilization and hence higher packet processing latency when doing large file transfers between Windows 10 clients. Issue is seen in 5.0.8, but not in 6.0.4 as 6.x code does not have buffering in the SMB rust protocol parser [https://github.com/OISF/suricata/commit/674b8dc0fb35c7f91074cdb993d7b925f1987027].

On using perf top, it was evident that __memmove_avx_unaligned_erms() was getting called frequently

@
PerfTop: 375 irqs/sec kernel: 6.7% exact: 0.0% [300Hz cycles], (target_pid: 23192)
---------------------------------------------------------------------------------------------------------------------------------------

41.69%  libc-2.27.so        [.] __memmove_avx_unaligned_erms
     4.59%  suricata            [.] AppLayerParserGetStateProgress
     4.06%  suricata            [.] rs_smb_state_get_tx_iterator
     3.58%  suricata            [.] GetDetectTx
     2.75%  libpthread-2.27.so  [.] __pthread_mutex_unlock
     2.61%  libpthread-2.27.so  [.] __pthread_mutex_trylock
     2.39%  suricata            [.] FlowGetProtoMapping
     2.27%  suricata            [.] AppLayerParserTransactionsCleanup
     2.10%  suricata            [.] DetectRunTx
     2.10%  libc-2.27.so        [.] cfree@GLIBC_2.2.5
     1.72%  suricata            [.] AppLayerParserGetTxDetectFlags
     1.70%  suricata            [.] FlowTimeoutHash
     1.57%  suricata            [.] SMBGetTxIterator
     1.12%  suricata            [.] rs_smb_tx_get_alstate_progress
     0.99%  suricata            [.] InspectionBufferClean
     0.94%  suricata            [.] SMBGetAlstateProgress

PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
24092 root      20   0  685.5m 196.3m  21.2m R 93.5  1.2   0:16.16 W#01-ens224
24091 root      20   0  685.5m 196.3m  21.2m S 25.8  1.2   0:04.54 W#01-ens192
24090 root      20   0  685.5m 196.3m  21.2m S  0.0  1.2   0:00.30 Suricata-Main
24093 root      20   0  685.5m 196.3m  21.2m S  0.0  1.2   0:00.12 FM#01
24094 root      20   0  685.5m 196.3m  21.2m S  0.0  1.2   0:00.10 FR#01
24095 root      20   0  685.5m 196.3m  21.2m S  0.0  1.2   0:00.00 CW
24096 root      20   0  685.5m 196.3m  21.2m S  0.0  1.2   0:00.00 CS
24097 root      20   0  685.5m 196.3m  21.2m S  0.0  1.2   0:00.00 US

From Intel VTune, it was clear that most calls to __memmove_avx_unaligned_erms() were getting called from suricata::smb::smb::SMBState::parse_tcp_data_ts(). Using rust-gdb, this was further narrowed down to the following call stack.

#0 __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:221
#1 0x00005555558d3b12 in core::intrinsics::copy_nonoverlapping (src=0x7fffe5b2ba58 "\002 \020-߷", dst=<optimized out>, count=1448)
at /build/rustc-fdjxBl/rustc-1.53.0+dfsg1+llvm/library/core/src/intrinsics.rs:1861
#2 alloc::vec::Vec<T,A>::append_elements (self=0x7ffff233fa40, other=...)
at /build/rustc-fdjxBl/rustc-1.53.0+dfsg1+llvm/library/alloc/src/vec/mod.rs:1698
#3 <alloc::vec::Vec<T,A> as alloc::vec::spec_extend::SpecExtend<&T,core::slice::iter::Iter<T>>>::spec_extend (self=0x7ffff233fa40,
iterator=...) at /build/rustc-fdjxBl/rustc-1.53.0+dfsg1+llvm/library/alloc/src/vec/spec_extend.rs:83
#4 alloc::vec::Vec<T,A>::extend_from_slice (self=0x7ffff233fa40, other=...)
at /build/rustc-fdjxBl/rustc-1.53.0+dfsg1+llvm/library/alloc/src/vec/mod.rs:2119
#5 suricata::smb::smb::SMBState::parse_tcp_data_ts (self=0x7fffec3c0440, i=...) at src/smb/smb.rs:1382
#6 0x00005555558d92f1 in suricata::smb::smb::rs_smb_parse_request_tcp (flow=<optimized out>, state=0x7fffec3c0440,
_pstate=<optimized out>, input=0x7fffe5b2ba58 "\002 \020-߷", input_len=1448, _data=<optimized out>, flags=4)
at src/smb/smb.rs:1862
#7 0x00005555555fc871 in SMBTCPParseRequest (f=0x555556bedfe0, state=0x7fffec3c0440, pstate=0x7fffec3c0400,
input=0x7fffe5b2ba58 "\002 \020-߷", input_len=1448, local_data=0x0, flags=4 '\004') at app-layer-smb.c:47
#8 0x00005555555fb1d1 in AppLayerParserParse (tv=0x555555e30350, alp_tctx=0x7fffe429adb0, f=0x555556bedfe0, alproto=8,
flags=4 '\004', input=0x7fffe5b2ba58 "\002 \020-߷", input_len=1448) at app-layer-parser.c:1266
#9 0x0000555555593616 in AppLayerHandleTCPData (tv=0x555555e30350, ra_ctx=0x7fffe4293af0, p=0x7fffe426c170, f=0x555556bedfe0,
ssn=0x7fffe435d570, stream=0x7ffff233fe00, data=0x7fffe5b2ba58 "\002 \020-߷", data_len=1448, flags=4 '\004') at app-layer.c:683
#10 0x00005555557ecd19 in ReassembleUpdateAppLayer (tv=0x555555e30350, ra_ctx=0x7fffe4293af0, ssn=0x7fffe435d570,
stream=0x7ffff233fe00, p=0x7fffe426c170, dir=UPDATE_DIR_PACKET) at stream-tcp-reassemble.c:1102
#11 0x00005555557eceb4 in StreamTcpReassembleAppLayer (tv=0x555555e30350, ra_ctx=0x7fffe4293af0, ssn=0x7fffe435d570,
stream=0x7fffe435d608, p=0x7fffe426c170, dir=UPDATE_DIR_PACKET) at stream-tcp-reassemble.c:1159
#12 0x00005555557ee30a in StreamTcpReassembleHandleSegment (tv=0x555555e30350, ra_ctx=0x7fffe4293af0, ssn=0x7fffe435d570,
stream=0x7fffe435d608, p=0x7fffe426c170, pq=0x7fffe4293798) at stream-tcp-reassemble.c:1832
#13 0x00005555557db67c in HandleEstablishedPacketToServer (tv=0x555555e30350, ssn=0x7fffe435d570, p=0x7fffe426c170,
stt=0x7fffe4293790, pq=0x7fffe4293798) at stream-tcp.c:2318
#14 0x00005555557dc9d8 in StreamTcpPacketStateEstablished (tv=0x555555e30350, p=0x7fffe426c170, stt=0x7fffe4293790,
ssn=0x7fffe435d570, pq=0x7fffe4293798) at stream-tcp.c:2692
#15 0x00005555557e4247 in StreamTcpStateDispatch (tv=0x555555e30350, p=0x7fffe426c170, stt=0x7fffe4293790, ssn=0x7fffe435d570,
pq=0x7fffe4293798, state=4 '\004') at stream-tcp.c:4726
#16 0x00005555557e4a59 in StreamTcpPacket (tv=0x555555e30350, p=0x7fffe426c170, stt=0x7fffe4293790, pq=0x7fffe4289ea8)
at stream-tcp.c:4916
#17 0x00005555557e53aa in StreamTcp (tv=0x555555e30350, p=0x7fffe426c170, data=0x7fffe4293790, pq=0x7fffe4289ea8, postpq=0x0)
at stream-tcp.c:5252
#18 0x0000555555731e52 in FlowWorker (tv=0x555555e30350, p=0x7fffe426c170, data=0x7fffe4289e80, preq=0x555555d9acb0, unused=0x0)
at flow-worker.c:245
#19 0x0000555555800517 in TmThreadsSlotVarRun (tv=0x555555e30350, p=0x7fffe426c170, slot=0x555555e16440) at tm-threads.c:134
#20 0x00005555557bd3ea in TmThreadsSlotProcessPkt (tv=0x555555e30350, s=0x555555e16440, p=0x7fffe426c170) at tm-threads.h:163
#21 0x00005555557bfc39 in AFPParsePacketV3 (ptv=0x7fffe426cb20, pbd=0x7fffe2cfe000, ppd=0x7fffe2d01db0) at source-af-packet.c:1180

Looking at the code at parse_tcp_data_ts() code in rust/src/smb/smb.rs, its seen that self.tcp_buffer_ts.len() is called post call to self.tcp_buffer_ts.split_off(0). But its seen that
self.tcp_buffer_ts.len() always returns the length as zero since I think Vec::split_off() function sets the self.length to zero post call to self.tcp_buffer_ts.split_off(0), which nullifies the greater than 100000 check.

@
From: "rust/src/smb/smb.rs"
1367 /// Parsing function, handling TCP chunks fragmentation
1368 pub fn parse_tcp_data_ts<'b>(&mut self, i: &'b[u8]) -> u32
1369 {
1370 let mut v : Vec<u8>;
1371 //println!("parse_tcp_data_ts ({})",i.len());
1372 //println!("{:?}",i);
1373 // Check if TCP data is being defragmented
1374 let tcp_buffer = match self.tcp_buffer_ts.len() {
1375 0 => i,
1376 _ => {
1377 v = self.tcp_buffer_ts.split_off(0);
1378 if self.tcp_buffer_ts.len() + i.len() > 100000 {
1379 self.set_event(SMBEvent::RecordOverflow);
1380 return 1;
1381 };
1382 v.extend_from_slice(i);
1383 v.as_slice()
1384 },
1385 };

From: "rust-1.53.0/library/alloc/src/vec/mod.rs"
1833 pub fn split_off(&mut self, at: usize) -> Self
1834 where
1835 A: Clone,
1836 {
1837 #[cold]
1838 #[inline(never)]
1839 fn assert_failed(at: usize, len: usize) -> ! {
1840 panic!("`at` split index (is {}) should be <= len (is {})", at, len);
1841 }
1842
1843 if at > self.len() {
1844 assert_failed(at, self.len());
1845 }
1846
1847 if at == 0 {
1848 // the new vector can take over the original buffer and avoid the copy
1849 return mem::replace(
1850 self,
1851 Vec::with_capacity_in(self.capacity(), self.allocator().clone()),
1852 );
1853 }
1854
1855 let other_len = self.len - at;
1856 let mut other = Vec::with_capacity_in(other_len, self.allocator().clone());
1857
1858 // Unsafely `set_len` and copy items to `other`.
1859 unsafe {
1860 self.set_len(at);
1861 other.set_len(other_len);

From: "rust/src/krb/krb5.rs"
536 // sanity check to avoid memory exhaustion
537 if state.defrag_buf_ts.len() + buf.len() > 100000 {
538 SCLogDebug!("rs_krb5_parse_request_tcp: TCP buffer exploded {} {}",
539 state.defrag_buf_ts.len(), buf.len());
540 return 1;
541 }
542 v = state.defrag_buf_ts.split_off(0);
543 v.extend_from_slice(buf);
544 v.as_slice()

To fix the issue, the following code change is required (based on similar code in rust/src/krb/krb5.rs). Similar fix will also be needed in rust/src/nfs/nfs.rs as well as the bug seems to exists for NFS parser as well.

diff aruNp rust/src/smb/smb.rs.orig rust/src/smb/smb.rs
-- rust/src/smb/smb.rs.orig 2022-01-10 12:59:53.269071810 0530
++ rust/src/smb/smb.rs 2022-01-10 13:00:13.099088084 0530
@ -1374,11 +1374,11 @ impl SMBState {
let tcp_buffer = match self.tcp_buffer_ts.len() {
0 => i,
_ => {
- v = self.tcp_buffer_ts.split_off(0);
if self.tcp_buffer_ts.len() + i.len() > 100000 {
self.set_event(SMBEvent::RecordOverflow);
return 1;
};
v = self.tcp_buffer_ts.split_off(0);
v.extend_from_slice(i);
v.as_slice()
},
@ -1602,11 +1602,11 @ impl SMBState {
let tcp_buffer = match self.tcp_buffer_tc.len() {
0 => i,
_ => {
- v = self.tcp_buffer_tc.split_off(0);
if self.tcp_buffer_tc.len() + i.len() > 100000 {
self.set_event(SMBEvent::RecordOverflow);
return 1;
};
+ v = self.tcp_buffer_tc.split_off(0);
v.extend_from_slice(i);
v.as_slice()
},
@

Post the change, the CPU utilization comes down.

@
PerfTop: 584 irqs/sec kernel:10.3% exact: 0.0% [300Hz cycles], (target_pid: 22936)
---------------------------------------------------------------------------------------------------------------------------------------

8.70%  suricata            [.] AppLayerParserGetStateProgress
     7.33%  suricata            [.] rs_smb_state_get_tx_iterator
     6.79%  suricata            [.] GetDetectTx
     5.88%  suricata            [.] AppLayerParserTransactionsCleanup
     5.65%  libpthread-2.27.so  [.] __pthread_mutex_unlock
     5.40%  libpthread-2.27.so  [.] __pthread_mutex_trylock
     4.39%  suricata            [.] DetectRunTx
     3.90%  suricata            [.] AppLayerParserGetTxDetectFlags
     3.11%  suricata            [.] FlowGetProtoMapping
     2.99%  suricata            [.] SMBGetTxIterator
     2.89%  suricata            [.] FlowTimeoutHash
     2.83%  suricata            [.] SMBGetAlstateProgress
     2.05%  libc-2.27.so        [.] cfree@GLIBC_2.2.5
     1.33%  libc-2.27.so        [.] __memset_avx2_erms

PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
23836 root      20   0  557.5m 102.1m  21.3m R 71.0  0.6   0:13.64 W#01-ens192
23837 root      20   0  557.5m 102.1m  21.3m R 48.4  0.6   0:12.20 W#01-ens224
23835 root      20   0  557.5m 102.1m  21.3m S  0.0  0.6   0:00.27 Suricata-Main
23838 root      20   0  557.5m 102.1m  21.3m S  0.0  0.6   0:00.14 FM#01
23839 root      20   0  557.5m 102.1m  21.3m S  0.0  0.6   0:00.13 FR#01
23840 root      20   0  557.5m 102.1m  21.3m S  0.0  0.6   0:00.00 CW
23841 root      20   0  557.5m 102.1m  21.3m S  0.0  0.6   0:00.00 CS
23842 root      20   0  557.5m 102.1m  21.3m S  0.0  0.6   0:00.00 US

Attaching the Suricata config and Intel VTune screenshot. I also have a 800Mb pcap (compressed size 54MB) which can reproduce the issue. Request Suricata team to include the pcap as part of Suricata regression tests.

md5sum smb_small_ixia.pcap.7z 0506633f0e9351354429f70b1597b2eb smb_small_ixia.pcap.7z

Content of smb.rules (single ETOpen rule is enough to trigger the issue)
@

drop smb $HOME_NET any -> any any (msg:"ET EXPLOIT ETERNALBLUE Probe Vulnerable System Response MS17-010"; flow:from_server,established; content:"|ff|SMB|25 05 02 00 c0 98 01|"; offset:4; depth:11; content:"|00 00 00 00 00 00 00 00 00 00|"; distance:3; within:10; content:"|00 00 00|"; distance:8; within:3; isdataat:!1,relative; threshold: type limit, track by_src, count 1, seconds 30; reference:url,github.com/rapid7/metasploit-framework/blob/master/modules/auxiliary/scanner/smb/smb_ms17_010.rb; classtype:trojan-activity; sid:2025650; rev:2; metadata:affected_product Windows_XP_Vista_7_8_10_Server_32_64_Bit, attack_target Client_Endpoint, created_at 2018_07_11, deployment Internal, former_category EXPLOIT, signature_severity Major, tag Metasploit, tag ETERNALBLUE, updated_at 2019_09_28;)

Files

Download all files

config.yaml (73.7 KB) config.yaml		Srini J, 01/10/2022 07:50 AM
vtune_smb.png (133 KB) vtune_smb.png		Srini J, 01/10/2022 07:51 AM

Related issues 1 (0 open — 1 closed)

JL Updated by Jeff Lucovsky over 4 years ago Actions
Copy link
#1

Copied from Bug #4945: smb: excessive CPU utilization and higher packet processing latency due to excessive calls to Vec::extend_from_slice() added