Bug #5863: netmap: packet stalls (6.0.x backport) - Suricata - Open Information Security Foundation

Actions

Copy link

Bug #5863

closed

Bug #5862: netmap: packet stalls

netmap: packet stalls (6.0.x backport)

Added by Jeff Lucovsky over 2 years ago. Updated over 2 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Bill Meeks

Target version:

6.0.11

Affected Versions:

6.0.10

Effort:

Difficulty:

Label:

Files

netmap-patch-6.0.x-backport-v5.diff (26.1 KB) netmap-patch-6.0.x-backport-v5.diff

Updated netmap stall fix patch

Bill Meeks, 03/02/2023 02:39 PM

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Bill Meeks over 2 years ago

File netmap-patch-6.0.x-backport.diff added

I have created a proposed patch for the netmap code in Suricata 6.0.x. My patch is attached to this ticket. It includes the fix for the stall/hang condition previously reported in #5744 along with a number of other needed changes.

Actions

Copy link

Updated by Bill Meeks over 2 years ago

File deleted (~~netmap-patch-6.0.x-backport.diff~~)

Actions

Copy link

Updated by Bill Meeks over 2 years ago

File netmap-patch-6.0.x-backport.diff added

Added patch file for netmap in the master-6.0.x branch.

Actions

Copy link

Updated by Bill Meeks over 2 years ago

The attached patch has been tested and verified to solve the stall issue by an OPNsense user that was previously experiencing the stall regularly. Here is the info I posted in the original #5744 thread:

I have some very encouraging news to report about my latest patch. It appears to correct the stall issue. I encountered a former OPNsense user on the pfSense forum, and we were discussing the stall issue. He agreed to test Franco's latest Suricata test package for OPNsense. He was routinely experiencing the stall condition previously with the new Suricata netmap code (6.0.9) shortly after it was released last December. He just resurrected his OPNsense VM and tested again with the latest Suricata netmap patch applied (my v5 patch) and had zero stalls and was able to saturate his Gigabit link.

So, I think the v5 patch is the solution. I will cross-post these results in the newly opened tickets.

I suggest the patch file attached to this ticket be merged into Suricata 6.0.11 and also 7.0-RC. I will be happy to prepare and submit a pull request against 7.0-RC containing this patch.

Actions

Copy link

Updated by Franco Fichtner over 2 years ago

The sleep could very well decrease performance under link saturation. For the sake of avoiding further regressions in 6.0.x I'd make the patch as minimal as it can be and only bring back the ioctl in the error case. I'm not sure which way the locking should go. ioctl() under lock is safer but may also slow down multiple threads.

Actions

Copy link

Updated by Victor Julien over 2 years ago

Related to Bug #5744: netmap: 6.0.9 v14 backport causes known packet stalls from v14 implementation in "legacy" mode too added

Actions

Copy link

Updated by Bill Meeks over 2 years ago

Franco Fichtner wrote in #note-5:

The sleep could very well decrease performance under link saturation. For the sake of avoiding further regressions in 6.0.x I'd make the patch as minimal as it can be and only bring back the ioctl in the error case. I'm not sure which way the locking should go. ioctl() under lock is safer but may also slow down multiple threads.

What about removing the sleep() call but leaving the error retry loop in place? So, if the first send attempt fails, call the ioctl() function to sync TX rings, then loop up to three times attempting to resend the packet. After three successive failures, bail with an error.

So, do you think it is better to perform the ioctl() call outside of the netmap device lock in both instances (meaning the instance where the nm_inject() call fails and even later, farther down just before the function returns, when it was successful)? That device lock is only present when using runmode "autofp". Workers mode does not use the lock as no contention for the same ring is anticipated there.

There are some other potential issues fixed in the latest 6.0.x backport patch I attached to this ticket. There are two places where the netmap device list lock used in both the legacy and new code is missing "unlock" calls before exiting from an error. I believe those are necessary. And I identified a potential error situation when the hardware NIC exposes unmatched RX/TX ring counts. For example, the NIC exposes more TX rings than RX rings. There are such animals I understand, but they are rare. Since the netmap code in Suricata has always assumed the ring counts were symmetrical, I thought it better to test to verify that and error out if they are not. Perhaps a stern logged warning message and continuing by using the lower ring count is okay, but I favor the Fatal Error exit in that instance.

Actions

Copy link

Updated by Jeff Lucovsky over 2 years ago

I'd suggest holding the lock over the ioctl call.

Actions

Copy link

Updated by Bill Meeks over 2 years ago

Jeff Lucovsky wrote in #note-8:

I'd suggest holding the lock over the ioctl call.

I concur. The device lock is only used with "autofp" run mode, and in that mode the ring needs protection from concurrent access until we are completely finished with trying to write data to it in the current thread.

I've attached an updated version of the master-6.0.x-backport.diff patch file that removes the sleep() call but leaves the other changes.

Actions

Copy link

#10

Updated by Bill Meeks over 2 years ago

File deleted (~~netmap-patch-6.0.x-backport.diff~~)

Actions

Copy link

#11

Updated by Bill Meeks over 2 years ago

File netmap-patch-6.0.x-backport.diff added

Updated master-6.0.x-backport.diff file. This version has the call to sleep() removed from the NetmapWritePackets() function.

Actions

Copy link

#12

Updated by Bill Meeks over 2 years ago

File netmap-patch-6.0.x-backport.diff added

Update to the patch diff file per comments from Jeff. Latest version attached.

A pull request containing the latest version of the patch has been submitted to the master-6.0.x branch here: https://github.com/OISF/suricata/pull/8549.

Actions

Copy link

#13

Updated by Jeff Lucovsky over 2 years ago

Status changed from Assigned to In Review
Assignee changed from Jeff Lucovsky to Bill Meeks

https://github.com/OISF/suricata/pull/8554

Actions

Copy link

#14

Updated by Bill Meeks over 2 years ago

File netmap-patch-6.0.x-backport-v5.diff netmap-patch-6.0.x-backport-v5.diff added

Attaching latest iteration of the netmap stall patch diff that matches the current pull request.

Actions

Copy link

#15

Updated by Bill Meeks over 2 years ago

File deleted (~~netmap-patch-6.0.x-backport-v1.diff~~)

Actions

Copy link

#16

Updated by Bill Meeks over 2 years ago

File deleted (~~netmap-patch-6.0.x-backport-v2.diff~~)

Actions

Copy link

#17

Updated by Victor Julien over 2 years ago

Status changed from In Review to Closed

https://github.com/OISF/suricata/pull/8554

Actions

Copy link

#18

Updated by Jeff Lucovsky over 2 years ago

franco The @master-6.0.x branch has the updates to remove the packet stall.

Could you confirm that the stall is eliminated in your deployments?

Actions

Copy link

#19

Updated by Franco Fichtner over 2 years ago

The ioctl() spot looks good, but I can't easily verify a branch as the FreeBSD port requires a distribution tarball and local build doesn't easily package to ship this to relevant users.

I'll put out a call for testing once 6.0.11 is out in any case to see if things look good now.

Thanks,
Franco

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Suricata

Custom queries

Bug #5863

netmap: packet stalls (6.0.x backport)

Updated by Bill Meeks over 2 years ago

Updated by Bill Meeks over 2 years ago

Updated by Bill Meeks over 2 years ago

Updated by Bill Meeks over 2 years ago

Updated by Franco Fichtner over 2 years ago

Updated by Victor Julien over 2 years ago

Updated by Bill Meeks over 2 years ago

Updated by Jeff Lucovsky over 2 years ago

Updated by Bill Meeks over 2 years ago

Updated by Bill Meeks over 2 years ago

Updated by Bill Meeks over 2 years ago

Updated by Bill Meeks over 2 years ago

Updated by Jeff Lucovsky over 2 years ago

Updated by Bill Meeks over 2 years ago

Updated by Bill Meeks over 2 years ago

Updated by Bill Meeks over 2 years ago

Updated by Victor Julien over 2 years ago

Updated by Jeff Lucovsky over 2 years ago

Updated by Franco Fichtner over 2 years ago