Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for SKIP-RX-COPY using MSG_TRUNC and Zero-copy using SO_ZEROCOPY/MSG_ZEROCOPY #1690

Conversation

davidBar-On
Copy link
Contributor

@davidBar-On davidBar-On commented Apr 27, 2024

Add support for SKIP-RX-COPY (using MSG_TRUNC) and SO_ZEROCOPY/MSG_ZEROCOPY. Although it is not clear that all added functionality improve performance and throughput, support for all these options and their combinations was added, to allow testing all of them. The assumptions is that different environments may have different levels of support for the different options and their combinations.

(Note that running bootstrap.sh; configure is required before make to support the new features.)

The added options are:

  1. --skip-rx-copy: when used, for both TCP and UDP, recv(..., MSG_TRUNC) is used instead of read().
  2. Support for MSG_ZEROCOPY. When used, socket option SO_ZEROCOPY is set and send(...., MSG_ZEROCOPY) is used instead of write(). MSG_ZEROCOPY is used in the following cases:
    2.1 UDP: when -Z/--zerocopy option is set.
    2.2 TCP: when --zerocopy=z is set. Otherwise, sendfile() continue to be used for TCP zero copy.

@marcosfsch
Copy link
Contributor

Thanks David, this looks very promissing.

Follows my results on a 100G back to back test environment:

Server command: sudo taskset --cpu-list 1 ./src/iperf3 -s -i 1
Client Comands:
sudo taskset --cpu-list 1 ./src/iperf3 -c 192.168.1.18 -i 1 -t 30 -V
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-30.00 sec 126 GBytes 36.0 Gbits/sec 8 sender
[ 5] 0.00-30.00 sec 126 GBytes 36.0 Gbits/sec receiver
CPU Utilization: local/sender 40.2% (0.9%u/39.3%s), remote/receiver 99.7% (1.7%u/98.0%s)

sudo taskset --cpu-list 1 ./src/iperf3 -c 192.168.1.18 -i 1 -t 30 -V --skip-rx-copy
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-30.00 sec 239 GBytes 68.4 Gbits/sec 0 sender
[ 5] 0.00-30.00 sec 239 GBytes 68.4 Gbits/sec receiver
CPU Utilization: local/sender 79.5% (1.8%u/77.8%s), remote/receiver 90.5% (4.4%u/86.0%s)

sudo taskset --cpu-list 1 ./src/iperf3 -c 192.168.1.18 -i 1 -t 30 -V --skip-rx-copy --zerocopy=z
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-30.00 sec 344 GBytes 98.5 Gbits/sec 0 sender
[ 5] 0.00-30.00 sec 344 GBytes 98.5 Gbits/sec receiver
CPU Utilization: local/sender 33.2% (2.1%u/31.1%s), remote/receiver 99.9% (6.4%u/93.5%s)

sudo taskset --cpu-list 1 ./src/iperf3 -c 192.168.1.18 -i 1 -t 30 -V --skip-rx-copy -Z
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-30.00 sec 346 GBytes 99.0 Gbits/sec 0 sender
[ 5] 0.00-30.00 sec 346 GBytes 99.0 Gbits/sec receiver
CPU Utilization: local/sender 52.5% (2.2%u/50.3%s), remote/receiver 100.0% (6.5%u/93.4%s)

@bltierney
Copy link
Contributor

This is fantastic! Thanks! I get 2x throughput on a 100G path.

Question: Shouldn't --skip-rx-copy be a server side option instead of a client side? (or both if using the -R option)

@davidBar-On
Copy link
Contributor Author

Question: Shouldn't --skip-rx-copy be a server side option instead of a client side? (or both if using the -R option)

The client send this option (as several of the other options) to the server, so setting it for a test is applicable for both normal and reverse (-R) modes. Making it a server option means that that there will be no way to set the option per test when the server is a receiver.

@bltierney
Copy link
Contributor

Ah, that makes sense. Thanks. I do see a use case where I might want to force it on the server side, but passing the option from the client is more useful.

@davidBar-On
Copy link
Contributor Author

davidBar-On commented May 3, 2024 via email

@bmah888
Copy link
Contributor

bmah888 commented May 20, 2024

Thanks for the pull request! We're gonna need to study this a bit.

@MattCatz
Copy link
Contributor

Having read through the patch (haven't run anything yet) and associated Linux documentation:

  • The MSG_TRUNC seems to affect TCP and UDP different. You have things set-up in a way that I would expect only TCP throughput to be improved. Has anyone seen an improvement in UDP throughput?
  • This iteration of changes is probably incompatible with the --file flag (specifically MSG_ZEROCOPY). To be compatible we would need to wait for the kernel to notify us that it is done with the shared buffer.

https://man7.org/linux/man-pages/man2/recv.2.html
https://man7.org/linux/man-pages/man7/tcp.7.html
https://docs.kernel.org/networking/msg_zerocopy.html

src/iperf_api.c Outdated
Comment on lines 1770 to 1776
// sero copy for TCP use sendfile()
if (test->zerocopy && test->protocol->id != Pudp && !has_sendfile()) {
i_errno = IENOSENDFILE;
return -1;
}
#else
// sero copy is supported only by TCP
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// sero copy for TCP use sendfile()
if (test->zerocopy && test->protocol->id != Pudp && !has_sendfile()) {
i_errno = IENOSENDFILE;
return -1;
}
#else
// sero copy is supported only by TCP
// zero copy for TCP use sendfile()
if (test->zerocopy && test->protocol->id != Pudp && !has_sendfile()) {
i_errno = IENOSENDFILE;
return -1;
}
#else
// zero copy is supported only by TCP

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed (with other changes - see comments)

@davidBar-On
Copy link
Contributor Author

The MSG_TRUNC seems to affect TCP and UDP different. You have things set-up in a way that I would expect only TCP throughput to be improved. Has anyone seen an improvement in UDP throughput?

Why "things set-up in a way that I would expect only TCP throughput to be improved"? Although practically that may be the case, what is wrong in the implementation regarding UDP? (Both UDP and TCP use Nrecv() with MSG_TRUNC as socket-option.)

This iteration of changes is probably incompatible with the --file flag (specifically MSG_ZEROCOPY). To be compatible we would need to wait for the kernel to notify us that it is done with the shared buffer.

Forgot to take --file into account. Without this option, the sent buffer is fixed, so there is no need to handle the kernel notifications. To simplify the initial solution, I now added a check to not allow the use of MSG_ZEROCOPY when --file is set.

@MattCatz
Copy link
Contributor

Why "things set-up in a way that I would expect only TCP throughput to be improved"? Although practically that may be the case, what is wrong in the implementation regarding UDP? (Both UDP and TCP use Nrecv() with MSG_TRUNC as socket-option.)

If I understand the Linux kernel documentation correctly, recv(fd, buf, nleft, MSG_TRUNC) will discard different parts of the kernel's buffer depending on if it is UDP or TCP.

With TCP it will discard nleft from the kernel buffer.

With UDP it will discard everything after nleft.

This means that we only have to read the first part of UDP packet to get the UDP stats the server sticks in there:

--- a/src/iperf_udp.c
+++ b/src/iperf_udp.c
@@ -69,6 +69,9 @@ iperf_udp_recv(struct iperf_stream *sp)
     sock_opt = 0;
 #endif /* HAVE_MSG_TRUNC */
 
+    if (sock_opt)
+        size = sizeof(sec) + sizeof(usec) + sizeof(pcount);
+
     r = Nrecv(sp->socket, sp->buffer, size, Pudp, sock_opt);
 
     /*
--- a/src/net.c
+++ b/src/net.c
@@ -397,8 +397,8 @@ Nread(int fd, char *buf, size_t count, int prot)
 int
 Nrecv(int fd, char *buf, size_t count, int prot, int sock_opt)
 {
-    register ssize_t r;
-    register size_t nleft = count;
+    register ssize_t r, total=0;
+    register ssize_t nleft = count;
     struct iperf_time ftimeout = { 0, 0 };
 
     fd_set rfdset;
@@ -441,6 +441,7 @@ Nrecv(int fd, char *buf, size_t count, int prot, int sock_opt)
         } else if (r == 0)
             break;
 
+        total += r;
         nleft -= r;
         buf += r;
 
@@ -477,7 +478,7 @@ Nrecv(int fd, char *buf, size_t count, int prot, int sock_opt)
             }
         }
     }
-    return count - nleft;
+    return total;
 }
 
 

Doing a quick loopback test on my system shows about a 20% increase in throughput for large UDP packets.

On a quick tangent, it looks like with this change UDP tests are pretty much dominated by the overhead of select.

Below is not using --skip-rx-copy:
image

Below is using --skip-rx-copy:
image

@MattCatz
Copy link
Contributor

Forgot to take --file into account. Without this option, the sent buffer is fixed, so there is no need to handle the kernel notifications. To simplify the initial solution, I now added a check to not allow the use of MSG_ZEROCOPY when --file is set.

I wasn't aware of this in my original comment but since iperf likes to insert UDP stats at the start of the buffer used for UDP packets, using MSG_ZEROCOPY will create a similar race condition for UDP tests. This will cause iperf to miss report lost datagrams.

@davidBar-On davidBar-On force-pushed the issue-1678-MSG_TRUNC_and_SO_ZEROCOPY-MSG_ZEROCOPY branch from 95ee7db to abf359a Compare June 14, 2024 13:30
@davidBar-On
Copy link
Contributor Author

If I understand the Linux kernel documentation correctly, recv(fd, buf, nleft, MSG_TRUNC) will discard different parts of the kernel's buffer depending on if it is UDP or TCP. ... With UDP it will discard everything after nleft. This means that we only have to read the first part of UDP packet to get the UDP stats the server sticks in there:

Thanks a lot! I completely missed that point. I now added you suggested change to iperf_udp_recv(). I didn't make the suggested changes to Nrecv(), since I understand that they are just cosmetic, and usually my approach is to make the minimum changes required, as this seems to be more safe (bugs introduction, portability, etc.).

The new commit is with rebase, and I forgot to mention previously that the changes include the PR #1708 fix, to reduce server's CPU overhead.

... UDP stats at the start of the buffer used for UDP packets, using MSG_ZEROCOPY will create a similar race condition for UDP tests.

As you probably understood, I somehow overlooked the UDP dynamic prefix of a packet ... For the receiving side you solved the issue with the above. For the sending side it seems that the only solution is using the MSG_ZEROCOPY Notifications, but I don't want to add this complexity at this point. It seems that initially it is better to just not implement zero copy for UDP. Before doing that, do you agree? Do you have other suggestion?

@MattCatz
Copy link
Contributor

I agree that MSG_ZEROCOPY won't work with UDP without notifications.

The math for Nread/Nrecv took me a second to reason through (because of the double negative). I.e With UDP_MAX reads you get something like:

count = 16;
nleft = count;
...
nleft -= 65507; // nleft = (-65491)
...
return count - nleft;  // 16 - (-65491) = 65507

(which does account for all the bits lol)

@davidBar-On davidBar-On changed the title Add support for SKIP-RX-COPY and SO_ZEROCOPY/MSG_ZEROCOPY Add support for SO_ZEROCOPY/MSG_ZEROCOPY Jun 16, 2024
@davidBar-On davidBar-On changed the title Add support for SO_ZEROCOPY/MSG_ZEROCOPY Add support for Zero-copy using SO_ZEROCOPY/MSG_ZEROCOPY Jun 16, 2024
@davidBar-On
Copy link
Contributor Author

The changes to support SKIP-RX-COPY where moved to PR #1717, and this PR will be used only for the support of zero-copy using SO_ZEROCOPY/MSG_ZEROCOPY.

The math for Nread/Nrecv took me a second to reason through (because of the double negative). ....

I am not sure I realized that nleft becomes negative, so for clarity I did make the suggested changes in PR #1717.

I agree that MSG_ZEROCOPY won't work with UDP without notifications.

Will add support for the notifications in this PR.

@davidBar-On davidBar-On changed the title Add support for Zero-copy using SO_ZEROCOPY/MSG_ZEROCOPY Add support for SKIP-RX-COPY using MSG_TRUNC and Zero-copy using SO_ZEROCOPY/MSG_ZEROCOPY Jun 23, 2024
@davidBar-On
Copy link
Contributor Author

Added a separate PR #1720 for the support of MSG_ZEROCPY/SOZEROCOPY, now with notifications support.

Closing this PR as its functionality is now split between PRs #1717 and #1720.

@swlars
Copy link
Contributor

swlars commented Oct 25, 2024

@MattCatz Hi MattCatz, what were you using for the "flame graphs" for performance? Thanks!

@MattCatz
Copy link
Contributor

@MattCatz Hi MattCatz, what were you using for the "flame graphs" for performance? Thanks!

I didn't write down the exact steps but probably something like:

  • build iperf3 with debug symbols and frame pointer (i.e. -fno-omit-frame-pointer)
  • install kernel debug symbols ( I believe on ubuntu it is something like sudo apt-get install linux-image-uname -r-dbgsym)
  • Follow this guide: https://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
  • Run your run iperf3. e.x. perf -F 97 -- iperf3 -c 127.0.0.1

If that doesn't work then feel free to reach out over email (you can find it in the git log).

@swlars
Copy link
Contributor

swlars commented Nov 7, 2024

@MattCatz Thanks for the instructions! I'll give it a shot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants