Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NXP Ethernet eth_mcux poor performance #60144

Closed
DerekSnell opened this issue Jul 7, 2023 · 14 comments
Closed

NXP Ethernet eth_mcux poor performance #60144

DerekSnell opened this issue Jul 7, 2023 · 14 comments
Assignees
Labels
area: Ethernet area: Networking bug The issue is a bug, or the PR is fixing a bug platform: NXP NXP priority: low Low impact/importance bug

Comments

@DerekSnell
Copy link
Contributor

Creating a new Issue, originally reported by @jameswalmsley at #51107. He reports that zperf shows only 1Mbps:

I have the mimxrt1064_evk board. So far I can build the following (from zephyr/main):

west build -b mimxrt1064_evk zephyr/samples/net/zperf/    

zephyr-shell

zperf udp download

linux

iperf -V -u -c fe80::4:9fff:fe39:3ca4%enp0s20f0u1u1
------------------------------------------------------------
Client connecting to fe80::4:9fff:fe39:3ca4%enp0s20f0u1u1, UDP port 5001
Sending 1450 byte datagrams, IPG target: 11062.62 us (kalman adjust)
UDP buffer size:  208 KByte (default)
------------------------------------------------------------
[  1] local fe80::935:d1d6:7fda:bd83 port 49230 connected with fe80::4:9fff:fe39:3ca4 port 5001
[ ID] Interval       Transfer     Bandwidth
[  1] 0.0000-10.0119 sec  1.25 MBytes  1.05 Mbits/sec
[  1] Sent 906 datagrams
read failed: Connection refused
read failed: Connection refused
read failed: Connection refused

The speed seems very slow. I have tried to change many settings like dtcm, hardware acceleration etc, and I always get a very similar result of 1.05 Mbits/sec. I get the same result on our board too.

I have created a PR to fix some issues with the driver. #60073

The previous ENET_GetRxFrameSize() errors came from the eth_mcux driver only supporting the REFCLK being generated by the RT1064 (as in the ref-board).

I have added some changes in #60073 to enable the REFCLK as input, and support configuration of both 25MHz / 50MHz crystals on the PHYs.

I've also added other changes to disable cache maintenance in the HAL driver when DTCM is used for all buffers.

Unfortunately I was not able to find the source of the performance issue.

@danieldegrasse
Copy link
Collaborator

I've verified I can reproduce this on an RT1060 EVK. Performance also seems to have dropped for TCP RX by ~10Mbps since Zephyr 3.2. Moving the application to ITCM does not seem to help performance (and actually results in dropped packets in the TCP test)

@jhedberg jhedberg added the priority: low Low impact/importance bug label Jul 11, 2023
@ssharks
Copy link
Collaborator

ssharks commented Jul 13, 2023

Could you check if setting CONFIG_NET_TCP_RECV_QUEUE_TIMEOUT=0 has any influence. Handling out of order data was significantly changed in 3.2.

@danieldegrasse
Copy link
Collaborator

Based on some tests ran on 3.4 as well as the latest main, it appears that the performance issues I encountered may have been caused by commits between 3.4 and main, by updating to the latest version of main I was able to resolve the issue. I did not observe poor performance on 3.4 either. @DerekSnell, would you like to keep this issue open?

@ssharks
Copy link
Collaborator

ssharks commented Jul 19, 2023

Based on some tests ran on 3.4 as well as the latest main, it appears that the performance issues I encountered may have been caused by commits between 3.4 and main, by updating to the latest version of main I was able to resolve the issue. I did not observe poor performance on 3.4 either. @DerekSnell, would you like to keep this issue open?

Could you explain / check the exact numbers in this comments? As I read it, the statement is in contradiction with itself.

@DerekSnell
Copy link
Contributor Author

Based on some tests ran on 3.4 as well as the latest main, it appears that the performance issues I encountered may have been caused by commits between 3.4 and main, by updating to the latest version of main I was able to resolve the issue. I did not observe poor performance on 3.4 either. @DerekSnell, would you like to keep this issue open?

@jameswalmsley, does using the latest main resolve your performance issues? Thanks

@danieldegrasse
Copy link
Collaborator

Could you explain / check the exact numbers in this comments? As I read it, the statement is in contradiction with itself.

Sure, I'll do the best I can. When I ran some initial tests to validate that the issue existed, I was able to reproduce the performance problems mentioned in this issue. Later in the day, I sat down to bisect the issue (since I know Ethernet performance was acceptable at the 3.2 release point). As part of this, I updated my branch to the latest main. When I sat down to do additional performance validation, I was unable to reproduce the issue. Unfortunately, I did not record the initial test commands I used so I am not sure if there was an error there, or a commit between 3.4 and main had caused a temporary performance regression that was patched. I've attached the performance results from the testing I ran on 3.4 and main as a CSV to this comment.
Network Stack Testing.csv

@danieldegrasse
Copy link
Collaborator

@jameswalmsley, could you try with #60273 and see if this improves your performance? If you are running a custom application, you would need to integrate those code relocation directives into your CMakeLists file

@jameswalmsley-cpi
Copy link

@danieldegrasse I'll try this on our end. Thanks

@ssharks
Copy link
Collaborator

ssharks commented Jul 22, 2023

During testing for PR #60112 I found some unexpected behavior that could be related to this issue.

There 3 large transfer tests in tests/net/socket/tcp. One without packet loss, one with 12.5% packet loss with NO_DELAY option, and one without NO_DELAY option.

With the packet loss tests:
What I observed was that in some occasions the fast re-transmit was triggered, due to a triple duplicate ack, while in others the packet loss was only resolved by a timeout.
The response of a fast-retransmit to resolve a packet loss is orders of magnitude faster then waiting for a timeout. If the socket buffers are too small to hold a minimum of 4 x MSS of packets, you can never trigger a triple duplicate ACK. As default the connection buffers are set to the total buffer size / 3, you will need at least 12 x MSS available buffer size for this to work.

With default MSS being 1280 bytes and 128 bytes / buffer, you will need 10 x 12 = 120 network buffers to make fast-retransmit work.

Could this be related to your issue?

@blemouzy
Copy link
Contributor

blemouzy commented Aug 2, 2023

I made some Ethernet tests today between custom board (based on i.MXRT1052 with Ethernet 50 MHz external oscillator and Zephyr v3.4.0) and a Raspberry Pi 3 with following Zephyr settings:

#
# Network tuning from
#  https://github.com/zephyrproject-rtos/zephyr/issues/23302#issuecomment-1142138602
#
config NET_BUF_DATA_SIZE
	int
	default 1500

config NET_TC_TX_COUNT
	int
	default 1

config NET_BUF_TX_COUNT
	int
	default 80

config NET_BUF_RX_COUNT
	int
	default 80

config NET_PKT_TX_COUNT
	int
	default 80

config NET_PKT_RX_COUNT
	int
	default 80
Configuration Command Result
RPI server
Zephyr client
TCP 1KB
zperf tcp upload IP 5001 60 1K 1M Duration: 1.00 m
Packet size: 1024 bytes
Rate: 1024 kbps
Starting...
-
Upload completed!
Duration: 1.00 m
Num packets: 540935
Num errors: 0 (retry or fail)
Rate: 70.43 Mbps
RPI server
Zephyr client
TCP 10B
zperf tcp upload IP 5001 60 10 1M Duration: 1.00 m
Packet size: 10 bytes
Rate: 1024 kbps
Starting...
-
Upload completed!
Duration: 1.00 m
Num packets: 3834358
Num errors: 0 (retry or fail)
Rate: 4.87 Mbps
RPI server
Zephyr client
UDP 80Mbps
zperf udp upload IP 5001 60 1K 80M Duration: 1.00 m
Packet size: 1024 bytes
Rate: 81920 kbps
Starting...
Rate: 80.00 Mbps
Packet duration 97 us
-
Upload completed!
Statistics: server (client)
Duration: 1.00 m (1.00 m)
Num packets: 688070 (688070)
Num packets out order: 0
Num packets lost: 49
Jitter: 127 us
Rate: 89.57 Mbps (89.59 Mbps)
RPI server
Zephyr client
UDP 70Mbps
zperf udp upload IP 5001 60 1K 70M Duration: 1.00 m
Packet size: 1024 bytes
Rate: 71680 kbps
Starting...
Rate: 70.00 Mbps
Packet duration 111 us
-
Upload completed!
Statistics: server (client)
Duration: 1.00 m (1.00 m)
Num packets: 462634 (462634)
Num packets out order: 0
Num packets lost: 0
Jitter: 55 us
Rate: 60.20 Mbps (60.23 Mbps)

Note: same results with Zephyr as server and RPI as client

That looks good to me.

Hope that helps.

@blemouzy
Copy link
Contributor

blemouzy commented Aug 2, 2023

iperf -V -u -c fe80::4:9fff:fe39:3ca4%enp0s20f0u1u1
------------------------------------------------------------
Client connecting to fe80::4:9fff:fe39:3ca4%enp0s20f0u1u1, UDP port 5001
Sending 1450 byte datagrams, IPG target: 11062.62 us (kalman adjust)
UDP buffer size:  208 KByte (default)
------------------------------------------------------------
[  1] local fe80::935:d1d6:7fda:bd83 port 49230 connected with fe80::4:9fff:fe39:3ca4 port 5001
[ ID] Interval       Transfer     Bandwidth
[  1] 0.0000-10.0119 sec  1.25 MBytes  1.05 Mbits/sec
[  1] Sent 906 datagrams
read failed: Connection refused
read failed: Connection refused
read failed: Connection refused

Are you adding -b at the end of the command to specify the UDP bandwith?

iperf -V -u -c fe80::4:9fff:fe39:3ca4%enp0s20f0u1u1 -b 70M

@ssharks
Copy link
Collaborator

ssharks commented Aug 16, 2023

@DerekSnell What is the status of this issue? Is it still present or can it be closed?

@DerekSnell
Copy link
Contributor Author

Hi @ssharks ,
I closed this issue. We had left this open for @jameswalmsley-cpi to report if he still had an issue. But since it has been a month with no update, and others are reporting good performance, we can close this.

@jameswalmsley-cpi
Copy link

@DerekSnell many thanks. Apologies I was on holiday but got back today. I'll run it on my board tomorrow and confirm.

Sounds like it is fixed based on other reports.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Ethernet area: Networking bug The issue is a bug, or the PR is fixing a bug platform: NXP NXP priority: low Low impact/importance bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants