Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SONiC] CPU Stall: Soft Lockup & Hard Lockup Issues observed with SONiC NOS builds (202305/202311/master) #367

Open
thovikeerthi opened this issue Dec 11, 2023 · 0 comments

Comments

@thovikeerthi
Copy link

thovikeerthi commented Dec 11, 2023

Hi Team,
We are trying to run/execute SONiC community defined T1-topology test-suite using '202305 Release build' on platform 'Accton-AS7716-32X'.
On every attempt of our test-run, we are facing 'CPU Stall Issue' randomly, wherein noticed that Kernel-threads are getting into Hung state (randomly) and NMI watch-dog timer is getting triggered, which results in board restart.

We are observing the same issue in 202311 & 'master' branch release builds (generated from Community Azure pipeline) i.e. CPU stall issue is seen with SONiC NOS builds having Bullseye/5.10-Kernel version (i.e. 202311) & Bookworm/6.1-kernel version (i.e. master), as well.

We consulted Accton team and shared our observations, they confirmed that it is not a hardware issue and need to be checked further from Community NOS. We have made some initial analysis based on the received logs and reported issue in Community GITHub link (currently unable to proceed further & conclude/resolve the issue).

Request your expert advice in this regard to solve this issue and provide any suggestions in this regard.

GITHub Issue Links for CPU Stall Issues:
1. CPU Stall: Soft Lockup issue observed in 202305 Release Branch [Kernel: 5.10.140-1] [Accton-AS7716-32X]
sonic-net/sonic-buildimage#17358

2. CPU Stall: Hard Lockup issue observed in 202305 Release Branch [Kernel: 5.10.140-1] [Accton-AS7716-32X]
sonic-net/sonic-buildimage#17361
sonic-net/sonic-buildimage#17363

Some Observations:
- These CPU Stall issues are seen RANDOMLY, while executing community test-suite.
- These CPU Stall issues are predominantly seen while executing 'T1-topology cases' and not with T0-topology cases.
- These CPU Stall issues are seen only when test cases are executed in a batch. If the same test-case is executed individually, then such issues are not seen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant