Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[202205] [Mellanox] Disable SSD NCQ on Mellanox platforms #17662

Merged
merged 1 commit into from
Jan 10, 2024

Conversation

volodymyrsamotiy
Copy link
Collaborator

@volodymyrsamotiy volodymyrsamotiy commented Jan 3, 2024

Backport of #17567

Why I did it

Based on some research some products might experience an occasional IO failures in the communication between CPU and SSD because of NCQ.
There seems to be a problem between some kernel versions and some SATA controllers.

Syslog error message examples:

  • Error "ata1: SError: { UnrecovData Handshk }" - "failed command: WRITE FPDMA QUEUED".
  • Error "ata1: SError: { RecovComm HostInt PHYRdyChg CommWake 10B8B DevExch }" - "failed command: READ FPDMA QUEUED".

Some vendors already disabled NCQ on their platforms in SONiC due to similar issue:

Also there are other discussions on Debian/Ubuntu forums about similar issues and it was suggested to disable NCQ:

Work item tracking
  • Microsoft ADO (number only):

How I did it

Add a kernel parameter to tell libata to disable NCQ

How to verify it

Use FIO tool - fio --direct=1 --rw=randrw --bs=64k --ioengine=libaio --iodepth=64 --runtime=120 --numjobs=4

Test results with NCQ enabled:

 READ: bw=128MiB/s (135MB/s), 128MiB/s-128MiB/s (135MB/s-135MB/s), io=247MiB (259MB), run=1924-1924msec
WRITE: bw=131MiB/s (138MB/s), 131MiB/s-131MiB/s (138MB/s-138MB/s), io=253MiB (265MB), run=1924-1924msec
…
 READ: bw=130MiB/s (136MB/s), 130MiB/s-130MiB/s (136MB/s-136MB/s), io=247MiB (259MB), run=1902-1902msec
WRITE: bw=133MiB/s (139MB/s), 133MiB/s-133MiB/s (139MB/s-139MB/s), io=253MiB (265MB), run=1902-1902msec
…
 READ: bw=129MiB/s (135MB/s), 129MiB/s-129MiB/s (135MB/s-135MB/s), io=247MiB (259MB), run=1919-1919msec
WRITE: bw=132MiB/s (138MB/s), 132MiB/s-132MiB/s (138MB/s-138MB/s), io=253MiB (265MB), run=1919-1919msec

Test results with NCQ disabled:

 READ: bw=105MiB/s (110MB/s), 105MiB/s-105MiB/s (110MB/s-110MB/s), io=247MiB (259MB), run=2354-2354msec
WRITE: bw=107MiB/s (113MB/s), 107MiB/s-107MiB/s (113MB/s-113MB/s), io=253MiB (265MB), run=2354-2354msec
…
 READ: bw=105MiB/s (110MB/s), 105MiB/s-105MiB/s (110MB/s-110MB/s), io=247MiB (259MB), run=2349-2349msec
WRITE: bw=108MiB/s (113MB/s), 108MiB/s-108MiB/s (113MB/s-113MB/s), io=253MiB (265MB), run=2349-2349msec
…
 READ: bw=105MiB/s (110MB/s), 105MiB/s-105MiB/s (110MB/s-110MB/s), io=247MiB (259MB), run=2349-2349msec
WRITE: bw=108MiB/s (113MB/s), 108MiB/s-108MiB/s (113MB/s-113MB/s), io=253MiB (265MB), run=2349-2349msec

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211
  • 202305
  • 202311

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@volodymyrsamotiy volodymyrsamotiy changed the title [Mellanox] Disable SSD NCQ on Mellanox platforms [202205] [Mellanox] Disable SSD NCQ on Mellanox platforms Jan 3, 2024
@mssonicbld
Copy link
Collaborator

@yxieca PR: #17662 is conflict with MS internal repo
Please complete the following PR by pushing fix commit to sonicbld/conflict_prefix/17662-fix
https://msazure.visualstudio.com/One/_git/Networking-acs-buildimage/pullrequest/9303124
Then comment "/azpw ms_conflict" to rerun PR checker.

@yxieca
Copy link
Contributor

yxieca commented Jan 3, 2024

/azpw ms_conflict

@yxieca
Copy link
Contributor

yxieca commented Jan 3, 2024

/azp ms_conflict

Copy link

Command 'ms_conflict' is not supported by Azure Pipelines.

Supported commands
  • help:
    • Get descriptions, examples and documentation about supported commands
    • Example: help "command_name"
  • list:
    • List all pipelines for this repository using a comment.
    • Example: "list"
  • run:
    • Run all pipelines or specific pipelines for this repository using a comment. Use this command by itself to trigger all related pipelines, or specify specific pipelines to run.
    • Example: "run" or "run pipeline_name, pipeline_name, pipeline_name"
  • where:
    • Report back the Azure DevOps orgs that are related to this repository and org
    • Example: "where"

See additional documentation.

@yxieca
Copy link
Contributor

yxieca commented Jan 4, 2024

/azpw ms_conflict

@yxieca
Copy link
Contributor

yxieca commented Jan 4, 2024

@volodymyrsamotiy why this PR is still in draft mode? can we move forward?

@volodymyrsamotiy volodymyrsamotiy marked this pull request as ready for review January 8, 2024 12:52
@prgeor
Copy link
Contributor

prgeor commented Jan 10, 2024

@yxieca this is an important bug fix. could you merge?

@yxieca yxieca merged commit b085fb1 into sonic-net:202205 Jan 10, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants