-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-14442 control: Make NVMe auto-faulty configurable #13548
Conversation
Bug-tracker data: |
8315958
to
3b377ce
Compare
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13548/5/execution/node/1112/log |
Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13548/6/execution/node/694/log |
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13548/6/execution/node/648/log |
I don't think it's worth running through CI again for these indentation check warnings: https://github.com/daos-stack/daos/actions/runs/7454353436/job/20281518852?pr=13548 I will fix in a subsequent PR |
CI run https://build.hpdd.intel.com/blue/organizations/jenkins/daos-stack%2Fdaos/detail/PR-13548/6/tests/ failed with unrelated known issues:
|
Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13548/7/testReport/ |
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13548/7/execution/node/1418/log |
Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13548/7/execution/node/1526/log |
Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13548/8/testReport/ |
Required-githooks: true Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13548/9/testReport/ |
Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13548/9/execution/node/1480/log |
CI run https://build.hpdd.intel.com/blue/organizations/jenkins/daos-stack%2Fdaos/detail/PR-13548/9/tests/ failed on the following issues:
|
Required-githooks: true Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
…to-faulty-config Features: control Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13548/10/testReport/ |
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13548/10/execution/node/1433/log |
Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13548/11/testReport/ |
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13548/11/execution/node/437/log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The C changes look good to me.
CI results for run no. 11 with control and pool features failed for the following known issues: Requesting forced landing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly OK for me.
Just minor remarks which could be fix if repush is needed.
## Optional parameter that should only be set if overriding the automatically calculated value is # | ||
## #necessary. Specifies the number (not size) of hugepages to allocate for use by NVMe through | ||
## #SPDK. For optimum performance each target requires 1 GiB of hugepage space. The provided value | ||
## should be calculated by dividing the total amount of hugepages memory required for all targets | ||
## across all engines on a host by the system hugepage size. If not set here, the value will be | ||
## automatically calculated based on the number of targets (using the default system hugepage size). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
## Optional parameter that should only be set if overriding the automatically calculated value is # | |
## #necessary. Specifies the number (not size) of hugepages to allocate for use by NVMe through | |
## #SPDK. For optimum performance each target requires 1 GiB of hugepage space. The provided value | |
## should be calculated by dividing the total amount of hugepages memory required for all targets | |
## across all engines on a host by the system hugepage size. If not set here, the value will be | |
## automatically calculated based on the number of targets (using the default system hugepage size). | |
## Optional parameter that should only be set if overriding the automatically calculated value is | |
## necessary. Specifies the number (not size) of hugepages to allocate for use by NVMe through | |
## SPDK. For optimum performance each target requires 1 GiB of hugepage space. The provided value | |
## should be calculated by dividing the total amount of hugepages memory required for all targets | |
## across all engines on a host by the system hugepage size. If not set here, the value will be | |
## automatically calculated based on the number of targets (using the default system hugepage size). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will fix if repushed
D_INFO("NVMe auto faulty is %s. Criteria: max_io_errs:%u, max_csum_errs:%u\n", | ||
*enable ? "enabled" : "disabled", *max_io_errs, *max_csum_errs); | ||
|
||
if (cfg.method != NULL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, not needed to test as D_FREE will do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will fix if repushed
…to-faulty-config Features: control Required-githooks: true Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
@daos-stack/daos-gatekeeper please can this PR be force landed now landings are open on master? |
since there is a control test failure on the latest run, please merge latest master and add gatekeeper when testing completes. |
clang-format failures to be addressed in a subsequent PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code changes seem fine to me, but I have some concerns about the design.
In an ideal world, I think that this configuration would be done as system properties, as I don't see any valid use case for allowing per-engine permutations in behavior. We don't currently have a way to propagate system-level configuration settings out to engines as they join, so I understand that we need to use the server yaml. That having been said, it would probably be better to set these in a top-level entry which is then applied to all engines, similar to how we set the provider and other parameters that should be the same for all engines in a configuration.
I wouldn't block on making this change, but I recommend that it be made sooner rather than later so that you don't have to support legacy configuration styles.
@@ -242,6 +236,14 @@ bio_nvme_init(const char *nvme_conf, int numa_node, unsigned int mem_size, | |||
goto free_mutex; | |||
} | |||
|
|||
glb_criteria.fc_enabled = true; | |||
glb_criteria.fc_max_io_errs = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: I know they weren't before this patch, but it seems like these should be named constant values.
noted |
@mjmac can you land the PR please? |
SSD auto faulty is enabled by default and criteria set through
environment variables. Make criteria settable through the server
configuration file instead of environment variables. bdev_auto_faulty
parameter can be used to set enable, max_io_errs and max_csum_errs.
Features: control
Required-githooks: true
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: