DAOS-14442 control: Make NVMe auto-faulty configurable #13548

tanabarr · 2024-01-01T14:48:47Z

SSD auto faulty is enabled by default and criteria set through
environment variables. Make criteria settable through the server
configuration file instead of environment variables. bdev_auto_faulty
parameter can be used to set enable, max_io_errs and max_csum_errs.

Features: control
Required-githooks: true

Before requesting gatekeeper:

Two review approvals and any prior change requests have been resolved.
Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
Commit messages follows the guidelines outlined here.
Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

github-actions · 2024-01-01T14:49:04Z

Bug-tracker data:
Ticket title is 'Make SSD auto faulty configurable'
Status is 'In Review'
Labels: 'hotplug'
https://daosio.atlassian.net/browse/DAOS-14442

daosbuild1 · 2024-01-09T02:36:38Z

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13548/5/execution/node/1112/log

daosbuild1 · 2024-01-09T22:04:31Z

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13548/6/execution/node/694/log

daosbuild1 · 2024-01-10T00:07:51Z

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13548/6/execution/node/648/log

tanabarr · 2024-01-10T14:06:24Z

I don't think it's worth running through CI again for these indentation check warnings: https://github.com/daos-stack/daos/actions/runs/7454353436/job/20281518852?pr=13548 I will fix in a subsequent PR

tanabarr · 2024-01-10T14:06:43Z

CI run https://build.hpdd.intel.com/blue/organizations/jenkins/daos-stack%2Fdaos/detail/PR-13548/6/tests/ failed with unrelated known issues:

DFS_Parallel_DTX_cmocka
EcodOnlineRebuildMdtest
test_daos_rebuild_ec

daosbuild1 · 2024-01-10T22:22:55Z

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13548/7/testReport/

daosbuild1 · 2024-01-10T23:38:29Z

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13548/7/execution/node/1418/log

daosbuild1 · 2024-01-10T23:44:20Z

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13548/7/execution/node/1526/log

daosbuild1 · 2024-01-12T14:18:07Z

Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13548/8/testReport/

Required-githooks: true Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>

daosbuild1 · 2024-01-13T00:53:45Z

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13548/9/testReport/

daosbuild1 · 2024-01-13T01:19:04Z

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13548/9/execution/node/1480/log

tanabarr · 2024-01-13T10:46:13Z

CI run https://build.hpdd.intel.com/blue/organizations/jenkins/daos-stack%2Fdaos/detail/PR-13548/9/tests/ failed on the following issues:

DAOS-14845 online_rebuild_mdtest.py Timeout for mdtest after killing one rank
DAOS-14884 dmg failure with scm-size zero

Required-githooks: true Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>

…to-faulty-config Features: control Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>

daosbuild1 · 2024-01-15T06:31:31Z

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13548/10/testReport/

daosbuild1 · 2024-01-15T08:46:37Z

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13548/10/execution/node/1433/log

daosbuild1 · 2024-01-15T17:21:20Z

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13548/11/testReport/

daosbuild1 · 2024-01-15T18:30:28Z

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13548/11/execution/node/437/log

NiuYawei

The C changes look good to me.

tanabarr · 2024-01-16T10:54:09Z

CI results for run no. 11 with control and pool features failed for the following known issues:

Requesting forced landing

knard-intel

Mostly OK for me.
Just minor remarks which could be fix if repush is needed.

knard-intel · 2024-01-16T13:46:46Z

utils/config/daos_server.yml

+## Optional parameter that should only be set if overriding the automatically calculated value is #
+## #necessary. Specifies the number (not size) of hugepages to allocate for use by NVMe through
+## #SPDK. For optimum performance each target requires 1 GiB of hugepage space. The provided value
+## should be calculated by dividing the total amount of hugepages memory required for all targets
+## across all engines on a host by the system hugepage size. If not set here, the value will be
+## automatically calculated based on the number of targets (using the default system hugepage size).


Suggested change

## Optional parameter that should only be set if overriding the automatically calculated value is #

## #necessary. Specifies the number (not size) of hugepages to allocate for use by NVMe through

## #SPDK. For optimum performance each target requires 1 GiB of hugepage space. The provided value

## should be calculated by dividing the total amount of hugepages memory required for all targets

## across all engines on a host by the system hugepage size. If not set here, the value will be

## automatically calculated based on the number of targets (using the default system hugepage size).

## Optional parameter that should only be set if overriding the automatically calculated value is

## necessary. Specifies the number (not size) of hugepages to allocate for use by NVMe through

## SPDK. For optimum performance each target requires 1 GiB of hugepage space. The provided value

## should be calculated by dividing the total amount of hugepages memory required for all targets

## across all engines on a host by the system hugepage size. If not set here, the value will be

## automatically calculated based on the number of targets (using the default system hugepage size).

will fix if repushed

knard-intel · 2024-01-16T13:52:53Z

src/bio/bio_config.c

+	D_INFO("NVMe auto faulty is %s. Criteria: max_io_errs:%u, max_csum_errs:%u\n",
+	       *enable ? "enabled" : "disabled", *max_io_errs, *max_csum_errs);
+
+	if (cfg.method != NULL)


Nit, not needed to test as D_FREE will do it.

will fix if repushed

…to-faulty-config Features: control Required-githooks: true Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>

tanabarr · 2024-01-22T16:04:16Z

@daos-stack/daos-gatekeeper please can this PR be force landed now landings are open on master?

mchaarawi · 2024-01-22T17:23:23Z

since there is a control test failure on the latest run, please merge latest master and add gatekeeper when testing completes.

tanabarr · 2024-01-24T16:14:43Z

clang-format failures to be addressed in a subsequent PR

mjmac

The code changes seem fine to me, but I have some concerns about the design.

In an ideal world, I think that this configuration would be done as system properties, as I don't see any valid use case for allowing per-engine permutations in behavior. We don't currently have a way to propagate system-level configuration settings out to engines as they join, so I understand that we need to use the server yaml. That having been said, it would probably be better to set these in a top-level entry which is then applied to all engines, similar to how we set the provider and other parameters that should be the same for all engines in a configuration.

I wouldn't block on making this change, but I recommend that it be made sooner rather than later so that you don't have to support legacy configuration styles.

mjmac · 2024-01-24T17:48:34Z

src/bio/bio_xstream.c

@@ -242,6 +236,14 @@ bio_nvme_init(const char *nvme_conf, int numa_node, unsigned int mem_size,
 		goto free_mutex;
 	}

+	glb_criteria.fc_enabled     = true;
+	glb_criteria.fc_max_io_errs = 10;


minor: I know they weren't before this patch, but it seems like these should be named constant values.

tanabarr · 2024-01-24T18:29:55Z

The code changes seem fine to me, but I have some concerns about the design.

In an ideal world, I think that this configuration would be done as system properties, as I don't see any valid use case for allowing per-engine permutations in behavior. We don't currently have a way to propagate system-level configuration settings out to engines as they join, so I understand that we need to use the server yaml. That having been said, it would probably be better to set these in a top-level entry which is then applied to all engines, similar to how we set the provider and other parameters that should be the same for all engines in a configuration.

I wouldn't block on making this change, but I recommend that it be made sooner rather than later so that you don't have to support legacy configuration styles.

noted

tanabarr · 2024-01-24T18:30:57Z

@mjmac can you land the PR please?

tanabarr requested a review from a team as a code owner January 1, 2024 14:48

tanabarr requested review from mjmac and kjacque and removed request for a team January 1, 2024 14:48

tanabarr self-assigned this Jan 1, 2024

tanabarr force-pushed the tanabarr/control-auto-faulty-config branch from 8315958 to 3b377ce Compare January 3, 2024 00:27

tanabarr requested review from wangshilong and NiuYawei January 3, 2024 00:27

tanabarr marked this pull request as draft January 3, 2024 00:27

wangshilong previously approved these changes Jan 3, 2024

View reviewed changes

tanabarr dismissed wangshilong’s stale review via f4102e6 January 8, 2024 16:40

tanabarr marked this pull request as ready for review January 8, 2024 17:43

tanabarr requested a review from a team as a code owner January 8, 2024 23:16

wangshilong previously approved these changes Jan 9, 2024

View reviewed changes

tanabarr added the forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. label Jan 10, 2024

tanabarr dismissed wangshilong’s stale review via 6e0888b January 10, 2024 14:29

wangshilong previously approved these changes Jan 11, 2024

View reviewed changes

tanabarr removed the forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. label Jan 12, 2024

tanabarr dismissed wangshilong’s stale review via 108398b January 12, 2024 13:46

DAOS-14442 control: Make NVMe auto-faulty configurable

f8595bd

Required-githooks: true Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>

tanabarr added 2 commits January 14, 2024 23:18

fix codespell issue

58642d8

Required-githooks: true Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>

Merge remote-tracking branch 'origin/master' into tanabarr/control-au…

cb4b3ae

…to-faulty-config Features: control Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>

NiuYawei approved these changes Jan 16, 2024

View reviewed changes

tanabarr requested a review from wangshilong January 16, 2024 10:51

tanabarr added the forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. label Jan 16, 2024

tanabarr requested a review from knard-intel January 16, 2024 11:00

knard-intel approved these changes Jan 16, 2024

View reviewed changes

tanabarr requested a review from a team January 16, 2024 14:23

Merge remote-tracking branch 'origin/master' into tanabarr/control-au…

9e2efb6

…to-faulty-config Features: control Required-githooks: true Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>

mchaarawi removed the request for review from a team January 22, 2024 17:23

tanabarr removed the forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. label Jan 22, 2024

tanabarr requested a review from a team January 24, 2024 16:12

tanabarr added forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. and removed forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. labels Jan 24, 2024

mjmac approved these changes Jan 24, 2024

View reviewed changes

mchaarawi merged commit f128f07 into master Jan 24, 2024
49 of 50 checks passed

mchaarawi deleted the tanabarr/control-auto-faulty-config branch January 24, 2024 18:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DAOS-14442 control: Make NVMe auto-faulty configurable #13548

DAOS-14442 control: Make NVMe auto-faulty configurable #13548

tanabarr commented Jan 1, 2024 •

edited

Loading

github-actions bot commented Jan 1, 2024 •

edited

Loading

daosbuild1 commented Jan 9, 2024

daosbuild1 commented Jan 9, 2024

daosbuild1 commented Jan 10, 2024

tanabarr commented Jan 10, 2024

tanabarr commented Jan 10, 2024

daosbuild1 commented Jan 10, 2024

daosbuild1 commented Jan 10, 2024

daosbuild1 commented Jan 10, 2024

daosbuild1 commented Jan 12, 2024

daosbuild1 commented Jan 13, 2024

daosbuild1 commented Jan 13, 2024

tanabarr commented Jan 13, 2024 •

edited

Loading

daosbuild1 commented Jan 15, 2024

daosbuild1 commented Jan 15, 2024

daosbuild1 commented Jan 15, 2024

daosbuild1 commented Jan 15, 2024

NiuYawei left a comment

tanabarr commented Jan 16, 2024

knard-intel left a comment

knard-intel Jan 16, 2024

tanabarr Jan 16, 2024

knard-intel Jan 16, 2024

tanabarr Jan 16, 2024

tanabarr commented Jan 22, 2024

mchaarawi commented Jan 22, 2024

tanabarr commented Jan 24, 2024

mjmac left a comment

mjmac Jan 24, 2024

tanabarr commented Jan 24, 2024

tanabarr commented Jan 24, 2024

DAOS-14442 control: Make NVMe auto-faulty configurable #13548

DAOS-14442 control: Make NVMe auto-faulty configurable #13548

Conversation

tanabarr commented Jan 1, 2024 • edited Loading

Before requesting gatekeeper:

Gatekeeper:

github-actions bot commented Jan 1, 2024 • edited Loading

daosbuild1 commented Jan 9, 2024

daosbuild1 commented Jan 9, 2024

daosbuild1 commented Jan 10, 2024

tanabarr commented Jan 10, 2024

tanabarr commented Jan 10, 2024

daosbuild1 commented Jan 10, 2024

daosbuild1 commented Jan 10, 2024

daosbuild1 commented Jan 10, 2024

daosbuild1 commented Jan 12, 2024

daosbuild1 commented Jan 13, 2024

daosbuild1 commented Jan 13, 2024

tanabarr commented Jan 13, 2024 • edited Loading

daosbuild1 commented Jan 15, 2024

daosbuild1 commented Jan 15, 2024

daosbuild1 commented Jan 15, 2024

daosbuild1 commented Jan 15, 2024

NiuYawei left a comment

Choose a reason for hiding this comment

tanabarr commented Jan 16, 2024

knard-intel left a comment

Choose a reason for hiding this comment

knard-intel Jan 16, 2024

Choose a reason for hiding this comment

tanabarr Jan 16, 2024

Choose a reason for hiding this comment

knard-intel Jan 16, 2024

Choose a reason for hiding this comment

tanabarr Jan 16, 2024

Choose a reason for hiding this comment

tanabarr commented Jan 22, 2024

mchaarawi commented Jan 22, 2024

tanabarr commented Jan 24, 2024

mjmac left a comment

Choose a reason for hiding this comment

mjmac Jan 24, 2024

Choose a reason for hiding this comment

tanabarr commented Jan 24, 2024

tanabarr commented Jan 24, 2024

tanabarr commented Jan 1, 2024 •

edited

Loading

github-actions bot commented Jan 1, 2024 •

edited

Loading

tanabarr commented Jan 13, 2024 •

edited

Loading