Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-14408 common: enable NDCTL for DCPM #14371

Merged
merged 74 commits into from
Oct 9, 2024
Merged

Conversation

grom72
Copy link
Contributor

@grom72 grom72 commented May 15, 2024

This PR prepares DAOS to be used with NDCTL enabled in PMDK, which means:

  • NDCTL must not be used when non-DCPM (simulate PMem) - storage class: "ram" is used:
    PMEMOBJ_CONF=sds.at_create=0 env variable disables NDCTL features in the PMDK
    This change affects all tests run on simulated PMem (e.g. inside VMs).
    Some DOAS utility applications may also require PMEMOBJ_CONF=sds.at_create=0 to be set.

  • The default ULT stack size must be at least 20KiB to avoid stack overuse by PMDK with NDCTL enabled and be aligned with Linux page size.
    ABT_THREAD_STACKSIZE=20480 env variable is used to increase the default ULT stack size.
    This env variable is set by control/server module just before engine is started.
    Much bigger stack is used for pmempool open/create-related tasks e.g. tgt_vos_create_one to avoid stack overusage.

This modification shall not affect md-on-ssd mode as long as storage class: "ram" is used for the first tier in the storage configuration.
This change does not require any configuration changes to existing systems.

The new PMDK package with NDCTL enabled (daos-stack/pmdk#38) will land as soon as this PR is merged and backported to stable/2.6.

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

Copy link

github-actions bot commented May 15, 2024

Ticket title is 'NDCTL must be enabled to provide support for RAS functionality in PMDK'
Status is 'In Review'
Labels: 'scrubbed_2.8,triaged'
https://daosio.atlassian.net/browse/DAOS-14408

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium UCX Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/5/execution/node/886/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/7/execution/node/329/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/7/execution/node/366/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/7/execution/node/363/log

@daosbuild1
Copy link
Collaborator

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/7/execution/node/310/log

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/8/execution/node/1176/log

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/9/execution/node/1176/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/9/execution/node/1417/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/9/execution/node/1509/log

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/10/execution/node/1152/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/10/execution/node/1463/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/10/execution/node/1417/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/9/execution/node/1601/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/10/execution/node/1602/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/11/execution/node/273/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/11/execution/node/367/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/11/execution/node/343/log

@daosbuild1
Copy link
Collaborator

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/11/execution/node/383/log

@daosbuild1
Copy link
Collaborator

Test stage Test RPMs on EL 8.6 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/15/execution/node/758/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/17/execution/node/920/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/18/execution/node/920/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/18/execution/node/904/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/19/execution/node/870/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/16/execution/node/968/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/20/execution/node/962/log

kjacque
kjacque previously approved these changes Oct 1, 2024
@daltonbohning
Copy link
Contributor

There are conflicts now :(
Conflicting files
debian/changelog
debian/control
utils/rpms/daos.spec

…tion

Skip-list: test_dfuse_daos_build_wt_pil4dfs:DAOS-16556

Priority: 2
Cancel-prev-build: false
Allow-unstable-test: true

Required-githooks: true
Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>
@grom72 grom72 dismissed stale reviews from kjacque and tanabarr via d13a99b October 1, 2024 16:26
@grom72
Copy link
Contributor Author

grom72 commented Oct 1, 2024

There are conflicts now :( Conflicting files debian/changelog debian/control utils/rpms/daos.spec

All conflicts resolved

Copy link
Contributor

@daltonbohning daltonbohning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will reserve approval for those familiar with engine and control plane. Build-wise, I don't see issues

…tion

PR-repos: pmdk@PR-38:14
Skip-list: test_dfuse_daos_build_wt_pil4dfs:DAOS-16556
Allow-unstable-test: true

Required-githooks: true

Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>
@grom72 grom72 requested a review from a team October 7, 2024 09:42
@brianjmurrell
Copy link
Contributor

@grom72 Your final commit, the one that you want @daos-stack/daos-gatekeeper to land did not do all of the testing required. It did not (functional) test the RPMs on Leap 15 and EL9 when you are making daos.spec changes specific to those distributions and that testing on those platforms.

While I can appreciate that you did test those two distributions two commits prior to the final commits that you want to get landed, we now don't know of those two new commits may have introduced any regressions relative to your testing of two commits ago.

Generally speaking, when you are ready for landing, your final commit should do all of the (optional even -- although I would suggest that the platforms you skipped testing on should not have been optional and that we should have some kind of context testing to enable those platforms automatically) testing that is necessary for the changes you are making in your PR, otherwise we may land regressions.

PR-repos: pmdk@PR-38:14
Skip-list: test_dfuse_daos_build_wt_pil4dfs:DAOS-16556
Priority: 2

Do not re-run UT
Skip-unit-tests: true

Force tests on various OSes
Skip-func-test-leap15: false
Skip-func-test-el9: false
Skip-test-leap-15.4-rpms: false
Skip-test-el9-rpms: false

Allow-unstable-test: true

HW tests already done in the previous build
Skip-func-hw-test: true

Required-githooks: true

Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>
Skip-list: test_dfuse_daos_build_wt_pil4dfs:DAOS-16556
Priority: 2

Cancel-prev-build: false

Force tests on various OSes
Skip-func-test-leap15: false
Skip-func-test-el9: false
Skip-test-leap-15.4-rpms: false
Skip-test-el9-rpms: false

Allow-unstable-test: true

Required-githooks: true

Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>
@grom72
Copy link
Contributor Author

grom72 commented Oct 8, 2024

Validation with NDCTL enabled:
https://build.hpdd.intel.com/job/daos-stack/job/daos/view/change-requests/job/PR-14371/133/
Validation with NDCTL enabled - extended tests on various OSes) as suggested in #14371 (comment):
https://build.hpdd.intel.com/job/daos-stack/job/daos/view/change-requests/job/PR-14371/134/

Validation with legacy PMDK (extended tests on various OSes as suggested in #14371 (comment):
https://build.hpdd.intel.com/job/daos-stack/job/daos/view/change-requests/job/PR-14371/135/
+
https://build.hpdd.intel.com/job/daos-stack/job/daos/view/change-requests/job/PR-14371/136 for tests that failed in the build 135 due to Jenkins issue.

@daltonbohning daltonbohning removed the request for review from a team October 8, 2024 15:16
Skip-list: test_dfuse_daos_build_wt_pil4dfs:DAOS-16556
Priority: 2

Cancel-prev-build: false

Skip tests that passed in previous build
Skip-unit-tests: true
Skip-unit-test: true
Skip-unit-test-memcheck: true
Skip-nlt: true
Skip-func-test-vm: true

Allow-unstable-test: true

Skip-func-test-hw-large: true

Required-githooks: true

Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>
@grom72 grom72 requested a review from a team October 9, 2024 18:44
@grom72
Copy link
Contributor Author

grom72 commented Oct 9, 2024

@daos-stack/daos-gatekeeper please let me know if you want me to squash all commits into a few logical ones before landing

@daltonbohning
Copy link
Contributor

@daos-stack/daos-gatekeeper please let me know if you want me to squash all commits into a few logical ones before landing

Please don't squash - that will throw off the history and CI status. If the description at the top is updated, we can use that for merge

@grom72
Copy link
Contributor Author

grom72 commented Oct 9, 2024

@daos-stack/daos-gatekeeper please let me know if you want me to squash all commits into a few logical ones before landing

Please don't squash - that will throw off the history and CI status. If the description at the top is updated, we can use that for merge

Done. Please use it.

@daltonbohning daltonbohning merged commit dad109c into master Oct 9, 2024
67 of 69 checks passed
@daltonbohning daltonbohning deleted the grom72/ndctl-validation branch October 9, 2024 22:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go Pull requests that update Go code release-2.6.2 Targeted for release 2.6.2
Development

Successfully merging this pull request may close these issues.