Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-14408 common: enable NDCTL for DCPM #14371

Open
wants to merge 69 commits into
base: master
Choose a base branch
from

Conversation

grom72
Copy link
Contributor

@grom72 grom72 commented May 15, 2024

This PR prepares DAOS to be used with NDCTL enabled in PMDK, which means:

  • NDCTL must not be used when non-DCPM (simulate PMem) - storage class: "ram" is used:
    PMEMOBJ_CONF=sds.at_create=0 env variable disables NDCTL features in the PMDK

  • The default ULT stack size must be at least 20KiB to avoid stack overuse by PMDK with NDCTL enabled and be aligned with Linux page size.
    ABT_THREAD_STACKSIZE=20480 env variable is used to increase the default ULT stack size.

This modification shall not affect md-on-ssd mode as long as storage class: "ram" is used for the first tier in the storage configuration.
This change does not require any configuration changes to existing systems.

The new PMDK package with NDCTL enabled (daos-stack/pmdk#38) will be delivered as soon as this PR is merged and backported to stable/2.6.

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

Copy link

github-actions bot commented May 15, 2024

Ticket title is 'NDCTL must be enabled to provide support for RAS functionality in PMDK'
Status is 'In Review'
Labels: 'scrubbed_2.8,triaged'
https://daosio.atlassian.net/browse/DAOS-14408

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium UCX Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/5/execution/node/886/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/7/execution/node/329/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/7/execution/node/366/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/7/execution/node/363/log

@daosbuild1
Copy link
Collaborator

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/7/execution/node/310/log

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/8/execution/node/1176/log

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/9/execution/node/1176/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/9/execution/node/1417/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/9/execution/node/1509/log

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/10/execution/node/1152/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/10/execution/node/1463/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/10/execution/node/1417/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/9/execution/node/1601/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/10/execution/node/1602/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/11/execution/node/273/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/11/execution/node/367/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/11/execution/node/343/log

@daosbuild1
Copy link
Collaborator

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/11/execution/node/383/log

@daosbuild1
Copy link
Collaborator

Test stage Test RPMs on EL 8.6 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/15/execution/node/758/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/17/execution/node/920/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/18/execution/node/920/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/18/execution/node/904/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/19/execution/node/870/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/16/execution/node/968/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/20/execution/node/962/log

@brianjmurrell
Copy link
Contributor

@brianjmurrell retested with

Skip-func-test-leap15: false
Skip-func-test-el9: false
Skip-test-leap-15.4-rpms: false
Skip-test-el9-rpms: false

No test failure: build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-14371/123/pipeline-graph

@grom72 Don't you also need the Test-tag: DaosBuild commit pragma to run the tests affected (by the daos-client-tests Requires: changes) in the el9 and leap15 stages?

Test-tag: DaosBuild
PR-repos: pmdk@PR-38:11

Priority: 2
Cancel-prev-build: false
Skip-func-test-leap15: false
Skip-func-test-el9: false
Skip-test-leap-15.4-rpms: false
Skip-test-el9-rpms: false

Allow-unstable-test: true
Skip-func-hw-test: true

Required-githooks: true

Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>
@daosbuild1
Copy link
Collaborator

Test stage Functional on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/124/execution/node/1310/log

@daosbuild1
Copy link
Collaborator

Test stage Functional on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/125/execution/node/538/log

Test-tag: DaosBuild
PR-repos: pmdk@PR-38:11
Skip-list: test_dfuse_daos_build_wt_il:DAOS-16556

Priority: 2
Cancel-prev-build: false
Skip-func-test-leap15: false
Skip-func-test-el9: false
Skip-test-leap-15.4-rpms: false
Skip-test-el9-rpms: false

Allow-unstable-test: true
Skip-func-hw-test: true

Required-githooks: true

Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>
@daosbuild1
Copy link
Collaborator

Test stage Functional on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14371/126/execution/node/1232/log

Test-tag: DaosBuild
PR-repos: pmdk@PR-38:11
Skip-list: test_dfuse_daos_build_wt_pil4dfs:DAOS-16556

Priority: 2
Cancel-prev-build: false
Skip-func-test-leap15: false
Skip-func-test-el9: false
Skip-test-leap-15.4-rpms: false
Skip-test-el9-rpms: false

Allow-unstable-test: true
Skip-func-hw-test: true

Required-githooks: true
Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>
@daosbuild1
Copy link
Collaborator

Test stage Functional on Leap 15.5 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14371/127/testReport/

Copy link
Contributor

@brianjmurrell brianjmurrell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you can see in your latest test run, DaosBuild.test_dfuse_daos_build_wt_il failed on Leap15 with:

06:40:54 DEBUG|     src/common.inc:336: *** Please install libndctl-dev/libndctl-devel/ndctl-devel >= 63.  Stop.

This is because of the concern I raised previously.

I suspect you also need to add this requirement (lib{nd,dax}ctl-dev packages) to the debian/control file's Package: daos-client-tests's Depends: clause. To be sure, there lots of others missing there that have been added to the daos.spec in the past without adding them to debian/control but since we are identifying this particular one missing, let's take the opportunity to add it in this PR so as not to increase technical debt. Indeed, it would be nice if we were doing even minimal testing on Ubuntu to help identify these kinds of gaps as they happen.

utils/rpms/daos.spec Show resolved Hide resolved
utils/rpms/daos.spec Outdated Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious why daxctl-devel is not needed on Leap15.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any answer here @grom72?

debian/changelog Outdated Show resolved Hide resolved
Test-tag: DaosBuild
PR-repos: pmdk@PR-38:11
Skip-list: test_dfuse_daos_build_wt_pil4dfs:DAOS-16556

Priority: 2
Cancel-prev-build: false

Skip-func-test-leap15: false
Skip-func-test-el9: false
Skip-test-leap-15.4-rpms: false
Skip-test-el9-rpms: false

Allow-unstable-test: true
Skip-func-hw-test: false

Required-githooks: true

Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>
@grom72 grom72 dismissed stale reviews from kjacque, NiuYawei, and tanabarr via c6f4853 September 25, 2024 17:46
@grom72
Copy link
Contributor Author

grom72 commented Sep 25, 2024

I suspect you also need to add this requirement (lib{nd,dax}ctl-dev packages) to the debian/control file's Package: daos-client-tests's Depends: clause. To be sure, there lots of others missing there that have been added to the daos.spec in the past without adding them to debian/control but since we are identifying this particular one missing, let's take the opportunity to add it in this PR so as not to increase technical debt. Indeed, it would be nice if we were doing even minimal testing on Ubuntu to help identify these kinds of gaps as they happen.

I wonder why we have to add PMem-related dependencies to daos-client-tests. DAOS client has nothing to do with PMem directly.

Test-tag: DaosBuild
PR-repos: pmdk@PR-38:11
Skip-list: test_dfuse_daos_build_wt_pil4dfs:DAOS-16556

Priority: 2
Cancel-prev-build: false

Skip-build-ubuntu20-rpm: false
Skip-build-leap15-rpm: true
Skip-build-leap15-icc: true
Skip-build-el9-rpm: true
Skip-nlt: true
Skip-unit-tests: true
Skip-func-test-vm: true
Skip-test-rpms: true
Skip-unit-test-memcheck: true
Skip-func-test: true
Skip-unit-tests: true
Allow-unstable-test: true
Skip-func-hw-test: true

Required-githooks: true

Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>
@grom72
Copy link
Contributor Author

grom72 commented Sep 25, 2024

@brianjmurrell
Copy link
Contributor

I wonder why we have to add PMem-related dependencies to daos-client-tests. DAOS client has nothing to do with PMem directly.

Because one of the daos (client) tests is to clone and build daos itself on daos (i.e. dfuse IIUC). It builds server+client. So the dependencies there are not actually for the client but for the building of daos.

I have more than once questioned the value of building daos specifically as a test of daos. While building something with a lot of files is probably a good test of daos/dfuse, I wonder if it needs to be as complicated as daos. Maybe the Linux kernel, for example, which is entirely self-contained in it's own source tree and doesn't have a brazillion dependencies is a better project. But that's just my opinion.

@grom72
Copy link
Contributor Author

grom72 commented Sep 25, 2024

I wonder why we have to add PMem-related dependencies to daos-client-tests. DAOS client has nothing to do with PMem directly.

Because one of the daos (client) tests is to clone and build daos itself on daos (i.e. dfuse IIUC). It builds server+client. So the dependencies there are not actually for the client but for the building of daos.

I have more than once questioned the value of building daos specifically as a test of daos. While building something with a lot of files is probably a good test of daos/dfuse, I wonder if it needs to be as complicated as daos. Maybe the Linux kernel, for example, which is entirely self-contained in it's own source tree and doesn't have a brazillion dependencies is a better project. But that's just my opinion.

Does it mean that in the case of all client tests, we install dependences required for DAOS build (*-dev)?
If yes, how can we know that the client works properly as we use it in a "dirty" environment with some development packages?

@brianjmurrell
Copy link
Contributor

Does it mean that in the case of all client tests, we install dependences required for DAOS build (*-dev)?

Yes, *-devel to be accurate. Most specifically, we (want to) install the result of dnf builddep utils/rpms/daos.spec so we effectively mirror the BuildRequires: for EL8 in utils/rpms/daos.spec as Requires: in the daos-client-tests subpackage.

Yes, it's messy.

If yes, how can we know that the client works properly as we use it in a "dirty" environment with some development packages?

This is a valid question and point. Fortunately the tendency is to miss Requires: in the daos-client-tests package and not the daos-client package. But this is another good reason to switch from trying to build daos on daos to building something else that has no such large pool of dependencies. I propose[d] the Linux kernel. It seems to be a popular target for filesystem testing.

@brianjmurrell
Copy link
Contributor

Skip-build-leap15-rpm: true
Skip-build-leap15-icc: true
Skip-build-el9-rpm: true
Skip-nlt: true
Skip-unit-tests: true
Skip-func-test-vm: true
Skip-test-rpms: true
Skip-unit-test-memcheck: true
Skip-func-test: true
Skip-unit-tests: true
Allow-unstable-test: true
Skip-func-hw-test: true

Maybe gatekeepers will disagree but I don't think it's valid to skip all of the build and test that is required to show the validity of the PR in what you expect to be the final commit before requesting landing.

@brianjmurrell brianjmurrell dismissed their stale review September 25, 2024 20:57

All requested changes have been made.

PR-repos: pmdk@PR-38:11
Skip-list: test_dfuse_daos_build_wt_pil4dfs:DAOS-16556

Priority: 2

Allow-unstable-test: true

Required-githooks: true

Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>
Skip-list: test_dfuse_daos_build_wt_pil4dfs:DAOS-16556

Priority: 2
Cancel-prev-build: false
Allow-unstable-test: true

Required-githooks: true

Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>
@grom72
Copy link
Contributor Author

grom72 commented Sep 26, 2024

Skip-build-leap15-rpm: true
Skip-build-leap15-icc: true
Skip-build-el9-rpm: true
Skip-nlt: true
Skip-unit-tests: true
Skip-func-test-vm: true
Skip-test-rpms: true
Skip-unit-test-memcheck: true
Skip-func-test: true
Skip-unit-tests: true
Allow-unstable-test: true
Skip-func-hw-test: true

Maybe gatekeepers will disagree but I don't think it's valid to skip all of the build and test that is required to show the validity of the PR in what you expect to be the final commit before requesting landing.

The last builds were only to confirm that Leap rpms/Ubuntu pkg are properly built in a test environment.
9175174...e444b1f

The full validation has been done in the following builds:
w/o NDCTL
https://build.hpdd.intel.com/job/daos-stack/job/daos/view/change-requests/job/PR-14371/121/pipeline-graph/
w/ NDCTL
https://build.hpdd.intel.com/job/daos-stack/job/daos/view/change-requests/job/PR-14371/122/pipeline-graph/
https://build.hpdd.intel.com/job/daos-stack/job/daos/view/change-requests/job/PR-14371/123/pipeline-graph/

Anyhow I have triggered full validation if we want to have a consistent picture of validation in one place:
with NDCTL:
https://build.hpdd.intel.com/job/daos-stack/job/daos/view/change-requests/job/PR-14371/130/
without NDCTL:
https://build.hpdd.intel.com/job/daos-stack/job/daos/view/change-requests/job/PR-14371/131/

@grom72 grom72 requested a review from a team September 26, 2024 05:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go Pull requests that update Go code release-2.6.2 Targeted for release 2.6.2
Development

Successfully merging this pull request may close these issues.