Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-16591 mgmt, vos, common: Align scm/meta size #15146

Merged
merged 3 commits into from
Sep 24, 2024

Conversation

sherintg
Copy link
Collaborator

The scm and meta sizes for vos are now aligned-up to 16M for pools using phase 2 allocator.

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

Copy link

github-actions bot commented Sep 18, 2024

Ticket title is 'The scm and meta size should be aligned up to the page size of BMEM allocator'
Status is 'Open'
Labels: 'md_on_ssd2'
Errors are component not formatted correctly
https://daosio.atlassian.net/browse/DAOS-16591

The scm and meta sizes for vos are now aligned-up to 16M for pools
using phase 2 allocator.

Signed-off-by: Sherin T George <sherin-t.george@hpe.com>
@sherintg sherintg force-pushed the sherintg/vos_on_blob_p2/DAOS-16591 branch from 488082a to 08e761b Compare September 18, 2024 07:20
@sherintg sherintg marked this pull request as ready for review September 18, 2024 08:36
@sherintg sherintg requested review from a team as code owners September 18, 2024 08:36

backend = umempobj_get_backend_type();
if ((*scm_sz != *meta_sz) && (backend == DAOS_MD_BMEM))
backend = DAOS_MD_BMEM_V2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we should return failure in this case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why, this seems valid to me. It is the standard case for phase 2 where mem_ratio is not 100% and therefore we need to use V2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought that if users configured backend as DAOS_MD_BMEM using environment DAOS_MD_ON_SSD_MODE. I would think we don't support phase2 in this case? As phase1 and Phase2 will be totally different layout.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangshilong see vos_pool_create() in vos_on_blob_p2 branch, now DAOS_MD_BMEM is the default backend, we use BMEM_V2 backend only when "meta size > scm size" or the DAOS_MD_ON_SSD_MODE is explicitly specified to BMEM_V2 by user.

@sherintg I think we need to make above code into a function to avoid the dup code in vos_pool_create(), so that when we change our backend selection strategy in the future, we don't have to update code here and there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the latest commit. Also, added a new tunable DAOS_MD_DISABLE_BMEM_V2 to disable creation of v2 pools.

Copy link
Contributor

@wangshilong wangshilong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, just confirming the question, no objections to the PR.

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15146/2/testReport/


backend = umempobj_get_backend_type();
if ((*scm_sz != *meta_sz) && (backend == DAOS_MD_BMEM))
backend = DAOS_MD_BMEM_V2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangshilong see vos_pool_create() in vos_on_blob_p2 branch, now DAOS_MD_BMEM is the default backend, we use BMEM_V2 backend only when "meta size > scm size" or the DAOS_MD_ON_SSD_MODE is explicitly specified to BMEM_V2 by user.

@sherintg I think we need to make above code into a function to avoid the dup code in vos_pool_create(), so that when we change our backend selection strategy in the future, we don't have to update code here and there.

size_t alignsz;

backend = umempobj_get_backend_type();
if ((*scm_sz != *meta_sz) && (backend == DAOS_MD_BMEM))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scm_sz > meta_sz is invalid, we should assert or return error. BTW, do we guarantee the size passed from control plane is always non-zero?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meta and scm sz should always be non-zero as per src/mgmt/srv_drpc.c L504

alignsz = umempobj_pgsz(backend);

*scm_sz = D_ALIGNUP(*scm_sz, alignsz);
if (*meta_sz)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like control plane may pass zero meta_sz? Then we could mistakenly assume it's BMEM_V2 backend in above code?

BTW, I think we'd move the "max(tca->tca_scm_size / dss_tgt_nr, 1 << 24)" from tgt_create_preallocate() to here and apply it to both scm size and meta size, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meta_sz check was added assuming that for pmem mode this may not be set. Verified that that both scm_sz and meta_sz is set by the control plane in all modes pmem, bmem_v1 and bmem_v2. Hence removed the checks and added asserts.

Addressed review comments.
Added a tunable to disable creating BMEM_V2 pools.

Signed-off-by: Sherin T George <sherin-t.george@hpe.com>
@tanabarr
Copy link
Contributor

@sherintg please try to avoid force push if you can because it forces reviewers to go through the whole patch again not just the differential. TIA

rc = tgt_vos_preallocate_sequential(tca->tca_ptrec->dptr_uuid,
max(tca->tca_scm_size / dss_tgt_nr,
1 << 24), dss_tgt_nr);
rc = tgt_vos_preallocate_sequential(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have to allocate sequentially in MD mode rather than parallel like in PMem?

Copy link
Collaborator Author

@sherintg sherintg Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its the other way round, its sequential in PMEM and parallel in MD. This change was done in phase 1 to overcome the overhead of fallocate on tmpfs.

@@ -1123,7 +1122,12 @@ ds_mgmt_hdlr_tgt_create(crt_rpc_t *tc_req)

tgt_scm_sz = tc_in->tc_scm_size / dss_tgt_nr;
tgt_meta_sz = tc_in->tc_meta_size / dss_tgt_nr;
vos_pool_roundup_size(&tgt_scm_sz, &tgt_meta_sz);
rc = vos_pool_roundup_size(&tgt_scm_sz, &tgt_meta_sz);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this strange formatting enforced by the clang linter?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@sherintg
Copy link
Collaborator Author

@tanabarr Apologies for doing the force push against second commit. I did a second commit yesterday addressing the review comments. However, post md_on_ssd meeting I modified this commit (with a force push) to include a tunable to disable phase 2 allocator. The second commit should still contain the difference of what you had reviewed and approved and the changes done as part of rework.

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15146/4/execution/node/1463/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15146/4/execution/node/1447/log

size_t alignsz;
int rc;

D_ASSERT((*scm_sz != 0) && (*meta_sz != 0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assert is triggered in CI tests, looks we pass 0 meta_size sometimes (for pmem or phase1 mode).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed with the 3rd commit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From https://build.hpdd.intel.com/job/daos-stack/job/daos/view/change-requests/job/PR-15146/4/artifact/Functional%20Hardware%20Medium/daos_test/rebuild.py/daos_logs.wolf-254/01-DAOS_Rebuild_0to10_daos_control.log/ failed we can see that the assert happens before pool create, just after engine start (before any dmg pool commands are issued).

  • Wolf-254 engine 1 starts - rank 2
  • Wolf-254 engine 0 starts - rank 5
  • SmdQuery issued from dmg to wolf-254 for ranks 2,5
  • ranks 2&5 fail due to assert
    e.g. 13:06:12 INFO | Sep 20 13:03:37 wolf-254.wolf.hpdd.intel.com daos_server[167095]: ERROR: daos_engine:0 09/20-13:03:37.29 wolf-254 DAOS[168521/0/4058] vos EMRG src/vos/vos_pool.c:787 vos_pool_roundup_size() Assertion '(*scm_sz != 0) && (*meta_sz != 0)' failed

So whilst meta_sz will be set nonzero in control-plane pool create calls, other engine internal calls need to be handled for meta_sz == 0.

Handled meta_sz == 0 during pool extend operation.

Signed-off-by: Sherin T George <sherin-t.george@hpe.com>
@NiuYawei NiuYawei self-requested a review September 24, 2024 05:52
@NiuYawei NiuYawei merged commit af71f88 into feature/vos_on_blob_p2 Sep 24, 2024
49 of 53 checks passed
@NiuYawei NiuYawei deleted the sherintg/vos_on_blob_p2/DAOS-16591 branch September 24, 2024 05:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants