-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-16591 mgmt, vos, common: Align scm/meta size #15146
DAOS-16591 mgmt, vos, common: Align scm/meta size #15146
Conversation
Ticket title is 'The scm and meta size should be aligned up to the page size of BMEM allocator' |
The scm and meta sizes for vos are now aligned-up to 16M for pools using phase 2 allocator. Signed-off-by: Sherin T George <sherin-t.george@hpe.com>
488082a
to
08e761b
Compare
src/vos/vos_pool.c
Outdated
|
||
backend = umempobj_get_backend_type(); | ||
if ((*scm_sz != *meta_sz) && (backend == DAOS_MD_BMEM)) | ||
backend = DAOS_MD_BMEM_V2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we should return failure in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why, this seems valid to me. It is the standard case for phase 2 where mem_ratio is not 100% and therefore we need to use V2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i thought that if users configured backend as DAOS_MD_BMEM using environment DAOS_MD_ON_SSD_MODE. I would think we don't support phase2 in this case? As phase1 and Phase2 will be totally different layout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wangshilong see vos_pool_create() in vos_on_blob_p2 branch, now DAOS_MD_BMEM is the default backend, we use BMEM_V2 backend only when "meta size > scm size" or the DAOS_MD_ON_SSD_MODE is explicitly specified to BMEM_V2 by user.
@sherintg I think we need to make above code into a function to avoid the dup code in vos_pool_create(), so that when we change our backend selection strategy in the future, we don't have to update code here and there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in the latest commit. Also, added a new tunable DAOS_MD_DISABLE_BMEM_V2 to disable creation of v2 pools.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, just confirming the question, no objections to the PR.
Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15146/2/testReport/ |
src/vos/vos_pool.c
Outdated
|
||
backend = umempobj_get_backend_type(); | ||
if ((*scm_sz != *meta_sz) && (backend == DAOS_MD_BMEM)) | ||
backend = DAOS_MD_BMEM_V2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wangshilong see vos_pool_create() in vos_on_blob_p2 branch, now DAOS_MD_BMEM is the default backend, we use BMEM_V2 backend only when "meta size > scm size" or the DAOS_MD_ON_SSD_MODE is explicitly specified to BMEM_V2 by user.
@sherintg I think we need to make above code into a function to avoid the dup code in vos_pool_create(), so that when we change our backend selection strategy in the future, we don't have to update code here and there.
src/vos/vos_pool.c
Outdated
size_t alignsz; | ||
|
||
backend = umempobj_get_backend_type(); | ||
if ((*scm_sz != *meta_sz) && (backend == DAOS_MD_BMEM)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scm_sz > meta_sz is invalid, we should assert or return error. BTW, do we guarantee the size passed from control plane is always non-zero?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
meta and scm sz should always be non-zero as per src/mgmt/srv_drpc.c L504
src/vos/vos_pool.c
Outdated
alignsz = umempobj_pgsz(backend); | ||
|
||
*scm_sz = D_ALIGNUP(*scm_sz, alignsz); | ||
if (*meta_sz) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like control plane may pass zero meta_sz? Then we could mistakenly assume it's BMEM_V2 backend in above code?
BTW, I think we'd move the "max(tca->tca_scm_size / dss_tgt_nr, 1 << 24)" from tgt_create_preallocate() to here and apply it to both scm size and meta size, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
meta_sz check was added assuming that for pmem mode this may not be set. Verified that that both scm_sz and meta_sz is set by the control plane in all modes pmem, bmem_v1 and bmem_v2. Hence removed the checks and added asserts.
Addressed review comments. Added a tunable to disable creating BMEM_V2 pools. Signed-off-by: Sherin T George <sherin-t.george@hpe.com>
b1bb4cf
to
35ae67a
Compare
@sherintg please try to avoid force push if you can because it forces reviewers to go through the whole patch again not just the differential. TIA |
rc = tgt_vos_preallocate_sequential(tca->tca_ptrec->dptr_uuid, | ||
max(tca->tca_scm_size / dss_tgt_nr, | ||
1 << 24), dss_tgt_nr); | ||
rc = tgt_vos_preallocate_sequential( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have to allocate sequentially in MD mode rather than parallel like in PMem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its the other way round, its sequential in PMEM and parallel in MD. This change was done in phase 1 to overcome the overhead of fallocate on tmpfs.
@@ -1123,7 +1122,12 @@ ds_mgmt_hdlr_tgt_create(crt_rpc_t *tc_req) | |||
|
|||
tgt_scm_sz = tc_in->tc_scm_size / dss_tgt_nr; | |||
tgt_meta_sz = tc_in->tc_meta_size / dss_tgt_nr; | |||
vos_pool_roundup_size(&tgt_scm_sz, &tgt_meta_sz); | |||
rc = vos_pool_roundup_size(&tgt_scm_sz, &tgt_meta_sz); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this strange formatting enforced by the clang linter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
@tanabarr Apologies for doing the force push against second commit. I did a second commit yesterday addressing the review comments. However, post md_on_ssd meeting I modified this commit (with a force push) to include a tunable to disable phase 2 allocator. The second commit should still contain the difference of what you had reviewed and approved and the changes done as part of rework. |
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15146/4/execution/node/1463/log |
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15146/4/execution/node/1447/log |
src/vos/vos_pool.c
Outdated
size_t alignsz; | ||
int rc; | ||
|
||
D_ASSERT((*scm_sz != 0) && (*meta_sz != 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assert is triggered in CI tests, looks we pass 0 meta_size sometimes (for pmem or phase1 mode).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed with the 3rd commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From https://build.hpdd.intel.com/job/daos-stack/job/daos/view/change-requests/job/PR-15146/4/artifact/Functional%20Hardware%20Medium/daos_test/rebuild.py/daos_logs.wolf-254/01-DAOS_Rebuild_0to10_daos_control.log/ failed we can see that the assert happens before pool create, just after engine start (before any dmg pool
commands are issued).
- Wolf-254 engine 1 starts - rank 2
- Wolf-254 engine 0 starts - rank 5
- SmdQuery issued from dmg to wolf-254 for ranks 2,5
- ranks 2&5 fail due to assert
e.g.13:06:12 INFO | Sep 20 13:03:37 wolf-254.wolf.hpdd.intel.com daos_server[167095]: ERROR: daos_engine:0 09/20-13:03:37.29 wolf-254 DAOS[168521/0/4058] vos EMRG src/vos/vos_pool.c:787 vos_pool_roundup_size() Assertion '(*scm_sz != 0) && (*meta_sz != 0)' failed
So whilst meta_sz
will be set nonzero in control-plane pool create calls, other engine internal calls need to be handled for meta_sz == 0
.
Handled meta_sz == 0 during pool extend operation. Signed-off-by: Sherin T George <sherin-t.george@hpe.com>
The scm and meta sizes for vos are now aligned-up to 16M for pools using phase 2 allocator.
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: