-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-15420 pool: Clean up ds_pool_svc_<op> #14036
Conversation
Ticket title is 'Clean up engine-side PS client functions like |
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14036/1/execution/node/1198/log |
Test stage Functional on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14036/2/testReport/ |
Test stage Functional Hardware Medium Verbs Provider completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14036/3/testReport/ |
Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14036/3/testReport/ |
6920612
to
d919d44
Compare
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14036/6/execution/node/1405/log |
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14036/8/execution/node/1174/log |
429f293
to
07c867f
Compare
Convert ds_pool_svc_check_evict ds_pool_svc_query_target ds_pool_svc_get_prop ds_pool_svc_set_prop ds_pool_svc_target_update_state ds_pool_svc_update_acl ds_pool_svc_delete_acl ds_pool_svc_upgrade ds_pool_extend to the dsc_pool_svc_call framework, so that they will - time out, instead of hanging forever, if PSs are unavailable, and - respond much faster in common cases thanks to exponential backoffs. The req_time variable in dsc_pool_svc_call is part of the operation identifier, and should therefore retain its value across retries. Features: pool Signed-off-by: Li Wei <wei.g.li@intel.com> Required-githooks: true
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14036/11/execution/node/1455/log |
daos_test/rebuild_simple: DAOS-15431 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no problems with this change that I can see
Features: pool Required-githooks: true
Features: pool Required-githooks: true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have 3 comments. One is only a very minor change request (unconditional use of the word "Destroy" in the evict log message). Other comments are open questions that you may decide to make some changes for, depending on the answers.
src/pool/srv_pool_scrub_ult.c
Outdated
@@ -244,7 +244,9 @@ drain_pool_target(uuid_t pool_uuid, d_rank_t rank, uint32_t target) | |||
addr.pta_target = target; | |||
target_list.pta_addrs = &addr; | |||
|
|||
rc = ds_pool_target_update_state(pool_uuid, &out_ranks, &target_list, PO_COMP_ST_DRAIN); | |||
rc = dsc_pool_svc_update_target_state(pool_uuid, &out_ranks, | |||
daos_getmtime_coarse() + 60 * 1000, &target_list, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I have this right, if crt_timeout in the engine is 60 seconds, then this choice of deadline (current time + 60 seconds) doesn't allow for any retries to occur. Should the deadline be larger (but maybe not as high as 5 minutes that would be the case if this were a control-plane initiated operation)?
In another respect, if the crt_timeout is intentionally set larger for some environments requiring it, this 60 second deadline will cause the timeout to be readjusted downward approximately to current time + this shorter deadline.
Should anything be changed, or is it OK for this particular operation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I don't know the answer at the moment. For now, perhaps it's safer to change this to a higher value, since the existing code retries indefinitely.
src/pool/srv_cli.c
Outdated
int rc = out->pvo_op.po_rc; | ||
|
||
if (rc != 0) | ||
DL_ERROR(rc, DF_UUID ": pool destroy failed to evict handles", DP_UUID(pool_uuid)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this return rc in the error case, rather than proceed to set *arg->pea_count?
I can't remember if the failure execution flow still requires the value to be set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is again the existing logic. I recall I had the same question and found that ds_pool_evict_handler
indeed sets pvo_n_hdls_evicted
regardless of rc
, perhaps because an error may occur after having evicted some handles? (I don't know the real answer.)
src/pool/srv_cli.c
Outdated
.pea_count = count | ||
}; | ||
|
||
D_DEBUG(DB_MGMT, DF_UUID ": Destroy pool (force: %d), inspect/evict handles\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log message may want to not always say "Destroy", depending on the uint32_t destroy argument setting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this was copied from the existing code without modification. Let me fix the ambiguity.
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14036/13/execution/node/1408/log |
Signed-off-by: Li Wei <wei.g.li@intel.com> Required-githooks: true
Features: pool Required-githooks: true
Convert
to the dsc_pool_svc_call framework, so that they will
The req_time variable in dsc_pool_svc_call is part of the operation
identifier, and should therefore retain its value across retries.
Features: pool
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: