Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-15420 pool: Clean up ds_pool_svc_<op> #14036

Merged
merged 5 commits into from
May 20, 2024
Merged

DAOS-15420 pool: Clean up ds_pool_svc_<op> #14036

merged 5 commits into from
May 20, 2024

Conversation

liw
Copy link
Contributor

@liw liw commented Mar 21, 2024

Convert

ds_pool_svc_check_evict
ds_pool_svc_query_target
ds_pool_svc_get_prop
ds_pool_svc_set_prop
ds_pool_svc_target_update_state
ds_pool_svc_update_acl
ds_pool_svc_delete_acl
ds_pool_svc_upgrade
ds_pool_extend

to the dsc_pool_svc_call framework, so that they will

  • time out, instead of hanging forever, if PSs are unavailable, and
  • respond much faster in common cases thanks to exponential backoffs.

The req_time variable in dsc_pool_svc_call is part of the operation
identifier, and should therefore retain its value across retries.

Features: pool

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

Copy link

github-actions bot commented Mar 21, 2024

Ticket title is 'Clean up engine-side PS client functions like ds_pool_svc_check_evict'
Status is 'In Review'
https://daosio.atlassian.net/browse/DAOS-15420

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14036/1/execution/node/1198/log

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14036/2/testReport/

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14036/3/testReport/

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14036/3/testReport/

@liw liw force-pushed the liw/dsc_pool_svc branch 2 times, most recently from 6920612 to d919d44 Compare April 16, 2024 23:48
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14036/6/execution/node/1405/log

@liw liw changed the title DAOS-15420 pool: Clean up ds_pool_svc_<op> patch 1 DAOS-15420 pool: Clean up ds_pool_svc_<op> Apr 26, 2024
@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14036/8/execution/node/1174/log

@liw liw force-pushed the liw/dsc_pool_svc branch 2 times, most recently from 429f293 to 07c867f Compare May 7, 2024 23:36
Convert

  ds_pool_svc_check_evict
  ds_pool_svc_query_target
  ds_pool_svc_get_prop
  ds_pool_svc_set_prop
  ds_pool_svc_target_update_state
  ds_pool_svc_update_acl
  ds_pool_svc_delete_acl
  ds_pool_svc_upgrade
  ds_pool_extend

to the dsc_pool_svc_call framework, so that they will

  - time out, instead of hanging forever, if PSs are unavailable, and
  - respond much faster in common cases thanks to exponential backoffs.

The req_time variable in dsc_pool_svc_call is part of the operation
identifier, and should therefore retain its value across retries.

Features: pool
Signed-off-by: Li Wei <wei.g.li@intel.com>
Required-githooks: true
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14036/11/execution/node/1455/log

@liw
Copy link
Contributor Author

liw commented May 9, 2024

daos_test/rebuild_simple: DAOS-15431

@liw liw marked this pull request as ready for review May 9, 2024 11:10
@liw liw requested review from a team as code owners May 9, 2024 11:10
@liw liw requested review from kccain and tanabarr May 9, 2024 11:11
tanabarr
tanabarr previously approved these changes May 9, 2024
Copy link
Contributor

@tanabarr tanabarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no problems with this change that I can see

liw added 2 commits May 11, 2024 10:02
Features: pool
Required-githooks: true
Features: pool
Required-githooks: true
kccain
kccain previously approved these changes May 15, 2024
Copy link
Contributor

@kccain kccain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have 3 comments. One is only a very minor change request (unconditional use of the word "Destroy" in the evict log message). Other comments are open questions that you may decide to make some changes for, depending on the answers.

@@ -244,7 +244,9 @@ drain_pool_target(uuid_t pool_uuid, d_rank_t rank, uint32_t target)
addr.pta_target = target;
target_list.pta_addrs = &addr;

rc = ds_pool_target_update_state(pool_uuid, &out_ranks, &target_list, PO_COMP_ST_DRAIN);
rc = dsc_pool_svc_update_target_state(pool_uuid, &out_ranks,
daos_getmtime_coarse() + 60 * 1000, &target_list,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I have this right, if crt_timeout in the engine is 60 seconds, then this choice of deadline (current time + 60 seconds) doesn't allow for any retries to occur. Should the deadline be larger (but maybe not as high as 5 minutes that would be the case if this were a control-plane initiated operation)?

In another respect, if the crt_timeout is intentionally set larger for some environments requiring it, this 60 second deadline will cause the timeout to be readjusted downward approximately to current time + this shorter deadline.

Should anything be changed, or is it OK for this particular operation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I don't know the answer at the moment. For now, perhaps it's safer to change this to a higher value, since the existing code retries indefinitely.

int rc = out->pvo_op.po_rc;

if (rc != 0)
DL_ERROR(rc, DF_UUID ": pool destroy failed to evict handles", DP_UUID(pool_uuid));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this return rc in the error case, rather than proceed to set *arg->pea_count?
I can't remember if the failure execution flow still requires the value to be set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is again the existing logic. I recall I had the same question and found that ds_pool_evict_handler indeed sets pvo_n_hdls_evicted regardless of rc, perhaps because an error may occur after having evicted some handles? (I don't know the real answer.)

.pea_count = count
};

D_DEBUG(DB_MGMT, DF_UUID ": Destroy pool (force: %d), inspect/evict handles\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log message may want to not always say "Destroy", depending on the uint32_t destroy argument setting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this was copied from the existing code without modification. Let me fix the ambiguity.

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14036/13/execution/node/1408/log

liw added 2 commits May 17, 2024 17:59
Signed-off-by: Li Wei <wei.g.li@intel.com>
Required-githooks: true
Features: pool
Required-githooks: true
@liw liw dismissed stale reviews from kccain and tanabarr via 30ce6cb May 17, 2024 08:59
@liw liw requested review from tanabarr and kccain May 17, 2024 09:00
@gnailzenh gnailzenh merged commit b090745 into master May 20, 2024
51 of 52 checks passed
@gnailzenh gnailzenh deleted the liw/dsc_pool_svc branch May 20, 2024 13:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants