Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-13380 engine: refine tgt_nr check #12405

Merged
merged 10 commits into from
Jul 25, 2023
Merged

DAOS-13380 engine: refine tgt_nr check #12405

merged 10 commits into from
Jul 25, 2023

Conversation

liuxuezhao
Copy link
Contributor

@liuxuezhao liuxuezhao commented Jun 15, 2023

  1. for non-DAOS_TARGET_OVERSUBSCRIBE case fail to start engine if #cores is not enough
  2. for DAOS_TARGET_OVERSUBSCRIBE case allow to force start engine.

Required-githooks: true

Co-authored-by: Tom Nabarro tom.nabarro@intel.com
Signed-off-by: Xuezhao Liu xuezhao.liu@intel.com

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate watchers.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

@github-actions
Copy link

github-actions bot commented Jun 15, 2023

Bug-tracker data:
Ticket title is 'engine shall not change requested #targets'
Status is 'In Review'
Labels: 'florence,triaged,usability'
https://daosio.atlassian.net/browse/DAOS-13380

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@tanabarr
Copy link
Contributor

I think there are some associated control-plane changes related to this, as we update the number of targets somewhere in the control plane if the number used is different from the number requested. @liuxuezhao do you mind if I push to this PR or should I create a new one?

@liuxuezhao
Copy link
Contributor Author

liuxuezhao commented Jun 15, 2023

I think there are some associated control-plane changes related to this, as we update the number of targets somewhere in the control plane if the number used is different from the number requested. @liuxuezhao do you mind if I push to this PR or should I create a new one?

Ah, sure, please go ahead, thanks! some details in the ticket https://daosio.atlassian.net/browse/DAOS-13380 @tanabarr

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-12405/3/execution/node/1115/log

1. for non-DAOS_TARGET_OVERSUBSCRIBE case
   fail to start engine if #cores is not enough
2. for DAOS_TARGET_OVERSUBSCRIBE case
   allow to force start engine
The #nr_xs_helpers possibly be reduced for either case.

Required-githooks: true

Signed-off-by: Xuezhao Liu <xuezhao.liu@intel.com>
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

Copy link
Contributor

@daltonbohning daltonbohning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ftest LGTM

…r necessary

Required-githooks: true

Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
@tanabarr tanabarr requested a review from a team as a code owner June 16, 2023 16:28
@tanabarr tanabarr requested review from mjmac and daltonbohning and removed request for a team June 16, 2023 16:28
tanabarr
tanabarr previously approved these changes Jun 16, 2023
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@daosbuild1
Copy link
Collaborator

Test stage Unit Test on EL 8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-12405/5/testReport/(root)/

@tanabarr
Copy link
Contributor

tanabarr commented Jun 20, 2023 via email

@tanabarr
Copy link
Contributor

go unit test failures for previous CI runs seem unrelated, I have created a ticket for them as they seem to be a new issue: https://daosio.atlassian.net/browse/DAOS-13771

Copy link
Contributor

@knard-intel knard-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some points which are not clear for me, probably as I am not used to work on the engine code.

return -DER_INVAL;
}
dss_tgt_offload_xs_nr = ncores - DAOS_TGT0_OFFSET - tgt_nr;
D_PRINT("Start engine with %d targets on %d cores, #nr_xs_helpers set as %d.\n",
Copy link
Contributor

@knard-intel knard-intel Jun 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message should also indicate that the number of helpers have been forced.

I was expecting that updating this value could only be done if oversubscribe is true.


out:
if (dss_tgt_offload_xs_nr % tgt_nr != 0)
dss_helper_pool = true;
Copy link
Contributor

@knard-intel knard-intel Jun 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is some performance impact, it could be useful to indicate that the helper threads will be shared.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whether or not helper XS is shared by targets is based on the #helpers and #targets. for example if with 8 tgts and 4 helpers, it will be shared.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, I was just wondering that it could be useful for the end user and the support to know if the helper XS is shared or not.
Indeed, this configuration is not explicitly managed by the end user, and thus if there is some performance impact, this information could be useful.

*/
static int
dss_tgt_nr_get(unsigned int ncores, unsigned int nr, bool oversubscribe)
dss_tgt_nr_check(unsigned int ncores, unsigned int tgt_nr, bool oversubscribe)
Copy link
Contributor

@knard-intel knard-intel Jun 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not used to the engine code, but it seems a little bit odd that a check function change the value of the global variables dss_tgt_offload_xs_nr and dss_helper_pool.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm wondering if we should treat nr_offload in the same of how we treat nr_target? I mean should we silently 'correct' user's invalid config or should we explicitly reject the invalid config? Should we apply 'oversubscribe' to nr_offload too?
@Michael-Hennecke , @tanabarr , any thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, just see this comment.
helper threads is not must needed to start daos engine, and start more #helpers than #cores could hurt performance, that is different with #targets, so I select to reduce #helpers if #cores is not enough but #cores are enough to start with #targets.
For example even if DAOS_TARGET_OVERSUBSCRIBE not specified, on a node with 16 cores, user configured with #target=12 and #helpers=12, I think the #target=12 is mandatory but #helpers=12 is not mandatory, so why not just start server with #targets=12 and #helpers=2 in this case? (another 2 cores used for system XS).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the argument above but I would be favour of being explicit and refusing to start with #helpers=12 if there are not enough cores (assuming oversubscribe is not set). Printing explicit error that is easy to understand and forces the user to reduce #helpers and restart the server. This should make behaviour easier to understand and avoid the situation where the user is running an engine unaware that the number of actual helper XSs isn't what they requested in the config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to think that the "nr_xs_helpers" is not a mandatory requirement, that is different with "targets". Imagine the case that #cores is not enough for helpers but #cores are enough to start with #targets, if fails the engine starting it just add some trouble for user to change yaml file and restart. For a system with different nodes that with different #cores, user have to provide different yml files for different node.
we may need to change document (admin/deployment.md) to explain the "nr_xs_helpers" is not a mandatory requirement and possibly be reduced if the hardware core resource is not enough.
How do you think? @tanabarr @Michael-Hennecke

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liuxuezhao sorry for delay in response, I'm on leave. I think that explanation makes sense to me so let's leave it as it is for the moment with the loose requirement special case for nr_xs_helpers. We should get this PR landed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough for me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liuxuezhao sorry for delay in response, I'm on leave. I think that explanation makes sense to me so let's leave it as it is for the moment with the loose requirement special case for nr_xs_helpers. We should get this PR landed.

Hi no problem at all, I just refreshed it to add a few other small changes.

Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
@liuxuezhao liuxuezhao removed the request for review from a team July 17, 2023 01:52
Required-githooks: true
@liuxuezhao liuxuezhao dismissed stale reviews from tanabarr, knard-intel, and NiuYawei via 580844e July 17, 2023 08:09
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

tanabarr
tanabarr previously approved these changes Jul 17, 2023
@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-12405/12/execution/node/1096/log

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

make the #xs_helpers requirement be mandatory.

Required-githooks: true

Signed-off-by: Xuezhao Liu <xuezhao.liu@intel.com>
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-12405/15/execution/node/1074/log

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@liuxuezhao liuxuezhao requested review from NiuYawei, Michael-Hennecke and a team July 20, 2023 12:56
@liuxuezhao
Copy link
Contributor Author

I would prefer that the #xs_helpers are treated the same way as the #targets: If DAOS_TARGET_OVERSUBSCRIBE is not set, and the sum of targets and helpers exceeds the number of cores on the socket, we should throw an error rather than silently adjusting either #targets or #xs_helpers.

it is important to notify the administrator that something is wrong in the requested server config, and it's easy for the admin to adjust the YML file once they understand the requirements. but silently changing the requested values may lead to obscure and hard-to-find performance behaviour so we should really avoid doing that.

OK, changed as the suggestion. thanks

@gnailzenh gnailzenh merged commit 27b50c1 into master Jul 25, 2023
@gnailzenh gnailzenh deleted the lxz/tgt_nr branch July 25, 2023 08:20
jolivier23 added a commit that referenced this pull request May 8, 2024
Backport for the following patches
DAOS-13380 engine: refine tgt_nr check (#12405)
DAOS-15739 engine: Add multi-socket support (#14234)
DAOS-623 engine: Fix a typo (#14329)

* DAOS-13380 engine: refine tgt_nr check

1. for non-DAOS_TARGET_OVERSUBSCRIBE case
   fail to start engine if #cores is not enough
2. for DAOS_TARGET_OVERSUBSCRIBE case
   allow to force start engine
The #nr_xs_helpers possibly be reduced for either case.

* DAOS-15739 engine: Add multi-socket support (#14234)

Add a simple multi-socket mode for use cases where a single
engine must be used. Avoids the issue of having all helper
xstreams automatically assigned to a single NUMA node thus
increasing efficiency of synchronizations between I/O and
helper xstreams.

It is the default behavior if all of the following are true

Neither pinned_numa_node nor first_core are used.
No oversubscription is requested
NUMA has uniform number of cores
targets and helpers divide evenly among numa nodes
There is more than one numa node
Update server config logic to ensure first_core is passed
on to engine if it's set while keeping existing behavior
when both first_core: 0 and pinned_numa_node are set.

Signed-off-by: Jeff Olivier <jeffolivier@google.com>
Signed-off-by: Xuezhao Liu <xuezhao.liu@intel.com>
Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

9 participants