-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-13380 engine: refine tgt_nr check #12405
Conversation
Bug-tracker data: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
I think there are some associated control-plane changes related to this, as we update the number of targets somewhere in the control plane if the number used is different from the number requested. @liuxuezhao do you mind if I push to this PR or should I create a new one? |
Ah, sure, please go ahead, thanks! some details in the ticket https://daosio.atlassian.net/browse/DAOS-13380 @tanabarr |
Test stage Functional on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-12405/3/execution/node/1115/log |
1. for non-DAOS_TARGET_OVERSUBSCRIBE case fail to start engine if #cores is not enough 2. for DAOS_TARGET_OVERSUBSCRIBE case allow to force start engine The #nr_xs_helpers possibly be reduced for either case. Required-githooks: true Signed-off-by: Xuezhao Liu <xuezhao.liu@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ftest LGTM
…r necessary Required-githooks: true Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Test stage Unit Test on EL 8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-12405/5/testReport/(root)/ |
I’ve commented, thanks. It is passed test now
From: Liu Xuezhao ***@***.***>
Sent: Sunday, June 18, 2023 12:00 PM
To: daos-stack/daos ***@***.***>
Cc: Nabarro, Tom ***@***.***>; Mention ***@***.***>
Subject: Re: [daos-stack/daos] DAOS-13380 engine: refine tgt_nr check (PR #12405)
I think there are some associated control-plane changes related to this, as we update the number of targets somewhere in the control plane if the number used is different from the number requested. @liuxuezhao<https://github.com/liuxuezhao> do you mind if I push to this PR or should I create a new one?
@tanabarr<https://github.com/tanabarr> thanks for updating the go part. there are a few test failures in src/control/run_go_tests.sh (log in unit_test_logs/src-control-run_go_tests.sh_61/output.log) I am not quite familiar, could you please check at convenient time?
—
Reply to this email directly, view it on GitHub<#12405 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAZAGUHSZ2MAB5AV3GN5ET3XL3NUBANCNFSM6AAAAAAZHTHBHA>.
You are receiving this because you were mentioned.Message ID: ***@***.******@***.***>>
|
go unit test failures for previous CI runs seem unrelated, I have created a ticket for them as they seem to be a new issue: https://daosio.atlassian.net/browse/DAOS-13771 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is some points which are not clear for me, probably as I am not used to work on the engine code.
src/engine/init.c
Outdated
return -DER_INVAL; | ||
} | ||
dss_tgt_offload_xs_nr = ncores - DAOS_TGT0_OFFSET - tgt_nr; | ||
D_PRINT("Start engine with %d targets on %d cores, #nr_xs_helpers set as %d.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The message should also indicate that the number of helpers have been forced.
I was expecting that updating this value could only be done if oversubscribe
is true.
|
||
out: | ||
if (dss_tgt_offload_xs_nr % tgt_nr != 0) | ||
dss_helper_pool = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is some performance impact, it could be useful to indicate that the helper threads will be shared.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whether or not helper XS is shared by targets is based on the #helpers and #targets. for example if with 8 tgts and 4 helpers, it will be shared.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, I was just wondering that it could be useful for the end user and the support to know if the helper XS is shared or not.
Indeed, this configuration is not explicitly managed by the end user, and thus if there is some performance impact, this information could be useful.
*/ | ||
static int | ||
dss_tgt_nr_get(unsigned int ncores, unsigned int nr, bool oversubscribe) | ||
dss_tgt_nr_check(unsigned int ncores, unsigned int tgt_nr, bool oversubscribe) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not used to the engine code, but it seems a little bit odd that a check function change the value of the global variables dss_tgt_offload_xs_nr
and dss_helper_pool
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm wondering if we should treat nr_offload in the same of how we treat nr_target? I mean should we silently 'correct' user's invalid config or should we explicitly reject the invalid config? Should we apply 'oversubscribe' to nr_offload too?
@Michael-Hennecke , @tanabarr , any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, just see this comment.
helper threads is not must needed to start daos engine, and start more #helpers than #cores could hurt performance, that is different with #targets, so I select to reduce #helpers if #cores is not enough but #cores are enough to start with #targets.
For example even if DAOS_TARGET_OVERSUBSCRIBE not specified, on a node with 16 cores, user configured with #target=12 and #helpers=12, I think the #target=12 is mandatory but #helpers=12 is not mandatory, so why not just start server with #targets=12 and #helpers=2 in this case? (another 2 cores used for system XS).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the argument above but I would be favour of being explicit and refusing to start with #helpers=12 if there are not enough cores (assuming oversubscribe is not set). Printing explicit error that is easy to understand and forces the user to reduce #helpers and restart the server. This should make behaviour easier to understand and avoid the situation where the user is running an engine unaware that the number of actual helper XSs isn't what they requested in the config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to think that the "nr_xs_helpers" is not a mandatory requirement, that is different with "targets". Imagine the case that #cores is not enough for helpers but #cores are enough to start with #targets, if fails the engine starting it just add some trouble for user to change yaml file and restart. For a system with different nodes that with different #cores, user have to provide different yml files for different node.
we may need to change document (admin/deployment.md) to explain the "nr_xs_helpers" is not a mandatory requirement and possibly be reduced if the hardware core resource is not enough.
How do you think? @tanabarr @Michael-Hennecke
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liuxuezhao sorry for delay in response, I'm on leave. I think that explanation makes sense to me so let's leave it as it is for the moment with the loose requirement special case for nr_xs_helpers
. We should get this PR landed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liuxuezhao sorry for delay in response, I'm on leave. I think that explanation makes sense to me so let's leave it as it is for the moment with the loose requirement special case for
nr_xs_helpers
. We should get this PR landed.
Hi no problem at all, I just refreshed it to add a few other small changes.
Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
Required-githooks: true
580844e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Test stage Functional on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-12405/12/execution/node/1096/log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
make the #xs_helpers requirement be mandatory. Required-githooks: true Signed-off-by: Xuezhao Liu <xuezhao.liu@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Test stage Functional on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-12405/15/execution/node/1074/log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
OK, changed as the suggestion. thanks |
Backport for the following patches DAOS-13380 engine: refine tgt_nr check (#12405) DAOS-15739 engine: Add multi-socket support (#14234) DAOS-623 engine: Fix a typo (#14329) * DAOS-13380 engine: refine tgt_nr check 1. for non-DAOS_TARGET_OVERSUBSCRIBE case fail to start engine if #cores is not enough 2. for DAOS_TARGET_OVERSUBSCRIBE case allow to force start engine The #nr_xs_helpers possibly be reduced for either case. * DAOS-15739 engine: Add multi-socket support (#14234) Add a simple multi-socket mode for use cases where a single engine must be used. Avoids the issue of having all helper xstreams automatically assigned to a single NUMA node thus increasing efficiency of synchronizations between I/O and helper xstreams. It is the default behavior if all of the following are true Neither pinned_numa_node nor first_core are used. No oversubscription is requested NUMA has uniform number of cores targets and helpers divide evenly among numa nodes There is more than one numa node Update server config logic to ensure first_core is passed on to engine if it's set while keeping existing behavior when both first_core: 0 and pinned_numa_node are set. Signed-off-by: Jeff Olivier <jeffolivier@google.com> Signed-off-by: Xuezhao Liu <xuezhao.liu@intel.com> Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
Required-githooks: true
Co-authored-by: Tom Nabarro tom.nabarro@intel.com
Signed-off-by: Xuezhao Liu xuezhao.liu@intel.com
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: