DAOS-13380 engine: refine tgt_nr check #12405

liuxuezhao · 2023-06-15T10:04:40Z

for non-DAOS_TARGET_OVERSUBSCRIBE case fail to start engine if #cores is not enough
for DAOS_TARGET_OVERSUBSCRIBE case allow to force start engine.

Required-githooks: true

Co-authored-by: Tom Nabarro tom.nabarro@intel.com
Signed-off-by: Xuezhao Liu xuezhao.liu@intel.com

Before requesting gatekeeper:

Two review approvals and any prior change requests have been resolved.
Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
Commit messages follows the guidelines outlined here.
Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

github-actions · 2023-06-15T10:04:57Z

Bug-tracker data:
Ticket title is 'engine shall not change requested #targets'
Status is 'In Review'
Labels: 'florence,triaged,usability'
https://daosio.atlassian.net/browse/DAOS-13380

daosbuild1

LGTM. No errors found by checkpatch.

daosbuild1

LGTM. No errors found by checkpatch.

daosbuild1

LGTM. No errors found by checkpatch.

tanabarr · 2023-06-15T13:53:45Z

I think there are some associated control-plane changes related to this, as we update the number of targets somewhere in the control plane if the number used is different from the number requested. @liuxuezhao do you mind if I push to this PR or should I create a new one?

liuxuezhao · 2023-06-15T14:07:55Z

I think there are some associated control-plane changes related to this, as we update the number of targets somewhere in the control plane if the number used is different from the number requested. @liuxuezhao do you mind if I push to this PR or should I create a new one?

Ah, sure, please go ahead, thanks! some details in the ticket https://daosio.atlassian.net/browse/DAOS-13380 @tanabarr

daosbuild1 · 2023-06-15T18:09:03Z

Test stage Functional on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-12405/3/execution/node/1115/log

1. for non-DAOS_TARGET_OVERSUBSCRIBE case fail to start engine if #cores is not enough 2. for DAOS_TARGET_OVERSUBSCRIBE case allow to force start engine The #nr_xs_helpers possibly be reduced for either case. Required-githooks: true Signed-off-by: Xuezhao Liu <xuezhao.liu@intel.com>

daosbuild1

LGTM. No errors found by checkpatch.

daltonbohning

ftest LGTM

…r necessary Required-githooks: true Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>

daosbuild1

LGTM. No errors found by checkpatch.

daosbuild1 · 2023-06-16T17:05:32Z

Test stage Unit Test on EL 8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-12405/5/testReport/(root)/

tanabarr · 2023-06-20T13:22:31Z

I’ve commented, thanks. It is passed test now From: Liu Xuezhao ***@***.***> Sent: Sunday, June 18, 2023 12:00 PM To: daos-stack/daos ***@***.***> Cc: Nabarro, Tom ***@***.***>; Mention ***@***.***> Subject: Re: [daos-stack/daos] DAOS-13380 engine: refine tgt_nr check (PR #12405) I think there are some associated control-plane changes related to this, as we update the number of targets somewhere in the control plane if the number used is different from the number requested. @liuxuezhao<https://github.com/liuxuezhao> do you mind if I push to this PR or should I create a new one? @tanabarr<https://github.com/tanabarr> thanks for updating the go part. there are a few test failures in src/control/run_go_tests.sh (log in unit_test_logs/src-control-run_go_tests.sh_61/output.log) I am not quite familiar, could you please check at convenient time? — Reply to this email directly, view it on GitHub<#12405 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAZAGUHSZ2MAB5AV3GN5ET3XL3NUBANCNFSM6AAAAAAZHTHBHA>. You are receiving this because you were mentioned.Message ID: ***@***.******@***.***>>

tanabarr · 2023-06-20T13:22:53Z

go unit test failures for previous CI runs seem unrelated, I have created a ticket for them as they seem to be a new issue: https://daosio.atlassian.net/browse/DAOS-13771

knard-intel

There is some points which are not clear for me, probably as I am not used to work on the engine code.

knard-intel · 2023-06-21T06:30:18Z

src/engine/init.c

+			return -DER_INVAL;
+		}
+		dss_tgt_offload_xs_nr = ncores - DAOS_TGT0_OFFSET - tgt_nr;
+		D_PRINT("Start engine with %d targets on %d cores, #nr_xs_helpers set as %d.\n",


The message should also indicate that the number of helpers have been forced.

I was expecting that updating this value could only be done if oversubscribe is true.

knard-intel · 2023-06-21T06:36:19Z

src/engine/init.c


+out:
 	if (dss_tgt_offload_xs_nr % tgt_nr != 0)
 		dss_helper_pool = true;


If there is some performance impact, it could be useful to indicate that the helper threads will be shared.

whether or not helper XS is shared by targets is based on the #helpers and #targets. for example if with 8 tgts and 4 helpers, it will be shared.

In fact, I was just wondering that it could be useful for the end user and the support to know if the helper XS is shared or not.
Indeed, this configuration is not explicitly managed by the end user, and thus if there is some performance impact, this information could be useful.

knard-intel · 2023-06-21T06:37:05Z

src/engine/init.c

 */
 static int
-dss_tgt_nr_get(unsigned int ncores, unsigned int nr, bool oversubscribe)
+dss_tgt_nr_check(unsigned int ncores, unsigned int tgt_nr, bool oversubscribe)


I am not used to the engine code, but it seems a little bit odd that a check function change the value of the global variables dss_tgt_offload_xs_nr and dss_helper_pool.

Yeah, I'm wondering if we should treat nr_offload in the same of how we treat nr_target? I mean should we silently 'correct' user's invalid config or should we explicitly reject the invalid config? Should we apply 'oversubscribe' to nr_offload too?
@Michael-Hennecke , @tanabarr , any thoughts?

ah, just see this comment.
helper threads is not must needed to start daos engine, and start more #helpers than #cores could hurt performance, that is different with #targets, so I select to reduce #helpers if #cores is not enough but #cores are enough to start with #targets.
For example even if DAOS_TARGET_OVERSUBSCRIBE not specified, on a node with 16 cores, user configured with #target=12 and #helpers=12, I think the #target=12 is mandatory but #helpers=12 is not mandatory, so why not just start server with #targets=12 and #helpers=2 in this case? (another 2 cores used for system XS).

I understand the argument above but I would be favour of being explicit and refusing to start with #helpers=12 if there are not enough cores (assuming oversubscribe is not set). Printing explicit error that is easy to understand and forces the user to reduce #helpers and restart the server. This should make behaviour easier to understand and avoid the situation where the user is running an engine unaware that the number of actual helper XSs isn't what they requested in the config.

I tend to think that the "nr_xs_helpers" is not a mandatory requirement, that is different with "targets". Imagine the case that #cores is not enough for helpers but #cores are enough to start with #targets, if fails the engine starting it just add some trouble for user to change yaml file and restart. For a system with different nodes that with different #cores, user have to provide different yml files for different node.
we may need to change document (admin/deployment.md) to explain the "nr_xs_helpers" is not a mandatory requirement and possibly be reduced if the hardware core resource is not enough.
How do you think? @tanabarr @Michael-Hennecke

@liuxuezhao sorry for delay in response, I'm on leave. I think that explanation makes sense to me so let's leave it as it is for the moment with the loose requirement special case for nr_xs_helpers. We should get this PR landed.

Fair enough for me.

@liuxuezhao sorry for delay in response, I'm on leave. I think that explanation makes sense to me so let's leave it as it is for the moment with the loose requirement special case for nr_xs_helpers. We should get this PR landed.

Hi no problem at all, I just refreshed it to add a few other small changes.

Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>

Required-githooks: true

daosbuild1

LGTM. No errors found by checkpatch.

daosbuild1 · 2023-07-17T10:50:19Z

Test stage Functional on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-12405/12/execution/node/1096/log

daosbuild1

LGTM. No errors found by checkpatch.

make the #xs_helpers requirement be mandatory. Required-githooks: true Signed-off-by: Xuezhao Liu <xuezhao.liu@intel.com>

daosbuild1

LGTM. No errors found by checkpatch.

daosbuild1

LGTM. No errors found by checkpatch.

daosbuild1 · 2023-07-18T11:29:02Z

Test stage Functional on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-12405/15/execution/node/1074/log

daosbuild1

LGTM. No errors found by checkpatch.

liuxuezhao · 2023-07-20T12:57:38Z

I would prefer that the #xs_helpers are treated the same way as the #targets: If DAOS_TARGET_OVERSUBSCRIBE is not set, and the sum of targets and helpers exceeds the number of cores on the socket, we should throw an error rather than silently adjusting either #targets or #xs_helpers.

it is important to notify the administrator that something is wrong in the requested server config, and it's easy for the admin to adjust the YML file once they understand the requirements. but silently changing the requested values may lead to obscure and hard-to-find performance behaviour so we should really avoid doing that.

OK, changed as the suggestion. thanks

Backport for the following patches DAOS-13380 engine: refine tgt_nr check (#12405) DAOS-15739 engine: Add multi-socket support (#14234) DAOS-623 engine: Fix a typo (#14329) * DAOS-13380 engine: refine tgt_nr check 1. for non-DAOS_TARGET_OVERSUBSCRIBE case fail to start engine if #cores is not enough 2. for DAOS_TARGET_OVERSUBSCRIBE case allow to force start engine The #nr_xs_helpers possibly be reduced for either case. * DAOS-15739 engine: Add multi-socket support (#14234) Add a simple multi-socket mode for use cases where a single engine must be used. Avoids the issue of having all helper xstreams automatically assigned to a single NUMA node thus increasing efficiency of synchronizations between I/O and helper xstreams. It is the default behavior if all of the following are true Neither pinned_numa_node nor first_core are used. No oversubscription is requested NUMA has uniform number of cores targets and helpers divide evenly among numa nodes There is more than one numa node Update server config logic to ensure first_core is passed on to engine if it's set while keeping existing behavior when both first_core: 0 and pinned_numa_node are set. Signed-off-by: Jeff Olivier <jeffolivier@google.com> Signed-off-by: Xuezhao Liu <xuezhao.liu@intel.com> Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>

daosbuild1 reviewed Jun 15, 2023

View reviewed changes

liuxuezhao force-pushed the lxz/tgt_nr branch from bdbfe69 to 28b79fe Compare June 15, 2023 10:10

daosbuild1 reviewed Jun 15, 2023

View reviewed changes

liuxuezhao force-pushed the lxz/tgt_nr branch from 28b79fe to 32584f4 Compare June 15, 2023 12:29

daosbuild1 reviewed Jun 15, 2023

View reviewed changes

liuxuezhao force-pushed the lxz/tgt_nr branch from 32584f4 to c5ce565 Compare June 16, 2023 01:43

liuxuezhao requested a review from a team as a code owner June 16, 2023 01:43

daosbuild1 reviewed Jun 16, 2023

View reviewed changes

daltonbohning reviewed Jun 16, 2023

View reviewed changes

updating target count based on allocated verses requested is no longe…

de91aa4

…r necessary Required-githooks: true Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>

tanabarr requested a review from a team as a code owner June 16, 2023 16:28

tanabarr requested review from mjmac and daltonbohning and removed request for a team June 16, 2023 16:28

tanabarr previously approved these changes Jun 16, 2023

View reviewed changes

daosbuild1 reviewed Jun 16, 2023

View reviewed changes

liuxuezhao requested review from NiuYawei and frostedcmos June 20, 2023 00:48

tanabarr requested a review from knard-intel June 20, 2023 09:23

knard-intel reviewed Jun 21, 2023

View reviewed changes

Merge branch 'master' into lxz/tgt_nr

dbc6585

Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>

tanabarr dismissed their stale review via dbc6585 June 23, 2023 10:03

liuxuezhao removed the request for review from a team July 17, 2023 01:52

Merge branch 'master' into lxz/tgt_nr

2b7c10f

Required-githooks: true

liuxuezhao dismissed stale reviews from tanabarr, knard-intel, and NiuYawei via 580844e July 17, 2023 08:09

daosbuild1 reviewed Jul 17, 2023

View reviewed changes

tanabarr previously approved these changes Jul 17, 2023

View reviewed changes

liuxuezhao dismissed tanabarr’s stale review via 0db5f4f July 17, 2023 15:47

liuxuezhao force-pushed the lxz/tgt_nr branch from 580844e to 0db5f4f Compare July 17, 2023 15:47

daosbuild1 reviewed Jul 17, 2023

View reviewed changes

DAOS-13380 engine: address comment

3d3ac8c

make the #xs_helpers requirement be mandatory. Required-githooks: true Signed-off-by: Xuezhao Liu <xuezhao.liu@intel.com>

liuxuezhao force-pushed the lxz/tgt_nr branch from 0db5f4f to 3d3ac8c Compare July 18, 2023 07:00

daosbuild1 reviewed Jul 18, 2023

View reviewed changes

daosbuild1 reviewed Jul 19, 2023

View reviewed changes

tanabarr approved these changes Jul 20, 2023

View reviewed changes

knard-intel approved these changes Jul 20, 2023

View reviewed changes

liuxuezhao requested review from NiuYawei, Michael-Hennecke and a team July 20, 2023 12:56

liuxuezhao removed request for daltonbohning and frostedcmos July 24, 2023 07:03

gnailzenh approved these changes Jul 25, 2023

View reviewed changes

gnailzenh merged commit 27b50c1 into master Jul 25, 2023

gnailzenh deleted the lxz/tgt_nr branch July 25, 2023 08:20

jolivier23 mentioned this pull request May 3, 2024

DAOS-15739 engine: Add single-engine, multi-socket support #14311

Merged

18 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DAOS-13380 engine: refine tgt_nr check #12405

DAOS-13380 engine: refine tgt_nr check #12405

liuxuezhao commented Jun 15, 2023 •

edited

Loading

github-actions bot commented Jun 15, 2023 •

edited

Loading

daosbuild1 left a comment

daosbuild1 left a comment

daosbuild1 left a comment

tanabarr commented Jun 15, 2023

liuxuezhao commented Jun 15, 2023 •

edited

Loading

daosbuild1 commented Jun 15, 2023

daosbuild1 left a comment

daltonbohning left a comment

daosbuild1 left a comment

daosbuild1 commented Jun 16, 2023

tanabarr commented Jun 20, 2023 via email

tanabarr commented Jun 20, 2023

knard-intel left a comment

knard-intel Jun 21, 2023 •

edited

Loading

knard-intel Jun 21, 2023 •

edited

Loading

liuxuezhao Jun 25, 2023

knard-intel Jul 4, 2023

knard-intel Jun 21, 2023 •

edited

Loading

NiuYawei Jun 25, 2023

liuxuezhao Jun 25, 2023

tanabarr Jun 26, 2023

liuxuezhao Jun 27, 2023

tanabarr Jul 4, 2023

knard-intel Jul 4, 2023

liuxuezhao Jul 12, 2023

daosbuild1 left a comment

daosbuild1 commented Jul 17, 2023

daosbuild1 left a comment

daosbuild1 left a comment

daosbuild1 left a comment

daosbuild1 commented Jul 18, 2023

daosbuild1 left a comment

liuxuezhao commented Jul 20, 2023

DAOS-13380 engine: refine tgt_nr check #12405

DAOS-13380 engine: refine tgt_nr check #12405

Conversation

liuxuezhao commented Jun 15, 2023 • edited Loading

Before requesting gatekeeper:

Gatekeeper:

github-actions bot commented Jun 15, 2023 • edited Loading

daosbuild1 left a comment

Choose a reason for hiding this comment

daosbuild1 left a comment

Choose a reason for hiding this comment

daosbuild1 left a comment

Choose a reason for hiding this comment

tanabarr commented Jun 15, 2023

liuxuezhao commented Jun 15, 2023 • edited Loading

daosbuild1 commented Jun 15, 2023

daosbuild1 left a comment

Choose a reason for hiding this comment

daltonbohning left a comment

Choose a reason for hiding this comment

daosbuild1 left a comment

Choose a reason for hiding this comment

daosbuild1 commented Jun 16, 2023

tanabarr commented Jun 20, 2023 via email

tanabarr commented Jun 20, 2023

knard-intel left a comment

Choose a reason for hiding this comment

knard-intel Jun 21, 2023 • edited Loading

Choose a reason for hiding this comment

knard-intel Jun 21, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

knard-intel Jun 21, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daosbuild1 left a comment

Choose a reason for hiding this comment

daosbuild1 commented Jul 17, 2023

daosbuild1 left a comment

Choose a reason for hiding this comment

daosbuild1 left a comment

Choose a reason for hiding this comment

daosbuild1 left a comment

Choose a reason for hiding this comment

daosbuild1 commented Jul 18, 2023

daosbuild1 left a comment

Choose a reason for hiding this comment

liuxuezhao commented Jul 20, 2023

liuxuezhao commented Jun 15, 2023 •

edited

Loading

github-actions bot commented Jun 15, 2023 •

edited

Loading

liuxuezhao commented Jun 15, 2023 •

edited

Loading

knard-intel Jun 21, 2023 •

edited

Loading

knard-intel Jun 21, 2023 •

edited

Loading

knard-intel Jun 21, 2023 •

edited

Loading