-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-8331 client: add client side metrics #13517
Conversation
Bug-tracker data: |
Test stage NLT on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13517/1/execution/node/792/log |
148f294
to
2f9961a
Compare
Test stage NLT on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13517/2/execution/node/794/log |
This is failing with a segfault in pydaos, as a start you could disable this test by commenting out the two lines at https://github.com/daos-stack/daos/blob/master/utils/node_local_test.py#L6031-L6032 which would allow the rest of the testing to proceed which may give some indication of the cause. |
2f9961a
to
dd93dd8
Compare
Thanks! @ashleypittman |
Test stage NLT on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13517/3/execution/node/794/log |
dd93dd8
to
63a3f72
Compare
Test stage NLT on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13517/4/execution/node/794/log |
ops |= D_TM_ITER_RESET; | ||
iter_cb = iter_reset; | ||
break; | ||
case 'j': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the usage test in print_usage()
to include this new flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
iter_arg.iter_cb = iter_cb; | ||
iter_arg.delay = delay; | ||
iter_arg.num_iter = num_iter; | ||
d_tm_iterate(ctx, root, 0, D_TM_DIRECTORY, NULL, format, extra_descriptors, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I played around with the PR on my development node where I'm running the server and some client processes. When I run daos_metrics -j , I get the engine metrics mixed in and the client metrics are all zero. I wonder if it's because the wrong context is being used in the iterator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you provide correct jobid after -j ? hmm, the iteration will iterate all pid under root/jobid, then iterate each pid metrics here. see d_tm_iterate->iter_per_pid->process_metrics(). It should only include client object metrics for the moment.
I tried this on my side with multiple jobid, it works here.
63a3f72
to
d195131
Compare
Test stage NLT on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13517/5/execution/node/794/log |
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13517/6/execution/node/1111/log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1 for the namespace pollution.
If this is creating shared memory on the client then we need some documentation or user education on how to manage this and detect/cleanup and leaked resources - even if that is just "the agent does it".
src/client/api/metrics.c
Outdated
bool client_metric; | ||
bool client_metric_retain; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These (and many others) should be static if possible or renamed with a daos_ prefix if not. There are much stricter naming requirements for symbols in client libraries than there are for the server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
src/client/api/metrics.c
Outdated
|
||
rc = d_tm_init(DC_TM_JOB_ROOT_ID, MAX_IDS_SIZE(INIT_JOB_NUM), metrics_tag); | ||
if (rc != 0) { | ||
D_ERROR("init job root id %u: %d\n", DC_TM_JOB_ROOT_ID, rc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I won't comment on all of these, but this one in particular looks like we should log the failure properly.
D_ERROR("init job root id %u: %d\n", DC_TM_JOB_ROOT_ID, rc); | |
DL_ERROR(rc, "init job root id %u", DC_TM_JOB_ROOT_ID); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
src/client/api/metrics.c
Outdated
rc = d_tm_add_metric(&job_node, D_TM_DIRECTORY, | ||
"job id directory", "dir", | ||
"%s/%u", dc_jobid, pid); | ||
/* Close job root sheme */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These codespell annotations all need to be fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
src/common/tls.c
Outdated
#include <daos/tls.h> | ||
|
||
/* The array remember all of registered module keys on one node. */ | ||
struct daos_module_key *daos_module_keys[DAOS_MODULE_KEYS_NR] = { NULL }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be static again and you don't need to initliase it.
struct daos_module_key *daos_module_keys[DAOS_MODULE_KEYS_NR] = { NULL }; | |
static struct daos_module_key *daos_module_keys[DAOS_MODULE_KEYS_NR]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
src/common/tls.c
Outdated
/** | ||
* Init thread context | ||
* | ||
* \param[in]dtls Init the thread context to allocate the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think doxygen will like this, you need to adjust the whitespace. Applies elsewhere as well.
* \param[in]dtls Init the thread context to allocate the | |
* \param[in] dtls Init the thread context to allocate the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
src/gurt/telemetry.c
Outdated
|
||
D_INFO("Delete share memory for id %u destory %s\n", tm_shmem.id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a DF_BOOL macro to use here rather than yes/no trigraph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
src/include/daos/metric.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't require all our code to be well formatted with clang-format yet but for new public header files we really should. I can push an update to this PR for just there new files if you would like, alternatively it should just be a case of "clang-format -i ".
src/include/daos/tls.h
Outdated
#define METRIC_DUMP_ENV "DAOS_METRIC_DUMP_ENV" | ||
#define DAOS_CLIENT_METRICS_ENV "DAOS_CLIENT_METRICS" | ||
#define DAOS_CLIENT_METRICS_RETAIN_ENV "DAOS_CLIENT_METRICS_RETAIN" | ||
extern bool client_metric; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't have a variable called "client_metric" in a shared library that you're linking with end-user applications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
src/include/daos/tls.h
Outdated
} | ||
|
||
/* For now TLS is only enabled if metrics are enabled */ | ||
#define METRIC_DUMP_ENV "DAOS_METRIC_DUMP_ENV" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I forgot to mention this on my other review... Suggest that this env variable should be something more like "DAOS_CLIENT_METRICS_FILE"
for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, sure.
82a786e
to
1e874d0
Compare
Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13517/8/execution/node/335/log |
yes, you can test it with single client and multiple threads. and the metrics directory does create per client here. @daltonbohning |
@wangdi1 Two issues with the latest.
If this code is complicated / problematic then we probably need CI tests to verify. |
Could you please delete those sharememory before running the job. from ipcrm -a on all client nodes? Ah there is a known shmem size calculation issue if you run too much ranks per node. I think @mjmac point this out before, but we decided to resolve this later in the following patch. Could you please run this with fewer ranks per clients? like <= 16 ranks per client. @daltonbohning I will see what I can do here. Thanks. |
Does it matter to run
Even if it works, the performance numbers won't be meaningful because we typically run with 32 ranks per client. IMO, we should resolve the synchronization issues before running on Frontera. It's time-consuming to test small changes like this on Frontera, and we're running v2.4.2 testing now. Let me try to get a reproducer on boro so we can debug this more quickly. |
And if I can get a reproducer on boro, I can help with a CI test |
Different issue on boro. I get past daos_init and ior or mdtest runs okay. But sometimes during cleanup (maybe daos_fini?) I see
|
only with this D_CLIENT_METRICS_ENABLE=1 setting? how many clients and ranks per client in your run? @daltonbohning thanks. |
Only with |
yes, it will help to resolve this shmem calculation issue. And also the updated PR actually change the format of shmem a bit. so if you do not delete the old shmem, which might cause hang. I will play with larger scale in boro see how it goes. Thanks. |
@wangdi1 I so far can't reproduce the hangs or segfaults on boro. Even with 144 ranks one one client. I'll queue some more jobs on Frontera, but I'm not sure when they'll finish. We're running 2.4.2 testing and Frontera will be unavailable Friday through Tuesday |
if (d_list_empty(&shmem->sh_subregions)) | ||
cur = (d_list_t *)(shmem->sh_base_addr + | ||
(uint64_t)(&((struct d_tm_shmem_hdr *)(0))->sh_subregions)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow what this change is doing. This is a pointer to the list head, right? If the list is empty, isn't head->next == head?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The list is initialized by D_INIT_LIST_HEAD(&header->sh_subregions), so it is not a shmem address if it is empty, if I understand correctly.
src/gurt/telemetry.c
Outdated
@@ -2556,10 +2572,13 @@ d_tm_add_ephemeral_dir(struct d_tm_node_t **node, size_t size_bytes, | |||
if (rc != 0) | |||
D_GOTO(fail, rc); | |||
|
|||
if (tm_shmem.ephemeral_dir_lock) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we only concerned with the local setting, and not the flag in the shmem header? If we don't want each shmem region to potentially have its own settings, we should only have the flag in one place.
@wangdi1, @kjacque: Can we please arrive at a consensus on how to get this landed? I have two follow-on PRs that are waiting for this one (#13545, #5382). The most recent comments by Kris look like only small adjustments. There will undoubtedly be some more adjustments needed as we start to really use this feature, but as long as we're not introducing any regressions for the non-telemetry use cases, then I don't think it needs to be perfect before landing. |
Required-githooks: true Signed-off-by: Di Wang <di.wang@intel.com>
Required-githooks: true Signed-off-by: Di Wang <di.wang@intel.com>
Resolve comments from Kris. Required-githooks: true Signed-off-by: Di Wang <di.wang@intel.com>
Bug-tracker data: |
@wangdi1, @kjacque: Thanks for the quick turnaround on the feedback. FYI, I found an issue in my testing that I've fixed in my follow-on: https://github.com/daos-stack/daos/pull/13545/files#diff-49b244164a1877abd5a7363633cb22baec5bce3b270d978f1beea89a00761bbcR2730 With that fix in place, the agent was able to manage the per-job/pid subdirs with no problems, and ran for a few hours with telemetry managed for hundreds of client processes. @mchaarawi, @ashleypittman, @daltonbohning: Can we please land these two PRs so that we can keep making progress? I think Dalton's testing has shown that there is no appreciable performance impact with metrics enabled or disabled, correct? |
I don't see any feedback from Dalton on his latest run whether that segfaulted again or not.. but we probably need to run the test again since it was 2 weeks ago and more changes were landed to this PR. |
The last I ran with |
Looking back at my mdtest hard results, it seems unlikely that |
For what it's worth, my follow-on PR based on this one passes all of the linter checks: #13545 Can we please get another +1 on it and get it landed? @johannlombardi, @mchaarawi, @ashleypittman? Thanks in advance. :) |
im going to object on this until 1) we have proved that the segfault issue is resolved/tested, and 2) perf numbers show there is no affect on the non-metric path. |
OK. @daltonbohning: Please advise when you'll have time to re-run the performance tests. |
Sorry, I'm behind. I should get back to this either this week or early next week. |
With the latest version, I still see mdtest hanging with |
Move TLS to common, so both client and server can have TLS, which metrics can be attached metrics on it.
Add object metrics on the client side, enabled by export DAOS_CLIENT_METRICS=1. And client metrics are organized as "root/jobid/pid/xxxxx"
And root/jobid/pid are stored in an independent share memory, which will only be destoryed if all jobs are destroyed.
During each daos thread initialization, it will created another shmem (pid/xxx), which all metrics of the thread will be attached to. And this metric will be destoryed once the thread exit.
Add DAOS_METRIC_DUMP_ENV dump metrics from current thread once it exit.
Some fixes in telemetrics about conv_ptr during re-open the share memory.
Add daos_metrics --jobid XXX options to retrieve all metrics of the job.
Required-githooks: true
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: