Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-8331 client: Add client side metrics (#14030) #14204

Merged
merged 3 commits into from
Apr 29, 2024

Conversation

mjmac
Copy link
Contributor

@mjmac mjmac commented Apr 20, 2024

This commit comprises two separate patches to enable optional
collection and export of client-side telemetry.

The daos_agent configuration file includes new parameters to control
collection and export of per-client telemetry. If the telemetry_port option
is set, then per-client telemetry will be published in Prometheus format
for real-time sampling of client processes. By default, the client telemetry
will be automatically cleaned up on client exit, but may be optionally
retained for some amount of time after client exit in order to allow for
a final sample to be read.

Example daos_agent.yml updates:
telemetry_port: 9192 # export on port 9192
telemetry_enable: true # enable client telemetry for all connected clients
telemetry_retain: 1m # retain metrics for 1 minute after client exit

If telemetry_enable is false (default), client telemetry may be enabled on
a per-process basis by setting D_CLIENT_METRICS_ENABLE=1 in the
environment for clients that should collect telemetry.

Notes from the first patch by Di:

Move TLS to common, so both client and server can have TLS,
which metrics can be attached metrics on it.

Add object metrics on the client side, enabled by
export D_CLIENT_METRICS_ENABLE=1. And client metrics are organized
as "/jobid/pid/xxxxx".

During each daos thread initialization, it will created another
shmem (pid/xxx), which all metrics of the thread will be attached
to. And this metric will be destroyed once the thread exit, though
if D_CLIENT_METRICS_RETAIN is set, these client metrics will be
retain, and it can be retrieved by
daos_metrics --jobid
Add D_CLIENT_METRICS_DUMP_PATH dump metrics from current thread
once it exit.

Some fixes in telemetrics about conv_ptr during re-open the
share memory.

Add daos_metrics --jobid XXX options to retrieve all metrics
of the job.

Required-githooks: true

Change-Id: Ib80ff89f39d259e0dce26e0ae8388318f96a3540
Signed-off-by: Di Wang di.wang@intel.com
Signed-off-by: Michael MacDonald mjmac@google.com
Co-authored-by: Di Wang di.wang@intel.com
Signed-off-by: Michael MacDonald mjmac@google.com

Copy link

github-actions bot commented Apr 20, 2024

Bug-tracker data:
Ticket title is 'Client side metrics/stats support for DAOS'
Status is 'In Review'
Labels: 'HPE'
https://daosio.atlassian.net/browse/DAOS-8331

@mjmac mjmac changed the title DAOS-14850 control: Allow logging.Logger in Context (#13569) DAOS-8331 client: Add client side metrics (#14030) Apr 20, 2024
src/client/api/metrics.c Outdated Show resolved Hide resolved
@daosbuild1
Copy link
Collaborator

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14204/1/testReport/

@mjmac mjmac force-pushed the mjmac/DAOS-8331-2.4-backport branch 2 times, most recently from 1070f9d to c6f802b Compare April 21, 2024 12:39
@daosbuild1 daosbuild1 dismissed their stale review April 21, 2024 12:43

Updated patch

src/client/api/metrics.c Outdated Show resolved Hide resolved
@daosbuild1
Copy link
Collaborator

@mjmac mjmac force-pushed the mjmac/DAOS-8331-2.4-backport branch from c6f802b to 2aa0249 Compare April 21, 2024 12:51
@daosbuild1 daosbuild1 dismissed their stale review April 21, 2024 12:53

Updated patch

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14204/4/execution/node/1263/log

@mjmac mjmac force-pushed the mjmac/DAOS-8331-2.4-backport branch from 2aa0249 to 91ab43a Compare April 22, 2024 14:24
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@mjmac mjmac changed the base branch from google/2.4 to mjmac/DAOS-14850-2.4 April 22, 2024 14:37
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

Base automatically changed from mjmac/DAOS-14850-2.4 to google/2.4 April 24, 2024 15:16
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

phender and others added 3 commits April 24, 2024 15:22
Adding tests for WAL commit, reply, and checkpoint metrics.

Signed-off-by: Phil Henderson <phillip.henderson@intel.com>
Required-githooks: true

Change-Id: I17ac20dd02462edcf09af1267a66e31d39eac691
Signed-off-by: Michael MacDonald <mjmac@google.com>
This commit comprises two separate patches to enable optional
collection and export of client-side telemetry.

The daos_agent configuration file includes new parameters to control
collection and export of per-client telemetry. If the telemetry_port option
is set, then per-client telemetry will be published in Prometheus format
for real-time sampling of client processes. By default, the client telemetry
will be automatically cleaned up on client exit, but may be optionally
retained for some amount of time after client exit in order to allow for
a final sample to be read.

Example daos_agent.yml updates:
telemetry_port: 9192 # export on port 9192
telemetry_enable: true # enable client telemetry for all connected clients
telemetry_retain: 1m # retain metrics for 1 minute after client exit

If telemetry_enable is false (default), client telemetry may be enabled on
a per-process basis by setting D_CLIENT_METRICS_ENABLE=1 in the
environment for clients that should collect telemetry.

Notes from the first patch by Di:

Move TLS to common, so both client and server can have TLS,
which metrics can be attached metrics on it.

Add object metrics on the client side, enabled by
export D_CLIENT_METRICS_ENABLE=1. And client metrics are organized
as "/jobid/pid/xxxxx".

During each daos thread initialization, it will created another
shmem (pid/xxx), which all metrics of the thread will be attached
to. And this metric will be destroyed once the thread exit, though
if D_CLIENT_METRICS_RETAIN is set, these client metrics will be
retain, and it can be retrieved by
daos_metrics --jobid
Add D_CLIENT_METRICS_DUMP_PATH dump metrics from current thread
once it exit.

Some fixes in telemetrics about conv_ptr during re-open the
share memory.

Add daos_metrics --jobid XXX options to retrieve all metrics
of the job.

Features: telemetry
Required-githooks: true
Change-Id: Ib80ff89f39d259e0dce26e0ae8388318f96a3540
Co-authored-by: Di Wang <di.wang@intel.com>
Signed-off-by: Di Wang <di.wang@intel.com>
Signed-off-by: Michael MacDonald <mjmac@google.com>
@mjmac mjmac force-pushed the mjmac/DAOS-8331-2.4-backport branch from d6d083e to a647825 Compare April 24, 2024 15:24
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@mjmac mjmac merged commit 5a027b1 into google/2.4 Apr 29, 2024
31 of 32 checks passed
@mjmac mjmac deleted the mjmac/DAOS-8331-2.4-backport branch April 29, 2024 15:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants