This commit comprises two separate patches to enable optional
collection and export of client-side telemetry.
The daos_agent configuration file includes new parameters to control
collection and export of per-client telemetry. If the telemetry_port option
is set, then per-client telemetry will be published in Prometheus format
for real-time sampling of client processes. By default, the client telemetry
will be automatically cleaned up on client exit, but may be optionally
retained for some amount of time after client exit in order to allow for
a final sample to be read.
Example daos_agent.yml updates:
telemetry_port: 9192 # export on port 9192
telemetry_enable: true # enable client telemetry for all connected clients
telemetry_retain: 1m # retain metrics for 1 minute after client exit
If telemetry_enable is false (default), client telemetry may be enabled on
a per-process basis by setting D_CLIENT_METRICS_ENABLE=1 in the
environment for clients that should collect telemetry.
Notes from the first patch by Di:
Move TLS to common, so both client and server can have TLS,
which metrics can be attached metrics on it.
Add object metrics on the client side, enabled by
export D_CLIENT_METRICS_ENABLE=1. And client metrics are organized
as "/jobid/pid/xxxxx".
During each daos thread initialization, it will created another
shmem (pid/xxx), which all metrics of the thread will be attached
to. And this metric will be destroyed once the thread exit, though
if D_CLIENT_METRICS_RETAIN is set, these client metrics will be
retain, and it can be retrieved by
daos_metrics --jobid
Add D_CLIENT_METRICS_DUMP_PATH dump metrics from current thread
once it exit.
Some fixes in telemetrics about conv_ptr during re-open the
share memory.
Add daos_metrics --jobid XXX options to retrieve all metrics
of the job.
Features: telemetry
Required-githooks: true
Change-Id: Ib80ff89f39d259e0dce26e0ae8388318f96a3540
Co-authored-by: Di Wang <di.wang@intel.com>
Signed-off-by: Di Wang <di.wang@intel.com>
Signed-off-by: Michael MacDonald <mjmac@google.com>