First pass at tracking system metrics #48

Quantumplation · 2025-01-06T07:36:47Z

This spawns a background thread to track common system metrics; currently only tracks a few, but we can add more as needed.

I evaluated a few different metrics:

Of these, sys_metrics doesn't support windows, but is the most actively maintained.

As a follow up, it might be useful to track Tokio runtime metrics as well:

https://docs.rs/tokio/latest/tokio/runtime/struct.RuntimeMetrics.html

Opening this as a draft PR, looking for feedback on the structure / organization before I write a few more tests and add a few more metrics.

This spawns a background thread to track common system metrics; currently only tracks a few, but we can add more as needed. I evaluated a few different metrics: - [opentelemetry-system-metrics](https://crates.io/crates/opentelemetry-system-metrics) - [sys_metrics](https://crates.io/crates/sys_metrics) - [heim](https://github.com/heim-rs/heim?tab=readme-ov-file) Of these, sys_metrics doesn't support windows, but is the most actively maintained. As a follow up, it might be useful to track Tokio runtime metrics as well: https://docs.rs/tokio/latest/tokio/runtime/struct.RuntimeMetrics.html

Quantumplation · 2025-01-06T07:38:24Z

Example view in grafana:

I should also note that this follows naming conventions as specified here: https://opentelemetry.io/docs/specs/semconv/general/metrics/#general-metric-semantic-conventions

abailly · 2025-01-06T13:36:02Z

Do we want to track system-level metrics from within the process? Would it not be simpler and more flexible to let the users reap those from the underlying system, using whatever makes sense for them?

Quantumplation · 2025-01-09T02:45:32Z

I have two main thoughts here, though I'll defer to @KtorZ as he's the one that requested this.

If we completely externalize this, then we place the burden on the operator to find something; I think we should have an in-built solution, and then allow someone to disable it (or just ignore it) if they have a preferred way to gather metrics. But tracking metrics is such a critical part of infrastructure health, that if we have nothing, we're basically forcing someone to do extra steps to run Amaru, which is one of the big mistakes that the Haskell node made, IMO
In general, these metrics focus on the node process and what it's been allocated, not the underlying system system, so I still think it's in the purview of the process itself to track/report these alongside other metrics.

KtorZ · 2025-01-09T08:48:45Z

@abailly: Do we want to track system-level metrics from within the process?

That definitely also crossed my mind, and was part of the question when we talked about it on Discord. Yet, I also agree with @Quantumplation that it's a good complement because we can make metrics available that are runtime-specific rather than being process-specific.

I still expect anyone doing ops to have their preferred ways of monitoring processes. But we can at least provide some simple metrics (if only for development) as well as some more fine-grained ones that aren't immediately observable from a process.

KtorZ · 2025-01-09T08:51:42Z

crates/amaru/src/bin/amaru/main.rs

@@ -39,7 +40,7 @@ async fn main() -> miette::Result<()> {
    let args = Cli::parse();

    let result = match args.command {
-        Command::Daemon(args) => cmd::daemon::run(args, counter).await,
+        Command::Daemon(args) => cmd::daemon::run(args, counter, metrics.clone()).await,


Note that this counter was merely me toying around passing a metric counter all-the-way down the ledger. What I had in mind for this was to become some sort of interface / handle to metrics in general; possibly abstract behind traits and driven by the tracing setup.

agreed, I had the same thought; I thought about trying to refactor it as part of this, but figured a lighter touch at least for the draft was better

(I can take a pass at something like that in a follow up PR, if you want!)

This should make conditional compilation easier, since we can enable/disable the module as a whole

chore: clippy fixes

32e8d9e

Quantumplation force-pushed the pi/sys-metrics branch from c66cfbd to 32e8d9e Compare January 6, 2025 07:43

KtorZ reviewed Jan 9, 2025

View reviewed changes

Quantumplation marked this pull request as ready for review January 10, 2025 14:45

Quantumplation and others added 4 commits January 10, 2025 09:53

Merge branch 'main' into pi/sys-metrics

d5a0185

Disable system metrics on windows until the dependency supports windows

2d86c88

Fix windows build

96abd6e

Move internals into submodule

0b257ef

This should make conditional compilation easier, since we can enable/disable the module as a whole

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First pass at tracking system metrics #48

First pass at tracking system metrics #48

Quantumplation commented Jan 6, 2025

Quantumplation commented Jan 6, 2025

abailly commented Jan 6, 2025

Quantumplation commented Jan 9, 2025

KtorZ commented Jan 9, 2025

KtorZ Jan 9, 2025

Quantumplation Jan 9, 2025

Quantumplation Jan 9, 2025

First pass at tracking system metrics #48

Are you sure you want to change the base?

First pass at tracking system metrics #48

Conversation

Quantumplation commented Jan 6, 2025

Quantumplation commented Jan 6, 2025

abailly commented Jan 6, 2025

Quantumplation commented Jan 9, 2025

KtorZ commented Jan 9, 2025

KtorZ Jan 9, 2025

Choose a reason for hiding this comment

Quantumplation Jan 9, 2025

Choose a reason for hiding this comment

Quantumplation Jan 9, 2025

Choose a reason for hiding this comment