Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First pass at tracking system metrics #48

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

Quantumplation
Copy link
Contributor

This spawns a background thread to track common system metrics; currently only tracks a few, but we can add more as needed.

I evaluated a few different metrics:

Of these, sys_metrics doesn't support windows, but is the most actively maintained.

As a follow up, it might be useful to track Tokio runtime metrics as well:

https://docs.rs/tokio/latest/tokio/runtime/struct.RuntimeMetrics.html

Opening this as a draft PR, looking for feedback on the structure / organization before I write a few more tests and add a few more metrics.

This spawns a background thread to track common system metrics;
currently only tracks a few, but we can add more as needed.

I evaluated a few different metrics:
 - [opentelemetry-system-metrics](https://crates.io/crates/opentelemetry-system-metrics)
 - [sys_metrics](https://crates.io/crates/sys_metrics)
 - [heim](https://github.com/heim-rs/heim?tab=readme-ov-file)

Of these, sys_metrics doesn't support windows, but is the most actively
maintained.

As a follow up, it might be useful to track Tokio runtime metrics as
well:

https://docs.rs/tokio/latest/tokio/runtime/struct.RuntimeMetrics.html
@Quantumplation
Copy link
Contributor Author

Example view in grafana:
image

I should also note that this follows naming conventions as specified here: https://opentelemetry.io/docs/specs/semconv/general/metrics/#general-metric-semantic-conventions

@abailly
Copy link
Contributor

abailly commented Jan 6, 2025

Do we want to track system-level metrics from within the process? Would it not be simpler and more flexible to let the users reap those from the underlying system, using whatever makes sense for them?

@Quantumplation
Copy link
Contributor Author

I have two main thoughts here, though I'll defer to @KtorZ as he's the one that requested this.

  1. If we completely externalize this, then we place the burden on the operator to find something; I think we should have an in-built solution, and then allow someone to disable it (or just ignore it) if they have a preferred way to gather metrics. But tracking metrics is such a critical part of infrastructure health, that if we have nothing, we're basically forcing someone to do extra steps to run Amaru, which is one of the big mistakes that the Haskell node made, IMO

  2. In general, these metrics focus on the node process and what it's been allocated, not the underlying system system, so I still think it's in the purview of the process itself to track/report these alongside other metrics.

@KtorZ
Copy link
Contributor

KtorZ commented Jan 9, 2025

@abailly: Do we want to track system-level metrics from within the process?

That definitely also crossed my mind, and was part of the question when we talked about it on Discord. Yet, I also agree with @Quantumplation that it's a good complement because we can make metrics available that are runtime-specific rather than being process-specific.

I still expect anyone doing ops to have their preferred ways of monitoring processes. But we can at least provide some simple metrics (if only for development) as well as some more fine-grained ones that aren't immediately observable from a process.

@@ -39,7 +40,7 @@ async fn main() -> miette::Result<()> {
let args = Cli::parse();

let result = match args.command {
Command::Daemon(args) => cmd::daemon::run(args, counter).await,
Command::Daemon(args) => cmd::daemon::run(args, counter, metrics.clone()).await,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this counter was merely me toying around passing a metric counter all-the-way down the ledger. What I had in mind for this was to become some sort of interface / handle to metrics in general; possibly abstract behind traits and driven by the tracing setup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, I had the same thought; I thought about trying to refactor it as part of this, but figured a lighter touch at least for the draft was better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I can take a pass at something like that in a follow up PR, if you want!)

@Quantumplation Quantumplation marked this pull request as ready for review January 10, 2025 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants