-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PeerDAS metrics: add data column, kzg, custody metrics #14
base: master
Are you sure you want to change the base?
PeerDAS metrics: add data column, kzg, custody metrics #14
Conversation
Clients' Status Overview
✅ - implemented |
| `beacon_data_column_sidecar_computation_seconds` | Histogram | Time taken to compute data column sidecar, including cells, proofs and inclusion proof | On data column sidecar computation | | ||
| `beacon_data_column_sidecar_inclusion_proof_verification_seconds` | Histogram | Time taken to verify data column sidecar inclusion proof | On data column sidecar inclusion proof verification | | ||
| `beacon_kzg_verification_data_column_single_seconds` | Histogram | Runtime of single data column kzg verification | On single data column kzg verification | | ||
| `beacon_kzg_verification_data_column_batch_seconds` | Histogram | Runtime of batched data column kzg verification | On batched data column kzg verification | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KatyaRyazantseva in the last call you mentioned that there was proposals to make this milliseconds for more granularity.
Lighthouse currently record these metrics in seconds with floating-point precision, which already provides a high degree of granularity, depending on how the histogram buckets are set.
The units for histograms in all prometheus client libraries are standardised to seconds:
A histogram SHOULD have the following methods:
Some way to time code for users in seconds. In Python this is the time() decorator/context manager. In Java this is startTimer/observeDuration. Units other than seconds MUST NOT be offered (if a user wants something else, they can do it by hand). This should follow the same pattern as Gauge/Summary.
https://prometheus.io/docs/instrumenting/writing_clientlibs/#histogram
| `beacon_data_availability_reconstruction_time_seconds` | Histogram | Time taken to reconstruct columns | On data column kzg verification | | ||
| `beacon_data_column_sidecar_computation_seconds` | Histogram | Time taken to compute data column sidecar, including cells, proofs and inclusion proof | On data column sidecar computation | | ||
| `beacon_data_column_sidecar_inclusion_proof_verification_seconds` | Histogram | Time taken to verify data column sidecar inclusion proof | On data column sidecar inclusion proof verification | | ||
| `beacon_kzg_verification_data_column_single_seconds` | Histogram | Runtime of single data column kzg verification | On single data column kzg verification | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed on the call, this metric can be removed.
|
||
| Name | Metric type | Usage | Sample collection event | | ||
|--------------------------------------------|-------------|-------------------------------------------------------------|----------------------| | ||
| `beacon_data_column_sidecar_processing_requests_total` | Counter | Number of data column sidecars submitted for processing | On data column sidecar gossip verification | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name and description may be a bit confusing here, as "processing" could mean different things in different clients. The description makes sense in Lighthouse because we have a task scheduling component called BeaconProcessor
, and all the tasks are "submitted" to this scheduler for processing - so you see this terminology used quite frequently in our metrics.
It's more of a convenient metric rather than a must-have in Lighthouse, as we can get the same data with count(beacon_data_column_sidecar_gossip_verification_seconds)
. IMO we should probably minimise the number of standardise metrics, so we don't force all clients to implement metrics that aren't necessary for them. It comes with maintenance cost (once introduced, renaming / removing would be a breaking change) as well as extra prometheus storage cost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may still make sense to have this metric though.
For a few of our metrics we discard the timer if the operations fails, so it may not reflect the actual attemtpt count.
Is it also worth mentioning whether we should discard a timer metric if the operation fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah right, I see this implemented in Teku already, if it make sense feel free to ignore the above comment!
| `beacon_data_column_sidecar_gossip_verification_seconds` | Histogram | Full runtime of data column sidecars gossip verification | On data column sidecar gossip verification | | ||
| `beacon_data_availability_reconstructed_columns_total` | Counter | Total count of reconstructed columns | On data column kzg verification | | ||
| `beacon_data_availability_reconstruction_time_seconds` | Histogram | Time taken to reconstruct columns | On data column kzg verification | | ||
| `beacon_data_column_sidecar_computation_seconds` | Histogram | Time taken to compute data column sidecar, including cells, proofs and inclusion proof | On data column sidecar computation | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth suggesting a historgram bucket? e.g. [0.1, 0.15, 0.25, 0.35, 0.5, 0.7, 1.0, 2.5, 5.0, 10.0]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've also implemented a blob_count
label, so that we can individual timings for different blob counts.
| `beacon_data_column_sidecar_computation_seconds` | Histogram | Time taken to compute data column sidecar, including cells, proofs and inclusion proof | On data column sidecar computation | | ||
| `beacon_data_column_sidecar_inclusion_proof_verification_seconds` | Histogram | Time taken to verify data column sidecar inclusion proof | On data column sidecar inclusion proof verification | | ||
| `beacon_kzg_verification_data_column_single_seconds` | Histogram | Runtime of single data column kzg verification | On single data column kzg verification | | ||
| `beacon_kzg_verification_data_column_batch_seconds` | Histogram | Runtime of batched data column kzg verification | On batched data column kzg verification | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our current buckets for this histogram in case it helps:
0.002, 0.004, 0.006, 0.008, 0.01, 0.012, 0.015, 0.02, 0.03, 0.05, 0.07
| `beacon_data_column_sidecar_processing_successes_total` | Counter | Number of data column sidecars verified for gossip | On data column sidecar gossip verification | | ||
| `beacon_data_column_sidecar_gossip_verification_seconds` | Histogram | Full runtime of data column sidecars gossip verification | On data column sidecar gossip verification | | ||
| `beacon_data_availability_reconstructed_columns_total` | Counter | Total count of reconstructed columns | On data column kzg verification | | ||
| `beacon_data_availability_reconstruction_time_seconds` | Histogram | Time taken to reconstruct columns | On data column kzg verification | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more reconstruction metrics that could be useful:
kzg_data_column_reconstruction_attempts
: Count of times data column reconstruction has been attemptedkzg_data_column_reconstruction_failures
: Count of times data column reconstruction has failed
Lighthouse source for this here
| `beacon_data_column_sidecar_inclusion_proof_verification_seconds` | Histogram | Time taken to verify data column sidecar inclusion proof | On data column sidecar inclusion proof verification | | ||
| `beacon_kzg_verification_data_column_single_seconds` | Histogram | Runtime of single data column kzg verification | On single data column kzg verification | | ||
| `beacon_kzg_verification_data_column_batch_seconds` | Histogram | Runtime of batched data column kzg verification | On batched data column kzg verification | | ||
| `beacon_custody_columns_count_total` | Counter | Total count of columns in custody within the data availability boundary | On custody collecting and verification | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there another metric for tracking the custody count?
This PR introduces a subset of the recommended metrics for monitoring PeerDAS - data column, kzg and custody metrics. The proposed changes attempt to standardize the naming convention for these metrics across client implementations. By adopting these metrics, clients can provide better PeerDAS monitoring, compatibility and simplify the process of building summary dashboards for PeerDAS performance tracking and analysis.
The list is open for discussion. Each client has the opportunity to contribute to it by suggesting additions or disputing existing metrics.
Data column, kzg, custody metrics
beacon_data_column_sidecar_processing_requests_total
beacon_data_column_sidecar_processing_successes_total
beacon_data_column_sidecar_gossip_verification_seconds
beacon_data_availability_reconstructed_columns_total
beacon_data_availability_reconstruction_time_seconds
beacon_data_column_sidecar_computation_seconds
beacon_data_column_sidecar_inclusion_proof_verification_seconds
beacon_kzg_verification_data_column_single_seconds
beacon_kzg_verification_data_column_batch_seconds
beacon_custody_columns_count_total