Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PeerDAS metrics: add data column, kzg, custody metrics #14

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

KatyaRyazantseva
Copy link

@KatyaRyazantseva KatyaRyazantseva commented Oct 3, 2024

This PR introduces a subset of the recommended metrics for monitoring PeerDAS - data column, kzg and custody metrics. The proposed changes attempt to standardize the naming convention for these metrics across client implementations. By adopting these metrics, clients can provide better PeerDAS monitoring, compatibility and simplify the process of building summary dashboards for PeerDAS performance tracking and analysis.

The list is open for discussion. Each client has the opportunity to contribute to it by suggesting additions or disputing existing metrics.

Data column, kzg, custody metrics

Name Usage
beacon_data_column_sidecar_processing_requests_total Number of data column sidecars submitted for processing (counter)
beacon_data_column_sidecar_processing_successes_total Number of data column sidecars verified for gossip (counter)
beacon_data_column_sidecar_gossip_verification_seconds Full runtime of data column sidecars gossip verification (histogram)
beacon_data_availability_reconstructed_columns_total Total count of reconstructed columns (counter)
beacon_data_availability_reconstruction_time_seconds Time taken to reconstruct columns (histogram)
beacon_data_column_sidecar_computation_seconds Time taken to compute data column sidecar, including cells, proofs and inclusion proof (histogram)
beacon_data_column_sidecar_inclusion_proof_verification_seconds Time taken to verify data column sidecar inclusion proof (histogram)
beacon_kzg_verification_data_column_single_seconds Runtime of single data column kzg verification (histogram)
beacon_kzg_verification_data_column_batch_seconds Runtime of batched data column kzg verification (histogram)
beacon_custody_columns_count_total Total count of columns in custody within the data availability boundary (counter)

@KatyaRyazantseva
Copy link
Author

KatyaRyazantseva commented Oct 3, 2024

Clients' Status Overview

Metric Lighthouse Teku Grandine Prysm Lodestar Nimbus
Raised issue #6018 #65 #14129
beacon_data_column_sidecar_processing_requests_total 📝
beacon_data_column_sidecar_processing_successes_total 📝
beacon_data_column_sidecar_gossip_verification_seconds 📝 📝
beacon_data_availability_reconstructed_columns_total 📝 📝
beacon_data_availability_reconstruction_time_seconds 📝 📝 📝
beacon_data_column_sidecar_computation_seconds 📝 📝
beacon_data_column_sidecar_inclusion_proof_verification_seconds 📝 📝
beacon_kzg_verification_data_column_single_seconds 📝
beacon_kzg_verification_data_column_batch_seconds 📝 📝
beacon_custody_columns_count_total

✅ - implemented
📝 - in progress, requiring adjustments
□ - not implemented

| `beacon_data_column_sidecar_computation_seconds` | Histogram | Time taken to compute data column sidecar, including cells, proofs and inclusion proof | On data column sidecar computation |
| `beacon_data_column_sidecar_inclusion_proof_verification_seconds` | Histogram | Time taken to verify data column sidecar inclusion proof | On data column sidecar inclusion proof verification |
| `beacon_kzg_verification_data_column_single_seconds` | Histogram | Runtime of single data column kzg verification | On single data column kzg verification |
| `beacon_kzg_verification_data_column_batch_seconds` | Histogram | Runtime of batched data column kzg verification | On batched data column kzg verification |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KatyaRyazantseva in the last call you mentioned that there was proposals to make this milliseconds for more granularity.

Lighthouse currently record these metrics in seconds with floating-point precision, which already provides a high degree of granularity, depending on how the histogram buckets are set.

The units for histograms in all prometheus client libraries are standardised to seconds:

A histogram SHOULD have the following methods:
Some way to time code for users in seconds. In Python this is the time() decorator/context manager. In Java this is startTimer/observeDuration. Units other than seconds MUST NOT be offered (if a user wants something else, they can do it by hand). This should follow the same pattern as Gauge/Summary.

https://prometheus.io/docs/instrumenting/writing_clientlibs/#histogram

| `beacon_data_availability_reconstruction_time_seconds` | Histogram | Time taken to reconstruct columns | On data column kzg verification |
| `beacon_data_column_sidecar_computation_seconds` | Histogram | Time taken to compute data column sidecar, including cells, proofs and inclusion proof | On data column sidecar computation |
| `beacon_data_column_sidecar_inclusion_proof_verification_seconds` | Histogram | Time taken to verify data column sidecar inclusion proof | On data column sidecar inclusion proof verification |
| `beacon_kzg_verification_data_column_single_seconds` | Histogram | Runtime of single data column kzg verification | On single data column kzg verification |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed on the call, this metric can be removed.


| Name | Metric type | Usage | Sample collection event |
|--------------------------------------------|-------------|-------------------------------------------------------------|----------------------|
| `beacon_data_column_sidecar_processing_requests_total` | Counter | Number of data column sidecars submitted for processing | On data column sidecar gossip verification |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name and description may be a bit confusing here, as "processing" could mean different things in different clients. The description makes sense in Lighthouse because we have a task scheduling component called BeaconProcessor, and all the tasks are "submitted" to this scheduler for processing - so you see this terminology used quite frequently in our metrics.

It's more of a convenient metric rather than a must-have in Lighthouse, as we can get the same data with count(beacon_data_column_sidecar_gossip_verification_seconds). IMO we should probably minimise the number of standardise metrics, so we don't force all clients to implement metrics that aren't necessary for them. It comes with maintenance cost (once introduced, renaming / removing would be a breaking change) as well as extra prometheus storage cost.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may still make sense to have this metric though.

For a few of our metrics we discard the timer if the operations fails, so it may not reflect the actual attemtpt count.

Is it also worth mentioning whether we should discard a timer metric if the operation fails?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah right, I see this implemented in Teku already, if it make sense feel free to ignore the above comment!

| `beacon_data_column_sidecar_gossip_verification_seconds` | Histogram | Full runtime of data column sidecars gossip verification | On data column sidecar gossip verification |
| `beacon_data_availability_reconstructed_columns_total` | Counter | Total count of reconstructed columns | On data column kzg verification |
| `beacon_data_availability_reconstruction_time_seconds` | Histogram | Time taken to reconstruct columns | On data column kzg verification |
| `beacon_data_column_sidecar_computation_seconds` | Histogram | Time taken to compute data column sidecar, including cells, proofs and inclusion proof | On data column sidecar computation |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth suggesting a historgram bucket? e.g. [0.1, 0.15, 0.25, 0.35, 0.5, 0.7, 1.0, 2.5, 5.0, 10.0]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've also implemented a blob_count label, so that we can individual timings for different blob counts.

| `beacon_data_column_sidecar_computation_seconds` | Histogram | Time taken to compute data column sidecar, including cells, proofs and inclusion proof | On data column sidecar computation |
| `beacon_data_column_sidecar_inclusion_proof_verification_seconds` | Histogram | Time taken to verify data column sidecar inclusion proof | On data column sidecar inclusion proof verification |
| `beacon_kzg_verification_data_column_single_seconds` | Histogram | Runtime of single data column kzg verification | On single data column kzg verification |
| `beacon_kzg_verification_data_column_batch_seconds` | Histogram | Runtime of batched data column kzg verification | On batched data column kzg verification |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our current buckets for this histogram in case it helps:
0.002, 0.004, 0.006, 0.008, 0.01, 0.012, 0.015, 0.02, 0.03, 0.05, 0.07

| `beacon_data_column_sidecar_processing_successes_total` | Counter | Number of data column sidecars verified for gossip | On data column sidecar gossip verification |
| `beacon_data_column_sidecar_gossip_verification_seconds` | Histogram | Full runtime of data column sidecars gossip verification | On data column sidecar gossip verification |
| `beacon_data_availability_reconstructed_columns_total` | Counter | Total count of reconstructed columns | On data column kzg verification |
| `beacon_data_availability_reconstruction_time_seconds` | Histogram | Time taken to reconstruct columns | On data column kzg verification |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more reconstruction metrics that could be useful:

  • kzg_data_column_reconstruction_attempts: Count of times data column reconstruction has been attempted
  • kzg_data_column_reconstruction_failures: Count of times data column reconstruction has failed

Lighthouse source for this here

| `beacon_data_column_sidecar_inclusion_proof_verification_seconds` | Histogram | Time taken to verify data column sidecar inclusion proof | On data column sidecar inclusion proof verification |
| `beacon_kzg_verification_data_column_single_seconds` | Histogram | Runtime of single data column kzg verification | On single data column kzg verification |
| `beacon_kzg_verification_data_column_batch_seconds` | Histogram | Runtime of batched data column kzg verification | On batched data column kzg verification |
| `beacon_custody_columns_count_total` | Counter | Total count of columns in custody within the data availability boundary | On custody collecting and verification |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there another metric for tracking the custody count?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants