Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PeerDAS metrics: add data column, kzg, custody metrics #14

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,26 @@ The following are the minimal metrics agreed to be conformed by the various clie
| `beacon_processed_deposits_total` | Gauge | Total number of deposits processed | On epoch transition |

\* All `*_root` values are converted to signed 64-bit integers utilizing the last 8 bytes interpreted as little-endian (`int.from_bytes(root[24:32], byteorder='little', signed=True)`).

### PeerDAS Metrics

The following metrics are proposed to be added to clients for PeerDAS monitoring. This list is open for discussion. Each client has the opportunity to contribute to it by suggesting additions or disputing existing metrics.

#### Data column, kzg, custody metrics

| Name | Metric type | Usage | Sample collection event |
|--------------------------------------------|-------------|-------------------------------------------------------------|----------------------|
| `beacon_data_column_sidecar_processing_requests_total` | Counter | Number of data column sidecars submitted for processing | On data column sidecar gossip verification |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name and description may be a bit confusing here, as "processing" could mean different things in different clients. The description makes sense in Lighthouse because we have a task scheduling component called BeaconProcessor, and all the tasks are "submitted" to this scheduler for processing - so you see this terminology used quite frequently in our metrics.

It's more of a convenient metric rather than a must-have in Lighthouse, as we can get the same data with count(beacon_data_column_sidecar_gossip_verification_seconds). IMO we should probably minimise the number of standardise metrics, so we don't force all clients to implement metrics that aren't necessary for them. It comes with maintenance cost (once introduced, renaming / removing would be a breaking change) as well as extra prometheus storage cost.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may still make sense to have this metric though.

For a few of our metrics we discard the timer if the operations fails, so it may not reflect the actual attemtpt count.

Is it also worth mentioning whether we should discard a timer metric if the operation fails?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah right, I see this implemented in Teku already, if it make sense feel free to ignore the above comment!

| `beacon_data_column_sidecar_processing_successes_total` | Counter | Number of data column sidecars verified for gossip | On data column sidecar gossip verification |
| `beacon_data_column_sidecar_gossip_verification_seconds` | Histogram | Full runtime of data column sidecars gossip verification | On data column sidecar gossip verification |
| `beacon_data_availability_reconstructed_columns_total` | Counter | Total count of reconstructed columns | On data column kzg verification |
| `beacon_data_availability_reconstruction_time_seconds` | Histogram | Time taken to reconstruct columns | On data column kzg verification |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more reconstruction metrics that could be useful:

  • kzg_data_column_reconstruction_attempts: Count of times data column reconstruction has been attempted
  • kzg_data_column_reconstruction_failures: Count of times data column reconstruction has failed

Lighthouse source for this here

| `beacon_data_column_sidecar_computation_seconds` | Histogram | Time taken to compute data column sidecar, including cells, proofs and inclusion proof | On data column sidecar computation |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth suggesting a historgram bucket? e.g. [0.1, 0.15, 0.25, 0.35, 0.5, 0.7, 1.0, 2.5, 5.0, 10.0]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've also implemented a blob_count label, so that we can individual timings for different blob counts.

| `beacon_data_column_sidecar_inclusion_proof_verification_seconds` | Histogram | Time taken to verify data column sidecar inclusion proof | On data column sidecar inclusion proof verification |
| `beacon_kzg_verification_data_column_single_seconds` | Histogram | Runtime of single data column kzg verification | On single data column kzg verification |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed on the call, this metric can be removed.

| `beacon_kzg_verification_data_column_batch_seconds` | Histogram | Runtime of batched data column kzg verification | On batched data column kzg verification |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KatyaRyazantseva in the last call you mentioned that there was proposals to make this milliseconds for more granularity.

Lighthouse currently record these metrics in seconds with floating-point precision, which already provides a high degree of granularity, depending on how the histogram buckets are set.

The units for histograms in all prometheus client libraries are standardised to seconds:

A histogram SHOULD have the following methods:
Some way to time code for users in seconds. In Python this is the time() decorator/context manager. In Java this is startTimer/observeDuration. Units other than seconds MUST NOT be offered (if a user wants something else, they can do it by hand). This should follow the same pattern as Gauge/Summary.

https://prometheus.io/docs/instrumenting/writing_clientlibs/#histogram

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our current buckets for this histogram in case it helps:
0.002, 0.004, 0.006, 0.008, 0.01, 0.012, 0.015, 0.02, 0.03, 0.05, 0.07

| `beacon_custody_columns_count_total` | Counter | Total count of columns in custody within the data availability boundary | On custody collecting and verification |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there another metric for tracking the custody count?


### Additional Metrics

The following are proposed metrics to be added to clients. This list is _not_ stable and is subject to drastic changes, deletions, and additions. The additional metric list is being
Expand Down