PeerDAS metrics: add data column, kzg, custody metrics #14

KatyaRyazantseva · 2024-10-03T20:19:33Z

This PR introduces a subset of the recommended metrics for monitoring PeerDAS - data column, kzg and custody metrics. The proposed changes attempt to standardize the naming convention for these metrics across client implementations. By adopting these metrics, clients can provide better PeerDAS monitoring, compatibility and simplify the process of building summary dashboards for PeerDAS performance tracking and analysis.

The list is open for discussion. Each client has the opportunity to contribute to it by suggesting additions or disputing existing metrics.

Data column, kzg, custody metrics

Name	Usage
`beacon_data_column_sidecar_processing_requests_total`	Number of data column sidecars submitted for processing (counter)
`beacon_data_column_sidecar_processing_successes_total`	Number of data column sidecars verified for gossip (counter)
`beacon_data_column_sidecar_gossip_verification_seconds`	Full runtime of data column sidecars gossip verification (histogram)
`beacon_data_availability_reconstructed_columns_total`	Total count of reconstructed columns (counter)
`beacon_data_availability_reconstruction_time_seconds`	Time taken to reconstruct columns (histogram)
`beacon_data_column_sidecar_computation_seconds`	Time taken to compute data column sidecar, including cells, proofs and inclusion proof (histogram)
`beacon_data_column_sidecar_inclusion_proof_verification_seconds`	Time taken to verify data column sidecar inclusion proof (histogram)
`beacon_kzg_verification_data_column_single_seconds`	Runtime of single data column kzg verification (histogram)
`beacon_kzg_verification_data_column_batch_seconds`	Runtime of batched data column kzg verification (histogram)
`beacon_custody_columns_count_total`	Total count of columns in custody within the data availability boundary (counter)

KatyaRyazantseva · 2024-10-03T20:22:52Z

Clients' Status Overview

Metric	Lighthouse	Teku	Grandine	Prysm
Raised issue	#6018	#65		#14129
`beacon_data_column_sidecar_processing_requests_total`	✅	✅	📝	□
`beacon_data_column_sidecar_processing_successes_total`	✅	✅	📝	□
`beacon_data_column_sidecar_gossip_verification_seconds`	✅	✅	📝	📝
`beacon_data_availability_reconstructed_columns_total`	📝	📝	□	□
`beacon_data_availability_reconstruction_time_seconds`	📝	📝	📝	□
`beacon_data_column_sidecar_computation_seconds`	📝	□	📝	□
`beacon_data_column_sidecar_inclusion_proof_verification_seconds`	📝	□	📝	□
`beacon_kzg_verification_data_column_single_seconds`	📝	□	□	□
`beacon_kzg_verification_data_column_batch_seconds`	📝	□	📝	□
`beacon_custody_columns_count_total`	□	□	□	□

✅ - implemented
📝 - in progress, requiring adjustments
□ - not implemented

jimmygchen · 2024-11-05T01:09:14Z

metrics.md

+| `beacon_data_column_sidecar_computation_seconds`                  | Histogram | Time taken to compute data column sidecar, including cells, proofs and inclusion proof                |  On data column sidecar computation            |
+| `beacon_data_column_sidecar_inclusion_proof_verification_seconds` | Histogram | Time taken to verify data column sidecar inclusion proof                          |  On data column sidecar inclusion proof verification  |
+| `beacon_kzg_verification_data_column_single_seconds`              | Histogram | Runtime of single data column kzg verification                                 | On single data column kzg verification  |
+| `beacon_kzg_verification_data_column_batch_seconds`               | Histogram | Runtime of batched data column kzg verification                                 | On batched data column kzg verification |


@KatyaRyazantseva in the last call you mentioned that there was proposals to make this milliseconds for more granularity.

Lighthouse currently record these metrics in seconds with floating-point precision, which already provides a high degree of granularity, depending on how the histogram buckets are set.

The units for histograms in all prometheus client libraries are standardised to seconds:

A histogram SHOULD have the following methods:
Some way to time code for users in seconds. In Python this is the time() decorator/context manager. In Java this is startTimer/observeDuration. Units other than seconds MUST NOT be offered (if a user wants something else, they can do it by hand). This should follow the same pattern as Gauge/Summary.

https://prometheus.io/docs/instrumenting/writing_clientlibs/#histogram

jimmygchen · 2024-11-05T01:09:52Z

metrics.md

+| `beacon_data_availability_reconstruction_time_seconds`            | Histogram | Time taken to reconstruct columns                                      | On data column kzg verification  |
+| `beacon_data_column_sidecar_computation_seconds`                  | Histogram | Time taken to compute data column sidecar, including cells, proofs and inclusion proof                |  On data column sidecar computation            |
+| `beacon_data_column_sidecar_inclusion_proof_verification_seconds` | Histogram | Time taken to verify data column sidecar inclusion proof                          |  On data column sidecar inclusion proof verification  |
+| `beacon_kzg_verification_data_column_single_seconds`              | Histogram | Runtime of single data column kzg verification                                 | On single data column kzg verification  |


As discussed on the call, this metric can be removed.

jimmygchen · 2024-11-05T01:18:07Z

metrics.md

+
+| Name | Metric type | Usage | Sample collection event |
+|--------------------------------------------|-------------|-------------------------------------------------------------|----------------------|
+| `beacon_data_column_sidecar_processing_requests_total`            | Counter   | Number of data column sidecars submitted for processing                     | On data column sidecar gossip verification     |


The name and description may be a bit confusing here, as "processing" could mean different things in different clients. The description makes sense in Lighthouse because we have a task scheduling component called BeaconProcessor, and all the tasks are "submitted" to this scheduler for processing - so you see this terminology used quite frequently in our metrics.

It's more of a convenient metric rather than a must-have in Lighthouse, as we can get the same data with count(beacon_data_column_sidecar_gossip_verification_seconds). IMO we should probably minimise the number of standardise metrics, so we don't force all clients to implement metrics that aren't necessary for them. It comes with maintenance cost (once introduced, renaming / removing would be a breaking change) as well as extra prometheus storage cost.

It may still make sense to have this metric though.

For a few of our metrics we discard the timer if the operations fails, so it may not reflect the actual attemtpt count.

Is it also worth mentioning whether we should discard a timer metric if the operation fails?

ah right, I see this implemented in Teku already, if it make sense feel free to ignore the above comment!

jimmygchen · 2024-11-05T01:46:52Z

metrics.md

+| `beacon_data_column_sidecar_gossip_verification_seconds`          | Histogram | Full runtime of data column sidecars gossip verification                   | On data column sidecar gossip verification     |
+| `beacon_data_availability_reconstructed_columns_total`            | Counter   | Total count of reconstructed columns                                      | On data column kzg verification  |
+| `beacon_data_availability_reconstruction_time_seconds`            | Histogram | Time taken to reconstruct columns                                      | On data column kzg verification  |
+| `beacon_data_column_sidecar_computation_seconds`                  | Histogram | Time taken to compute data column sidecar, including cells, proofs and inclusion proof                |  On data column sidecar computation            |


Is it worth suggesting a historgram bucket? e.g. [0.1, 0.15, 0.25, 0.35, 0.5, 0.7, 1.0, 2.5, 5.0, 10.0]

We've also implemented a blob_count label, so that we can individual timings for different blob counts.

jimmygchen · 2024-11-05T01:49:06Z

metrics.md

+| `beacon_data_column_sidecar_computation_seconds`                  | Histogram | Time taken to compute data column sidecar, including cells, proofs and inclusion proof                |  On data column sidecar computation            |
+| `beacon_data_column_sidecar_inclusion_proof_verification_seconds` | Histogram | Time taken to verify data column sidecar inclusion proof                          |  On data column sidecar inclusion proof verification  |
+| `beacon_kzg_verification_data_column_single_seconds`              | Histogram | Runtime of single data column kzg verification                                 | On single data column kzg verification  |
+| `beacon_kzg_verification_data_column_batch_seconds`               | Histogram | Runtime of batched data column kzg verification                                 | On batched data column kzg verification |


Our current buckets for this histogram in case it helps:
0.002, 0.004, 0.006, 0.008, 0.01, 0.012, 0.015, 0.02, 0.03, 0.05, 0.07

jimmygchen · 2024-11-05T01:52:32Z

metrics.md

+| `beacon_data_column_sidecar_processing_successes_total`           | Counter   | Number of data column sidecars verified for gossip                         | On data column sidecar gossip verification     |
+| `beacon_data_column_sidecar_gossip_verification_seconds`          | Histogram | Full runtime of data column sidecars gossip verification                   | On data column sidecar gossip verification     |
+| `beacon_data_availability_reconstructed_columns_total`            | Counter   | Total count of reconstructed columns                                      | On data column kzg verification  |
+| `beacon_data_availability_reconstruction_time_seconds`            | Histogram | Time taken to reconstruct columns                                      | On data column kzg verification  |


A few more reconstruction metrics that could be useful:

kzg_data_column_reconstruction_attempts: Count of times data column reconstruction has been attempted

kzg_data_column_reconstruction_failures: Count of times data column reconstruction has failed

Lighthouse source for this here

jimmygchen · 2024-11-05T01:53:27Z

metrics.md

+| `beacon_data_column_sidecar_inclusion_proof_verification_seconds` | Histogram | Time taken to verify data column sidecar inclusion proof                          |  On data column sidecar inclusion proof verification  |
+| `beacon_kzg_verification_data_column_single_seconds`              | Histogram | Runtime of single data column kzg verification                                 | On single data column kzg verification  |
+| `beacon_kzg_verification_data_column_batch_seconds`               | Histogram | Runtime of batched data column kzg verification                                 | On batched data column kzg verification |
+| `beacon_custody_columns_count_total`                              | Counter     | Total count of columns in custody within the data availability boundary                                     | On custody collecting and verification |


Is there another metric for tracking the custody count?

KatyaRyazantseva added 6 commits August 22, 2024 11:52

add draft PeerDAS metrics

7976fc5

correct custody events

85262be

Gossip and req/resp metrics

024a544

add labels for metrics

f748c61

typo fix

4cdc1e1

add data column, kzg, custody metrics

0384ab4

KatyaRyazantseva mentioned this pull request Oct 23, 2024

Fix 'data_column_sidecar_computation' metric align with PeerDAS metrics specs prysmaticlabs/prysm#14574

Open

3 tasks

jimmygchen reviewed Nov 5, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PeerDAS metrics: add data column, kzg, custody metrics #14

PeerDAS metrics: add data column, kzg, custody metrics #14

KatyaRyazantseva commented Oct 3, 2024 •

edited

Loading

KatyaRyazantseva commented Oct 3, 2024 •

edited

Loading

jimmygchen Nov 5, 2024

jimmygchen Nov 5, 2024

jimmygchen Nov 5, 2024

jimmygchen Nov 5, 2024

jimmygchen Nov 5, 2024

jimmygchen Nov 5, 2024

jimmygchen Nov 5, 2024

jimmygchen Nov 5, 2024

jimmygchen Nov 5, 2024

jimmygchen Nov 5, 2024

PeerDAS metrics: add data column, kzg, custody metrics #14

Are you sure you want to change the base?

PeerDAS metrics: add data column, kzg, custody metrics #14

Conversation

KatyaRyazantseva commented Oct 3, 2024 • edited Loading

Data column, kzg, custody metrics

KatyaRyazantseva commented Oct 3, 2024 • edited Loading

Clients' Status Overview

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KatyaRyazantseva commented Oct 3, 2024 •

edited

Loading

KatyaRyazantseva commented Oct 3, 2024 •

edited

Loading