Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add timing for block availability #5510

Merged
merged 10 commits into from
Apr 23, 2024

Conversation

michaelsproul
Copy link
Member

@michaelsproul michaelsproul commented Apr 2, 2024

Issue Addressed

Presently the block import metrics that measure block observation delay and import delay (time to import a block) are unaware of blob-processing and related delays.

This can lead to the import_delay seeming artificially high because it also factors in the time spent waiting for blobs to arrive.

Proposed Changes

  • Add an available timestamp to the block delay cache which records when a block became fully available.
  • Add a similar attestable timestamp that records when a block became capable of being attested to (this allows us to ignore the non-critical disk writes and similar that are present in the import_delay). Revived from an old PR: Add basic cause analysis for delayed head blocks #3232.
  • Log these delays every slot (not just for late blocks), in milliseconds.
  • Add a debug log with the exact time taken by each newPayload call to the EL. This helps per-block investigations and is clearer and more accurate than the metric histogram. We can also use the exact ms value combined with log timestamp and the observed_delay to calculate the time taken doing other stuff prior to calling the EL (log_timestamp - slot_start - time_taken_ms - observed_delay).
  • Add matching prometheus metrics.

Additional Info

The available delay currently reported by this PR is much higher than I expected, and sometimes seems to include things like snapshot cache misses. I need to do some more investigating to make sure I've put the observation of availability at the earliest possible point.

let available_timestamp = self
.slot_clock
.now_duration()
.ok_or(AvailabilityCheckError::SlotClockError)?;
results.push(MaybeAvailableBlock::Available(AvailableBlock {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fairly certain this function is only called during range-sync or backfill. So these blocks are not related to the current slot. Also due to the way sync works, none of the blocks hit this cache until all the data for all the blocks in the request are downloaded. Thus it's a bit meaningless to measure how far into the current slot they became available. Might be a good idea to exclude them from the data?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it!

@@ -543,6 +567,8 @@ pub struct AvailableBlock<E: EthSpec> {
block_root: Hash256,
block: Arc<SignedBeaconBlock<E>>,
blobs: Option<BlobSidecarList<E>>,
/// Timestamp at which this block first became available (UNIX timestamp, time since 1970).
available_timestamp: Duration,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storing the timestamp here will force you to include the timestamp for backfill & range-sync blocks which will only pollute the data. I think a better way would be to include this timestamp in the AvailableExecutedBlock & then populate it inside make_available(). This will probe what I believe you actually want as make_available() is the final destination of:

which are the 3 entry points for which new data can come in & complete a block.

The other entry points:

Don't actually hit the DA cache as they assume all blobs are already present.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried doing this just now but got stuck because to fill in the available_timestamp on AvailableExecutedBlock here would require it to be part of the AvailableBlock anyway:

MaybeAvailableBlock::Available(available_block) => {
Self::Available(AvailableExecutedBlock::new(
available_block,
import_data,
payload_verification_outcome,
))
}

@AgeManning AgeManning changed the base branch from stable to unstable April 3, 2024 00:08
@AgeManning AgeManning added ready-for-review The code is ready for review and removed work-in-progress PR is a work-in-progress labels Apr 21, 2024
@michaelsproul
Copy link
Member Author

@AgeManning I pushed a commit here 73cb982 which I think fixes the issues with RPC blobs not being recorded. We now record a seen timestamp for every blob, and when we call make_available we take the max of the seen timestamps to calculate the blobs_available_timestamp (this is similar to what @ethDreamer suggested here: #5510 (comment)).

From watching the logs it seems to produce similar results. If you could try your metrics dash with the latest commit and confirm that it looks as expected, that would be great 🙏

@michaelsproul
Copy link
Member Author

@Mergifyio queue

Copy link

mergify bot commented Apr 23, 2024

queue

🛑 The pull request has been removed from the queue default

The queue conditions cannot be satisfied due to failing checks.

You can take a look at Queue: Embarked in merge queue check runs for more details.

In case of a failure due to a flaky test, you should first retrigger the CI.
Then, re-embark the pull request into the merge queue by posting the comment
@mergifyio refresh on the pull request.

@michaelsproul michaelsproul added ready-for-merge This PR is ready to merge. backwards-incompat Backwards-incompatible API change and removed ready-for-review The code is ready for review labels Apr 23, 2024
@michaelsproul
Copy link
Member Author

Marking as a breaking change because this PR deletes and changes some metrics related to block processing.

mergify bot added a commit that referenced this pull request Apr 23, 2024
@michaelsproul
Copy link
Member Author

@Mergifyio queue

Copy link

mergify bot commented Apr 23, 2024

queue

🛑 The pull request has been removed from the queue default

The queue conditions cannot be satisfied due to failing checks.

You can take a look at Queue: Embarked in merge queue check runs for more details.

In case of a failure due to a flaky test, you should first retrigger the CI.
Then, re-embark the pull request into the merge queue by posting the comment
@mergifyio refresh on the pull request.

mergify bot added a commit that referenced this pull request Apr 23, 2024
@michaelsproul
Copy link
Member Author

@Mergifyio requeue

Copy link

mergify bot commented Apr 23, 2024

requeue

✅ This pull request will be re-embarked automatically

The followup queue command will be automatically executed to re-embark the pull request

Copy link

mergify bot commented Apr 23, 2024

queue

✅ The pull request has been merged automatically

The pull request has been merged automatically at 72a3360

mergify bot added a commit that referenced this pull request Apr 23, 2024
@mergify mergify bot merged commit 72a3360 into sigp:unstable Apr 23, 2024
26 of 27 checks passed
@michaelsproul michaelsproul deleted the block-availability-metric branch July 15, 2024 01:57
);
pub static ref BEACON_BLOCK_DELAY_HEAD_SLOT_START_EXCEEDED_TOTAL: Result<IntCounter> = try_create_int_counter(
"beacon_block_delay_head_slot_start_exceeded_total",
"A counter that is triggered when the duration between the start of the block's slot and the current time \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michaelsproul why did you change from histogram to gauge here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that was @AgeManning

thinking was to get exact values rather than buckets

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I guess the average observation frequency is 1 every 12 seconds so kinda ok? Still weird tho

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backwards-incompat Backwards-incompatible API change deneb ready-for-merge This PR is ready to merge. UX-and-logs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants