-
Notifications
You must be signed in to change notification settings - Fork 742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add timing for block availability #5510
Add timing for block availability #5510
Conversation
let available_timestamp = self | ||
.slot_clock | ||
.now_duration() | ||
.ok_or(AvailabilityCheckError::SlotClockError)?; | ||
results.push(MaybeAvailableBlock::Available(AvailableBlock { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fairly certain this function is only called during range-sync or backfill. So these blocks are not related to the current slot. Also due to the way sync works, none of the blocks hit this cache until all the data for all the blocks in the request are downloaded. Thus it's a bit meaningless to measure how far into the current slot they became available. Might be a good idea to exclude them from the data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it!
@@ -543,6 +567,8 @@ pub struct AvailableBlock<E: EthSpec> { | |||
block_root: Hash256, | |||
block: Arc<SignedBeaconBlock<E>>, | |||
blobs: Option<BlobSidecarList<E>>, | |||
/// Timestamp at which this block first became available (UNIX timestamp, time since 1970). | |||
available_timestamp: Duration, | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Storing the timestamp here will force you to include the timestamp for backfill & range-sync blocks which will only pollute the data. I think a better way would be to include this timestamp in the AvailableExecutedBlock
& then populate it inside make_available()
. This will probe what I believe you actually want as make_available()
is the final destination of:
which are the 3 entry points for which new data can come in & complete a block.
The other entry points:
Don't actually hit the DA cache as they assume all blobs are already present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried doing this just now but got stuck because to fill in the available_timestamp
on AvailableExecutedBlock
here would require it to be part of the AvailableBlock
anyway:
lighthouse/beacon_node/beacon_chain/src/block_verification_types.rs
Lines 194 to 200 in 3058b96
MaybeAvailableBlock::Available(available_block) => { | |
Self::Available(AvailableExecutedBlock::new( | |
available_block, | |
import_data, | |
payload_verification_outcome, | |
)) | |
} |
@AgeManning I pushed a commit here 73cb982 which I think fixes the issues with RPC blobs not being recorded. We now record a seen timestamp for every blob, and when we call From watching the logs it seems to produce similar results. If you could try your metrics dash with the latest commit and confirm that it looks as expected, that would be great 🙏 |
@Mergifyio queue |
🛑 The pull request has been removed from the queue
|
Marking as a breaking change because this PR deletes and changes some metrics related to block processing. |
@Mergifyio queue |
🛑 The pull request has been removed from the queue
|
@Mergifyio requeue |
✅ This pull request will be re-embarked automaticallyThe followup |
✅ The pull request has been merged automaticallyThe pull request has been merged automatically at 72a3360 |
); | ||
pub static ref BEACON_BLOCK_DELAY_HEAD_SLOT_START_EXCEEDED_TOTAL: Result<IntCounter> = try_create_int_counter( | ||
"beacon_block_delay_head_slot_start_exceeded_total", | ||
"A counter that is triggered when the duration between the start of the block's slot and the current time \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michaelsproul why did you change from histogram to gauge here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that was @AgeManning
thinking was to get exact values rather than buckets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm I guess the average observation frequency is 1 every 12 seconds so kinda ok? Still weird tho
Issue Addressed
Presently the block import metrics that measure block observation delay and import delay (time to import a block) are unaware of blob-processing and related delays.
This can lead to the
import_delay
seeming artificially high because it also factors in the time spent waiting for blobs to arrive.Proposed Changes
available
timestamp to the block delay cache which records when a block became fully available.attestable
timestamp that records when a block became capable of being attested to (this allows us to ignore the non-critical disk writes and similar that are present in theimport_delay
). Revived from an old PR: Add basic cause analysis for delayed head blocks #3232.debug
log with the exact time taken by eachnewPayload
call to the EL. This helps per-block investigations and is clearer and more accurate than the metric histogram. We can also use the exactms
value combined with log timestamp and theobserved_delay
to calculate the time taken doing other stuff prior to calling the EL (log_timestamp - slot_start - time_taken_ms - observed_delay
).Additional Info
The
available
delay currently reported by this PR is much higher than I expected, and sometimes seems to include things like snapshot cache misses. I need to do some more investigating to make sure I've put the observation of availability at the earliest possible point.