Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge unstable deneb june 6th #4477

Merged

Conversation

realbigsean
Copy link
Member

Issue Addressed

Which issue # does this PR address?

Proposed Changes

Please list or describe the changes introduced by this PR.

Additional Info

Please provide any additional information. For example, future considerations
or information useful for reviewers.

int88 and others added 11 commits June 29, 2023 09:39
## Issue Addressed

sigp#4314 

## Proposed Changes

avoid relocking head during builder health check

## Additional Info

NA
## Issue Addressed

 - sigp#4293 
 - sigp#4264 

## Proposed Changes

*Changes largely follow those suggested in the main issue*.

 - Add new routes to HTTP API
   - `post_beacon_blocks_v2`
   - `post_blinded_beacon_blocks_v2`
 - Add new routes to `BeaconNodeHttpClient`
   - `post_beacon_blocks_v2`
   - `post_blinded_beacon_blocks_v2`
 - Define new Eth2 common types
   - `BroadcastValidation`, enum representing the level of validation to apply to blocks prior to broadcast
   - `BroadcastValidationQuery`, the corresponding HTTP query string type for the above type
 - ~~Define `_checked` variants of both `publish_block` and `publish_blinded_block` that enforce a validation level at a type level~~
 - Add interactive tests to the `bn_http_api_tests` test target covering each validation level (to their own test module, `broadcast_validation_tests`)
   - `beacon/blocks`
       - `broadcast_validation=gossip`
         - Invalid (400)
         - Full Pass (200)
         - Partial Pass (202)
        - `broadcast_validation=consensus`
          - Invalid (400)
          - Only gossip (400)
          - Only consensus pass (i.e., equivocates) (200)
          - Full pass (200)
        - `broadcast_validation=consensus_and_equivocation`
          - Invalid (400)
          - Invalid due to early equivocation (400)
          - Only gossip (400)
          - Only consensus (400)
          - Pass (200)
   - `beacon/blinded_blocks`
       - `broadcast_validation=gossip`
         - Invalid (400)
         - Full Pass (200)
         - Partial Pass (202)
        - `broadcast_validation=consensus`
          - Invalid (400)
          - Only gossip (400)
          - ~~Only consensus pass (i.e., equivocates) (200)~~
          - Full pass (200)
        - `broadcast_validation=consensus_and_equivocation`
          - Invalid (400)
          - Invalid due to early equivocation (400)
          - Only gossip (400)
          - Only consensus (400)
          - Pass (200)
 - Add a new trait, `IntoGossipVerifiedBlock`, which allows type-level guarantees to be made as to gossip validity
 - Modify the structure of the `ObservedBlockProducers` cache from a `(slot, validator_index)` mapping to a `((slot, validator_index), block_root)` mapping
 - Modify `ObservedBlockProducers::proposer_has_been_observed` to return a `SeenBlock` rather than a boolean on success
 - Punish gossip peer (low) for submitting equivocating blocks
 - Rename `BlockError::SlashablePublish` to `BlockError::SlashableProposal`

## Additional Info

This PR contains changes that directly modify how blocks are verified within the client. For more context, consult [comments in-thread](sigp#4316 (comment)).


Co-authored-by: Michael Sproul <michael@sigmaprime.io>
## Issue Addressed

[sigp#4259](sigp#4259)

## Proposed Changes

debounce spammy `Unable to send message to the beacon processor` log messages

## Additional Info

We could potentially debounce other logs that have the potential to be "spammy". 

After some feedback we decided to additionally add the following change:

create a newtype wrapper around `mpsc::Sender<BeaconWorkEvent<T>>`. When there is an error on the try_send method on the wrapper, we increase a counter metric with one label per work type.
## Issue Addressed

[sigp#4292](sigp#4292)

## Proposed Changes

Updated the node health endpoint

will return a 200 status code if  `!syncing && !el_offline && !optimistic`

wil return a 206 if `(syncing || optimistic) &&  !el_offline`

will return a 503 if `el_offline`



## Additional Info
…tion (sigp#4362)

## Issue Addressed

sigp#4118 

## Proposed Changes

This PR introduces a "progressive balances" cache on the `BeaconState`, which keeps track of the accumulated target attestation balance for the current & previous epochs. The cached values are utilised by fork choice to calculate unrealized justification and finalization (instead of converting epoch participation arrays to balances for each block we receive).

This optimization will be rolled out gradually to allow for more testing. A new `--progressive-balances disabled|checked|strict|fast` flag is introduced to support this:
- `checked`: enabled with checks against participation cache, and falls back to the existing epoch processing calculation if there is a total target attester balance mismatch. There is no performance gain from this as the participation cache still needs to be computed. **This is the default mode for now.**
- `strict`: enabled with checks against participation cache, returns error if there is a mismatch. **Used for testing only**.
- `fast`: enabled with no comparative checks and without computing the participation cache. This mode gives us the performance gains from the optimization. This is still experimental and not currently recommended for production usage, but will become the default mode in a future release.
- `disabled`: disable the usage of progressive cache, and use the existing method for FFG progression calculation. This mode may be useful if we find a bug and want to stop the frequent error logs.

### Tasks

- [x] Initial cache implementation in `BeaconState`
- [x] Perform checks in fork choice to compare the progressive balances cache against results from `ParticipationCache`
- [x] Add CLI flag, and disable the optimization by default
- [x] Testing on Goerli & Benchmarking
- [x]  Move caching logic from state processing to the `ProgressiveBalancesCache` (see [this comment](sigp#4362 (comment)))
- [x] Add attesting balance metrics



Co-authored-by: Jimmy Chen <jimmy@sigmaprime.io>
We now officially have ipv6 support. The mainnet bootnodes have been updated to support ipv6. This PR updates lighthouse's internal bootnodes for mainnet to avoid fetching them on initial load.
## Issue Addressed

NA

## Proposed Changes

Bump versions

## Additional Info

NA
## Issue Addressed

sigp#4331 

## Proposed Changes

 - Use comparison rather than strict equality between the earliest epoch we know about and the backfill target (which will be the most recent WSP by default or genesis)
 - Add helper function `BackFillSync<T>::would_complete` to achieve this in one location

## Additional Info

 - There's an ad hoc test for this in sigp#4461


Co-authored-by: Age Manning <Age@AgeManning.com>
michaelsproul and others added 9 commits July 8, 2023 13:16
## Issue Addressed

[Users on Twitter](https://twitter.com/ashekhirin/status/1676334843192397824) are getting checkpoint sync URL timeouts with the default of 60s, so this PR increases the default timeout to 3 minutes.

I've also added a short section to the book about adjusting the timeout with `--checkpoint-sync-url-timeout`.
*Replaces sigp#4434. It is identical, but this PR has a smaller diff due to a curated commit history.*

## Issue Addressed

NA

## Proposed Changes

This PR moves the scheduling logic for the `BeaconProcessor` into a new crate in `beacon_node/beacon_processor`. Previously it existed in the `beacon_node/network` crate.

This addresses a circular-dependency problem where it's not possible to use the `BeaconProcessor` from the `beacon_chain` crate. The `network` crate depends on the `beacon_chain` crate (`network -> beacon_chain`), but importing the `BeaconProcessor` into the `beacon_chain` crate would create a circular dependancy of `beacon_chain -> network`.

The `BeaconProcessor` was designed to provide queuing and prioritized scheduling for messages from the network. It has proven to be quite valuable and I believe we'd make Lighthouse more stable and effective by using it elsewhere. In particular, I think we should use the `BeaconProcessor` for:

1. HTTP API requests.
1. Scheduled tasks in the `BeaconChain` (e.g., state advance).

Using the `BeaconProcessor` for these tasks would help prevent the BN from becoming overwhelmed and would also help it to prioritize operations (e.g., choosing to process blocks from gossip before responding to low-priority HTTP API requests).

## Additional Info

This PR is intended to have zero impact on runtime behaviour. It aims to simply separate the *scheduling* code (i.e., the `BeaconProcessor`) from the *business logic* in the `network` crate (i.e., the `Worker` impls). Future PRs (see sigp#4462) can build upon these works to actually use the `BeaconProcessor` for more operations.

I've gone to some effort to use `git mv` to make the diff look more like "file was moved and modified" rather than "file was deleted and a new one added". This should reduce review burden and help maintain commit attribution.
## Proposed Changes

* Add `lcli state-root` command for computing the hash tree root of a `BeaconState`.
* Add a `--network` flag which can be used instead of `--testnet-dir` to set the network, e.g. Mainnet, Goerli, Gnosis.
* Use the new network flag in `transition-blocks`, `skip-slots`, and `block-root`, which previously only supported mainnet.
* **BREAKING CHANGE** Remove the default value of `~/.lighthouse/testnet` from `--testnet-dir`. This may have made sense in previous versions where `lcli` was more testnet focussed, but IMO it is an unnecessary complication and foot-gun today.
## Issue Addressed

sigp#4494 

## Proposed Changes

 - Remove explicit re-exports of various types to appease the new compiler lint

## Additional Info

It seems `warn(hidden_glob_reexports)` is the main culprit.
@realbigsean realbigsean added the ready-for-review The code is ready for review label Jul 12, 2023
realbigsean and others added 8 commits July 12, 2023 13:05
## Issue Addressed

Addresses issue:  sigp#4459

## Proposed Changes

`cargo update -p proc-macro2`

proc-macro2 v1.0.58 -> v1.0.63

## Additional Info

See: sigp#4459 / rust-lang/rust#113152
## Issue Addressed

Fix an issue observed by `@zlan` on Discord where Lighthouse would sometimes return this error when looking up states via the API:

> {"code":500,"message":"UNHANDLED_ERROR: ForkChoiceError(MissingProtoArrayBlock(0xc9cf1495421b6ef3215d82253b388d77321176a1dcef0db0e71a0cd0ffc8cdb7))","stacktraces":[]}

## Proposed Changes

The error stems from a faulty assumption in the HTTP API logic: that any state in the hot database must have its block in fork choice. This isn't true because the state's hot database may update much less frequently than the fork choice store, e.g. if reconstructing states (where freezer migration pauses), or if the freezer migration runs slowly. There could also be a race between loading the hot state and checking fork choice, e.g. even if the finalization migration of DB+fork choice were atomic, the update could happen between the 1st and 2nd calls.

To address this I've changed the HTTP API logic to use the finalized block's execution status as a fallback where it is safe to do so. In the case where a block is non-canonical and prior to finalization (permanently orphaned) we default `execution_optimistic` to `true`.

## Additional Info

I've also added a new CLI flag to reduce the frequency of the finalization migration as this is useful for several purposes:

- Spacing out database writes (less frequent, larger batches)
- Keeping a limited chain history with high availability, e.g. the last month in the hot database.

This new flag made it _substantially_ easier to test this change. It was extracted from `tree-states` (where it's called `--db-migration-period`), which is why this PR also carries the `tree-states` label.
## Issue Addressed

When trying to run `eth1-sim` locally, the simulator doesn't start for me, and panicked due to duplicate arg names for `proposer-nodes` (using same arg names as `nodes`). Not sure why this isn't failing on CI but failing on mine 🤔 

```
thread 'main' panicked at 'Argument short must be unique
thread 'main' panicked at 'Argument long must be unique
```
## Issue Addressed

sigp#4488 

## Proposed Changes

 - Remove all instances of the `required` modifier where we have a default value specified for a subcommand

## Additional Info

N/A
michaelsproul and others added 7 commits July 17, 2023 00:14
## Issue Addressed

Fixes occasional compilation errors with mev-rs (see sigp#4456).

## Proposed Changes

- Update `mev-rs` to the latest version, which allows us to remove hacky `[patch]` sections
- Update the `axum` version used in `watch` so LH only uses a single version
## Proposed Changes

Replace `wget` in the EF-tests makefile with `curl`.

On macOS `curl` is pre-installed, and I found myself making this change to avoid installing `wget`.

The `-L` flag is used to follow redirects which is useful if a repo gets renamed, and more similar to `wget`'s default behaviour.
## Issue Addressed

Addresses an issue where CI could fail due to an nonexisting file error:

```
Run ./clean.sh
rm: cannot remove '/home/runner/.lighthouse/local-testnet/geth_datadir4/geth/fastcache.tmp.1549331618': No such file or directory
Error: Process completed with exit code 1.
```

This seems to happen quite frequently now, I'm not sure exactly why but perhaps worth trying suppressing the prompt?

https://github.com/sigp/lighthouse/actions/runs/5455027574/jobs/9925916159?pr=4463
## Issue Addressed

This PR attempts to workaround the recent frequent eth1 simulator failures caused by missing eth logs from Anvil. 

> FailedToInsertDeposit(NonConsecutive { log_index: 1, expected: 0 })

This usually occurs at the beginning of the tests, and it guarantees a timeout after a few hours if this log shows up, and this is currently causing our CIs to fail quite frequently. 

Example failure here: https://github.com/sigp/lighthouse/actions/runs/5525760195/jobs/10079736914

## Proposed Changes

The quick fix applied here adds a timeout to node startup and restarts the node again.

- Add a 60 seconds timeout to beacon node startup in eth1 simulator tests. It takes ~10 seconds on my machine, but could take longer on CI runners.
- Wrap the startup code in a retry function, that allows for 3 retries before returning an error.

## Additional Info

We should probably raise an issue under the Anvil GitHub repo there so this can be further investigated.
## Issue Addressed

N/A

## Proposed Changes

Add lints for rust 1.71

[3789134](sigp@3789134) is probably the one that needs most attention as it changes beacon state code. I changed the `is_in_inactivity_leak ` function to return a `ArithError` as not all consumers of that function work well with a `BeaconState::Error`.
## Issue Addressed

Speed up CI by installing foundry with Github action instead of building Anvil from source.

Building anvil from source on GItHub hosted runners currently takes about 10 mins. Using the `foundry-toolchain` action to install only takes about 2 seconds.
…sts (sigp#4485)

## Issue Addressed
n/a Noticed this while working on something else

## Proposed Changes
- leverage the appropriate types to avoid a bunch of `unwrap` and errors

## Additional Info
n/a
@jimmygchen
Copy link
Member

@realbigsean
I made a quick attempt to merge latest unstable into this branch before I realized you may have already done some of the work on the 14th July branch!

The branch builds and passes lint, but I'm not sure about the correctness of the merge since it appeared to be a tricky one. I've added comments on places where I added additional code due to the BeaconProcessor refactor. Feel free to use it only if it helps!

realbigsean#37

Copy link
Member

@pawanjay176 pawanjay176 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I mainly looked at changes in deneb stuff resulting from the merge and paid some extra attention to changes resulting from:

  1. Broadcast validation PR
  2. Changes in beacon processor because of moving it to a separate module

I'm satisfied with the changes. Amazing work with the changes, the conflicts would have been nasty.
Feel free to merge (i think there's one conflict remaining still)

@@ -2798,6 +2831,9 @@ impl<T: BeaconChainTypes> BeaconChain<T> {
notify_execution_layer,
)?;

//TODO(sean) error handling?
Copy link
Member

@pawanjay176 pawanjay176 Jul 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably something that makes more sense in the original PR. I say we keep it as is here and change it only if required for something deneb specific.
Also, the errors are being handled at the calling site as well, so not too much of an issue imo.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I think the thing here was I was wondering if I should pass the error through handle_block_error, but that method just prints out a log. I think publish_fn handles the logging fine, there's only a couple error cases and they're logged or returned via API errors

@pawanjay176
Copy link
Member

@realbigsean merged deneb-free-blobs to get CI going on this

@realbigsean realbigsean merged commit f98671f into sigp:deneb-free-blobs Jul 17, 2023
7 checks passed
@realbigsean realbigsean deleted the merge-unstable-deneb-june-6th branch November 21, 2023 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deneb ready-for-review The code is ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.