Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Erigon3 upstream e4eb9fc #483

Merged
merged 91 commits into from
Aug 20, 2024
Merged

Conversation

blxdyx
Copy link
Collaborator

@blxdyx blxdyx commented Aug 19, 2024

Upstream latest erigon3

dvovk and others added 30 commits August 7, 2024 09:18
- Updated gopsutil version as it has improvements in getting processes
and memory info.
`murmur3.New*` methods return interface. And need call minimum 3 methods
on it.
`16ns` -> `11ns`


Also i did bench `github.com/segmentio/murmur3` vs
`github.com/twmb/murmur3` on 60bytes hashed string
2nd is faster but adding asm deps. So i stick to go's dep (because asm
deps are not friendly for cross-compilation), maybe will try it later -
after our new release pipeline is ready. Bench results:
intel: `20ns` -> `14ns`
amd: `31ns` -> `26ns`
Before this PR we called heimdall.Synchronize as part of
heimdall.CheckpointsFromBlock and heimdall.MilestonesFromBlock. The
previous implementation of Synchronize was waiting on all scrappers to
be synchronised.

This is inefficient because `heimdall.CheckpointsFromBlock` needs only
the `checkpoints` scrapper to be synchronised. For the initial sync we
first only need to wait for the checkpoints to be downloaded and then we
can start downloading blocks from devp2p. While we are doing that we can
let the spans and milestones be scrapped in the background. Note this is
based on the fact that fetching checkpoints has been optimised by doing
bulk fetching and finishes in seconds, while fetching Spans has not yet
been optimised and for bor-mainnet can take a long time.

Changes in the PR:
- splits Synchronize into 3 more fine grained SynchronizeCheckpoints,
SynchronizeMilestones and SynchronizeSpans calls which are invoked by
the Sync algorithm at the right time
- Optimises SynchronizeSpans to check if it already has the
corresponding span for the given block number before blocking
- Moves synchronisation point for Spans and State Sync Events in
`Sync.commitExecution` just before we call
ExecutionEngine.UpdateForkChoice to make it clearer what data is
necessary to be sync-ed before calling Execution
- Changes EventNotifier and Synchronize funcs to return err if ctx is
cancelled or other errors have happened
- Input consistency between the heimdallSynchronizer and
bridgeSynchronizer - use blockNum instead of *type.Header
- Interface tidy ups
Make Cell unexported
Remove ProcessTree/Keys/Update
Reviewed and refreshed all unit/bench/fuzz tests related to commitment
erigontech#11326
Change test scheduling and timeouts after Ottersync introduction.
Now we can execute tests more frequently due to the significant
reduction in test time.

Scheduled to run every night:
- tip-tracking
- snap-download
- sync-from-scratch for mainnet, minimal node

Scheduled to run on Sunday:
- sync-from-scratch for testnets, archive node
- Collecting CPU and Memory usage info about all processes running on
the machine
- Running loop 5 times with 2 seconds delay and to calculate average
- Sort by CPU usage
- Write result to report file
Result:
![Screenshot 2024-08-07 at 18 40
08](https://github.com/user-attachments/assets/aac1264c-1eb9-4c8e-b6a6-7e248e37855a)
closes erigontech#11173

Adds tests for the Heimdall Service which cover:
- Milestone scrapping
- Span scrapping
- Checkpoint scrapping
- `Producers` API - compares the results with results from the
`bor_getSnapshotProposerSequence` RPC API
- Added totals for CPU and Memory usage to processes table
- Added CPU usage by cores

Example output:
![Screenshot 2024-08-08 at 12 46
17](https://github.com/user-attachments/assets/ec0897d0-81c8-4436-bb65-527363157e76)
forgot to silence the logging in the heimdall service tests in a
previous PR
the logging lvl can be tweaked at times of need if debugging is
necessary
Refactored table utils to have an option to generate table and return it
as string which will used for saving data to file.
…ot and added clearIndexing command (erigontech#11539)

Main checks:
* No gap in steps/blocks
* Check if all indexing present
* Check if all idx,  history, domain present
closes erigontech#11177
- adds unwind logic to the new polygon sync stage which uses astrid
- seems like we've never done running for bor heimdall so removing empty
funcs
Refactored printing cpu info:
- move CPU details to table
- move CPU usage next to details table
- refactor code
…ch#11549)

relates to:
erigontech#10734
erigontech#11387

restart Erigon with `SAVE_HEAP_PROFILE = true` env variable
wait until we reach 45% or more alloc in stage_headers when
"noProgressCounter >= 5" or "Rejected header marked as bad"
and also move `design` into `docs` in order to reduce the number of
top-level directories
Before we had transaction-wide cache (map)
Now i changing it to evm-wide.
EVM - is thread-unsafe object - it's ok to use thread-unsafe LRU.
But ExecV3 already using 1-ENV per worker. Means we will share between
blocks (not on chain-tip for now)

bench: 
- on `mainnet`: it shows 12% improvement on large eth_getLogs call
(re-exec large historical range of blocks near 6M block) - on hot state.

About chain-tip: 
- don't see much impact (even if make cache global) - because
mainnet/bor-mainnet current bottleneck is "flush" changes to db. but
`integration loop_exec --unwind=2` shows 5% improvement.
- in future PR we can share 1 lru for many new blocks - currently
creating new one every stage loop iteration.
- added flags which was applied to run command to report
Notable things:
Removed all testing on Total Difficulty or Difficulty as that is a PoW
concept
Integration tests about difficulty removed as they only test ETH PoW
Remove difficulty checks in a test
I had to tweak some hashes as I have removed difficulty computation so
some hashes were different is some tests (now hardcoded to 1)

- It is running on the tip of the chain too, YAY
CL may be able to handle a non-error message better.
Eventually I plan on improving this corner case, as FCU has an 8s
timeout and I don't think EL waits that long before declaring busy.
This causes some (or all) CL(s) to miss slots, as they don't know what
to do now
part of erigontech#11149
All events are supported now
Giulio2002 and others added 27 commits August 15, 2024 07:52
Co-authored-by: alex.sharov <AskAlexSharov@gmail.com>
Tidying up bor contracts code:
- Rename `GenesisContractsClient` to `StateReceiver` for clarity
- Remove unused `LastStateId` func in `StateReceiver`
- Move `StateReceiver` into its own file `state_receiver.go`
- Create a gomock for it in `state_receiver_mock.go` and use in test
- Remove unused function input for `ChainSpanner`
kurtosis assertoor tests in ci

decision points:

- since the job takes about 45 minutes to run, I decided not to run it
on every PR, but rather we run it every 12 hours at 2AM/PM UTC.
- all tests are present in
https://github.com/ethpandaops/assertoor-test/tree/master/assertoor-tests.
This ci runs
[these](https://github.com/erigontech/erigon/pull/11464/files#diff-b45e49409c33f39133315225d30c02d8cfacfd9a53d1157bca61284683a4c498R22)
tests.
- test is for ubuntu only; not for mac or windows.

Many tests are failing; specially the validator related ones. Tracked
separately - erigontech#11590
I'd like to add `MarshallKey`, `MarshallValue` (and corresponding
`Unmarshall`) receiver functions on `EventRecordWithTime` which hide
away the `abi` detail so callers don't have to deal with the ABI as a
function input everywhere in the code base - PR
erigontech#11620

However there is a circular dependency which blocks me from doing that
because `EventRecordWithTime` is in package `polygon/heimdall` and it
imports `polygon/bor` package for `bor.GenesisContractStateReceiverABI`,
however `polygon/bor` imports `polygon/heimdall` for
`EventRecordWithTime` due to the [bor event fallback
hack](https://github.com/erigontech/erigon/blob/main/polygon/bor/bor.go#L1516-L1556)
we have in Bor consensus engine (it is pending removal once we fix the
underlying issue that led to it appearing which is a WIP).

For now, to unblock myself I am moving the bor ABIs in a sub-package
`polygon/bor/borabi`
…bi (erigontech#11620)

Relates to and inspired by
erigontech#11225

Main motivation for this change is to simplify the Bridge Store
interface by removing the need to pass an `ABI` in the function
signatures. But it ended up tidying up other bits of the codebase too.

We should do the same for the entities in `polygon/heimdall EntityStore`
as it is much much cleaner
This reverts PR erigontech#11556, which caused Hive tests (e.g. "Bad Hash on
NewPayload") to crash.
Things added:
* Keep genesis file stored in datadir
* Fixed race conditions in Sentinel
* Fixed scheduling of hard fork for same epoch-hardforks
* Small refactorings

---------

Co-authored-by: Kewei <kewei.train@gmail.com>
Fixed issue with setup diagnostics client on erigon start
- `br.FrozenBlocks()` to ensure that all file types are exist
And write mb/s is also big at prune time - nvme handles it - but on a
bit slower disks will affect chain-tip perf.

BorMainnet: 
<img width="1054" alt="Screenshot 2024-08-16 at 08 26 22"
src="https://github.com/user-attachments/assets/87e8d227-ac33-4800-a53a-594d872b074f">

Gnosis:
<img width="1026" alt="Screenshot 2024-08-16 at 08 36 23"
src="https://github.com/user-attachments/assets/1290e57c-3325-4e89-95cd-83baf8fa93c7">
…#11489)

- on mainnet: it shows +10% improvement on large eth_getLogs call
(re-exec large historical range of blocks near 6M block) - on hot state.

on chain tip: 
- not noticible impact, but now it living only inside current Tx
- maybe in future PR's I can share such LRU between block somehow 
- on loop_exe i also not noticed difference
- Added functionality to grab the heap profile by calling the
diagnostics endpoint
- Added support to pass heap profile file to diagnostics UI through
WebSocket
…rom chaindata (erigontech#11634)

We prune `CanonicalHash` and `HeaderNumber ` in `chaindata`. this leads
to a -99% in these 2 tables and reduce their total size to 40 pages in
TOTAL (from >500k pages).

Benchmarks:

```
Total time taken to query blocks from 1000000 to 3000000: 1720.1410 seconds (OLD)
Total time taken to query blocks from 1000000 to 3000000: 1747.4729 seconds (New)

Amortized delta cost: 0.1ms/req
```

Is the RPC working for those ranges? Yes.

## How does it work?

Query canonical hash from snapshots by: `viewHeaderByNumber` ->
`header.Hash()` if db hit is missed
Query header number from snapshots by: `viewHeaderByHash` ->
`header.Num64` if db hit is missed
10% better throughput and latency of: small eth_getLogs, big eth_getLogs
increased limit to 4096 - because can re-use lru objects - without worry
on "alloc speed"
This adds integrity checking for borevents in the following places

1. Stage Bor Heimdall - check that the event occurs withing the expected
time window
2. Dump Events to snapshots - check that the event ids are continuous in
and between blocks (will add a time window tests)
3. Index event snapshots - check that the event ids are continuous in
and between blocks

It also adds an external integrity checker which runs on snapshots
checking for continuity and timeliness of events. This can be called
using the following command:
`erigon snapshots integrity --datadir=~/snapshots/bor-mainnet-patch-1
--from=45500000`
(--to specifies an end block)

This also now fixes the long running issue with bor events causing
unexpected fails in executions:

the problem event validation uncovered was a follows:

The **kv.BorEventNums** mapping table currently keeps the mapping first
event id->block. The code which produces bor-event snapshots to
determine which events to put into the snapshot.

however if no additional blocks have events by the time the block is
stored in the snapshot, the snapshot creation code does not know which
events to include - so drops them.

This causes problems in two places:

* RPC queries & re-execution from snapshots can't find these dropped
events
* Depending on purge timing these events may erroneously get inserted
into future blocks
  
 The code change in this PR fixes that bug.
 
It has been tested by running:

```
erigon --datadir=~/chains/e3/amoy --chain=amoy --bor.heimdall=https://heimdall-api-amoy.polygon.technology --db.writemap=false --txpool.disable --no-downloader --bor.milestone=false
```

with validation in place and the confimed by running the following:

```
erigon snapshots rm-all-state-snapshots --datadir=~/chains/e3/amoy
rm ~/chains/e3/amoy/chaindata/
erigon --datadir=~/chains/e3/amoy --chain=amoy --bor-heimdall=https://heimdall-api-amoy.polygon.technology --db.writemap=false --no-downloader --bor.milestone=false
```
To recreate the chain from snapshots.

It has also been checked with:

```
erigon snapshots integrity --datadir=~/chains/e3/amoy --check=NoBorEventGaps --failFast=true"
```

---------

Co-authored-by: alex.sharov <AskAlexSharov@gmail.com>
Co-authored-by: Mark Holt <mark@disributed.vision>
…_e4eb9fc

# Conflicts:
#	cmd/state/exec3/state_recon.go
#	core/chain_makers.go
#	core/state/rw_v3.go
#	core/types/block.go
#	erigon-lib/go.mod
#	erigon-lib/go.sum
#	erigon-lib/kv/tables.go
#	eth/stagedsync/exec3.go
#	eth/stagedsync/stage_snapshots.go
#	go.mod
#	go.sum
#	turbo/app/snapshots_cmd.go
#	turbo/jsonrpc/tracing.go
#	turbo/stages/headerdownload/header_algo_test.go
@setunapo setunapo merged commit f296530 into node-real:erigon3 Aug 20, 2024
1 of 3 checks passed
@blxdyx blxdyx deleted the erigon3_upstream_e4eb9fc branch August 20, 2024 07:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.