Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip withdrawn validators in withdrawal sweep #3528

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from

Conversation

ppopth
Copy link
Member

@ppopth ppopth commented Oct 24, 2023

Motivation

Since, when the validator index grows large, it's slow to withdraw the deposit, we tried to make it faster by reusing the indices in ethereum/EIPs#6914 (consensus-specs at #3307).

However, reusing the indices brings complexity to the protocol and the spec. This PR skips the withdrawn validators in the withdrawal sweeping process to make the withdrawal faster while not reusing the validator indices.

The good thing about not reusing the validator indices is that we have a one-to-one correspondence between the validator indices and the validator public keys.

This improvement is intended to be an alternative to EIP-6914. So this improvement can be applied only if we don't want to do EIP-6914.

Implementation detail

In the current spec, we know that get_expected_withdrawals will not run more than MAX_VALIDATORS_PER_WITHDRAWALS_SWEEP iterations. This improvement changes that function and makes it able to run more iterations than before.

Clients can optimize that by having a cache of unskippable validators and sweeping through the cache instead of the set of all validators. The running time will be still in the order of MAX_VALIDATORS_PER_WITHDRAWALS_SWEEP.

Since, when the validator index grows large, it's slow to withdraw the
deposit, we tried to make it faster by reusing the indices in EIP-6914.

However, reusing the indices brings complexity to the protocol and the
spec. This commit instead skips the nonpotential withdrawn validators in
the withdrawl sweep to make the withdrawal faster.

The function `get_expected_withdrawals` is modified in a way that it
will not always run in the order of `MAX_VALIDATORS_PER_WITHDRAWALS_SWEEP`
anymore, but clients SHOULD implement this function in a way that keeps
the running time to be still in the order of
`MAX_VALIDATORS_PER_WITHDRAWALS_SWEEP` by having a cache of unskippable
validators.

This improvement is intended to be an alternative to EIP-6914, so this
can be applied only if we don't want to go with EIP-6914.
@dapplion
Copy link
Collaborator

This PR does not address the main goals of EIP-6914. The key constraint on the beacon-chain is that the full state must be loaded in memory and offer fast iteration over all indices. I would recommend to read this WIP draft on the engineering details of EIP-6914 https://hackmd.io/@dapplion/eip6914 but the main goals are:

  • limit the max memory cost a single state, and related data-structures
  • bound the computational complexity of epoch transitions

Addressing the specific implementation of this PR it defeats the original goal of the withdrawal bounded sweep. The worst case of this PR is consensus clients having to iterate the majority of the validators registry every block.

@ppopth
Copy link
Member Author

ppopth commented Oct 24, 2023

bound the computational complexity of epoch transitions

The worst case of this PR is consensus clients having to iterate the majority of the validators registry every block.

I guess what you mean is that, if the majority of the validators are withdrawn long ago, the clients have to iterate through them every block.

If the clients are implemented as it is done in the spec, yes it's true. But the main idea of this PR is to encourage the clients to have the cache of unskippable validators and iterate through that cache instead of the set of all validators. By doing this the number of iterations is not more than MAX_VALIDATORS_PER_WITHDRAWALS_SWEEP.

So the computational complexity is not different from Capella, except that this PR iterates through more potential validators than Capella.

The code in the spec is implemented in a slower way just for simplicity purpose.

I understand that the clients have to have a different algorithm from the one specified in the spec, but, as far as I understand, we are okay with it. (for example, the full-sweep strategy in #3307)

@ppopth
Copy link
Member Author

ppopth commented Oct 24, 2023

The key constraint on the beacon-chain is that the full state must be loaded in memory

limit the max memory cost a single state, and related data-structures

Do we have a requirement that the beacon state must fit in the memory? If so and the state grows rapidly, yeah reusing the validator indices makes sense and is better.

If not and we are allowed to have some parts of the beacon state to be in the memory and some parts to be in the storage, the state size may not be the issue.

However, I think the state size requirement is counter-intuitive to me because the EL state is already large (smart contract state, etc.). So having the beacon state large may not be an issue, including that the beacon state may not grow as fast as the EL state.

@dapplion
Copy link
Collaborator

dapplion commented Oct 25, 2023

However, I think the state size requirement is counter-intuitive to me because the EL state is already large (smart contract state, etc.). So having the beacon state large may not be an issue, including that the beacon state may not grow as fast as the EL state.

The nature of EL vs CL state is very different, see below a summary comparison (from https://hackmd.io/@dapplion/eip6914#Why-is-this-a-problem)

Execution state Consensus state
lookups by hash by index
reads per block very sparse all state is read at least once per epoch
total serialized size many GBs many MBs

The fact the CL state lookups are done by index, and the need to do complete iterations "frequently" (once per epoch) restricts the design space. Note that none of those things happen in EL clients.

But the main idea of this PR is to encourage the clients to have the cache of unskippable validators and iterate through that cache instead of the set of all validators.

If I recall, part of the appeal of the partial sweep is that it allows clients to not add any cache at all. Caches have been a good source of consensus failures, so not needing one is a positive. Of course clients make extensive use of caches today, just pointing out the positives of cache reduction.

@ppopth
Copy link
Member Author

ppopth commented Oct 25, 2023

If I recall, part of the appeal of the partial sweep is that it allows clients to not add any cache at all. Caches have been a good source of consensus failures, so not needing one is a positive. Of course clients make extensive use of caches today, just pointing out the positives of cache reduction.

I'm not sure what the partial sweep means, but I guess it's the current Capella sweep.

The full-sweep approach in EIP-6914 requires a cache as well, right? Otherwise, get_index_for_new_validator will be very slow, as shown in #3307.

So I think the cache stuff is not the advantage that EIP-6914 has over this PR.

@potuz
Copy link
Contributor

potuz commented Oct 26, 2023

Although @dapplion already mentioned both these points I'd like to stress them again:

  • We are forced to keep the full beacon state in memory, not one, but several in fact. There are techniques to keep one full state plus a bunch of differences to others, but even with those techniques the amount of memory consumed is still several beacon states worth (with the advantage that many more can be recovered quickly).
  • Caches are pretty bad and have caused most of the consensus problems we see on devnets/testnets. Most of our day is currently spent reviewing edge cases around caches. The code complexity of having to deal with an extra cache, specially for this loop that is very cheap, does not make it worth it.

I would suggest we hold on to any approach of dealing with this until this becomes a real UX problem (or at least is closed to becoming one). Blocks are going to be packed with 16 withdrawals for the foreseeable future. When they are not, we can always increase substantially the sweep size, but realistically, when we get there we will have to have implemented index reuse or other means to bound the state in memory which is the real problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants