Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EIP-7594: Decouple network subnets from das-core #3832

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from

Conversation

ppopth
Copy link
Member

@ppopth ppopth commented Jul 3, 2024

Currently we use subnets as a unit of custody in the PeerDAS core protocol because it doesn't make sense to partially custody only some columns in the subnets and waste the bandwidth to download the columns the node doesn't custody.

Since subnets correspond to GossipSub topics which are in a layer lower than the core protocol, using subnets as a unit of custody makes the core layer and the network layer too coupled to each other and leave no room for the network layer flexibility.

This commit introduces "custody groups" which are used as a unit of custody instead of subnets.

The immediate benefit of the decoupling is that we can immediately increase the number of subnets without affecting the expected number of peers to cover all columns and affecting the network stability and without touching the core protocol.

The reason we want to increase the number of subnets to match the number of columns is that the columns will be propagated through the network faster when they have their own subnets. Just like EIP-4844, each blob has its own subnet because, if all the blobs are in a single subnet, the blobs will be propagated more slowly.

Since we keep the number of custody groups the same as the previous number of subnets (32), the expected number of peers you need to cover all the columns is not changed. In fact, you need only NUMBER_OF_COLUMNS and NUMBER_OF_CUSTODY_GROUPS to analyze the expected number, which makes the core protocol completely decoupled from the network layer.

@djrtwo
Copy link
Contributor

djrtwo commented Jul 3, 2024

I haven't done a detailed review of how you do the layering here, but I generally support this direction to ensure that core protocol and network are sufficiently decoupled so they can be tuned/optimized independently

@Nashatyrev
Copy link
Member

I think this is a good proposal. I believe it has something in common with former 'Network Shards' idea.

However it's worth to keep in mind that in this case the gossip 'backbone' is defined rather implicitly (by the upper das core layer) than explicitly. Changing for example NUMBER_OF_CUSTODY_GROUPS to 128 would affect not only the upper das core layer but would also affect the network layer by making subnet backbones too small and potentially unreliable/vulnerable.

@hwwhww hwwhww added the EIP-7594 PeerDAS label Jul 5, 2024
@cskiraly
Copy link
Contributor

cskiraly commented Jul 9, 2024

I like the idea of decoupling the concepts related to custody allocation from the delivery method.

My concern is that the only reason I know of for grouping columns in custody is the (supposedly) inefficiency of gossipsub (implementations) in handling a larger number of topics and having smaller topics. Anyway, I think the conceptual decoupling is good to have.

Another thing I see in the PR is that you also have FAQ extensions and alike. Haven't seen the details yet, I will try to give it a review later on.

@cskiraly
Copy link
Contributor

cskiraly commented Jul 9, 2024

One more thing: parameter changes are also proposed in #3779
That one is proposing to change DATA_COLUMN_SIDECAR_SUBNET_COUNT to 128

@ppopth
Copy link
Member Author

ppopth commented Jul 9, 2024

Changing for example NUMBER_OF_CUSTODY_GROUPS to 128 would affect not only the upper das core layer but would also affect the network layer by making subnet backbones too small and potentially unreliable/vulnerable.

From another point of view, that can be seen as a non-networking issue. Let me rephrase your issue. If the number of custody groups is too large, the number of nodes in each group will be so small that you cannot find the peers to cover all the custody groups.

I think whether it's an issue on the network layer is quite subjective. However, the whole point of doing DAS is to reduce the bandwidth consumption, so every part of the protocol (including the core one) can be seen as related to the network in some way.

@ppopth
Copy link
Member Author

ppopth commented Jul 9, 2024

Another way to decouple the GossipSub topics from the core protocol is not to correspond subnets to GossipSub topics, but think of subnets as groups of nodes. However, we use the term "subnets" to mean GossipSub topics a lot in the past, so it's hard to think of subnets as anything else.

@ppopth
Copy link
Member Author

ppopth commented Jul 16, 2024

One more thing: parameter changes are also proposed in #3779
That one is proposing to change DATA_COLUMN_SIDECAR_SUBNET_COUNT to 128

I have a big concern on that change. Let's discuss it in that PR.

specs/_features/eip7594/das-core.md Outdated Show resolved Hide resolved
specs/_features/eip7594/das-core.md Show resolved Hide resolved
specs/_features/eip7594/das-core.md Outdated Show resolved Hide resolved
specs/_features/eip7594/p2p-interface.md Outdated Show resolved Hide resolved
tests/formats/networking/get_custody_groups.md Outdated Show resolved Hide resolved
@zilm13
Copy link
Contributor

zilm13 commented Jul 18, 2024

I miss the method which translates custody group into subnet, I think we need it or I don't understand an idea a bit. So instead of this

def compute_subnet_for_data_column_sidecar(column_index: ColumnIndex) -> SubnetID:
    return SubnetID(column_index % DATA_COLUMN_SIDECAR_SUBNET_COUNT)

we should have something like

def compute_subnet_for_custody_group(custody_group: UInt64) -> SubnetID:
    assert custody_group < NUMBER_OF_CUSTODY_GROUPS
    return SubnetID(custody_group % DATA_COLUMN_SIDECAR_SUBNET_COUNT)

And also it would be good to have compile assertion that NUMBER_OF_CUSTODY_GROUPS is set to the number, which could be always translated to DATA_COLUMN_SIDECAR_SUBNET_COUNT (say, we don't have 32 groups and 64 subnets or 33 groups and 32 subnets)

@ppopth
Copy link
Member Author

ppopth commented Jul 25, 2024

I miss the method which translates custody group into subnet, I think we need it or I don't understand an idea a bit. So instead of this

The idea is that you can get the subnets by calling compute_columns_for_custody_group and then compute_subnet_for_data_column_sidecar. Does that make sense?

We don't want to do compute_subnet_for_custody_group because it will allows only one subnet for each custody group.

And also it would be good to have compile assertion that NUMBER_OF_CUSTODY_GROUPS is set to the number, which could be always translated to DATA_COLUMN_SIDECAR_SUBNET_COUNT (say, we don't have 32 groups and 64 subnets or 33 groups and 32 subnets)

What do you mean? NUMBER_OF_CUSTODY_GROUPS is independent from DATA_COLUMN_SIDECAR_SUBNET_COUNT and only the assertion that I can think of is that DATA_COLUMN_SIDECAR_SUBNET_COUNT should be greater than or equal to NUMBER_OF_CUSTODY_GROUPS.

we don't have 32 groups and 64 subnets

Absolutely, we can have 32 groups and 64 subnets.

@zilm13
Copy link
Contributor

zilm13 commented Jul 30, 2024

The idea is that you can get the subnets by calling compute_columns_for_custody_group and then compute_subnet_for_data_column_sidecar

@ppopth now I got this, thank you for the explanation

Copy link
Contributor

@g11tech g11tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm from lodestar

@hwwhww hwwhww mentioned this pull request Aug 9, 2024
3 tasks
Currently we use subnets as a unit of custody in the PeerDAS core
protocol because it doesn't make sense to partially custody only some
columns in the subnets and waste the bandwidth to download the columns
the node doesn't custody.

Since subnets correspond to GossipSub topics which are in a layer lower
than the core protocol, using subnets as a unit of custody makes the
core layer and the network layer too coupled to each other and leave no
room for the network layer flexibility.

This commit introduces "custody groups" which are used a unit of custody
instead of subnets.

The immediate benefit of the decoupling is that we can immediately
increase the number of subnets without affecting the expected number of
peers to cover all columns and affecting the network stability and
without touching the core protocol.

The reason we want to increase the number of subnets to match the number
of columns is that the columns will be propagated through the network
faster when they have their own subnets. Just like EIP-4844, each
blob has its own subnet because, if all the blobs are in a single subnet,
the blobs will be propagated more slowly.

Since we keep the number of custody groups the same as the previous
number of subnets (32), the expected number of peers you need to cover
all the columns is not changed. In fact, you need only NUMBER_OF_COLUMNS
and NUMBER_OF_CUSTODY_GROUPS to analyze the expected number, which
makes the core protocol completely decoupled from the network layer.
@ppopth
Copy link
Member Author

ppopth commented Aug 14, 2024

I think this is ready to be merged?


To start with a simple, stable backbone, for now, we don't shuffle the subnet assignments via the deterministic custody selection helper `get_custody_columns`. However, staggered rotation likely needs to happen on the order of the pruning period to ensure subnets can be utilized for recovery. For example, introducing an `epoch` argument allows the function to maintain stability over many epochs.
No, the number of subnets don't really matter. What matters to the network stability is the number of nodes and the churn rate in the network.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
No, the number of subnets don't really matter. What matters to the network stability is the number of nodes and the churn rate in the network.
No, the number of subnets doesn't really matter. What matters to network stability is the number of nodes and the churn rate in the network.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to have this paragraph at all I think it would be good to expand on it a bit, for example on the "what matters" part

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I also thought it was a bit light on details.

@@ -320,8 +321,8 @@ Requests the MetaData of a peer, using the new `MetaData` definition given above

##### Custody subnet count
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
##### Custody subnet count
##### Custody group count

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed that during the rebase as well

@@ -54,6 +55,7 @@ The following values are (non-configurable) constants used throughout the specif
| - | - | - |
| `RowIndex` | `uint64` | Row identifier in the matrix of cells |
| `ColumnIndex` | `uint64` | Column identifier in the matrix of cells |
| `CustodyIndex` | `uint64` | Custody group identifier in the matrix of cells |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit wrong in my opinion. It's an index into a set of groups & doesn't involve the matrix.

Suggested change
| `CustodyIndex` | `uint64` | Custody group identifier in the matrix of cells |
| `CustodyIndex` | `uint64` | Custody group identifier in the set of custody groups |

### Custody setting

| Name | Value | Description |
| - | - | - |
| `SAMPLES_PER_SLOT` | `8` | Number of `DataColumnSidecar` random samples a node queries per slot |
| `CUSTODY_REQUIREMENT` | `4` | Minimum number of subnets an honest node custodies and serves samples from |
| `NUMBER_OF_CUSTODY_GROUPS` | `128` | Number of column groups available for nodes to custody |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a custody group is a collection of columns, but calling it a "column group" is a inconsistent.

Suggested change
| `NUMBER_OF_CUSTODY_GROUPS` | `128` | Number of column groups available for nodes to custody |
| `NUMBER_OF_CUSTODY_GROUPS` | `128` | Number of custody groups available for nodes to custody |

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, maybe I missed that during the rebase

@jimmygchen
Copy link
Contributor

I know this proposal has been around for a while, and I actually reviewed it earlier, but I have some additional thoughts.

I’m trying to wrap my head around the benefits of this proposal. It looks like the idea is to allow us to tweak the subnet count without touching the core protocol (custody_requirement) and its security aspects, which makes sense given where PeerDAS is right now. But I’m wondering if the benefits we gain from this would really justify the added complexity, and whether this change is something we truly need at this stage.

Personally, I’m not entirely convinced that this change is necessary. The more we decouple things, the more complexity we introduce, which might not be ideal. Since we're already looking at up to 128 subnets, it feels like we won’t need to go beyond that in the near future.

I’ve heard that research shows it can be more efficient to spread larger chunks of data across multiple subnets rather than sticking to just one. So, if we end up increasing the blob count significantly, I can see how increasing the subnet count might make sense. But for now, I’m not sure there’s a big advantage in going above 128 subnets unless we’re dealing with a much higher blob count. Even with 256 blobs per block, we’re still only sending up to 512kb per block to each subnet.

Maybe there’s something I’m missing here and about the future plans - Is there an issue with the current spec, or are we just trying to future-proof for scenarios we’re not certain about yet?

@ppopth
Copy link
Member Author

ppopth commented Sep 10, 2024

@jimmygchen Sorry for the late reply.

I would like to compare this PR to attestation subnets and attestation committees. In the phase0 spec, we decoupled subnets from committees as well. In real life and as I always heard, no one really mentions committees. People always talk about attestation subnets. In fact, according to the spec, the phase0 core protocol has nothing to do with subnets. They only deal with committees. The subnets are in the network layer, another layer.

So in PeerDAS, people may still discuss things in term of subnets (just like the attestation subnets), but I think it doesn't add more complexity since phase0 is already complex (according to your standard, some people may say it's not complex).

@jimmygchen
Copy link
Contributor

Thanks for the context @ppopth.

It's a fair comparison with some differences. Attestations have a target committee size, while with DAS, we’ve got a fixed number of columns/groups. And repeating my earlier comment, I'm not sure we need to design this with increasing to over 128 subnets in mind at this stage, or if there's any reason to reduce the number of subnets after shipping PeerDAS.

I feel there might be some tension, but I just wanted to pass along our team's feedback and keep things moving since we talked about scheduling this for peerdas-devnet-3. I’m not saying the proposal is overly complex, just that it adds more complexity than not having it. But even compared to attestation committees, I’m not sure the benefits are clear right now. It feels less urgent compared to other priorities like validator custody or optimizing cell proof computation and bandwidth. That’s just my take - I just wanted to share our feedback. It’s not something I’m going to fight hard on.

@ppopth
Copy link
Member Author

ppopth commented Sep 11, 2024

@jimmygchen Thank you for the feedback.

I'm not sure we need to design this with increasing to over 128 subnets in mind at this stage, or if there's any reason to reduce the number of subnets after shipping PeerDAS.

I think it's reasonable to make the spec agile to changes even if we don't know clearly yet if we want to change the number of subnets or not. Again, just like attestation subnets, the number of subnets and the number of committees have been both 64 for a long time and we don't see whether this will ever change.

That reason also applies here. The number of column subnets will probably be equal to the number of custody groups forever, but it has value in the view of agility.

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.