Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout-based message expiration #212

Open
Hyodar opened this issue Jun 5, 2024 · 8 comments
Open

Timeout-based message expiration #212

Hyodar opened this issue Jun 5, 2024 · 8 comments
Assignees

Comments

@Hyodar
Copy link
Contributor

Hyodar commented Jun 5, 2024

In both aggregator and operator we don't really have a timeout for processing messages.

For operators, it means when a message is added to the resend queue, it'll just stay there while it's not resent enough, even though it's a very old message. This is not interesting because (1) the interest in a state root update is immediate, so resending a really old message is essentially unnecessary info and can result in a lot of latency for the actual blocks to be processed when re-upping and (2) the resend queue just grows unbounded, which is never nice. Here, I propose we just check in the resend queue if a message is old enough to be dropped and drop it even if the aggregator connection is down. We could use a fancier queue with TTL (I'd prefer that) or just straight up manage this on tryResendFromQueue.

For the aggregator, there's still the latency consideration - should take quite a while to get to the 'good' messages and, depending on the situation, aggregating an old message can affect a past checkpoint info. This should ideally be considered in slashing terms as well, but this is not desirable behavior. In that case, I propose we set a submission timeout and just ignore messages that are older than that when receiving them. It'd also be good to then adapt the checkpoint time range logic so it's from lastCheckpointTimestamp + 1 to currentBlockTime - MESSAGE_AGGREGATION_TIMEOUT - MESSAGE_SUBMISSION_TIMEOUT.

@Hyodar
Copy link
Contributor Author

Hyodar commented Jun 6, 2024

Btw, if you'll see testnet-0, this was already implemented in a naive way to avoid any load issues.

@emlautarom1
Copy link
Contributor

emlautarom1 commented Jun 24, 2024

Some work was done in #259 but it seems like on master we already have somewhat of a mechanism for TTL in the form of timeout + retries for the operator's RPC client:

const (
ResendInterval = 2 * time.Second
MaxRetries = 10
)

@emlautarom1
Copy link
Contributor

On the aggregator side, we handle 3 messages:

func (agg *Aggregator) ProcessSignedCheckpointTaskResponse(signedCheckpointTaskResponse *messages.SignedCheckpointTaskResponse) error {

func (agg *Aggregator) ProcessSignedStateRootUpdateMessage(signedStateRootUpdateMessage *messages.SignedStateRootUpdateMessage) error {

func (agg *Aggregator) ProcessSignedOperatorSetUpdateMessage(signedOperatorSetUpdateMessage *messages.SignedOperatorSetUpdateMessage) error {

ProcessSignedCheckpointTaskResponse is the only one that does not handle internally a TTL, though I'm not sure if this is what is being discussed in the ticket.

@emlautarom1
Copy link
Contributor

I propose we set a submission timeout and just ignore messages that are older than that when receiving them

Would adding the following validation to ProcessSignedStateRootUpdateMessage and ProcessSignedOperatorSetUpdateMessage be enough?

// Error handling is ignored for brevity
func (agg *Aggregator) validateMessageTimestamp(ctx context.Context, messageTimestamp uint64) error {
	currentBlockNumber, _ := agg.httpClient.BlockNumber(ctx)
	currentBlock, _ := agg.httpClient.BlockByNumber(ctx, big.NewInt(0).SetUint64(currentBlockNumber))

	if messageTimestamp < (currentBlock.Time() - MESSAGE_AGGREGATION_TIMEOUT - MESSAGE_SUBMISSION_TIMEOUT) {
		return errors.New("Message has expired")
	}

	return nil
}

It'd also be good to then adapt the checkpoint time range logic so it's from lastCheckpointTimestamp + 1 to currentBlockTime - MESSAGE_AGGREGATION_TIMEOUT - MESSAGE_SUBMISSION_TIMEOUT.

Would changing the following

toTimestamp := block.Time()

to

	toTimestamp := block.Time() - MESSAGE_AGGREGATION_TIMEOUT - MESSAGE_SUBMISSION_TIMEOUT

be enough?

@Hyodar
Copy link
Contributor Author

Hyodar commented Jul 1, 2024

Let me get up to speed on this, need to gather past thoughts!

@Hyodar
Copy link
Contributor Author

Hyodar commented Jul 1, 2024

Okay, so my understanding on the aggregator is:
We need to introduce MESSAGE_SUBMISSION_TIMEOUT - this timeout is purely for submission. We're not going to accept messages from timestamps lower than currentTimestamp() - MESSAGE_SUBMISSION_TIMEOUT - the message timestamp being defined as the timestamp in the message content. This means we need to filter these messages out when receiving (as pointed out by you, but then using the aggregator timestamp as reference as opposed to the current block). The idea for that is to avoid aggregating old messages, which skews the checkpoint content - whoever signs a checkpoint attests that the messages were passed around during that time.
Then, whenever we're making a checkpoint, we should be making it as you linked there.

@emlautarom1
Copy link
Contributor

but then using the aggregator timestamp as reference as opposed to the current block

What is the aggregator timestamp? Is it the unix timestamp of the machine running the aggregator or is it a domain specific concept?

@Hyodar
Copy link
Contributor Author

Hyodar commented Jul 1, 2024

Unix timestamp, really. The point here is just establishing some baseline for when a message is old or not. Let me know if it makes sense, I may also be forgetting about something here.

@emlautarom1 emlautarom1 linked a pull request Jul 3, 2024 that will close this issue
@emlautarom1 emlautarom1 removed a link to a pull request Jul 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants