Timeout-based message expiration #212

Hyodar · 2024-06-05T20:11:31Z

In both aggregator and operator we don't really have a timeout for processing messages.

For operators, it means when a message is added to the resend queue, it'll just stay there while it's not resent enough, even though it's a very old message. This is not interesting because (1) the interest in a state root update is immediate, so resending a really old message is essentially unnecessary info and can result in a lot of latency for the actual blocks to be processed when re-upping and (2) the resend queue just grows unbounded, which is never nice. Here, I propose we just check in the resend queue if a message is old enough to be dropped and drop it even if the aggregator connection is down. We could use a fancier queue with TTL (I'd prefer that) or just straight up manage this on tryResendFromQueue.

For the aggregator, there's still the latency consideration - should take quite a while to get to the 'good' messages and, depending on the situation, aggregating an old message can affect a past checkpoint info. This should ideally be considered in slashing terms as well, but this is not desirable behavior. In that case, I propose we set a submission timeout and just ignore messages that are older than that when receiving them. It'd also be good to then adapt the checkpoint time range logic so it's from lastCheckpointTimestamp + 1 to currentBlockTime - MESSAGE_AGGREGATION_TIMEOUT - MESSAGE_SUBMISSION_TIMEOUT.

The text was updated successfully, but these errors were encountered:

Hyodar · 2024-06-06T23:50:45Z

Btw, if you'll see testnet-0, this was already implemented in a naive way to avoid any load issues.

emlautarom1 · 2024-06-24T13:24:25Z

Some work was done in #259 but it seems like on master we already have somewhat of a mechanism for TTL in the form of timeout + retries for the operator's RPC client:

near-sffl/operator/rpc_client.go

Lines 20 to 23 in 8f43707

 const ( 

 ResendInterval = 2 * time.Second 

 MaxRetries = 10 

 )

emlautarom1 · 2024-06-24T16:39:01Z

On the aggregator side, we handle 3 messages:

near-sffl/aggregator/aggregator.go

Line 522 in 8f43707

 func (agg *Aggregator) ProcessSignedCheckpointTaskResponse(signedCheckpointTaskResponse *messages.SignedCheckpointTaskResponse) error { 

near-sffl/aggregator/aggregator.go

Line 551 in 8f43707

 func (agg *Aggregator) ProcessSignedStateRootUpdateMessage(signedStateRootUpdateMessage *messages.SignedStateRootUpdateMessage) error { 

near-sffl/aggregator/aggregator.go

Line 581 in 8f43707

 func (agg *Aggregator) ProcessSignedOperatorSetUpdateMessage(signedOperatorSetUpdateMessage *messages.SignedOperatorSetUpdateMessage) error { 

ProcessSignedCheckpointTaskResponse is the only one that does not handle internally a TTL, though I'm not sure if this is what is being discussed in the ticket.

emlautarom1 · 2024-07-01T15:14:30Z

I propose we set a submission timeout and just ignore messages that are older than that when receiving them

Would adding the following validation to ProcessSignedStateRootUpdateMessage and ProcessSignedOperatorSetUpdateMessage be enough?

// Error handling is ignored for brevity
func (agg *Aggregator) validateMessageTimestamp(ctx context.Context, messageTimestamp uint64) error {
	currentBlockNumber, _ := agg.httpClient.BlockNumber(ctx)
	currentBlock, _ := agg.httpClient.BlockByNumber(ctx, big.NewInt(0).SetUint64(currentBlockNumber))

	if messageTimestamp < (currentBlock.Time() - MESSAGE_AGGREGATION_TIMEOUT - MESSAGE_SUBMISSION_TIMEOUT) {
		return errors.New("Message has expired")
	}

	return nil
}

It'd also be good to then adapt the checkpoint time range logic so it's from lastCheckpointTimestamp + 1 to currentBlockTime - MESSAGE_AGGREGATION_TIMEOUT - MESSAGE_SUBMISSION_TIMEOUT.

Would changing the following

near-sffl/aggregator/aggregator.go

Line 396 in dfb1874

toTimestamp := block.Time()

to

	toTimestamp := block.Time() - MESSAGE_AGGREGATION_TIMEOUT - MESSAGE_SUBMISSION_TIMEOUT

be enough?

Hyodar · 2024-07-01T16:26:34Z

Let me get up to speed on this, need to gather past thoughts!

Hyodar · 2024-07-01T16:56:47Z

Okay, so my understanding on the aggregator is:
We need to introduce MESSAGE_SUBMISSION_TIMEOUT - this timeout is purely for submission. We're not going to accept messages from timestamps lower than currentTimestamp() - MESSAGE_SUBMISSION_TIMEOUT - the message timestamp being defined as the timestamp in the message content. This means we need to filter these messages out when receiving (as pointed out by you, but then using the aggregator timestamp as reference as opposed to the current block). The idea for that is to avoid aggregating old messages, which skews the checkpoint content - whoever signs a checkpoint attests that the messages were passed around during that time.
Then, whenever we're making a checkpoint, we should be making it as you linked there.

emlautarom1 · 2024-07-01T20:09:35Z

but then using the aggregator timestamp as reference as opposed to the current block

What is the aggregator timestamp? Is it the unix timestamp of the machine running the aggregator or is it a domain specific concept?

Hyodar · 2024-07-01T20:52:36Z

Unix timestamp, really. The point here is just establishing some baseline for when a message is old or not. Let me know if it makes sense, I may also be forgetting about something here.

Hyodar added the priority: 3 label Jun 6, 2024

emlautarom1 self-assigned this Jun 18, 2024

emlautarom1 mentioned this issue Jun 20, 2024

Add TTL to Operator RPC Client #259

Closed

emlautarom1 mentioned this issue Jul 3, 2024

Aggregator message TTL #273

Merged

emlautarom1 linked a pull request Jul 3, 2024 that will close this issue

Aggregator message TTL #273

Merged

emlautarom1 removed a link to a pull request Jul 3, 2024

Aggregator message TTL #273

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeout-based message expiration #212

Timeout-based message expiration #212

Hyodar commented Jun 5, 2024

Hyodar commented Jun 6, 2024

emlautarom1 commented Jun 24, 2024 •

edited

Loading

emlautarom1 commented Jun 24, 2024

emlautarom1 commented Jul 1, 2024

Hyodar commented Jul 1, 2024

Hyodar commented Jul 1, 2024 •

edited

Loading

emlautarom1 commented Jul 1, 2024

Hyodar commented Jul 1, 2024

Timeout-based message expiration #212

Timeout-based message expiration #212

Comments

Hyodar commented Jun 5, 2024

Hyodar commented Jun 6, 2024

emlautarom1 commented Jun 24, 2024 • edited Loading

emlautarom1 commented Jun 24, 2024

emlautarom1 commented Jul 1, 2024

Hyodar commented Jul 1, 2024

Hyodar commented Jul 1, 2024 • edited Loading

emlautarom1 commented Jul 1, 2024

Hyodar commented Jul 1, 2024

emlautarom1 commented Jun 24, 2024 •

edited

Loading

Hyodar commented Jul 1, 2024 •

edited

Loading