Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle terminally stuck transactions on send #14127

Conversation

amit-momin
Copy link
Contributor

@amit-momin amit-momin commented Aug 14, 2024

BCI-3014

  • Added client error classification for terminally stuck transactions on send
  • Added cases in the Broadcaster and Confirmer to handle these client errors

@@ -999,6 +999,21 @@ func (ec *Confirmer[CHAIN_ID, HEAD, ADDR, TX_HASH, BLOCK_HASH, R, SEQ, FEE]) han
ec.SvcErrBuffer.Append(sendError)
// This will loop continuously on every new head so it must be handled manually by the node operator!
return ec.txStore.DeleteInProgressAttempt(ctx, attempt)
case client.TerminallyStuck:
// A transaction could broadcast successfully but then be considered terminally stuck on another attempt
// Even though the transaction can succeeed under different circumstances, we want to purge this transaction as soon as we get this error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Even though the transaction can succeeed under different circumstances, we want to purge this transaction as soon as we get this error
// Even though the transaction can succeed under different circumstances, we want to purge this transaction as soon as we get this error

case client.TerminallyStuck:
// A transaction could broadcast successfully but then be considered terminally stuck on another attempt
// Even though the transaction can succeeed under different circumstances, we want to purge this transaction as soon as we get this error
lggr.Errorw("terminally stuck transaction detected", "err", sendError.Error())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whats the criteria for a Error log vs a Critical log?

Copy link
Contributor

@huangzhen1997 huangzhen1997 Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from my understanding this (tx stuck due to overflow, not enough keccak counters to continue the execution) is expected behavior, as least for zkSync. When it happens we just need to cancel/purge the existing tx by reprocessing, and it's not critical/fatal issue related to chainlink node that we need to raise alert, critial/fatal one for example: Invariant violation: fatal error while re-attempting transaction should not happen

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's my understanding as well. Since the TXM resolves this on its own, we don't have to raise a signal for NOPs to take any actions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. I would say, change log level also to a warn.
The Tx is bad, and is nothing wrong with the TXM.

return tx.Nonce() == uint64(346) && tx.Value().Cmp(big.NewInt(243)) == 0
}), fromAddress).Return(commonclient.Fatal, errors.New(terminallyStuckError)).Once()

// Do the thing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: replace with more descriptive comment

Copy link
Contributor

@huangzhen1997 huangzhen1997 Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol I notice we have a lot Do the thing lines in many test files, confirmer_test.go for example

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya this was just copied from another broadcaster test haha. But never too late to update at least the new tests to say something better.

huangzhen1997
huangzhen1997 previously approved these changes Aug 14, 2024
Copy link
Contributor

@huangzhen1997 huangzhen1997 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm except some lint errors

Copy link
Contributor

@poopoothegorilla poopoothegorilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good

@@ -596,6 +596,11 @@ func ClassifySendError(err error, clientErrors config.ClientErrors, lggr logger.
)
return commonclient.ExceedsMaxFee
}
if sendError.IsTerminallyStuckConfigError(configErrors) {
lggr.Criticalw("Transaction that would have been terminally stuck in the mempool detected on send. Marking as fatal error.", "err", sendError, "etx", tx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't be critical log. Should be a warning.
Even the Errorw() log is for cases where we clearly see a failure, although ones that we can recover from.
From example, an important RPC failed, or database writing failed.

A stuck tx is now an expected behavior, so warn log is enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do! I was just trying to match the behavior for Fatal. Think we're planning to rework logs in the near future so that might change anyways

case client.TerminallyStuck:
// A transaction could broadcast successfully but then be considered terminally stuck on another attempt
// Even though the transaction can succeeed under different circumstances, we want to purge this transaction as soon as we get this error
lggr.Errorw("terminally stuck transaction detected", "err", sendError.Error())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. I would say, change log level also to a warn.
The Tx is bad, and is nothing wrong with the TXM.

@prashantkumar1982 prashantkumar1982 added this pull request to the merge queue Aug 15, 2024
Merged via the queue into develop with commit 5e99bdb Aug 15, 2024
139 of 140 checks passed
@prashantkumar1982 prashantkumar1982 deleted the BCI-3014-Handle-ZK-overflow-error-on-send-transaction branch August 15, 2024 00:58
RensR added a commit to smartcontractkit/ccip that referenced this pull request Aug 15, 2024
Cherry-pick of smartcontractkit/chainlink#14127

Required to successfully handle zk overflows on Polygon zkEVM and X
Layer.

Co-authored-by: amit-momin <108959691+amit-momin@users.noreply.github.com>
Co-authored-by: Rens Rooimans <github@rensrooimans.nl>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants