Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BCI-2525: check all responses on transaction submission #11599

Conversation

dhaidashenko
Copy link
Collaborator

@dhaidashenko dhaidashenko commented Dec 18, 2023

Address slow SendTransaction execution and unnecessary retries caused by slow/unhealthy active node.

Copy link
Contributor

I see that you haven't updated any README files. Would it make sense to do so?

@dhaidashenko dhaidashenko marked this pull request as ready for review December 19, 2023 16:35
jmank88
jmank88 previously approved these changes Dec 19, 2023
Copy link
Collaborator

@samsondav samsondav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not safe. Send only nodes might return OK even if the transaction can never be broadcast, e.g. our own internal broadcast node I believe returns OK for everything even impossible transactions.

We must verify a response from a real ethereum node.

@samsondav
Copy link
Collaborator

What you can try, maybe, is to check responses of primary nodes and use those. It's unsafe to use responses from sendonly nodes as anything conclusive though, they are "fire and forget"

Copy link
Contributor

@prashantkumar1982 prashantkumar1982 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also hv a concern that some SendOnlyNodes may just forward the Tx to a real RPC, and blindly return success, so their response maynot be reliable.
Can't be sure, but I suspect that.

Also, if 2 nodes return incompatible errors, for example, 1 returns Success, and 1 returns Fatal, then we ought to log Critical Errors for that case.

Can we think of changing the overall logic to this:

  • Wait until we have responses from atleast 70% of nodes. (not waiting for 100% of nodes allows us to skip really slow nodes which will likely time out)
  • Save response category for each node.
  • If we have even 1 success, we return back success
  • Before returning, we do a validation check. If there's 1 success, then there shouldn't be any Fatal(or ExceedsMaxFee/FeeOutOfValidRange/Underpriced). If yes, then we log critical errors, and allow this error to be shown in health monitors. There's likely a deeper misconfiguration, or bug somewhere.
  • If any error was a Unknown type, also log critical error. We shouldn't ever get this.

}(n)
})
if !ok {
c.lggr.Debug("Cannot send transaction on sendonly node; MultiNode is stopped", "node", n.String())
c.lggr.Debugw("Cannot send transaction on node; MultiNode is stopped", "node", n.String())
return fmt.Errorf("MulltiNode is stopped: %w", context.Canceled)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch!

sendOnlyError := c.sendOnlyErrorParser(txErr)
if sendOnlyError != Successful {
c.lggr.Debugw("Node sent transaction", "name", n.String(), "tx", tx, "err", txErr)
isSuccess := c.sendOnlyErrorParser(txErr) == Successful
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The name sendOnlyErrorParser is misleading.
Can you please rename it to sendTxErrorParser.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's important to keep the sendOnly prefix as it emphasizes the distinction between the standard ClassifySendError function and the ClassifySendOnlyError

c.lggr.Warnw("RPC returned error", "name", n.String(), "tx", tx, "err", txErr)
}

if isSuccess || n == main {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like that we still will return error as soon as main node says so, without waiting for other nodes.
Could we wait till other nodes have had a chance to return success?
I don't think its a huge concern if that takes a long time. Or, maybe we could use some heuristics, like, if main node has errored, then atleast wait till 50% of nodes have also errored, before returning that error code back.

@dimriou
Copy link
Collaborator

dimriou commented Dec 20, 2023

I also hv a concern that some SendOnlyNodes may just forward the Tx to a real RPC, and blindly return success, so their response maynot be reliable. Can't be sure, but I suspect that.

Also, if 2 nodes return incompatible errors, for example, 1 returns Success, and 1 returns Fatal, then we ought to log Critical Errors for that case.

Can we think of changing the overall logic to this:

  • Wait until we have responses from atleast 70% of nodes. (not waiting for 100% of nodes allows us to skip really slow nodes which will likely time out)
  • Save response category for each node.
  • If we have even 1 success, we return back success
  • Before returning, we do a validation check. If there's 1 success, then there shouldn't be any Fatal(or ExceedsMaxFee/FeeOutOfValidRange/Underpriced). If yes, then we log critical errors, and allow this error to be shown in health monitors. There's likely a deeper misconfiguration, or bug somewhere.
  • If any error was a Unknown type, also log critical error. We shouldn't ever get this.

Given that we can't rely on send-only nodes at all, since their response can be arbitrary, do we really need to optimize this logic? Do we have any indication that this is currently slow and needs to be optimized? I'd rather have a "dumb" code that is 95% as fast as a super complicated optimized code.

@dhaidashenko dhaidashenko marked this pull request as draft December 21, 2023 17:34
…of github.com:smartcontractkit/chainlink into feature/BCI-2525-send-transaction-check-all-responses
@dhaidashenko dhaidashenko marked this pull request as ready for review December 22, 2023 19:02
Copy link
Contributor

@prashantkumar1982 prashantkumar1982 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very solid!

c.lggr.Criticalw(errMsg, "tx", tx, "resultsByCode", resultsByCode)
err := fmt.Errorf(errMsg)
c.SvcErrBuffer.Append(err)
return severeErrors[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should return success here, since atleast 1 node has accepted our broadcasted Tx, and thus it can now be included onchain.
We shouldn't throw away this signal.

Regarding manual intervention, you are already logging a Critical Log, that implies someone should be investigating it manually.
In future, when we create a proper health dashboard, this error should flow in to that dashboard as a critical issue, with all details.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider supplementing with a prom metric (or adding prom metrics to this more generally) since that is more likely to get alerted/actioned than a critical log

@@ -73,7 +73,10 @@ func NewChainClient(
chainID,
chainType,
"EVM",
ClassifySendOnlyError,
func(tx *types.Transaction, err error) commonclient.SendTxReturnCode {
return ClassifySendError(err, logger.Sugared(logger.Nop()), tx, common.Address{}, chainType.IsL2())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not pass the existing logger param instead of Nop?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing proper logger would result in partial logs that duplicate proper ones.
ClassifySendError is called inside the SendTransaction, that is called by SendTransactionReturnCode, which perform logging on its own.
ClassifySendError called from the SendTransaction does not have access to the address of the source account.

go func() {
wg.Wait()
c.wg.Done()
close(inTxResults)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better if you call close before c.wg.Done() since Close() shouldn't proceed unless this routine has fully returned.

}

txResultsToReport := make(chan sendTxResult, len(c.nodes))
go fanOut(inTxResults, txResultsToReport, txResults)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if broadcastTxAsync is already over at this point and someone calls Close(), would we have a leak?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fanOut is not leaking, because if broadcastTxAsync is already over - inTxResults is closed. Both txResultsToReport, txResults are sufficiently buffered, so we won't stuck even if no one is reading the results.
Also Close can not complete before fanOut is done as reportSendTxAnomalies, that is protected by a wait group, waits for txResultsToReport to be closed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not suggesting that it will get stuck, I'm saying there is a rare case the multinode can close before the fanOut method returns. The fanOut method will still close eventually, perhaps almost instantly after the mutlinode close, but theoretically you can have a scenario where the parent routine exits before the child routine, which might get picked up by random tests, so perhaps it should be better if we do c.wg.Add(2) at the beginning.

Copy link
Collaborator Author

@dhaidashenko dhaidashenko Jan 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, makes sense. It might be better to make it more explicit

Copy link
Collaborator

@dimriou dimriou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dhaidashenko another approach I drafted could look like this

ok := c.IfNotStopped(func() {
	c.wg.Add(len(c.sendonlys))
	// fire-n-forget, as sendOnlyNodes can not be trusted with result reporting
	for _, n := range c.sendonlys {
		go func(n SendOnlyNode[CHAIN_ID, RPC_CLIENT]) {
			defer c.wg.Done()
			c.broadcastTx(ctx, n, tx)
		}(n)
	}

	// signal when all the primary nodes done broadcasting tx
	txResultsToReport := make(chan sendTxResult, len(c.nodes))
	c.wg.Add(1)
	var wg sync.WaitGroup
	wg.Add(len(c.nodes))
	for _, n := range c.nodes {
		go func(n SendOnlyNode[CHAIN_ID, RPC_CLIENT]) {
			defer wg.Done()
			resultCode, txErr := c.broadcastTx(ctx, n, tx)

			txResults <- sendTxResult{Err: txErr, ResultCode: resultCode}
			txResultsToReport <- sendTxResult{Err: txErr, ResultCode: resultCode}
		}(n)
	}
	go func() {
		wg.Wait()
		close(txResults)
		close(txResultsToReport)
		c.wg.Done()
	}()

	c.wg.Add(1)
	go c.reportSendTxAnomalies(tx, txResultsToReport)
})

and broadcastTx will simply return the results without the channels. This way we remove the complexity from fanout. If what you've written seems more comprehensive, I'm fine either way.

txErr := n.RPC().SendTransaction(ctx, tx)
c.lggr.Debugw("Node sent transaction", "name", n.String(), "tx", tx, "err", txErr)
resultCode := c.classifySendTxError(tx, txErr)
if resultCode != Successful && resultCode != TransactionAlreadyKnown {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you just added a new list "sendTxSuccessfulCodes", this check should just change to checking if resultCode is in this list.

})
t.Run("Fails when closed on sendonly broadcast", func(t *testing.T) {
t.Run("Returns success without waiting for the rest of the nodes", func(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you ensure that we call mn.Close() at the end of all tests?
This will help test that the MultiNode's waitgroups have all ended gracefully, especially with more dependencies added now on the wg field.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the subtests explicitly close multinode at the end of case or we do it on cleanup:

newStartedMultiNode := func(t *testing.T, opts multiNodeOpts) testMultiNode {
		mn := newTestMultiNode(t, opts)
		err := mn.StartOnce("startedTestMultiNode", func() error { return nil })
		require.NoError(t, err)
		t.Cleanup(func() {
			require.NoError(t, mn.Close())
		})
		return mn
	}

return err, err
}

const sendTxQuorum = 0.7
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment on this const?

@prashantkumar1982 prashantkumar1982 added this pull request to the merge queue Feb 9, 2024
Merged via the queue into develop with commit 556a4f3 Feb 9, 2024
93 checks passed
@prashantkumar1982 prashantkumar1982 deleted the feature/BCI-2525-send-transaction-check-all-responses branch February 9, 2024 23:18
asoliman92 pushed a commit that referenced this pull request Jul 31, 2024
* sendTx: signal success if one of the nodes accepted transaction

* fix logger

* fix merge

* fix race

* fixed multinode tests race

* improve test coverage

* WIP: wait for 70% of nodes to reply on send TX

* tests

* Report invariant violation via prom metrics

* fixed sendTx tests

* address comments

* polish PR

* Describe implementation details in the comment to SendTransaction

* nit fixes

* more fixes

* use softTimeOut default value

* nit fix

* ensure all goroutines are done before Close

* refactor broadcast

* use sendTxSuccessfulCodes slice to identify if result is successful

---------

Co-authored-by: Prashant Yadav <34992934+prashantkumar1982@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants