Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flake in detecting reorg by nodes #15289

Merged
merged 3 commits into from
Nov 20, 2024
Merged

Fix flake in detecting reorg by nodes #15289

merged 3 commits into from
Nov 20, 2024

Conversation

b-gopalswami
Copy link
Contributor

@b-gopalswami b-gopalswami commented Nov 18, 2024

For some weird reasons, the health check call in nodes doesn't show finality violated error consistently across all the nodes. Due to which this test TestSmokeCCIPReorgAboveFinalityAtDestination/Above_finality_reorg_in_destination_chain flakes often in CI.

Adjusting the expectation to have at least half of the nodes detects violation instead of every node.
CI Failure examples: https://github.com/smartcontractkit/chainlink/actions/runs/11894867899/job/33143516553
https://github.com/smartcontractkit/chainlink/actions/runs/11894232425/job/33141414438?pr=15282

Copy link
Contributor

@AnieeG AnieeG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we adjust this to half of the nodes just to make the tests pass? I think it will help if we first try to investigate why rest of the nodes are not able to detect it? Also why half? if we verify f+1 number is able to detect reorg I think it would make more sense.

@b-gopalswami
Copy link
Contributor Author

should we adjust this to half of the nodes just to make the tests pass? I think it will help if we first try to investigate why rest of the nodes are not able to detect it? Also why half? if we verify f+1 number is able to detect reorg I think it would make more sense.

@AnieeG My understanding on this even if one node detects violation, the system should get halted. So, I was trying to make sure at least half the nodes detects it. If that's not the case and we need f+1 node to detect, then it would make sense to update.

I couldn't reproduce this failure in my local and node logs are not available in CI run. As every nodes are setup similarly in this, wondering why some nodes alone behave different or this might be due to timings?

@AnieeG
Copy link
Contributor

AnieeG commented Nov 18, 2024

should we adjust this to half of the nodes just to make the tests pass? I think it will help if we first try to investigate why rest of the nodes are not able to detect it? Also why half? if we verify f+1 number is able to detect reorg I think it would make more sense.

@AnieeG My understanding on this even if one node detects violation, the system should get halted. So, I was trying to make sure at least half the nodes detects it. If that's not the case and we need f+1 node to detect, then it would make sense to update.

I couldn't reproduce this failure in my local and node logs are not available in CI run. As every nodes are setup similarly in this, wondering why some nodes alone behave different or this might be due to timings?

@mateusz-sekara is this right assumption?

My understanding on this even if one node detects violation, the system should get halted.

@makramkd
Copy link
Contributor

No, enough nodes need to detect it and avoid participating in OCR. Commit requires 2f+1 observations, so f+1 nodes need to detect it in order to force the entire DON to stop processing messages.

Half the committee is sort of that, e.g for f = 1, N = 4 and half the committee means 2 nodes are seeing finality violations, which is also f+1. For f = 2, N = 7 and ceil(N/2) = 4 which is f+2.

@asoliman92 asoliman92 enabled auto-merge November 19, 2024 18:12
@asoliman92 asoliman92 added this pull request to the merge queue Nov 20, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 20, 2024
@asoliman92 asoliman92 added this pull request to the merge queue Nov 20, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 20, 2024
@asoliman92 asoliman92 added this pull request to the merge queue Nov 20, 2024
Merged via the queue into develop with commit 73e82d2 Nov 20, 2024
165 checks passed
@asoliman92 asoliman92 deleted the fix-reorgtest-flake branch November 20, 2024 13:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants