Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition in citus_set_coordinator_host when adding multiple coordinator nodes concurrently #7682

Merged
merged 1 commit into from
Sep 9, 2024

Conversation

m3hm3t
Copy link
Contributor

@m3hm3t m3hm3t commented Sep 4, 2024

When multiple sessions concurrently attempt to add the same coordinator node using citus_set_coordinator_host, there is a potential race condition. Both sessions may pass the initial metadata check (isCoordinatorInMetadata), but only one will succeed in adding the node. The other session will fail with an assertion error (Assert(!nodeAlreadyExists)), causing the server to crash. Even though the AddNodeMetadata function takes an exclusive lock, it appears that the lock is not preventing the race condition before the initial metadata check.

  • Issue: The current logic allows concurrent sessions to pass the check for existing coordinators, leading to an attempt to insert duplicate nodes, which triggers the assertion failure.

  • Impact: This race condition leads to crashes during operations that involve concurrent coordinator additions, as seen in Assert when executing SELECT citus_set_coordinator_host('localhost'); #7646.

Test Plan:

  • Isolation Test Limitation: An isolation test was added to simulate concurrent additions of the same coordinator node, but due to the behavior of PostgreSQL locking mechanisms, the test does not trigger the edge case. The lock applied within the function serializes the operations, preventing the race condition from occurring in the isolation test environment.
    While the edge case is difficult to reproduce in an isolation test, the fix addresses the core issue by ensuring concurrency control through proper locking.

  • Existing Tests: All existing tests related to node metadata and coordinator management have been run to ensure that no regressions were introduced.

After the Fix:

  • Concurrent attempts to add the same coordinator node will be serialized. One session will succeed in adding the node, while the others will skip the operation without crashing the server.

Copy link

codecov bot commented Sep 4, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.69%. Comparing base (68d28ec) to head (3e5afb2).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7682      +/-   ##
==========================================
- Coverage   89.70%   89.69%   -0.02%     
==========================================
  Files         283      283              
  Lines       60509    60511       +2     
  Branches     7541     7541              
==========================================
- Hits        54280    54275       -5     
- Misses       4074     4080       +6     
- Partials     2155     2156       +1     

@m3hm3t m3hm3t changed the title Add the locking mechanism before the check for whether the coordinator already exists Fix race condition in citus_set_coordinator_host when adding multiple coordinator nodes concurrently Sep 4, 2024
@m3hm3t m3hm3t self-assigned this Sep 4, 2024
@m3hm3t m3hm3t marked this pull request as ready for review September 4, 2024 18:47
@m3hm3t m3hm3t linked an issue Sep 4, 2024 that may be closed by this pull request
@onurctirtir onurctirtir self-requested a review September 6, 2024 11:06
Copy link
Member

@onurctirtir onurctirtir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed PR description!

@onurctirtir
Copy link
Member

Just to be more clear;

Impact: This race condition leads to crashes during operations that involve concurrent coordinator additions, as seen in #7646.

Assertion failures crash the backend if --enable-assert was provided when building the Postgres. For this reason, we usually prefer to just say "it causes an assertion failure when we do .." (as you also did) and avoid to add "it crashes when we do ..". This is because, specifically in our case, this bug doesn't actually cause a crash or such in prod builds, where asserts are almost always disabled.

Also a side note: Most of the time, if a code causes an assertion failure when asserts are enabled, it's quite likely that it'd cause some incorrect behavior later in the same code-path if asserts were disable since we mostly use asserts to check for the cases that where the rest of the code is not hardened against etc. (like out-of-bound array indexes etc.) However, sort of exceptionally for this issue, the underlying function that we use a few lines below just becomes a no-op when the node was already added. That's why this is more like a fix for dev builds.

@m3hm3t m3hm3t merged commit 4775715 into main Sep 9, 2024
122 checks passed
@m3hm3t m3hm3t deleted the m3hm3t/set_coordinator_lock branch September 9, 2024 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Assert when executing SELECT citus_set_coordinator_host('localhost');
3 participants