Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(clustering/sync): avoiding long delay caused by race condition #13896

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

StarlightIbuki
Copy link
Contributor

@StarlightIbuki StarlightIbuki commented Nov 20, 2024

Summary

It will retry until the version matches

Checklist

  • The Pull Request has tests
  • A changelog file has been added to CHANGELOG/unreleased/kong or adding skip-changelog label on PR if unnecessary. README.md
  • The Pull Request has backports to all the versions it needs to cover
  • There is a user-facing docs PR against https://github.com/Kong/docs.konghq.com - PUT DOCS PR HERE

Issue reference

Fix KAG-5857 KAG-5876

@github-actions github-actions bot added core/clustering cherry-pick kong-ee schedule this PR for cherry-picking to kong/kong-ee labels Nov 20, 2024
@StarlightIbuki StarlightIbuki requested review from dndx, chronolaw and chobits and removed request for dndx November 20, 2024 08:56
@chronolaw chronolaw changed the title fix(sync): avoiding long delay caused by race condition fix(clustering/sync): avoiding long delay caused by race condition Nov 20, 2024
return nil, err
end

return true
end


function sync_once(premature, retry_count)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has the same name as _M:sync_once, should we choice another one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be the new sync_hanlder()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. sync_handler is a name for another function

@chronolaw
Copy link
Contributor

Could we write a test case to verify this fix?

kong/clustering/services/sync/rpc.lua Show resolved Hide resolved
return nil, err
end

return true
end


function sync_once(premature, retry_count)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be the new sync_hanlder()

@StarlightIbuki
Copy link
Contributor Author

Could we write a test case to verify this fix?

Not easily. We cannot be sure that the race conditions happen when we test. We could observe whether it relieves the flaky tests.

@@ -164,6 +167,8 @@ function _M:init_dp(manager)

local lmdb_ver = tonumber(declarative.get_current_hash()) or 0
if lmdb_ver < version then
-- set lastest version to shm
kong_shm:set(CLUSTERING_DATA_PLANES_LATEST_VERSION_KEY, version)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only run sync in one worker, is it necessary to store it in shared memory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not true. @dndx Could you confirm?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We run incremental sync inside all workers, it's just that only one worker can sync at the same time.


return true
end)
local res, err = concurrency.with_worker_mutex(SYNC_MUTEX_OPTS, do_sync)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, could we use concurrency.with_coroutine_mutex?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We deign the sync.v2 to work without privileged worker (and worker no.0)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return nil, err
end

return true
end


function sync_once_impl(premature, retry_count)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use a simple loop? like:

for i = 1, 5 do
  sync_handler()
  if updated then
    break
  end
  ngx.sleep(0)
end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Recreating the timer prevents long-live timer from causing resource leak

if not latest_notified_version or current_version < latest_notified_version then
retry_count = retry_count or 0
if retry_count > MAX_RETRY then
ngx_log(ngx_ERR, "sync_once retry count exceeded. retry_count: ", retry_count)
Copy link
Contributor

@chobits chobits Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that DP started without getting notify rpc call from CP, it will always report ERROR log.

Copy link
Contributor

@chobits chobits Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so could we do this, if no one updated latest version, we could just sync one time then return directly without adding task again for intialization

if not lastest_notified_version then
  return
end

Copy link
Contributor

@chronolaw chronolaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should work, refactoring or cleaning may be later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-pick kong-ee schedule this PR for cherry-picking to kong/kong-ee core/clustering size/M skip-changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants