-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(clustering/sync): avoiding long delay caused by race condition #13896
base: master
Are you sure you want to change the base?
Conversation
return nil, err | ||
end | ||
|
||
return true | ||
end | ||
|
||
|
||
function sync_once(premature, retry_count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has the same name as _M:sync_once
, should we choice another one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be the new sync_hanlder()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. sync_handler
is a name for another function
Could we write a test case to verify this fix? |
return nil, err | ||
end | ||
|
||
return true | ||
end | ||
|
||
|
||
function sync_once(premature, retry_count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be the new sync_hanlder()
Not easily. We cannot be sure that the race conditions happen when we test. We could observe whether it relieves the flaky tests. |
6ac9004
to
c9f0ab7
Compare
@@ -164,6 +167,8 @@ function _M:init_dp(manager) | |||
|
|||
local lmdb_ver = tonumber(declarative.get_current_hash()) or 0 | |||
if lmdb_ver < version then | |||
-- set lastest version to shm | |||
kong_shm:set(CLUSTERING_DATA_PLANES_LATEST_VERSION_KEY, version) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only run sync in one worker, is it necessary to store it in shared memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not true. @dndx Could you confirm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We run incremental sync inside all workers, it's just that only one worker can sync at the same time.
|
||
return true | ||
end) | ||
local res, err = concurrency.with_worker_mutex(SYNC_MUTEX_OPTS, do_sync) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, could we use concurrency.with_coroutine_mutex
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We deign the sync.v2 to work without privileged worker (and worker no.0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return nil, err | ||
end | ||
|
||
return true | ||
end | ||
|
||
|
||
function sync_once_impl(premature, retry_count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use a simple loop? like:
for i = 1, 5 do
sync_handler()
if updated then
break
end
ngx.sleep(0)
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Recreating the timer prevents long-live timer from causing resource leak
if not latest_notified_version or current_version < latest_notified_version then | ||
retry_count = retry_count or 0 | ||
if retry_count > MAX_RETRY then | ||
ngx_log(ngx_ERR, "sync_once retry count exceeded. retry_count: ", retry_count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found that DP started without getting notify rpc call from CP, it will always report ERROR log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so could we do this, if no one updated latest version, we could just sync one time then return directly without adding task again for intialization
if not lastest_notified_version then
return
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should work, refactoring or cleaning may be later.
Summary
It will retry until the version matches
Checklist
CHANGELOG/unreleased/kong
or addingskip-changelog
label on PR if unnecessary. README.mdIssue reference
Fix KAG-5857 KAG-5876