-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add failing test for edge case max_batch_size=max_entries #11378
Conversation
I think I found a reasonable bugfix (already pushed, tests are fine again): potential fix for race conditionThis commit might fix #11377 by removing currently processed elements Two tests needed changes:
Why do I think this works?
These assumptions strongly need verification by some lua experts! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! Can you address the one comment that I have put in and also add a CHANGELOG entry? Thank you!
kong/tools/queue.lua
Outdated
@@ -251,11 +251,21 @@ function Queue:process_once() | |||
end | |||
end | |||
|
|||
local batch = {unpack(self.entries, self.front, self.front + entry_count - 1)} | |||
-- Guard against queue shrinkage during handler invocation by using math.min below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is no longer needed. The comment kind of pointed at the bug that you've discovered without me actually realizing that it was a bug. With your change, we'll actually always be able to remove entry_count
entries.
When defining a very special edge case configuration having max_batch_size=max_entries, the queue can fail with an assertion error when removing the frontmost element. This happens especially when the callback repeatedly fails (eg. an unavailable backend system receiving data). What happens: 1. we add max_batch_size elements, all of which "post" resources 2. the batch queue consumes all of those resources in `process_once` by `wait()`ing for them, but gets stuck processing/sending the batch 3. as `process_once` is stuck until `max_retry_time` passed, the function does not run `delete_frontmost_entry()` and thus actually moves the `front` reference 4. when enqueuing the next item, it tries to drop the oldest entry, but triggers the assertion in queue.lua as no resources are left
This commit might fix Kong#11377 by removing currently processed elements out of the race condition window. Two tests needed changes: 1. "giving up sending after retrying" needed another (otherwise) ignored value, such that we can wait long enough in `wait_until_queue_done` (there might be a more elegant solution here) 2. the new test required reactivating the handler to succeed to finally clear the queue Why do I think this works? - immediately after the last call on `semaphore:wait()`, we'll start actually removing items from `entries` - the code cannot be interrupted by other light threads before we actually start the handler These assumptions strongly need verification by some lua experts!
1b35377
to
2885cbd
Compare
Removed the min statement and added the changelog entry. I bundled both queue changes in subsequent lines, I guess the changelog order is not relevant otherwise. |
I've moved these changes to #11431 so that I can deal with the CI issues that we currently have. |
When defining a very special edge case configuration having
max_batch_size=max_entries
, the queue can fail with an assertion error when removing the frontmost element. This happens especially when the callback repeatedly fails (eg. an unavailable backend system receiving data).What happens:
process_once
bywait()
ing for them, but gets stuck processing/sending the batchprocess_once
is stuck untilmax_retry_time
passed, the function does not rundelete_frontmost_entry()
and thus actually moves thefront
referenceKudos to @27ascii for discovering the edge case configuration.
Summary
Checklist
Full changelog
Issue reference
FixTest case for #11377Jens Erat <jens.erat@mercedes-benz.com>, Mercedes-Benz Tech Innovation GmbH, imprint