add failing test for edge case max_batch_size=max_entries #11378

JensErat · 2023-08-09T13:43:05Z

When defining a very special edge case configuration having max_batch_size=max_entries, the queue can fail with an assertion error when removing the frontmost element. This happens especially when the callback repeatedly fails (eg. an unavailable backend system receiving data).

What happens:

we add max_batch_size elements, all of which "post" resources
the batch queue consumes all of those resources in process_once by wait()ing for them, but gets stuck processing/sending the batch
as process_once is stuck until max_retry_time passed, the function does not run delete_frontmost_entry() and thus actually moves the front reference
when enqueuing the next item, it tries to drop the oldest entry, but triggers the assertion in queue.lua as no resources are left

Kudos to @27ascii for discovering the edge case configuration.

Summary

Checklist

The Pull Request has tests
There's an entry in the CHANGELOG
There is a user-facing docs PR against https://github.com/Kong/docs.konghq.com - PUT DOCS PR HERE

Full changelog

[Implement ...]

Issue reference

~~Fix~~Test case for #11377

^{Jens Erat <jens.erat@mercedes-benz.com>, Mercedes-Benz Tech Innovation GmbH, imprint}

JensErat · 2023-08-09T14:38:58Z

I think I found a reasonable bugfix (already pushed, tests are fine again):

potential fix for race condition

This commit might fix #11377 by removing currently processed elements
out of the race condition window.

Two tests needed changes:

"giving up sending after retrying" needed another (otherwise) ignored
value, such that we can wait long enough in wait_until_queue_done
(there might be a more elegant solution here)
the new test required reactivating the handler to succeed to finally
clear the queue

Why do I think this works?

immediately after the last call on semaphore:wait(), we'll start
actually removing items from entries
the code cannot be interrupted by other light threads before we
actually start the handler

These assumptions strongly need verification by some lua experts!

hanshuebner

Nice catch! Can you address the one comment that I have put in and also add a CHANGELOG entry? Thank you!

hanshuebner · 2023-08-17T08:58:15Z

kong/tools/queue.lua

@@ -251,11 +251,21 @@ function Queue:process_once()
    end
  end

+  local batch = {unpack(self.entries, self.front, self.front + entry_count - 1)}
+  -- Guard against queue shrinkage during handler invocation by using math.min below.


This is no longer needed. The comment kind of pointed at the bug that you've discovered without me actually realizing that it was a bug. With your change, we'll actually always be able to remove entry_count entries.

When defining a very special edge case configuration having max_batch_size=max_entries, the queue can fail with an assertion error when removing the frontmost element. This happens especially when the callback repeatedly fails (eg. an unavailable backend system receiving data). What happens: 1. we add max_batch_size elements, all of which "post" resources 2. the batch queue consumes all of those resources in `process_once` by `wait()`ing for them, but gets stuck processing/sending the batch 3. as `process_once` is stuck until `max_retry_time` passed, the function does not run `delete_frontmost_entry()` and thus actually moves the `front` reference 4. when enqueuing the next item, it tries to drop the oldest entry, but triggers the assertion in queue.lua as no resources are left

This commit might fix Kong#11377 by removing currently processed elements out of the race condition window. Two tests needed changes: 1. "giving up sending after retrying" needed another (otherwise) ignored value, such that we can wait long enough in `wait_until_queue_done` (there might be a more elegant solution here) 2. the new test required reactivating the handler to succeed to finally clear the queue Why do I think this works? - immediately after the last call on `semaphore:wait()`, we'll start actually removing items from `entries` - the code cannot be interrupted by other light threads before we actually start the handler These assumptions strongly need verification by some lua experts!

JensErat · 2023-08-21T09:08:02Z

Nice catch! Can you address the one comment that I have put in and also add a CHANGELOG entry? Thank you!

Removed the min statement and added the changelog entry. I bundled both queue changes in subsequent lines, I guess the changelog order is not relevant otherwise.

hanshuebner · 2023-08-21T13:09:07Z

I've moved these changes to #11431 so that I can deal with the CI issues that we currently have.

pull-request-size bot added the size/M label Aug 9, 2023

JensErat mentioned this pull request Aug 9, 2023

New queue fails in edge case max_batch_size=max_entries with assertion error #11377

Closed

1 task

bungle requested a review from hanshuebner August 9, 2023 13:57

samugi linked an issue Aug 11, 2023 that may be closed by this pull request

New queue fails in edge case max_batch_size=max_entries with assertion error #11377

Closed

1 task

hanshuebner suggested changes Aug 17, 2023

View reviewed changes

JensErat added 4 commits August 21, 2023 10:32

remove shrinking queue workaround

7432cea

update changelog

2885cbd

JensErat force-pushed the bug-queue-assertion-fail branch from 1b35377 to 2885cbd Compare August 21, 2023 09:06

github-actions bot added the changelog label Aug 21, 2023

hanshuebner approved these changes Aug 21, 2023

View reviewed changes

hanshuebner closed this Aug 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add failing test for edge case max_batch_size=max_entries #11378

add failing test for edge case max_batch_size=max_entries #11378

JensErat commented Aug 9, 2023 •

edited

Loading

JensErat commented Aug 9, 2023

hanshuebner left a comment

hanshuebner Aug 17, 2023

JensErat commented Aug 21, 2023

hanshuebner commented Aug 21, 2023

add failing test for edge case max_batch_size=max_entries #11378

add failing test for edge case max_batch_size=max_entries #11378

Conversation

JensErat commented Aug 9, 2023 • edited Loading

Summary

Checklist

Full changelog

Issue reference

JensErat commented Aug 9, 2023

potential fix for race condition

hanshuebner left a comment

Choose a reason for hiding this comment

hanshuebner Aug 17, 2023

Choose a reason for hiding this comment

JensErat commented Aug 21, 2023

hanshuebner commented Aug 21, 2023

JensErat commented Aug 9, 2023 •

edited

Loading