Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[test] Fix IndexShardTests#testScheduledRefresh #110312

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

arteam
Copy link
Contributor

@arteam arteam commented Jun 30, 2024

After we flushed the shard, we only make sure that the refresh call is propagated to the shard engine, but we can't be sure that the call is actually ends up in a shard refresh. The call in InternalEngine#refresh can return false if we couldn't acquire the lock on ElasticsearchDirectoryReader, because it's already being refreshed.

We can wrap the call in assertBusy to retry it in order to make sure that the shard eventually gets refreshed.

Resolves #101008

After we flushed the shard, we only make sure that the refresh call is propagated
to the shard engine, but we can't be sure that the call is actually ends up in a shard refresh.
The call in `InternalEngine#refresh` can return `false` if we couldn't acquire the lock
on `ElasticsearchDirectoryReader`, because it's already being refreshed.

We can wrap the call in `assertBusy` to retry it in order to make sure that the
shard eventually gets refreshed.

Resolves #101008
@arteam arteam added :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. >test Issues or PRs that are addressing/adding tests labels Jun 30, 2024
@arteam arteam marked this pull request as ready for review July 1, 2024 06:19
@arteam arteam requested a review from fcofdez July 1, 2024 06:20
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team label Jul 1, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@arteam arteam requested review from kingherc and tlrx July 1, 2024 11:02
@fcofdez
Copy link
Contributor

fcofdez commented Jul 2, 2024

The call in InternalEngine#refresh can return false if we couldn't acquire the lock on ElasticsearchDirectoryReader, because it's already being refreshed.

But this test concerns the IndexShard alone, meaning that there shouldn't be any background threads trying to refresh the shard (i.e. scheduled refreshes). Could you elaborate on which thread is calling refresh concurrently?

@@ -3927,9 +3926,11 @@ public void testScheduledRefresh() throws Exception {
assertTrue(primary.searchIdleTime() >= TimeValue.ZERO.millis());
primary.flushOnIdle(0);
logger.info("--> scheduledRefresh(future5)");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some questions related to the test (and not specifically to this line) that I don't understand:

  • Why is assertFalse(future4.actionGet());? The comment before future4 says while shard is search active and ensure scheduleRefresh(...) makes documen visible:. So I am not sure why the scheduled refresh returns false? I think that the shard was search idle, and that's why the scheduled refresh is false. Right? We should probably correct that comment.
  • Is the primary.flushOnIdle(0); actually flushing the shard? Doesn't flush also refresh? So how come assertTrue(future5.actionGet());? I would expect that since the shard was refreshed during the flush, that the scheduled refresh of future5 would return false.

Until I understand the test, I'm not sure I can follow the proposed solution. Any answers would be very helpful to me!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the shard was search idle, and that's why the scheduled refresh is false. Right? We should probably correct that comment.

Yes, that's the reason why it returns false in this case, we should update the comment 👍.

Doesn't flush also refresh?

Notice that the refresh executed after a flush concerns the internal searcher (

refresh("version_table_flush", SearcherScope.INTERNAL, true);
) whereas the scheduled refresh concerns the external searcher (
ActionListener.completeWith(listener, () -> refresh(source, SearcherScope.EXTERNAL, false));
) hence the need for the explicit refresh.

If it's the flush thread racing against the scheduled refresh call what's causing this test failure, I'm not sure if the proposed solution will fix it, since it will return false always as the refresh will be a no-op.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @fcofdez for the clarification, I see!

If it's the flush thread racing against the scheduled refresh call what's causing this test failure, I'm not sure if the proposed solution will fix it, since it will return false always as the refresh will be a no-op.

Why is that? I think it actually may work. If there's a race with flush and refresh, the refresh may return false due to the race. But once the flush is executed, it makes active flag to be false, which should make the scheduledRefresh() finally refresh. So an assertBusy with scheduledRefresh should finally succeed. Right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which should make the scheduledRefresh() finally refresh. So an assertBusy with scheduledRefresh should finally succeed. Right?

But if the refresh results in a no-op (there're no pending docs in the Lucene IndexWriter) the refresh will return false , right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that assured by the previous indexDoc(primary, "_doc", "2", "{\"foo\" : \"bar\"}"); which has not been refreshed yet? I am thinking this is the reason the scheduled refresh actually returns true in the majority of the CI runs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kingherc I've added a check that the flush finishes before the refresh and a check that there are no pending refreshes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the same point, they do if (store.tryIncRef()) {, and the flush wins while the refresh does not win.

@kingherc but the store shouldn't be closed while the test is running 🤔 (both flush and refresh should be able to incRef the store ref unless the shard is closing)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only possibility is that

returns false due to a concurrent refresh making this one a no-op.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fcofdez , when I was quickly checking the code, I mistakenly thought tryIncRef does not have while-loop internally to retry, but it does, so my thinking was erroneous. Needs more in-depth investigation than my quick going-back-and-forth in the code finally :D

Still I believe it was not the intention of the test to let both the flush & the refresh to race with each other. However, I cannot say for sure why the racing results in refresh returning false. It would be good to understand why @arteam , plus whether there's some other refresh going on as @fcofdez suggests. @arteam were you able to reproduce it? Maybe we can add some more tracing to see if another refresh is going on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that when that test was written the flush call was executed synchronously... that needs a bit of digging :D

@arteam arteam requested a review from kingherc July 3, 2024 10:25
@arteam arteam requested review from kingherc and removed request for kingherc July 3, 2024 12:34
@@ -3927,9 +3926,11 @@ public void testScheduledRefresh() throws Exception {
assertTrue(primary.searchIdleTime() >= TimeValue.ZERO.millis());
primary.flushOnIdle(0);
logger.info("--> scheduledRefresh(future5)");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fcofdez , when I was quickly checking the code, I mistakenly thought tryIncRef does not have while-loop internally to retry, but it does, so my thinking was erroneous. Needs more in-depth investigation than my quick going-back-and-forth in the code finally :D

Still I believe it was not the intention of the test to let both the flush & the refresh to race with each other. However, I cannot say for sure why the racing results in refresh returning false. It would be good to understand why @arteam , plus whether there's some other refresh going on as @fcofdez suggests. @arteam were you able to reproduce it? Maybe we can add some more tracing to see if another refresh is going on.

logger.info("--> scheduledRefresh(future5)");
ensureNoPendingScheduledRefresh();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's some spurious refresh being run, we cannot be sure it is at this exact spot.

Maybe a better approach would be to get the number of external refreshes before the future5, and asserting that after the scheduled refresh, it's incremented by 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Engine Anything around managing Lucene and the Translog in an open shard. Team:Distributed Meta label for distributed team >test Issues or PRs that are addressing/adding tests v8.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] IndexShardTests testScheduledRefresh failing
4 participants