Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[test] Fix IndexShardTests#testScheduledRefresh #110312

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Original file line number Diff line number Diff line change
Expand Up @@ -157,7 +157,9 @@
import java.util.concurrent.ExecutionException;
import java.util.concurrent.Executor;
import java.util.concurrent.Executors;
import java.util.concurrent.Phaser;
import java.util.concurrent.Semaphore;
import java.util.concurrent.ThreadPoolExecutor;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicBoolean;
import java.util.concurrent.atomic.AtomicInteger;
Expand Down Expand Up @@ -3849,7 +3851,6 @@ public void testIsSearchIdle() throws Exception {
closeShards(primary);
}

@AwaitsFix(bugUrl = "https://github.com/elastic/elasticsearch/issues/101008")
@TestIssueLogging(
issueUrl = "https://github.com/elastic/elasticsearch/issues/101008",
value = "org.elasticsearch.index.shard.IndexShard:TRACE"
Expand Down Expand Up @@ -3925,8 +3926,12 @@ public void testScheduledRefresh() throws Exception {
logger.info("--> ensure search idle");
assertTrue(primary.isSearchIdle());
assertTrue(primary.searchIdleTime() >= TimeValue.ZERO.millis());
long periodicFlushesBefore = primary.flushStats().getPeriodic();
primary.flushOnIdle(0);
assertBusy(() -> assertThat(primary.flushStats().getPeriodic(), greaterThan(periodicFlushesBefore)));

logger.info("--> scheduledRefresh(future5)");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some questions related to the test (and not specifically to this line) that I don't understand:

  • Why is assertFalse(future4.actionGet());? The comment before future4 says while shard is search active and ensure scheduleRefresh(...) makes documen visible:. So I am not sure why the scheduled refresh returns false? I think that the shard was search idle, and that's why the scheduled refresh is false. Right? We should probably correct that comment.
  • Is the primary.flushOnIdle(0); actually flushing the shard? Doesn't flush also refresh? So how come assertTrue(future5.actionGet());? I would expect that since the shard was refreshed during the flush, that the scheduled refresh of future5 would return false.

Until I understand the test, I'm not sure I can follow the proposed solution. Any answers would be very helpful to me!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the shard was search idle, and that's why the scheduled refresh is false. Right? We should probably correct that comment.

Yes, that's the reason why it returns false in this case, we should update the comment 👍.

Doesn't flush also refresh?

Notice that the refresh executed after a flush concerns the internal searcher (

refresh("version_table_flush", SearcherScope.INTERNAL, true);
) whereas the scheduled refresh concerns the external searcher (
ActionListener.completeWith(listener, () -> refresh(source, SearcherScope.EXTERNAL, false));
) hence the need for the explicit refresh.

If it's the flush thread racing against the scheduled refresh call what's causing this test failure, I'm not sure if the proposed solution will fix it, since it will return false always as the refresh will be a no-op.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @fcofdez for the clarification, I see!

If it's the flush thread racing against the scheduled refresh call what's causing this test failure, I'm not sure if the proposed solution will fix it, since it will return false always as the refresh will be a no-op.

Why is that? I think it actually may work. If there's a race with flush and refresh, the refresh may return false due to the race. But once the flush is executed, it makes active flag to be false, which should make the scheduledRefresh() finally refresh. So an assertBusy with scheduledRefresh should finally succeed. Right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which should make the scheduledRefresh() finally refresh. So an assertBusy with scheduledRefresh should finally succeed. Right?

But if the refresh results in a no-op (there're no pending docs in the Lucene IndexWriter) the refresh will return false , right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that assured by the previous indexDoc(primary, "_doc", "2", "{\"foo\" : \"bar\"}"); which has not been refreshed yet? I am thinking this is the reason the scheduled refresh actually returns true in the majority of the CI runs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kingherc I've added a check that the flush finishes before the refresh and a check that there are no pending refreshes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the same point, they do if (store.tryIncRef()) {, and the flush wins while the refresh does not win.

@kingherc but the store shouldn't be closed while the test is running 🤔 (both flush and refresh should be able to incRef the store ref unless the shard is closing)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only possibility is that

returns false due to a concurrent refresh making this one a no-op.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fcofdez , when I was quickly checking the code, I mistakenly thought tryIncRef does not have while-loop internally to retry, but it does, so my thinking was erroneous. Needs more in-depth investigation than my quick going-back-and-forth in the code finally :D

Still I believe it was not the intention of the test to let both the flush & the refresh to race with each other. However, I cannot say for sure why the racing results in refresh returning false. It would be good to understand why @arteam , plus whether there's some other refresh going on as @fcofdez suggests. @arteam were you able to reproduce it? Maybe we can add some more tracing to see if another refresh is going on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that when that test was written the flush call was executed synchronously... that needs a bit of digging :D

ensureNoPendingScheduledRefresh();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's some spurious refresh being run, we cannot be sure it is at this exact spot.

Maybe a better approach would be to get the number of external refreshes before the future5, and asserting that after the scheduled refresh, it's incremented by 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kingherc That's a great idea. I've replaced the hack with blocking the refresh thread pool with checking the refresh stats.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to understand why the test was failing, in the logs of the test failures you can see that the flush is executed after the assertion trips, so I'm not convinced about the flush being the issue here

PlainActionFuture<Boolean> future5 = new PlainActionFuture<>();
primary.scheduledRefresh(future5);
assertTrue(future5.actionGet()); // make sure we refresh once the shard is inactive
Expand All @@ -3936,6 +3941,16 @@ public void testScheduledRefresh() throws Exception {
closeShards(primary);
}

private void ensureNoPendingScheduledRefresh() {
var refreshThreadPoolExecutor = (ThreadPoolExecutor) threadPool.executor(ThreadPool.Names.REFRESH);
int maximumPoolSize = refreshThreadPoolExecutor.getMaximumPoolSize();
var phaser = new Phaser(maximumPoolSize + 1);
for (int i = 0; i < maximumPoolSize; i++) {
refreshThreadPoolExecutor.execute(phaser::arriveAndAwaitAdvance);
}
phaser.arriveAndAwaitAdvance();
}

public void testRefreshIsNeededWithRefreshListeners() throws IOException, InterruptedException {
Settings settings = indexSettings(IndexVersion.current(), 1, 1).build();
IndexMetadata metadata = IndexMetadata.builder("test").putMapping("""
Expand Down
Loading