-
Notifications
You must be signed in to change notification settings - Fork 24.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[test] Fix IndexShardTests#testScheduledRefresh #110312
base: main
Are you sure you want to change the base?
Changes from 9 commits
b68e38f
9ed2634
45e9ddb
d3781ab
a37a174
c57ce1a
5afadce
26be0a1
1921630
0720dee
759c80a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -157,7 +157,9 @@ | |
import java.util.concurrent.ExecutionException; | ||
import java.util.concurrent.Executor; | ||
import java.util.concurrent.Executors; | ||
import java.util.concurrent.Phaser; | ||
import java.util.concurrent.Semaphore; | ||
import java.util.concurrent.ThreadPoolExecutor; | ||
import java.util.concurrent.TimeUnit; | ||
import java.util.concurrent.atomic.AtomicBoolean; | ||
import java.util.concurrent.atomic.AtomicInteger; | ||
|
@@ -3849,7 +3851,6 @@ public void testIsSearchIdle() throws Exception { | |
closeShards(primary); | ||
} | ||
|
||
@AwaitsFix(bugUrl = "https://github.com/elastic/elasticsearch/issues/101008") | ||
@TestIssueLogging( | ||
issueUrl = "https://github.com/elastic/elasticsearch/issues/101008", | ||
value = "org.elasticsearch.index.shard.IndexShard:TRACE" | ||
|
@@ -3925,8 +3926,12 @@ public void testScheduledRefresh() throws Exception { | |
logger.info("--> ensure search idle"); | ||
assertTrue(primary.isSearchIdle()); | ||
assertTrue(primary.searchIdleTime() >= TimeValue.ZERO.millis()); | ||
long periodicFlushesBefore = primary.flushStats().getPeriodic(); | ||
primary.flushOnIdle(0); | ||
assertBusy(() -> assertThat(primary.flushStats().getPeriodic(), greaterThan(periodicFlushesBefore))); | ||
|
||
logger.info("--> scheduledRefresh(future5)"); | ||
ensureNoPendingScheduledRefresh(); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If there's some spurious refresh being run, we cannot be sure it is at this exact spot. Maybe a better approach would be to get the number of external refreshes before the future5, and asserting that after the scheduled refresh, it's incremented by 1. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @kingherc That's a great idea. I've replaced the hack with blocking the refresh thread pool with checking the refresh stats. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would like to understand why the test was failing, in the logs of the test failures you can see that the flush is executed after the assertion trips, so I'm not convinced about the flush being the issue here |
||
PlainActionFuture<Boolean> future5 = new PlainActionFuture<>(); | ||
primary.scheduledRefresh(future5); | ||
assertTrue(future5.actionGet()); // make sure we refresh once the shard is inactive | ||
|
@@ -3936,6 +3941,16 @@ public void testScheduledRefresh() throws Exception { | |
closeShards(primary); | ||
} | ||
|
||
private void ensureNoPendingScheduledRefresh() { | ||
var refreshThreadPoolExecutor = (ThreadPoolExecutor) threadPool.executor(ThreadPool.Names.REFRESH); | ||
int maximumPoolSize = refreshThreadPoolExecutor.getMaximumPoolSize(); | ||
var phaser = new Phaser(maximumPoolSize + 1); | ||
for (int i = 0; i < maximumPoolSize; i++) { | ||
refreshThreadPoolExecutor.execute(phaser::arriveAndAwaitAdvance); | ||
} | ||
phaser.arriveAndAwaitAdvance(); | ||
} | ||
|
||
public void testRefreshIsNeededWithRefreshListeners() throws IOException, InterruptedException { | ||
Settings settings = indexSettings(IndexVersion.current(), 1, 1).build(); | ||
IndexMetadata metadata = IndexMetadata.builder("test").putMapping(""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some questions related to the test (and not specifically to this line) that I don't understand:
assertFalse(future4.actionGet());
? The comment before future4 sayswhile shard is search active and ensure scheduleRefresh(...) makes documen visible:
. So I am not sure why the scheduled refresh returns false? I think that the shard was search idle, and that's why the scheduled refresh is false. Right? We should probably correct that comment.primary.flushOnIdle(0);
actually flushing the shard? Doesn't flush also refresh? So how comeassertTrue(future5.actionGet());
? I would expect that since the shard was refreshed during the flush, that the scheduled refresh of future5 would return false.Until I understand the test, I'm not sure I can follow the proposed solution. Any answers would be very helpful to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's the reason why it returns false in this case, we should update the comment 👍.
Notice that the refresh executed after a flush concerns the internal searcher (
elasticsearch/server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java
Line 2225 in 6ae652f
elasticsearch/server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java
Line 2020 in 6ae652f
If it's the flush thread racing against the scheduled refresh call what's causing this test failure, I'm not sure if the proposed solution will fix it, since it will return false always as the refresh will be a no-op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @fcofdez for the clarification, I see!
Why is that? I think it actually may work. If there's a race with flush and refresh, the refresh may return false due to the race. But once the flush is executed, it makes
active
flag to befalse
, which should make thescheduledRefresh()
finally refresh. So an assertBusy with scheduledRefresh should finally succeed. Right?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if the refresh results in a no-op (there're no pending docs in the Lucene IndexWriter) the refresh will return false , right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't that assured by the previous
indexDoc(primary, "_doc", "2", "{\"foo\" : \"bar\"}");
which has not been refreshed yet? I am thinking this is the reason the scheduled refresh actually returns true in the majority of the CI runs.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kingherc I've added a check that the flush finishes before the refresh and a check that there are no pending refreshes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kingherc but the store shouldn't be closed while the test is running 🤔 (both flush and refresh should be able to incRef the store ref unless the shard is closing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only possibility is that
elasticsearch/server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java
Line 2046 in bdc65cb
false
due to a concurrent refresh making this one a no-op.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fcofdez , when I was quickly checking the code, I mistakenly thought
tryIncRef
does not have while-loop internally to retry, but it does, so my thinking was erroneous. Needs more in-depth investigation than my quick going-back-and-forth in the code finally :DStill I believe it was not the intention of the test to let both the flush & the refresh to race with each other. However, I cannot say for sure why the racing results in refresh returning false. It would be good to understand why @arteam , plus whether there's some other refresh going on as @fcofdez suggests. @arteam were you able to reproduce it? Maybe we can add some more tracing to see if another refresh is going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think that when that test was written the flush call was executed synchronously... that needs a bit of digging :D