-
Notifications
You must be signed in to change notification settings - Fork 24.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[test] Fix IndexShardTests#testScheduledRefresh #110312
base: main
Are you sure you want to change the base?
Conversation
After we flushed the shard, we only make sure that the refresh call is propagated to the shard engine, but we can't be sure that the call is actually ends up in a shard refresh. The call in `InternalEngine#refresh` can return `false` if we couldn't acquire the lock on `ElasticsearchDirectoryReader`, because it's already being refreshed. We can wrap the call in `assertBusy` to retry it in order to make sure that the shard eventually gets refreshed. Resolves #101008
Pinging @elastic/es-distributed (Team:Distributed) |
But this test concerns the |
@@ -3927,9 +3926,11 @@ public void testScheduledRefresh() throws Exception { | |||
assertTrue(primary.searchIdleTime() >= TimeValue.ZERO.millis()); | |||
primary.flushOnIdle(0); | |||
logger.info("--> scheduledRefresh(future5)"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some questions related to the test (and not specifically to this line) that I don't understand:
- Why is
assertFalse(future4.actionGet());
? The comment before future4 sayswhile shard is search active and ensure scheduleRefresh(...) makes documen visible:
. So I am not sure why the scheduled refresh returns false? I think that the shard was search idle, and that's why the scheduled refresh is false. Right? We should probably correct that comment. - Is the
primary.flushOnIdle(0);
actually flushing the shard? Doesn't flush also refresh? So how comeassertTrue(future5.actionGet());
? I would expect that since the shard was refreshed during the flush, that the scheduled refresh of future5 would return false.
Until I understand the test, I'm not sure I can follow the proposed solution. Any answers would be very helpful to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the shard was search idle, and that's why the scheduled refresh is false. Right? We should probably correct that comment.
Yes, that's the reason why it returns false in this case, we should update the comment 👍.
Doesn't flush also refresh?
Notice that the refresh executed after a flush concerns the internal searcher (
elasticsearch/server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java
Line 2225 in 6ae652f
refresh("version_table_flush", SearcherScope.INTERNAL, true); |
elasticsearch/server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java
Line 2020 in 6ae652f
ActionListener.completeWith(listener, () -> refresh(source, SearcherScope.EXTERNAL, false)); |
If it's the flush thread racing against the scheduled refresh call what's causing this test failure, I'm not sure if the proposed solution will fix it, since it will return false always as the refresh will be a no-op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @fcofdez for the clarification, I see!
If it's the flush thread racing against the scheduled refresh call what's causing this test failure, I'm not sure if the proposed solution will fix it, since it will return false always as the refresh will be a no-op.
Why is that? I think it actually may work. If there's a race with flush and refresh, the refresh may return false due to the race. But once the flush is executed, it makes active
flag to be false
, which should make the scheduledRefresh()
finally refresh. So an assertBusy with scheduledRefresh should finally succeed. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which should make the scheduledRefresh() finally refresh. So an assertBusy with scheduledRefresh should finally succeed. Right?
But if the refresh results in a no-op (there're no pending docs in the Lucene IndexWriter) the refresh will return false , right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't that assured by the previous indexDoc(primary, "_doc", "2", "{\"foo\" : \"bar\"}");
which has not been refreshed yet? I am thinking this is the reason the scheduled refresh actually returns true in the majority of the CI runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kingherc I've added a check that the flush finishes before the refresh and a check that there are no pending refreshes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the same point, they do if (store.tryIncRef()) {, and the flush wins while the refresh does not win.
@kingherc but the store shouldn't be closed while the test is running 🤔 (both flush and refresh should be able to incRef the store ref unless the shard is closing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only possibility is that
elasticsearch/server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java
Line 2046 in bdc65cb
refreshed = referenceManager.maybeRefresh(); |
false
due to a concurrent refresh making this one a no-op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fcofdez , when I was quickly checking the code, I mistakenly thought tryIncRef
does not have while-loop internally to retry, but it does, so my thinking was erroneous. Needs more in-depth investigation than my quick going-back-and-forth in the code finally :D
Still I believe it was not the intention of the test to let both the flush & the refresh to race with each other. However, I cannot say for sure why the racing results in refresh returning false. It would be good to understand why @arteam , plus whether there's some other refresh going on as @fcofdez suggests. @arteam were you able to reproduce it? Maybe we can add some more tracing to see if another refresh is going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think that when that test was written the flush call was executed synchronously... that needs a bit of digging :D
server/src/test/java/org/elasticsearch/index/shard/IndexShardTests.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/shard/IndexShard.java
Outdated
Show resolved
Hide resolved
This reverts commit a37a174.
@@ -3927,9 +3926,11 @@ public void testScheduledRefresh() throws Exception { | |||
assertTrue(primary.searchIdleTime() >= TimeValue.ZERO.millis()); | |||
primary.flushOnIdle(0); | |||
logger.info("--> scheduledRefresh(future5)"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fcofdez , when I was quickly checking the code, I mistakenly thought tryIncRef
does not have while-loop internally to retry, but it does, so my thinking was erroneous. Needs more in-depth investigation than my quick going-back-and-forth in the code finally :D
Still I believe it was not the intention of the test to let both the flush & the refresh to race with each other. However, I cannot say for sure why the racing results in refresh returning false. It would be good to understand why @arteam , plus whether there's some other refresh going on as @fcofdez suggests. @arteam were you able to reproduce it? Maybe we can add some more tracing to see if another refresh is going on.
logger.info("--> scheduledRefresh(future5)"); | ||
ensureNoPendingScheduledRefresh(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's some spurious refresh being run, we cannot be sure it is at this exact spot.
Maybe a better approach would be to get the number of external refreshes before the future5, and asserting that after the scheduled refresh, it's incremented by 1.
After we flushed the shard, we only make sure that the refresh call is propagated to the shard engine, but we can't be sure that the call is actually ends up in a shard refresh. The call in
InternalEngine#refresh
can returnfalse
if we couldn't acquire the lock onElasticsearchDirectoryReader
, because it's already being refreshed.We can wrap the call in
assertBusy
to retry it in order to make sure that the shard eventually gets refreshed.Resolves #101008