Fix race condition in ThreadJob.waitForRun #659 #661

HeikoKlare · 2023-09-05T12:12:30Z

ThreadJob.waitForRun() suffers from a race condition. Between checking for a blocking job with a conflicting scheduling rule and starting to wait on the state lock of that blocking job to get notified when the scheduling rule is released, the blocking job may have already changed its state and acquired a different scheduling rule, such that no conflict exists anymore. Since this condition is not reevaluated but waiting on the lock object is started without a timeout, the blocked job unnecessarily waits for the initially blocking job to change its state again, i.e., to finish its execution.

One relevant situation in which the scheduling rule of a job can change is if the job is an internally used ThreadJob. They are started whenever a rule is acquired on the JobManager and are reused for the same thread. When the same thread acquires multiple rules, one after another, the rule of the ThreadJob may effectively change. This, for example, happens during a workspace build operation, which changes the rule for two times. Thus, if the blocking job in ThreadJob.waitForRun() is a
ThreadJob, it may change its rule, in particular when that job performs a build operation.

With this change, the scheduling rules of blocking and blocked thread are checked for conflicts again before starting to wait on the state lock in the blocked job. This leads to a reevaluation of the block condition in case the scheduling rule of the blocking job has changed in between and avoids the job to unnecessarily block until the blocking job finishes.

Fixes #659.

github-actions · 2023-09-05T12:17:16Z

Test Results

      42 files +    14       42 suites +14 52m 0s ⏱️ + 14m 46s
  3 778 tests +      1   3 775 ✔️ +      6   3 💤 - 5 0 ❌ ±0
11 337 runs +3 781 11 310 ✔️ +3 774 27 💤 +7 0 ❌ ±0

Results for commit 2b90917. ± Comparison against base commit 9dbc9e4.

♻️ This comment has been updated with latest results.

HeikoKlare · 2023-09-21T10:47:32Z

Any objections on these changes?
Otherwise I would like to merge soon to reduce random test failures (#657) and avoid potential performance losses in parallel job execution (#659).

iloveeclipse · 2023-09-21T10:56:40Z

Would be nice if @jukzi could check this, he recently was involved in Job related changes.

jukzi

while it might be an improvement it does not seem to totally fix the issue.

...ore.tests.resources/src/org/eclipse/core/tests/internal/builders/ParallelBuildChainTest.java

runtime/bundles/org.eclipse.core.jobs/src/org/eclipse/core/internal/jobs/ThreadJob.java

HeikoKlare · 2023-09-27T13:26:09Z

@jukzi Thanks for all the investigation. The issue you found with a large number of jobs seems to be an (already existing) performance issue (see #661 (comment)).

While simply changing wait() to wait(250), as you propose in #661 (comment), would also address the problem, I still think that re-evaluating the rule conflict is better as it proceeds on the actual state change rather than on a timeout (as the wait() call should actually finish due to a notify because of the job state change, which may already have happended once calling wait()).

Still it might make sense to replace the unconditional wait() call with a call using some timeout to ensure that in case anything yet undetected goes wrong the job will not hang but proceed after some time. However, that may easily hide bugs, so there should probably be some log information about such events. As this is another issue, I would defer it to a different PR.

Do you still have objections to the proposed change?

jukzi · 2023-09-27T13:33:41Z

While simply changing wait() to wait(250), as you propose in #661 (comment), would also address the problem, I still think that re-evaluating the rule conflict is better as it proceeds on the actual state change rather than on a timeout (as the wait() call should actually finish due to a notify because of the job state change, which may already have happended once calling wait()).

I wonder why i don't see the performance benefit from checking again over wait(250) in the junit test. If there is not measurable impact, then i suggest to make the code less complex over adding additional logic.

HeikoKlare · 2023-09-27T15:40:21Z

You won't see a performance impact in the test for two reasons:

When having a small number of jobs, running into the wait() after the the rule of the blocking job has changed (i.e., the error case to be fixed by this PR), is an exceptional case. Otherwise the referenced test would have failed more often.
When starting many jobs, the impact of a small timeout in a single job does not affect the overall execution time, as the other jobs will proceed in between.

I see the point that using a timeout makes the code simpler while achieving the same result in terms of performance. However, I see three arguments speaking against that solution:

The code will only work quite correctly "by accident" (or more correctly: appear to work correctly), as a timeout hides a semantic error (which is waiting for some event that will never occur).
When executing the test with 400 jobs, this error occurs around 200 times, i.e., for half of the jobs an unnecessary wait() is performed.
Adding a timeout for the wait() operation is only necessary to ensure that execution proceeds in case of some other potential and yet undetected bug. That's why I would not suggest to set a low timeout value, but some rather high value like 5 or 10 seconds. In case there is no undetected bug in the code, it only makes sense to proceed execution once a notification comes from the blocking thread, as otherwise the thread would continue execution without a chance of being successful as the relevant state has not changed.

That's why I am not in favor of a "simple" solution here. Still, I improved the code by directly checking the jobs for conflicts instead of their rules iun 02fabca.

jukzi · 2023-09-28T07:45:47Z

You could add a note to the commit somewhere that it is (for example?) yieldRule() which triggered the deadlock.

HeikoKlare · 2023-09-28T09:58:39Z

You could add a note to the commit somewhere that it is (for example?) yieldRule() which triggered the deadlock.

Good idea. I have adapted the code comment to reflect why this can happen and added an explanation to the commit message in af5395a. I incorporated the insights concerning the reasons for a change of the scheduling rule in #661 (comment).

ThreadJob.waitForRun() suffers from a race condition. Between checking for a blocking job with a conflicting scheduling rule and starting to wait on the state lock of that blocking job to get notified when the scheduling rule is released, the blocking job may have already changed its state and acquired a different scheduling rule, such that no conflict exists anymore. Since this condition is not reevaluated but waiting on the lock object is started without a timeout, the blocked job unnecessarily waits for the initially blocking job to change its state again, i.e., to finish its execution. One relevant situation in which the scheduling rule of a job can change is if the job is an internally used ThreadJob. They are started whenever a rule is acquired on the JobManager and are reused for the same thread. When the same thread acquires multiple rules, one after another, the rule of the ThreadJob may effectively change. This, for example, happens during a workspace build operation, which changes the rule for two times. Thus, if the blocking job in ThreadJob.waitForRun() is a ThreadJob, it may change its rule, in particular when that job performs a build operation. With this change, the scheduling rules of blocking and blocked thread are checked for conflicts again before starting to wait on the state lock in the blocked job. This leads to a reevaluation of the block condition in case the scheduling rule of the blocking job has changed in between and avoids the job to unnecessarily block until the blocking job finishes. Fixes eclipse-platform#659.

HeikoKlare force-pushed the issue-659-fix branch 3 times, most recently from b0c619b to b176418 Compare September 12, 2023 11:06

HeikoKlare force-pushed the issue-659-fix branch from b176418 to c641bf0 Compare September 13, 2023 06:36

HeikoKlare marked this pull request as ready for review September 13, 2023 08:04

HeikoKlare force-pushed the issue-659-fix branch from c641bf0 to 4e8c24f Compare September 13, 2023 08:04

HeikoKlare mentioned this pull request Sep 19, 2023

SearchIndex: avoid access to files outside destDir #698

Merged

HeikoKlare force-pushed the issue-659-fix branch from 4e8c24f to b9d62d0 Compare September 21, 2023 08:10

jukzi self-requested a review September 22, 2023 07:30

jukzi requested changes Sep 27, 2023

View reviewed changes

...ore.tests.resources/src/org/eclipse/core/tests/internal/builders/ParallelBuildChainTest.java Show resolved Hide resolved

runtime/bundles/org.eclipse.core.jobs/src/org/eclipse/core/internal/jobs/ThreadJob.java Show resolved Hide resolved

HeikoKlare force-pushed the issue-659-fix branch from b9d62d0 to 02fabca Compare September 27, 2023 15:39

jukzi approved these changes Sep 28, 2023

View reviewed changes

HeikoKlare force-pushed the issue-659-fix branch 2 times, most recently from 5dc0af4 to af5395a Compare September 28, 2023 09:56

HeikoKlare mentioned this pull request Sep 28, 2023

Remove recycling of ThreadJob in ImplicitJobs #724

Closed

HeikoKlare force-pushed the issue-659-fix branch from af5395a to 2b90917 Compare September 29, 2023 10:04

HeikoKlare merged commit 113cae1 into eclipse-platform:master Sep 29, 2023
13 of 14 checks passed

HeikoKlare deleted the issue-659-fix branch September 29, 2023 12:36

HeikoKlare mentioned this pull request Sep 29, 2023

Random failing ParallelBuildChainTest.testIndividualProjectBuilds_ProjectRelaxedRule #657

Closed

HeikoKlare mentioned this pull request Oct 12, 2023

Race condition in ThreadJob unnecessarily blocks parallel build execution vi-eclipse/Eclipse-Platform#21

Closed

HeikoKlare mentioned this pull request Nov 6, 2023

Random failing ParallelBuildChainTest.testIndividualProjectBuilds_WithManyProjects_ProjectRelaxedRule #825

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race condition in ThreadJob.waitForRun #659 #661

Fix race condition in ThreadJob.waitForRun #659 #661

HeikoKlare commented Sep 5, 2023 •

edited

Loading

github-actions bot commented Sep 5, 2023 •

edited

Loading

HeikoKlare commented Sep 21, 2023

iloveeclipse commented Sep 21, 2023

jukzi left a comment

HeikoKlare commented Sep 27, 2023

jukzi commented Sep 27, 2023

HeikoKlare commented Sep 27, 2023

jukzi commented Sep 28, 2023

HeikoKlare commented Sep 28, 2023

Fix race condition in ThreadJob.waitForRun #659 #661

Fix race condition in ThreadJob.waitForRun #659 #661

Conversation

HeikoKlare commented Sep 5, 2023 • edited Loading

github-actions bot commented Sep 5, 2023 • edited Loading

Test Results

HeikoKlare commented Sep 21, 2023

iloveeclipse commented Sep 21, 2023

jukzi left a comment

Choose a reason for hiding this comment

HeikoKlare commented Sep 27, 2023

jukzi commented Sep 27, 2023

HeikoKlare commented Sep 27, 2023

jukzi commented Sep 28, 2023

HeikoKlare commented Sep 28, 2023

HeikoKlare commented Sep 5, 2023 •

edited

Loading

github-actions bot commented Sep 5, 2023 •

edited

Loading