Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition in ThreadJob.waitForRun #659 #661

Merged
merged 1 commit into from
Sep 29, 2023

Conversation

HeikoKlare
Copy link
Contributor

@HeikoKlare HeikoKlare commented Sep 5, 2023

ThreadJob.waitForRun() suffers from a race condition. Between checking for a blocking job with a conflicting scheduling rule and starting to wait on the state lock of that blocking job to get notified when the scheduling rule is released, the blocking job may have already changed its state and acquired a different scheduling rule, such that no conflict exists anymore. Since this condition is not reevaluated but waiting on the lock object is started without a timeout, the blocked job unnecessarily waits for the initially blocking job to change its state again, i.e., to finish its execution.

One relevant situation in which the scheduling rule of a job can change is if the job is an internally used ThreadJob. They are started whenever a rule is acquired on the JobManager and are reused for the same thread. When the same thread acquires multiple rules, one after another, the rule of the ThreadJob may effectively change. This, for example, happens during a workspace build operation, which changes the rule for two times. Thus, if the blocking job in ThreadJob.waitForRun() is a
ThreadJob, it may change its rule, in particular when that job performs a build operation.

With this change, the scheduling rules of blocking and blocked thread are checked for conflicts again before starting to wait on the state lock in the blocked job. This leads to a reevaluation of the block condition in case the scheduling rule of the blocking job has changed in between and avoids the job to unnecessarily block until the blocking job finishes.

Fixes #659.

@github-actions
Copy link
Contributor

github-actions bot commented Sep 5, 2023

Test Results

       42 files  +     14         42 suites  +14   52m 0s ⏱️ + 14m 46s
  3 778 tests +       1    3 775 ✔️ +       6    3 💤  - 5  0 ±0 
11 337 runs  +3 781  11 310 ✔️ +3 774  27 💤 +7  0 ±0 

Results for commit 2b90917. ± Comparison against base commit 9dbc9e4.

♻️ This comment has been updated with latest results.

@HeikoKlare
Copy link
Contributor Author

Any objections on these changes?
Otherwise I would like to merge soon to reduce random test failures (#657) and avoid potential performance losses in parallel job execution (#659).

@iloveeclipse
Copy link
Member

Would be nice if @jukzi could check this, he recently was involved in Job related changes.

@jukzi jukzi self-requested a review September 22, 2023 07:30
Copy link
Contributor

@jukzi jukzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while it might be an improvement it does not seem to totally fix the issue.

@HeikoKlare
Copy link
Contributor Author

@jukzi Thanks for all the investigation. The issue you found with a large number of jobs seems to be an (already existing) performance issue (see #661 (comment)).

While simply changing wait() to wait(250), as you propose in #661 (comment), would also address the problem, I still think that re-evaluating the rule conflict is better as it proceeds on the actual state change rather than on a timeout (as the wait() call should actually finish due to a notify because of the job state change, which may already have happended once calling wait()).

Still it might make sense to replace the unconditional wait() call with a call using some timeout to ensure that in case anything yet undetected goes wrong the job will not hang but proceed after some time. However, that may easily hide bugs, so there should probably be some log information about such events. As this is another issue, I would defer it to a different PR.

Do you still have objections to the proposed change?

@jukzi
Copy link
Contributor

jukzi commented Sep 27, 2023

While simply changing wait() to wait(250), as you propose in #661 (comment), would also address the problem, I still think that re-evaluating the rule conflict is better as it proceeds on the actual state change rather than on a timeout (as the wait() call should actually finish due to a notify because of the job state change, which may already have happended once calling wait()).

I wonder why i don't see the performance benefit from checking again over wait(250) in the junit test. If there is not measurable impact, then i suggest to make the code less complex over adding additional logic.

@HeikoKlare
Copy link
Contributor Author

You won't see a performance impact in the test for two reasons:

  1. When having a small number of jobs, running into the wait() after the the rule of the blocking job has changed (i.e., the error case to be fixed by this PR), is an exceptional case. Otherwise the referenced test would have failed more often.
  2. When starting many jobs, the impact of a small timeout in a single job does not affect the overall execution time, as the other jobs will proceed in between.

I see the point that using a timeout makes the code simpler while achieving the same result in terms of performance. However, I see three arguments speaking against that solution:

  1. The code will only work quite correctly "by accident" (or more correctly: appear to work correctly), as a timeout hides a semantic error (which is waiting for some event that will never occur).
  2. When executing the test with 400 jobs, this error occurs around 200 times, i.e., for half of the jobs an unnecessary wait() is performed.
  3. Adding a timeout for the wait() operation is only necessary to ensure that execution proceeds in case of some other potential and yet undetected bug. That's why I would not suggest to set a low timeout value, but some rather high value like 5 or 10 seconds. In case there is no undetected bug in the code, it only makes sense to proceed execution once a notification comes from the blocking thread, as otherwise the thread would continue execution without a chance of being successful as the relevant state has not changed.

That's why I am not in favor of a "simple" solution here. Still, I improved the code by directly checking the jobs for conflicts instead of their rules iun 02fabca.

@jukzi
Copy link
Contributor

jukzi commented Sep 28, 2023

You could add a note to the commit somewhere that it is (for example?) yieldRule() which triggered the deadlock.

@HeikoKlare HeikoKlare force-pushed the issue-659-fix branch 2 times, most recently from 5dc0af4 to af5395a Compare September 28, 2023 09:56
@HeikoKlare
Copy link
Contributor Author

You could add a note to the commit somewhere that it is (for example?) yieldRule() which triggered the deadlock.

Good idea. I have adapted the code comment to reflect why this can happen and added an explanation to the commit message in af5395a. I incorporated the insights concerning the reasons for a change of the scheduling rule in #661 (comment).

ThreadJob.waitForRun() suffers from a race condition. Between checking
for a blocking job with a conflicting scheduling rule and starting to
wait on the state lock of that blocking job to get notified when the
scheduling rule is released, the blocking job may have already changed
its state and acquired a different scheduling rule, such that no
conflict exists anymore. Since this condition is not reevaluated but
waiting on the lock object is started without a timeout, the blocked job
unnecessarily waits for the initially blocking job to change its state
again, i.e., to finish its execution.

One relevant situation in which the scheduling rule of a job can change
is if the job is an internally used ThreadJob. They are started whenever
a rule is acquired on the JobManager and are reused for the same thread.
When the same thread acquires multiple rules, one after another, the
rule of the ThreadJob may effectively change. This, for example, happens
during a workspace build operation, which changes the rule for two
times. Thus, if the blocking job in ThreadJob.waitForRun() is a
ThreadJob, it may change its rule, in particular when that job performs
a build operation.

With this change, the scheduling rules of blocking and blocked thread
are checked for conflicts again before starting to wait on the state
lock in the blocked job. This leads to a reevaluation of the block
condition in case the scheduling rule of the blocking job has changed in
between and avoids the job to unnecessarily block until the blocking job
finishes.

Fixes eclipse-platform#659.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Race condition in ThreadJob unnecessarily blocks parallel build execution
3 participants