-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to get into a joinable pthread state? #17984
Conversation
Looks reasonable, but I'm not sure it will actually fix the problems, because the way things are currently setup the shutdown should be working (provided dt_control_shutdown isn't called multiple times) - pthread_join blocks until the given thread exits (if it hasn't yet), and each worker thread exits as it finishes its current job or sleep and finds that an exit was requested before we called pthread_join. As for the counter not working properly, might you have missed some path on which a job could be terminated? (Or started, for that matter, if you're seeing negative counts.) |
The point "it should" but the worker threads are definitely not in a correct state to be joined if jobs are pending. Try the reproducers. If I wait for pending jobs are done, everything is safe and good. |
Edit: we just were not aware of this as the thread joining part was not processed. Thus late errors as @zisoft demonstrated. |
With this PR (compile error fixed) I can 100% reproduce an assertion failure:
Stacktrace coming from Lines 48 to 70 in 0acaaab
called from |
@zisoft this is exactly the issue I was referring to. |
In |
a9d0f10
to
0bde20e
Compare
This has been very hard work the last days. Some things (debug logs ...) are not necessary, @TurboGit let me know if you want a stripped down version. As it is possible that working jobs are continuing in the background after user-request to close darktable we could possibly avoid to quit for UI usability as we could use the pending jobs counter ... @dterrahe thanks for your hint! |
71c3b5d
to
2dc1dfb
Compare
@jenshannoschwalm : Thanks and congrats! Of course I'm a bit worried so close to the release. I'll cut the sources in about 48h now. I don't want to put too much pressure on you after working so hard, so my question what would be a stripped down version? How much work for you? |
Stripping down? No problem for me ...
|
2dc1dfb
to
fdaf125
Compare
Just re-force-pushed with extra checks and debug logs removed as not required any more while being very verbose ... |
Well the job counter commit is pretty simple and looks safe. I'll test this PR. |
I have a lock when I quit early and can reproduce 100% of the time.
|
Fixed with last commit ... |
Still reproduce... |
a72b9ef
to
872ac2c
Compare
Hmm, couldn't reproduce even once. Could be a race condition so use atomics now. Also added a simple commit for dtpthreads adding textual output for pthread creation and joining. Tried on wayland and X11, in/excluded gphoto, cups and opencl ... no issue. |
Some tests on macOS:
darktable/src/develop/pixelpipe_hb.c Lines 1348 to 1378 in 50f0e50
This is new and introduced by moving the cache_cleanups after Note: I'm always testing with |
No sure if this will help, but:
Maybe due to Lua:
|
Confirmed, it is due to Lua. Disabling Lua I cannot reproduce the issue. |
I have this code that help exiting while Lua is still not fully initialized, at least the process doesn't lock but I get a crash one out of 20 runs or so:
The Lua init/finalize code is complex, we may need help from @wpferguson to find a proper way to exit Lua while still not 100% initialized. |
I got a crash while images exporting after ctrl-q |
e1ced0b
to
7c5b153
Compare
@TurboGit it's sort of a mess i/we all were not aware of the implications of all the things we now do in the background and wanting a safe darktable exit ... I just force-pushed: |
@zisoft : Can you check again on your side? I have tested a bit, no issue on my side except the Lua one I have reported. Yet, the Lua issue is reproducible on master too, so not a regression on this PR. |
7c5b153
to
1a287f4
Compare
Just added a fix for not waiting on image import if control is not running any more. |
@TurboGit One of the first PR's i will do for 5.2 is "Only accept Ctrl-q if no jobs are pending" :-) |
It crashes in the first loop
If I set a breakpoint and single-step through the loop, most of the steps gives no log output. Then it comes to a step with lots of log output and that is the step which crashes. The last lines of the log (started with
crash in line darktable/src/develop/pixelpipe_hb.c Lines 1348 to 1378 in 50f0e50
|
1a287f4
to
e7f554a
Compare
@zisoft As there is just one of the threads involved it's random in what position in the loop it "happens", but as there is no imageio system running any more the flip might fail here. So this has been added to the jobs where ctrl-q is understood as "skip-the-rest' BTW you run ARM or intel? I am absolutely not sure if there are some peculiar things in the _DEBUG mode, even less on MACs. Could you re-test? |
First try, same crash :( I'm on an Intel Mac.
ah, ok. That's my experience here, the crash position in the loop is indeed random. EDIT: ah, sorry, I missed your last commit... now testing again |
A Lua hang on very early exit is probably the same underlying issue that I encountered while writing the splash screen code, and kept me from displaying any further progress updates after starting up Lua. I have since been able to trace that hang to the async processing of the Lua startup script. No luarc, no hang.... Once the script finishes, it is again OK to call dt_gui_process_events(). |
ok, some things have changed. I send screenshots here with log at the bottom and stacktrace on the left. The first type is again an assertion failure: And the second type is a I also removed my luarc but that does not make a difference as these are no early quit attempts. |
At this stage, let's move this to 5.2 too risky for 5.0 despite all the hard work done here. |
Making this draft for now, as there will be PR's doing maintenance and safety checks to be merged first and this will have to be rebased afterwards. |
As we now have a lot of things in background jobs related to images and mipmaps we must keep those systems available until the control system has been fully closed. NOTE: for some reason we can't have dt_imageio_cleanup() after dt_control_shutdown()
1. For code readability use 'control' instead of 's' 2. Simplify dt_control_running() 3. Removed not needed checks on errors 4. Two functions are strictly internal so no exposing via .h files and be static - _control_get_threadid() - _control_job_set_synchronous() 5. Two function were not used at all (so were untested) so the were removed - dt_control_job_set_state_callback() - dt_control_job_wait() 6. gphoto joining after cleanup test 7. Never create a control job if control is not in running state
While running the job dispatchers we keep track on jobs being inserted and executed so we at least know about the number of unprocessed jobs via dt_control_jobs_pending() Also show pending jobs in the log if closing dt.
Strictly prohibit reentrant use of dt_control_quit() and avoid a race condition via atomic instruction.
As we have control jobs on "file actions" we have to make sure that we have 'dt_imageio` available for: export, import, hdr creation, flip and exif refresh inside the job. While working the list of images we test for dt_control_running() and only process the job specific action if so. Also we have some **very** time consuming jobs that can be canceled in the UI, it seems to be much better to understand a ctrl-q or click on close-darktable button as a request-to-cancel so there will be no lengthy background actions after the window has closed. While we aborted file import via ctrl-q we don't want to wait ...
428576f
to
ecaa37b
Compare
EDIT after latest commits, see commit message about it ...
Currently the job dispatcher might have pending jobs. These are not processed any more if we test for dt_control_running() in job.c
We continue dispatching even in DT_CONTROL_STATE_CLEANUP to get all things done.
But how can we know if all jobs are finalized? This might be a way, we increase
pending_jobs
jobs-to-be-done and decrease for finalized jobs.In control_shutdown we have to wait until everything is done.