Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to get into a joinable pthread state? #17984

Closed

Conversation

jenshannoschwalm
Copy link
Collaborator

@jenshannoschwalm jenshannoschwalm commented Dec 10, 2024

EDIT after latest commits, see commit message about it ...

  1. Currently the job dispatcher might have pending jobs. These are not processed any more if we test for dt_control_running() in job.c

  2. We continue dispatching even in DT_CONTROL_STATE_CLEANUP to get all things done.

  3. But how can we know if all jobs are finalized? This might be a way, we increase pending_jobs jobs-to-be-done and decrease for finalized jobs.

  4. In control_shutdown we have to wait until everything is done.

@jenshannoschwalm jenshannoschwalm marked this pull request as draft December 10, 2024 19:33
@jenshannoschwalm jenshannoschwalm added bugfix pull request fixing a bug priority: high core features are broken and not usable at all, software crashes labels Dec 10, 2024
@jenshannoschwalm
Copy link
Collaborator Author

@TurboGit @ralfbrown @dterrahe @zisoft i really need help here.

Is this the way to go?
Any other ideas?
The up/down-counter doesn't work correctly, can you have a look?

Anyway, see #17961 #17978 for existing problems.

@ralfbrown
Copy link
Collaborator

ralfbrown commented Dec 10, 2024

Looks reasonable, but I'm not sure it will actually fix the problems, because the way things are currently setup the shutdown should be working (provided dt_control_shutdown isn't called multiple times) - pthread_join blocks until the given thread exits (if it hasn't yet), and each worker thread exits as it finishes its current job or sleep and finds that an exit was requested before we called pthread_join.

As for the counter not working properly, might you have missed some path on which a job could be terminated? (Or started, for that matter, if you're seeing negative counts.)

@jenshannoschwalm
Copy link
Collaborator Author

The point "it should" but the worker threads are definitely not in a correct state to be joined if jobs are pending. Try the reproducers.

If I wait for pending jobs are done, everything is safe and good.

@jenshannoschwalm
Copy link
Collaborator Author

jenshannoschwalm commented Dec 10, 2024

Edit: we just were not aware of this as the thread joining part was not processed. Thus late errors as @zisoft demonstrated.

@zisoft
Copy link
Collaborator

zisoft commented Dec 11, 2024

With this PR (compile error fixed) I can 100% reproduce an assertion failure:

  • select 10 images in lightroom
  • click the rotate 90 degrees button
  • wait a few seconds to see the first thumbnails rotate
  • quit

Assertion failed: (!res), function dt_pthread_rwlock_destroy, file dtpthread.h, line 255.

Stacktrace coming from dt_cache_cleanup(), line 65:

void dt_cache_cleanup(dt_cache_t *cache)
{
g_hash_table_destroy(cache->hashtable);
for(GList *l = cache->lru; l; l = g_list_next(l))
{
dt_cache_entry_t *entry = l->data;
if(cache->cleanup)
{
assert(entry->data_size);
ASAN_UNPOISON_MEMORY_REGION(entry->data, entry->data_size);
cache->cleanup(cache->cleanup_data, entry);
}
else
dt_free_align(entry->data);
dt_pthread_rwlock_destroy(&entry->lock);
g_slice_free1(sizeof(*entry), entry);
}
g_list_free(cache->lru);
dt_pthread_mutex_destroy(&cache->lock);
}

called from dt_mipmap_cache_cleanup() in dt_cleanup()

@jenshannoschwalm
Copy link
Collaborator Author

@zisoft this is exactly the issue I was referring to.

@dterrahe
Copy link
Member

In dt_cleanup dt_image_cache_cleanup is called before dt_control_shutdown, so there might still be threads wanting to use the caches? If properly mutexed, that might/should not be a problem, but at least when I move the dt_image_cache_cleanup and dt_mipmap_cache_cleanup calls to after the if(init_gui) surrounding dt_control_shutdown, the above crash is not (as easily) reproducable here. Hopefully you got this far, or further, already.

@jenshannoschwalm jenshannoschwalm marked this pull request as ready for review December 14, 2024 07:08
@jenshannoschwalm jenshannoschwalm added this to the 5.0 milestone Dec 14, 2024
@jenshannoschwalm
Copy link
Collaborator Author

jenshannoschwalm commented Dec 14, 2024

This has been very hard work the last days. Some things (debug logs ...) are not necessary, @TurboGit let me know if you want a stripped down version.

As it is possible that working jobs are continuing in the background after user-request to close darktable we could possibly avoid to quit for UI usability as we could use the pending jobs counter ...

@dterrahe thanks for your hint!

@jenshannoschwalm jenshannoschwalm force-pushed the my_checks branch 2 times, most recently from 71c3b5d to 2dc1dfb Compare December 14, 2024 07:27
@TurboGit
Copy link
Member

@jenshannoschwalm : Thanks and congrats! Of course I'm a bit worried so close to the release. I'll cut the sources in about 48h now. I don't want to put too much pressure on you after working so hard, so my question what would be a stripped down version? How much work for you?

@jenshannoschwalm
Copy link
Collaborator Author

Stripping down? No problem for me ...

  1. You could just ignore the pending_jobs counter commit for now (Use that in 5.2 for UI feedback ...)
  2. I would remove added dt_print() stuff for less code to review although not really relevant
  3. I would prefer to keep the 's' to 'control' renaming in the first commit for readability and making stuff static also seems good&safe

@jenshannoschwalm
Copy link
Collaborator Author

Just re-force-pushed with extra checks and debug logs removed as not required any more while being very verbose ...

@TurboGit
Copy link
Member

Well the job counter commit is pretty simple and looks safe. I'll test this PR.

@TurboGit
Copy link
Member

I have a lock when I quit early and can reproduce 100% of the time.

  • Start dt from the console
  • As soon as I see the splash screen I hit Ctrl+Q multiple time until the UI is displayed
  • Darktable UI quit but the process does not give back the prompt on the console, you need a Ctrl+C to hard kill it.

@jenshannoschwalm
Copy link
Collaborator Author

I have a lock when I quit early and can reproduce 100% of the time.

Fixed with last commit ...

@TurboGit
Copy link
Member

Still reproduce...

@jenshannoschwalm
Copy link
Collaborator Author

jenshannoschwalm commented Dec 14, 2024

Hmm, couldn't reproduce even once. Could be a race condition so use atomics now.

Also added a simple commit for dtpthreads adding textual output for pthread creation and joining.

Tried on wayland and X11, in/excluded gphoto, cups and opencl ... no issue.

@zisoft
Copy link
Collaborator

zisoft commented Dec 14, 2024

Some tests on macOS:

  • Splash screen disabled:
    • quit after GUI came up fully: no issues
    • early quit as soon as the main window comes up: no issues
  • Splash screen enabled
    • quit immediately after splash screen comes up: some strange effects with the main window but dt quits without issues
  • GUI fully started, quit, no issues
  • GUI fully started, quit while running the rotate 10 images test: This gives a 100% reproducible hard crash with EXC_BAD_ACCESS while accessing gui_module->flags in line 1374 of develop/pixelpipe_hb.c:

static gboolean _dev_pixelpipe_process_rec(dt_dev_pixelpipe_t *pipe,
dt_develop_t *dev,
void **output,
void **cl_mem_output,
dt_iop_buffer_dsc_t **out_format,
const dt_iop_roi_t *roi_out,
GList *modules,
GList *pieces,
const int pos)
{
if(dt_atomic_get_int(&pipe->shutdown))
return TRUE;
dt_iop_roi_t roi_in = *roi_out;
char module_name[256] = { 0 };
void *input = NULL;
void *cl_mem_input = NULL;
*cl_mem_output = NULL;
dt_iop_module_t *module = NULL;
dt_dev_pixelpipe_iop_t *piece = NULL;
const dt_dev_pixelpipe_type_t old_pipetype = pipe->type;
const dt_iop_module_t *gui_module = dt_dev_gui_module();
// if a module is active, check if this module allow a fast pipe run
if(gui_module
&& gui_module->flags() & IOP_FLAGS_ALLOW_FAST_PIPE
&& pipe->type & DT_DEV_PIXELPIPE_BASIC
&& dt_dev_modulegroups_test_activated(darktable.develop))
pipe->type |= DT_DEV_PIXELPIPE_FAST;
else

This is new and introduced by moving the cache_cleanups after if(init_gui)

Note: I'm always testing with --build-type Debug

@TurboGit
Copy link
Member

No sure if this will help, but:

(gdb) info threads 
  Id   Target Id                                           Frame 
* 1    Thread 0x7fffeea70f00 (LWP 101058) "darktable"      0x00007ffff769891e in __futex_abstimed_wait_common64 (private=0, 
    futex_word=0x555555558cf0 <darktable+3184>, expected=0, op=393, 
    abstime=0x0, cancel=true) at ./nptl/futex-internal.c:57
  2    Thread 0x7fffeea5c6c0 (LWP 101061) "pool-spawner"   syscall ()
    at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  3    Thread 0x7fffee25b6c0 (LWP 101062) "gmain"          0x00007ffff770d3f6 in __GI_ppoll (fds=0x5555555aadf0, nfds=2, timeout=<optimized out>, sigmask=0x0)
    at ../sysdeps/unix/sysv/linux/ppoll.c:42
  5    Thread 0x7fffed2596c0 (LWP 101064) "gdbus"          0x00007ffff770d3f6 in __GI_ppoll (fds=0x7fff08000c70, nfds=4, timeout=<optimized out>, sigmask=0x0)
    at ../sysdeps/unix/sysv/linux/ppoll.c:42
  6    Thread 0x7fffe7fff6c0 (LWP 101066) "darktable"      syscall ()
    at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  7    Thread 0x7fffe7dfe6c0 (LWP 101067) "darktable"      syscall ()
    at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  8    Thread 0x7fffe7bfd6c0 (LWP 101068) "darktable"      syscall ()
    at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  9    Thread 0x7fffe79fc6c0 (LWP 101069) "darktable"      syscall ()
    at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  10   Thread 0x7fffe77fb6c0 (LWP 101070) "darktable"      syscall ()
    at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  11   Thread 0x7fffe75fa6c0 (LWP 101071) "darktable"      syscall ()
    at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  12   Thread 0x7fffe73f96c0 (LWP 101072) "darktable"      syscall ()
    at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  13   Thread 0x7fffe71f86c0 (LWP 101073) "darktable"      syscall ()
    at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  14   Thread 0x7fffe6ff76c0 (LWP 101074) "darktable"      syscall ()
    at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  15   Thread 0x7fffe6df66c0 (LWP 101075) "darktable"      syscall ()
    at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  16   Thread 0x7fffe6bf56c0 (LWP 101076) "darktable"      syscall ()
    at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  17   Thread 0x7fffe69f46c0 (LWP 101077) "darktable"      syscall ()
    at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  18   Thread 0x7fffe67f36c0 (LWP 101078) "darktable"      syscall ()
    at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  19   Thread 0x7fffe65f26c0 (LWP 101079) "darktable"      syscall ()
    at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  20   Thread 0x7fffe63f16c0 (LWP 101080) "darktable"      syscall ()
    at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  21   Thread 0x7fffe61f06c0 (LWP 101081) "darktable"      syscall ()
    at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  22   Thread 0x7fffe5fef6c0 (LWP 101082) "darktable"      syscall ()
    at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  35   Thread 0x7fffccff96c0 (LWP 101098) "cuda-EvtHandlr" 0x00007ffff770ce9f in __GI___poll (fds=0x7fff28000c20, nfds=11, timeout=100)
    at ../sysdeps/unix/sysv/linux/poll.c:29
  36   Thread 0x7fffc17ff6c0 (LWP 101099) "worker 1"       0x00007ffff769891e in __futex_abstimed_wait_common64 (private=0, futex_word=0x7fff64006130, 
    expected=0, op=393, abstime=0x0, cancel=true) at ./nptl/futex-internal.c:57
  37   Thread 0x7fffc0ffe6c0 (LWP 101100) "worker 1"       0x00007ffff769891e in __futex_abstimed_wait_common64 (private=<optimized out>, 
    futex_word=0x7fff64711c58, expected=0, op=393, abstime=0x0, cancel=true)
    at ./nptl/futex-internal.c:57
  38   Thread 0x7fffb61ff6c0 (LWP 101101) "worker 1"       0x00007ffff769891e in __futex_abstimed_wait_common64 (private=<optimized out>, 
    futex_word=0x7fff64711c58, expected=0, op=393, abstime=0x0, cancel=true)
    at ./nptl/futex-internal.c:57
  39   Thread 0x7fffb59fe6c0 (LWP 101102) "worker 1"       0x00007ffff769891e in __futex_abstimed_wait_common64 (private=<optimized out>, 
    futex_word=0x7fff64711c58, expected=0, op=393, abstime=0x0, cancel=true)
    at ./nptl/futex-internal.c:57
  40   Thread 0x7fffb51fd6c0 (LWP 101103) "worker 1"       0x00007ffff769891e in __futex_abstimed_wait_common64 (private=<optimized out>, 
    futex_word=0x7fff64711c58, expected=0, op=393, abstime=0x0, cancel=true)
    at ./nptl/futex-internal.c:57
  41   Thread 0x7fffb49fc6c0 (LWP 101104) "worker 1"       0x00007ffff769891e in __futex_abstimed_wait_common64 (private=<optimized out>, 
    futex_word=0x7fff64712968, expected=0, op=393, abstime=0x0, cancel=true)
    at ./nptl/futex-internal.c:57
  44   Thread 0x7fffa9fff6c0 (LWP 101118) "darktable"      0x00007ffff7e7ae8e in ?? () from /lib/x86_64-linux-gnu/libgomp.so.1
  45   Thread 0x7fffeda5a6c0 (LWP 101119) "darktable"      0x00007ffff7e7ae8e in ?? () from /lib/x86_64-linux-gnu/libgomp.so.1
  46   Thread 0x7fffa8ffd6c0 (LWP 101120) "darktable"      0x00007ffff7e7ae8e in ?? () from /lib/x86_64-linux-gnu/libgomp.so.1
  47   Thread 0x7fff9dfff6c0 (LWP 101121) "darktable"      0x00007ffff7e7ae8e in ?? () from /lib/x86_64-linux-gnu/libgomp.so.1
  48   Thread 0x7fff9d7fe6c0 (LWP 101122) "darktable"      0x00007ffff7e7ae8e in ?? () from /lib/x86_64-linux-gnu/libgomp.so.1
  49   Thread 0x7fff9cffd6c0 (LWP 101123) "darktable"      0x00007ffff7e7ae8e in ?? () from /lib/x86_64-linux-gnu/libgomp.so.1
  50   Thread 0x7fff93fff6c0 (LWP 101124) "darktable"      0x00007ffff7e7ae8e in ?? () from /lib/x86_64-linux-gnu/libgomp.so.1
  51   Thread 0x7fff937fe6c0 (LWP 101125) "darktable"      0x00007ffff7e7ae8e in ?? () from /lib/x86_64-linux-gnu/libgomp.so.1
  52   Thread 0x7fff92ffd6c0 (LWP 101126) "darktable"      0x00007ffff7e7ae8e in ?? () from /lib/x86_64-linux-gnu/libgomp.so.1
  53   Thread 0x7fff927fc6c0 (LWP 101127) "darktable"      0x00007ffff7e7ae8e in ?? () from /lib/x86_64-linux-gnu/libgomp.so.1
  54   Thread 0x7fff91ffb6c0 (LWP 101128) "darktable"      0x00007ffff7e7ae8e in ?? () from /lib/x86_64-linux-gnu/libgomp.so.1
  55   Thread 0x7fff917fa6c0 (LWP 101129) "darktable"      0x00007ffff7e7ae8e in ?? () from /lib/x86_64-linux-gnu/libgomp.so.1
  56   Thread 0x7fff90ff96c0 (LWP 101130) "darktable"      0x00007ffff7e7ae8e in ?? () from /lib/x86_64-linux-gnu/libgomp.so.1
  57   Thread 0x7fff87fff6c0 (LWP 101131) "darktable"      0x00007ffff7e7ae8e in ?? () from /lib/x86_64-linux-gnu/libgomp.so.1
  58   Thread 0x7fff877fe6c0 (LWP 101132) "darktable"      0x00007ffff7e7ae8e in ?? () from /lib/x86_64-linux-gnu/libgomp.so.1
  59   Thread 0x7fff613ff6c0 (LWP 101133) "darktable"      0x00007ffff769891e in __futex_abstimed_wait_common64 (private=0, futex_word=0x7fff6e9edde0, 
    expected=0, op=393, abstime=0x0, cancel=true) at ./nptl/futex-internal.c:57
  60   Thread 0x7fff60bfe6c0 (LWP 101134) "darktable"      0x00007ffff769891e in __futex_abstimed_wait_common64 (private=0, futex_word=0x7fff6e9ede60, 
    expected=0, op=393, abstime=0x0, cancel=true) at ./nptl/futex-internal.c:57
  61   Thread 0x7fff57fff6c0 (LWP 101135) "darktable"      0x00007ffff769891e in __futex_abstimed_wait_common64 (private=0, futex_word=0x7fff6e9edee0, 
    expected=0, op=393, abstime=0x0, cancel=true) at ./nptl/futex-internal.c:57
  62   Thread 0x7fff577fe6c0 (LWP 101136) "darktable"      0x00007ffff769891e in __futex_abstimed_wait_common64 (private=0, futex_word=0x7fff6e9edf60, 
    expected=0, op=393, abstime=0x0, cancel=true) at ./nptl/futex-internal.c:57
  63   Thread 0x7fff56ffd6c0 (LWP 101137) "darktable"      0x00007ffff769891e in __futex_abstimed_wait_common64 (private=0, futex_word=0x7fff6e9edfe0, 
    expected=0, op=393, abstime=0x0, cancel=true) at ./nptl/futex-internal.c:57
  64   Thread 0x7fff567fc6c0 (LWP 101138) "darktable"      0x00007ffff769891e in __futex_abstimed_wait_common64 (private=0, futex_word=0x7fff6e9ee060, 
    expected=0, op=393, abstime=0x0, cancel=true) at ./nptl/futex-internal.c:57
  65   Thread 0x7fff55ffb6c0 (LWP 101139) "darktable"      0x00007ffff769891e in __futex_abstimed_wait_common64 (private=0, futex_word=0x7fff6e9ee0e0, 
    expected=0, op=393, abstime=0x0, cancel=true) at ./nptl/futex-internal.c:57
  66   Thread 0x7fff557fa6c0 (LWP 101140) "darktable"      0x00007ffff769891e in __futex_abstimed_wait_common64 (private=0, futex_word=0x7fff6e9ee160, 
    expected=0, op=393, abstime=0x0, cancel=true) at ./nptl/futex-internal.c:57
  67   Thread 0x7fff54ff96c0 (LWP 101141) "darktable"      0x00007ffff769891e in __futex_abstimed_wait_common64 (private=0, futex_word=0x7fff6e9ee1e0, 
    expected=0, op=393, abstime=0x0, cancel=true) at ./nptl/futex-internal.c:57
  68   Thread 0x7fff4bfff6c0 (LWP 101142) "darktable"      0x00007ffff769891e in __futex_abstimed_wait_common64 (private=0, futex_word=0x7fff6e9ee260, 
    expected=0, op=393, abstime=0x0, cancel=true) at ./nptl/futex-internal.c:57
  69   Thread 0x7fff4b7fe6c0 (LWP 101143) "darktable"      0x00007ffff769891e in __futex_abstimed_wait_common64 (private=0, futex_word=0x7fff6e9ee2e0, 
    expected=0, op=393, abstime=0x0, cancel=true) at ./nptl/futex-internal.c:57
  70   Thread 0x7fff4affd6c0 (LWP 101144) "darktable"      0x00007ffff769891e in __futex_abstimed_wait_common64 (private=0, futex_word=0x7fff6e9ee360, 
    expected=0, op=393, abstime=0x0, cancel=true) at ./nptl/futex-internal.c:57
  71   Thread 0x7fff4a7fc6c0 (LWP 101145) "darktable"      0x00007ffff769891e in __futex_abstimed_wait_common64 (private=0, futex_word=0x7fff6e9ee3e0, 
    expected=0, op=393, abstime=0x0, cancel=true) at ./nptl/futex-internal.c:57
  72   Thread 0x7fff49ffb6c0 (LWP 101146) "darktable"      0x00007ffff769891e in __futex_abstimed_wait_common64 (private=0, futex_word=0x7fff6e9ee460, 
    expected=0, op=393, abstime=0x0, cancel=true) at ./nptl/futex-internal.c:57
  73   Thread 0x7fff497fa6c0 (LWP 101147) "darktable"      0x00007ffff769891e in __futex_abstimed_wait_common64 (private=0, futex_word=0x7fff6e9ee4e0, 
    expected=0, op=393, abstime=0x0, cancel=true) at ./nptl/futex-internal.c:57
  78   Thread 0x7fff48ff96c0 (LWP 101161) "lua thread"     0x00007ffff769891e in __futex_abstimed_wait_common64 (private=0, 
    futex_word=0x555555558cf4 <darktable+3188>, expected=0, op=393, 
    abstime=0x0, cancel=true) at ./nptl/futex-internal.c:57
  79   Thread 0x7fff2dfff6c0 (LWP 101162) "pool-darktable" syscall ()
    at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  80   Thread 0x7fff2d7fe6c0 (LWP 101163) "pool-darktable" syscall ()
    at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38

Maybe due to Lua:

(gdb) bt
#0  0x00007ffff769891e in __futex_abstimed_wait_common64
    (private=0, futex_word=0x555555558cf0 <darktable+3184>, expected=0, op=393, abstime=0x0, cancel=true) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common
    (futex_word=futex_word@entry=0x555555558cf0 <darktable+3184>, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0, cancel=cancel@entry=true) at ./nptl/futex-internal.c:87
#2  0x00007ffff769899b in __GI___futex_abstimed_wait_cancelable64
    (futex_word=futex_word@entry=0x555555558cf0 <darktable+3184>, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139
#3  0x00007ffff769b1c0 in __pthread_cond_wait_common
    (cond=0x555555558cc8 <darktable+3144>, mutex=0x555555558ca0 <darktable+3104>, clockid=0, abstime=0x0) at ./nptl/pthread_cond_wait.c:503
#4  ___pthread_cond_wait
    (cond=cond@entry=0x555555558cc8 <darktable+3144>, mutex=mutex@entry=0x555555558ca0 <darktable+3104>) at ./nptl/pthread_cond_wait.c:618
#5  0x00007ffff7ae819f in dt_pthread_cond_wait
    (cond=<optimized out>, mutex=<optimized out>)
    at /home/obry/dev/builds/c-darktable/x86_64-linux-gnu-default/src/src/common/dtpthread.h:396
#6  dt_lua_lock_internal
    (function=function@entry=0x7ffff7cc6c30 <__FUNCTION__.0> "dt_lua_finalize", file=file@entry=0x7ffff7c95f50 "/home/obry/dev/builds/c-darktable/x86_64-linux-gnu-default/src/src/lua/init.c", line=line@entry=265, silent=silent@entry=0)
    at /home/obry/dev/builds/c-darktable/x86_64-linux-gnu-default/src/src/lua/lua.c:146
#7  0x00007ffff7ae72bb in dt_lua_finalize ()
    at /home/obry/dev/builds/c-darktable/x86_64-linux-gnu-default/src/src/lua/init.c:265
#8  0x00007ffff78f2893 in dt_cleanup ()
    at /home/obry/dev/builds/c-darktable/x86_64-linux-gnu-default/src/src/common/darktable.c:2055
#9  0x00005555555551c8 in main (argc=<optimized out>, argv=<optimized out>)
    at /home/obry/dev/builds/c-darktable/x86_64-linux-gnu-default/src/src/main.c:148
(gdb) 

@TurboGit
Copy link
Member

Confirmed, it is due to Lua. Disabling Lua I cannot reproduce the issue.

@TurboGit
Copy link
Member

I have this code that help exiting while Lua is still not fully initialized, at least the process doesn't lock but I get a crash one out of 20 runs or so:

diff --git a/src/lua/init.c b/src/lua/init.c
index 54e7f79397..ec61ef0afc 100644
--- a/src/lua/init.c
+++ b/src/lua/init.c
@@ -248,9 +248,10 @@ int luaopen_darktable(lua_State *L)
 void dt_lua_finalize_early()
 {
   darktable.lua_state.ending = true;
-  gboolean running = (darktable.lua_state.loop
-                      && g_atomic_int_get(&_lua_fully_initialized)
-                      && g_main_loop_is_running(darktable.lua_state.loop));
+
+  const gboolean running = (darktable.lua_state.loop
+                            && g_atomic_int_get(&_lua_fully_initialized)
+                            && g_main_loop_is_running(darktable.lua_state.loop));
   if(running)
   {
     dt_lua_lock();
@@ -258,6 +259,8 @@ void dt_lua_finalize_early()
     dt_lua_unlock();
     g_main_context_wakeup(darktable.lua_state.context);
   }
+  else
+    darktable.lua_state.exec_lock = false;
 }
 
 void dt_lua_finalize()

The Lua init/finalize code is complex, we may need help from @wpferguson to find a proper way to exit Lua while still not 100% initialized.

@jenshannoschwalm
Copy link
Collaborator Author

I got a crash while images exporting after ctrl-q

@jenshannoschwalm
Copy link
Collaborator Author

@TurboGit it's sort of a mess i/we all were not aware of the implications of all the things we now do in the background and wanting a safe darktable exit ...

I just force-pushed:

  1. in 8891c6c there is an additional safety check prohibiting to create a new job if control isnot running any more
  2. b064f9e is just plain renaming maintenance done while analyse and understand ...
  3. 7c5b153 fixes image file I/O problems in background jobs.

@TurboGit
Copy link
Member

@zisoft : Can you check again on your side?

I have tested a bit, no issue on my side except the Lua one I have reported. Yet, the Lua issue is reproducible on master too, so not a regression on this PR.

@jenshannoschwalm
Copy link
Collaborator Author

Just added a fix for not waiting on image import if control is not running any more.

@jenshannoschwalm jenshannoschwalm added the difficulty: hard big changes across different parts of the code base label Dec 15, 2024
@jenshannoschwalm
Copy link
Collaborator Author

@TurboGit One of the first PR's i will do for 5.2 is "Only accept Ctrl-q if no jobs are pending" :-)

@zisoft
Copy link
Collaborator

zisoft commented Dec 15, 2024

It crashes in the first loop

  /* wait for kick_on_workers_thread */
  dt_pthread_join(control->kick_on_workers_thread);

  for(int k = 0; k < control->num_threads-1; k++)
    dt_pthread_join(control->thread[k]);          // <-- breakpoint here

  for(int k = 0; k < DT_CTL_WORKER_RESERVED; k++)
    dt_pthread_join(control->thread_res[k]);

If I set a breakpoint and single-step through the loop, most of the steps gives no log output. Then it comes to a step with lots of log output and that is the step which crashes. The last lines of the log (started with -d all):


59.7676 [iop_validate_params] `flip' validated data for type "dt_image_orientation_t", field: orientation (5)
59.7676 [iop_validate_params] `flip' validated data for type "dt_iop_flip_params_t"
59.7676 committed                     [thumbnail]      flip                   1700  enabled  piece hash=9df07eab6adab321
59.7676 synch all modules done        [thumbnail]                                   defaults 0.1505s, history 0.0106s
59.7676 [dt_imageio_export_with_flags]  modules: rawprepare temperature highlights demosaic flip exposure colorin channelmixerrgb sigmoid colorout gamma
59.7676 modify roi OUT                [thumbnail]      rawprepare              100 (   0/   0) 5344x3516 scale=1.0000 --> (   0/   0) 5184x3456 scale=1.0000  ID=37
59.7676 modify roi OUT                [thumbnail]      flip                   1700 (   0/   0) 5184x3456 scale=1.0000 --> (   0/   0) 3456x5184 scale=1.0000  ID=37
59.7676 [export] creating pixelpipe took 0.168 secs (0.330 CPU)
59.7677 [dt_imageio_export] thumbnail imgid 37, 3456x5184 --> 73x110 (scale=0.0212, maxscale=1.0000). upscale=no, hq=no
59.7677 pipe starting             CL1 [thumbnail]                                  (   0/   0)   73x 110 scale=0.0212 --> (   0/   0)   73x 110 scale=0.0212  ID=37, appleamdradeonpro5500mcomputeengine
59.7677 [memory] before pixelpipe process
        max address space (vmpeak):         unknown
        cur address space (vmsize):     41845448 kB
        max used memory   (vmhwm ):         unknown
        cur used memory   (vmrss ):      2067612 kB

crash in line 1374 of pixelpipe.hb

static gboolean _dev_pixelpipe_process_rec(dt_dev_pixelpipe_t *pipe,
dt_develop_t *dev,
void **output,
void **cl_mem_output,
dt_iop_buffer_dsc_t **out_format,
const dt_iop_roi_t *roi_out,
GList *modules,
GList *pieces,
const int pos)
{
if(dt_atomic_get_int(&pipe->shutdown))
return TRUE;
dt_iop_roi_t roi_in = *roi_out;
char module_name[256] = { 0 };
void *input = NULL;
void *cl_mem_input = NULL;
*cl_mem_output = NULL;
dt_iop_module_t *module = NULL;
dt_dev_pixelpipe_iop_t *piece = NULL;
const dt_dev_pixelpipe_type_t old_pipetype = pipe->type;
const dt_iop_module_t *gui_module = dt_dev_gui_module();
// if a module is active, check if this module allow a fast pipe run
if(gui_module
&& gui_module->flags() & IOP_FLAGS_ALLOW_FAST_PIPE
&& pipe->type & DT_DEV_PIXELPIPE_BASIC
&& dt_dev_modulegroups_test_activated(darktable.develop))
pipe->type |= DT_DEV_PIXELPIPE_FAST;
else

@jenshannoschwalm
Copy link
Collaborator Author

If I set a breakpoint and single-step through the loop, most of the steps gives no log output. Then it comes to a step with lots of log output and that is the step which crashes. The last lines of the log (started with -d all):

@zisoft As there is just one of the threads involved it's random in what position in the loop it "happens", but as there is no imageio system running any more the flip might fail here. So this has been added to the jobs where ctrl-q is understood as "skip-the-rest'

BTW you run ARM or intel? I am absolutely not sure if there are some peculiar things in the _DEBUG mode, even less on MACs.

Could you re-test?

@zisoft
Copy link
Collaborator

zisoft commented Dec 15, 2024

First try, same crash :(

I'm on an Intel Mac.

As there is just one of the threads involved it's random in what position in the loop it "happens"

ah, ok. That's my experience here, the crash position in the loop is indeed random.

EDIT: ah, sorry, I missed your last commit... now testing again

@ralfbrown
Copy link
Collaborator

A Lua hang on very early exit is probably the same underlying issue that I encountered while writing the splash screen code, and kept me from displaying any further progress updates after starting up Lua. I have since been able to trace that hang to the async processing of the Lua startup script. No luarc, no hang.... Once the script finishes, it is again OK to call dt_gui_process_events().

@zisoft
Copy link
Collaborator

zisoft commented Dec 15, 2024

ok, some things have changed.
I now get two different crashes randomly. And successful quits as well.

I send screenshots here with log at the bottom and stacktrace on the left.

The first type is again an assertion failure:

Bildschirmfoto 2024-12-15 um 16 51 55

And the second type is a EXC_BAD_ACCESS in the dbus routines:

Bildschirmfoto 2024-12-15 um 16 53 29

I also removed my luarc but that does not make a difference as these are no early quit attempts.

@TurboGit
Copy link
Member

At this stage, let's move this to 5.2 too risky for 5.0 despite all the hard work done here.

@TurboGit TurboGit modified the milestones: 5.0, 5.2 Dec 15, 2024
@jenshannoschwalm
Copy link
Collaborator Author

Making this draft for now, as there will be PR's doing maintenance and safety checks to be merged first and this will have to be rebased afterwards.

As we now have a lot of things in background jobs related to images and mipmaps
we must keep those systems available until the control system has been fully closed.

NOTE: for some reason we can't have dt_imageio_cleanup() after dt_control_shutdown()
1. For code readability use 'control' instead of 's'
2. Simplify dt_control_running()
3. Removed not needed checks on errors
4. Two functions are strictly internal so no exposing via .h files and be static
   - _control_get_threadid()
   - _control_job_set_synchronous()
5. Two function were not used at all (so were untested) so the were removed
   - dt_control_job_set_state_callback()
   - dt_control_job_wait()
6. gphoto joining after cleanup test
7. Never create a control job if control is not in running state
While running the job dispatchers we keep track on jobs being inserted and executed so we
at least know about the number of unprocessed jobs via dt_control_jobs_pending()

Also show pending jobs in the log if closing dt.
Strictly prohibit reentrant use of dt_control_quit() and avoid a race condition via
atomic instruction.
As we have control jobs on "file actions" we have to make sure that we have 'dt_imageio` available for:
export, import, hdr creation, flip and exif refresh inside the job.
While working the list of images we test for dt_control_running() and only process the job specific action if so.

Also we have some **very** time consuming jobs that can be canceled in the UI, it seems to be much better to understand
a ctrl-q or click on close-darktable button as a request-to-cancel so there will be no lengthy background actions after
the window has closed.

While we aborted file import via ctrl-q we don't want to wait ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix pull request fixing a bug difficulty: hard big changes across different parts of the code base priority: high core features are broken and not usable at all, software crashes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants