Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Safe darktable shutdown #18061

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

jenshannoschwalm
Copy link
Collaborator

@jenshannoschwalm jenshannoschwalm commented Dec 24, 2024

EDIT again 01/04

This is a comprehensive collection of everything necessary to avoid darktable crashing after closing the app via ctrl-q or click on close-button. There have been lengthy discussions in previous pr's #18023 and #17984 for those of you interested in the progress we made. It has been a very hard and bumpy ride to get this all sorted out, with this pr it is very hard to get dt crashing after closing the ui. Thanks to all of you having tested or reviewed commented on those pr's.

I had still observed rare and spurious problems with the late PRs if there are many backjobs running exporting and flipping the same images for example in rawspeed loader, exif blob reading and mem access in nirvana.

Also there might be issues in dbus as reported by @zisoft but i couldn't trigger those.
The lua terminations problems have been finally understood and fixed.

The good one: The bunch is intended for master and 5.0 branch if reviewed for this.
The bad one for 5.0, to work safely the first 3 commits! are required

Commit 1 Fixes the order in control shutdown
Commit 2+3 fix crashes related to wrong cache usage. All 3 big bugs were not detected and were likely not happening if backthread processing on images had much lower cache pressure.

Proposal 4 Some additional easy fixes related to mipmap caching and bad dt_image_t returned and not fully checking for safe results. Not as important ...

Let's go into details from the commit messages for easier understanding why this all necessary.


Control quit and shutdown order issue (shutdown part 1)

How is darktable shutdown working safely?

  1. dt_control_quit() is called via user request, it sets control->running anatomically to DT_CONTROL_STATE_CLEANUP implying that dt_control_running() will not be TRUE any more.
    It finally calls gtk_main_quit() so with current code we don't have gtk events after that. Avoid race conditions by using an anatomic flag ensuring the valid code part is executed only once.

  2. Quitting gtk also means we are exactly here to continue the shutdown. Anything requiring a still active UI must be done before ...

  3. dt_control_shutdown() first waits for all threads to be joined, that means waiting for all pending jobs to be done completely.

  4. So we have to ensure:

    1. a full working software stack including image_cache, mipmap_cache and darktable.imageio to allow processing the pixelpipe and full access to all images.
    2. we don't access darktable.lib related stuff with dereferencing.
    3. The pipeline processing uses access to gui related data - focus, active module ... so make sure to avoid those too.
    4. As lua events might be fired by the backthreads it's state mutex must still be unlocked.
  5. After dt_control_shutdown() has finished we are sure there are no background threads running any more so we can safely close all mentioned subsystems and continue.

  6. In shutdown mode - tested via dt_control_running() is FALSE - we will not add new jobs. Avoids unlikely but possible race conditions for now.


Progress destroy and safety (shutdown part 2)

  1. A very annoying bug in dt_control_progress_cancel() has been fixed.

    Calling the progress->cancel callback is somewhat tricky as that might include destroying the progress structure which also includes destroying the process->mutex.

    dt_control_progress_cancel_callback() is doing that as it deep-down calls dt_control_job_cancel() which finally wants to destroy the mutex but as it's locked it can't do so.

    So, if progress->cancel == &dt_control_progress_cancel_callback we must not call progress->cancel with the mutex locked as that would make the mutex destroy failing leading to assert() failures in debug builds and we can't use that mutex anyway leading to a crash.

    Otherwise do the cancel callback in mutex locked state.

  2. We strictly avoid progress de-referencing if being NULL for safety, in case of getters we return safe dummies,
    in case of putters we avoid the function.


Fix dt_image_update_final_size() and dt_image_get_final_size() (shutdown part 3)

In dt_image_update_final_size() we must not use dt_cache_release() if the entry is NULL because of dereferencing.

In dt_image_get_final_size() we kept a pointer to a dt_image_t struct and later used it after the original dt_image_t has been released.

If system is under heavy cache load this could very well fail - and it does.


dt_image_t not available fixes (proposal shutdown part 4)

If we get dt_image_t *img = dt_image_cache_get() from the image cache img can be NULL for various reasons.

In all those cases we must not de-reference img in any way and must use a meaningful value - like when getting the imgid - and avoid further processing on that data.

We had that being checked in most cases, a few ones have been overseen yet.


Above has been edited to current commit status

@jenshannoschwalm jenshannoschwalm added bugfix pull request fixing a bug priority: high core features are broken and not usable at all, software crashes difficulty: hard big changes across different parts of the code base lua labels Dec 24, 2024
@wpferguson
Copy link
Member

As dt_lua_finalize() or the dt_lua_finalize_early() variant has been called before, the mutex behind dt_lua_lock() is not safe any longer and as the lua subsystem is not fully functional any more, we have to strictly avoid processing (problematic) lua code from now on.
So take care, anything happening in lua code must be prepared to respect dt_control_running() for signals or a responding gui.

dt_lua_finalize_early() isn't a variant, it's the preparatory for dt_lua_finalize(). dt_lua_finalize_early() sends the exit event to all the running scripts letting them know that darktable is ending. dt_lua_lock() is functional in dt_lua_finalize_early() so that the scripts can exit cleanly. dt_lua_finalize() destroys the Lua context, so Lua is no longer running after that.

Lua doesn't need the GUI to run, except for the case of darktable.gui.action() being called in darkroom. The GUI is used by some scripts just to get parameters so that they can perform the requested functions. So if a script has been triggered to run from a the GUI, it will continue running even if the GUI disappears.

For sure the dt_lua_lock() can't work and executions leads to crashes ( or assert() failing in debug builds ) if the lock mutex has been destroyed already.

Can you give me an instance or scenario where it fails or causes a crash? Or are you talking about the final lock in dt_lua_finalize() where it never unlocks, because Lua isn't running anymore so therefore it can't.

problematic lua code

Can I have a definition or example of that so I have some idea what to test?

@jenshannoschwalm
Copy link
Collaborator Author

dt_lua_finalize_early() isn't a variant, it's the preparatory for dt_lua_finalize(). dt_lua_finalize_early() sends the exit event to all the running scripts letting them know that darktable is ending. dt_lua_lock() is functional in dt_lua_finalize_early() so that the scripts can exit cleanly. dt_lua_finalize() destroys the Lua context, so Lua is no longer running after that.

OK. So this remarks should be changed. Maybe you make a proposal or i just take your comment ?

Lua doesn't need the GUI to run, except for the case of darktable.gui.action() being called in darkroom. The GUI is used by some scripts just to get parameters so that they can perform the requested functions. So if a script has been triggered to run from a the GUI, it will continue running even if the GUI disappears.

Ok, as long as as does not read "fresh" data or uses the lib stuff.

For sure the dt_lua_lock() can't work and executions leads to crashes ( or assert() failing in debug builds ) if the lock mutex has been destroyed already.

Can you give me an instance or scenario where it fails or causes a crash? Or are you talking about the final lock in dt_lua_finalize() where it never unlocks, because Lua isn't running anymore so therefore it can't.

problematic lua code

Can I have a definition or example of that so I have some idea what to test?

Sure - just look at the commit code :-)

  1. _import_internal uses dt_lua_lock(). Had been good this way until we started to using the backjobs. So there might be running import jobs after that lock had been destroyed.
  2. export_with_flags is the same story.
    Both crashed here and iirc also on @zisoft machine, for sure this is absolutely not safe and on debug builds runs into assert() issues.

@wpferguson
Copy link
Member

wpferguson commented Dec 24, 2024

What are backjobs? Is that in 5.0?

EDIT: So why not move dt_lua_finalize() after all the backjobs? dt_lua_lock() works until dt_lua_finalize() runs.

@wpferguson
Copy link
Member

Lua has been working just fine until backjobs were added. So since the only thing that changed was backjobs, that's where the problem is, not in Lua.

The "problematic" code is that the backjobs are firing events that Lua is supposed to handle. The reason for the crashes is that the backjobs were added after Lua has stopped. So the fix is to put the backjobs before Lua finalizes so that Lua can handle the events.

@jenshannoschwalm
Copy link
Collaborator Author

jenshannoschwalm commented Dec 25, 2024

The reason for the crashes is that the backjobs were added after Lua has stopped. So the fix is to put the backjobs before Lua finalizes so that Lua can handle the events.

Well, the backjobs were not added but continue running ...

Just mentioning this, i was also very surprised why dt started behaving like this and it took me a long time to understand :-)

So why not move dt_lua_finalize() after all the backjobs

I agree. I was a bit reluctant to do so as that has been in that position for a "long time" . Had that here in "test commit" and indeed could not find any issue the last days doing that, so squashed and pushing that now.

I would appreciate testing on your side as you will know best if anythings behaves strange.
(BTW i couldn't spot the code part that actually and finally unlocks and destroys the lua state mutex and cond.

@jenshannoschwalm
Copy link
Collaborator Author

Fixed, squashed and force-pushed after above discussion, updating the top "presentation" to current status

src/common/exif.cc Outdated Show resolved Hide resolved
@jenshannoschwalm
Copy link
Collaborator Author

Again fixed, squashed and force-pushed after above discussion, updating the top "presentation" to current status

@jenshannoschwalm
Copy link
Collaborator Author

Added a 4-th commit with two additional and i think rather important fixes.

Affected were dt_image_update_final_size() and dt_image_get_final_size(), under heavy cache load both can fail miserably.

@jenshannoschwalm jenshannoschwalm force-pushed the processing_in_the_dark branch 3 times, most recently from a2b085a to 8442b6a Compare December 31, 2024 04:26
@jenshannoschwalm jenshannoschwalm added this to the 5.2 milestone Jan 2, 2025
How is darktable shutdown working safely?
1. dt_control_quit() is called via user request, it sets control->running anatomically
     to DT_CONTROL_STATE_CLEANUP implying that dt_control_running() will not be TRUE any more.
     It finally calls gtk_main_quit() so with current code we don't have gtk events after that.

2. Quitting gtk also means **we are exactly here** to continue the shutdown. Anything requiring a
     still active UI must be done before ...

3. dt_control_shutdown() first waits for all threads to be joined, that means waiting for all pending
     jobs to be done completely.

4. So we have to ensure:
   1) a full working software stack including image_cache, mipmap_cache and darktable.imageio to allow
        processing the pixelpipe and full access to all images.
   2) we don't access darktable.lib related stuff with dereferencing.
   3) The pipeline processing uses access to gui related data - focus, active module ...
        so make sure to avoid those too.
   4) As lua events might be fired by the backthreads it's state mutex must still be unlocked.

5. After dt_control_shutdown() has finished we are sure there are no background threads running any
     more so we can safely close all mentioned subsystems and continue.

6. In shutdown mode - tested via dt_control_running() is FALSE - we will not add new jobs.
     Avoids unlikely but possible race conditions for now.
1. A very annoying bug in dt_control_progress_cancel() has been fixed.

Calling the progress->cancel callback is somewhat tricky as that might include
destroying the progress structure which also includes destroying the process->mutex.
  _control_progress_cancel_callback() doing in this order
  --> dt_control_job_cancel() --> _control_job_set_state --> dt_control_progress_destroy

So in this case - we used dt_control_progress_attach_job()
  a)  we **must not** call progress->cancel with the mutex locked as that would make the mutex
      destroy failing leading to assert() failures in debug builds and an undefined behaviour in
      release builds.
  b)  We unlock a mutex immediately after that but that location is not existing any more.

Otherwise, do the cancel callback in mutex locked state as the progress struct will not be destroyed.

2. We strictly avoid progress de-referencing if being NULL for safety, in case of getters we return safe dummies,
   in case of putters we avoid the function.
…own part 3)

In dt_image_update_final_size() we **must not** use dt_cache_release() if the entry is NULL.

In dt_image_get_final_size() we kept a pointer to a dt_image_t struct and later used it after
the original dt_image_t has been released.
If system is under heavy cache load this could very well fail - and it does.
If we get `dt_image_t *img = dt_image_cache_get()` from the image cache, img can be NULL
for various reasons.

In all those cases we must not de-reference img in any way and must use a meaningful
value - like when getting the imgid - and avoid further processing on that data.

We had that being checked in most cases, a few ones have been overseen yet.
@jenshannoschwalm
Copy link
Collaborator Author

@TurboGit i did a lot more testing the last week or so, checking what could possibly be done for the 5.0 branch as a bugfix. Remember this all is about safe closing without a crash if there is a background job still running when you do ctrl-q.

If we would take only a497c56 the chance for a safe close is already pretty good (all the other commits fix bugs that happen mostly with heavy cache pressure by multiple background jobs or by discarding a job) so i personally would suggest to take that for 5.0 too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix pull request fixing a bug difficulty: hard big changes across different parts of the code base lua priority: high core features are broken and not usable at all, software crashes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants