Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BenchExec subprocess hangs in __malloc_fork_lock_parent #656

Open
sim642 opened this issue Nov 11, 2020 · 10 comments
Open

BenchExec subprocess hangs in __malloc_fork_lock_parent #656

sim642 opened this issue Nov 11, 2020 · 10 comments
Labels
container related to container mode

Comments

@sim642
Copy link
Contributor

sim642 commented Nov 11, 2020

The following has happened to me twice now, so I guess it's not totally random.

I've run Goblint on SV-COMP SoftwareSystems-DeviceDriversLinux64-ReachSafety via BenchExec with the following additional options:

--read-only-dir / --overlay-dir . --hidden-dir /home --numOfThreads 15

What seems to happen is that it just gets stuck at the last task:

  • The terminal title says Task goblint-all-fast.sv-comp20_prop-reachsafety.SoftwareSystems-DeviceDriversLinux64-ReachSafety (2729/2729)
  • There's no real CPU usage, nothing is printed but it also doesn't print the statistics nor exit.
  • It stays that way for hours, so everything should be timeouted for sure.
  • I can SIGKILL some subprocesses and then it prints:
    2020-11-11 10:33:51 - CRITICAL - OSError 0 while starting './goblint' in '.': Child process of RunExecutor terminated with exit signal 9.
    
    Statistics:           2728 Files
      correct:            1006
        correct true:     1006
        correct false:       0
      incorrect:             0
        incorrect true:      0
        incorrect false:     0
      unknown:            1722
      Score:              2012 (max: 5108)
    
    In order to get HTML and CSV tables, run
    table-generator '/mnt/goblint-svcomp/benchexec/results/new-results14-all-fast-systems-fix_arithmetic_on_bot/goblint-all-fast.2020-11-10_17-46-59.results.sv-comp20_prop-reachsafety.SoftwareSystems-DeviceDriversLinux64-ReachSafety.xml.bz2'
    
    Script was interrupted by user, some runs may not be done.
    
    But still doesn't exit unless I SIGKILL the main benchexec process itself.
  • The results are then just missing one task's result.

So far, this has always been at the last task, never earlier, and only on a big set like SoftwareSystems, not on ones like NoDataRace.

This is on Ubuntu 20.04 and BenchExec 3.4-dev (commit 0207553e). I guess I should update that, but are there any changes since that would even influence this?

@PhilippWendler
Copy link
Member

There are no changes since then that should be improve the behavior in this regard. In principle a log with --debug would be useful in such cases, but I already have a suspicion what the problem is.

In particular, due to the high number of parallel executions you are using, I suspect that this is #435. Typically this happens only with highly-loaded systems (many executions on few cores) and I thought it should be fixed completely for Python 3.7+ on Linux x86_64, but maybe it is not.

Note that it does not always happen for the last task, it is just that BenchExec continues to execute other tasks until only the one with the deadlock remains.

How many cores does your machine have, how many cores do you allow to use per task, and are there other processes running at the same time that use significant CPU resources?

To double check something, could you please run python3 -c "import benchexec.container as c; print(c.NATIVE_CLONE_CALLBACK_SUPPORTED)" and check that it outputs True (this needs to have BenchExec on the Python package path)?

Furthermore, it would be nice if you could follow our FAQ entry on this and provide some gdb output with stack traces, if possible.

BenchExec also has another mitigation for this issue, which is just disabled for systems for which we thought the issue would not occur. You could test whether this mitigation helps for you by applying 656-workaround.patch.txt on your BenchExec installation locally and trying to reproduce the problem. If that improves the situation, we could enable it for everybody.

Apart from that, as a workaround it should work more reliably the less parallel runs are used.

@sim642
Copy link
Contributor Author

sim642 commented Nov 11, 2020

How many cores does your machine have, how many cores do you allow to use per task, and are there other processes running at the same time that use significant CPU resources?

The machine has 16 cores, I have 15 parallel runs each of which is limited to 1 core. So there's one CPU core as a spare for anything else (BenchExec itself, ssh, htop etc ­– but nothing CPU intensive).

To double check something, could you please run python3 -c "import benchexec.container as c; print(c.NATIVE_CLONE_CALLBACK_SUPPORTED)" and check that it outputs True (this needs to have BenchExec on the Python package path)?

Yes, it outputs True.

Furthermore, it would be nice if you could follow our FAQ entry on this and provide some gdb output with stack traces, if possible.

Thanks for the link! I'll keep this in mind for the next time this happens.

@sim642
Copy link
Contributor Author

sim642 commented Nov 17, 2020

It happened a third time now, so here's the bt and bt full (although the latter is largely useless) output for all the BenchExec processes and a snippet from htop showing their tree structure:
282458.txt
282459.txt
282463.txt
282857.txt
htop.txt

@PhilippWendler
Copy link
Member

Thanks. I think that 282458, 282459, and 282463 are the BenchExec main process and two helper processes. One is hanging in read(), one waits for a lock, and one is executing something, this should be normal.

282857 should be the problematic process that is dead-locked and needs to be killed in order to let BenchExec continue. It's stack trace starts at #21 ... in clone, which means that this is one of the processes that we clone in order for them to start the actual tool execution in the container (as expected). Contrary to #435, though, this process did not hang immediately after the clone. Instead it seems that it could successfully execute some Python code (several call methods of Python are on the stack) and is hanging inside fork. The actual hanging function is __malloc_fork_lock_parent, which is supposed to take an internal malloc lock before the actual fork is done (source code).

From this stack trace my assumption is that something similar like #435 is happening, i.e., some inconsistent lock state in the cloned subprocess and this leads to a dead lock when the subprocess attempts an operation that relies on that lock state. However, because the problematic lock seems to belong to malloc I am not sure whether this is a bug in BenchExec, Python, or libc. When we do the clone, we don't do this through a Python API, which caused the problem in #435. However, we do perform the clone through libc's API, which should mean that it is libc's reponsibility for handling the state of its malloc lock for both fork and clone.

But as far as I understand the code it just does not do that? __libc_fork calls the three __malloc_fork_* functions as can be seen in the source code, but there seems to be no other callers to this function (the result of GitHub's search match what I find locally). I am pretty sure that this is the source of libc's clone, and it does not handle anything related to malloc's lock.

So this all seems like using libc's clone can lead to broken state for malloc in the child process, and this could affect all kinds of programs.

The problem is that there seem to be no easy workarounds for this libc bug. We cannot manage mallocs lock state manually, and we cannot easily avoid calling clone because we need its namespacing features. However, after reading some more source code of glibc, the clone man page, and bugs like 21793 and 4737 it seems like we should really consider (a) not using clone and (b) avoid forking/cloning from a multi-threaded process, because for example clone does not call the atfork handlers and forking in multi-threaded processes is still not async-signal-safe.

So thank you very much for reporting this problem and providing the stack traces, but I am afraid that fixing it will be complex and take some time. Running BenchExec with less parallel executions and/or a smaller number of tasks should make the bug appear less often.

@PhilippWendler PhilippWendler changed the title BenchExec stuck at last task BenchExec subprocess hangs in __malloc_fork_lock_parent Nov 17, 2020
@PhilippWendler
Copy link
Member

Idea: If the child process deadlocks soon enough after being created (before the grand child execs the benchmarked tool), we can still safely kill the child (and the grand child if it already exists). For this we would just need to add a timeout to this line plus some cleanup logic:

grandchild_pid = int(os.read(from_grandchild, 10))

Then we can retry the run again.

This might be a useful workaround in practice until we get a full solution.

@michael-schwarz
Copy link

Such a fix would be highly appreciated by us at TUM, we still regularly encounter this bug on our 48 core machine!

PhilippWendler added a commit that referenced this issue Nov 2, 2022
When using BenchExec with concurrent execution of runs,
we sometimes experience a deadlock in our child process
that is forked from the main process.
This is due to missing proper handling of clone() in glibc.

With this workaround, we check if the child process takes unusually long
to create the container (timeout is set to 60s)
and if this happens, we assume the deadlock has occurred.
Because almost everything related to the container creation
happens inside the child process,
we can just kill the child process and attempt to start the run
(with a new child process) again.
This is safe because up to a certain point where we are sure
that the child process cannot have started the benchmarked tool already.

This is not guaranteed to catch all instances of the deadlock,
only those that happen soon enough after clone() in the child.
But in my tests this is the case.
@PhilippWendler
Copy link
Member

So this idea for a workaround was really not difficult to implement, I did so in https://github.com/sosy-lab/benchexec/tree/656-deadlock-workaround and it works in my (small) tests. It waits for 60s, then prints a warning like Child 310396 not ready after 60s, likely https://github.com/sosy-lab/benchexec/issues/656 occurred. Killing it and trying again. and then tries again.

@sim642 @michael-schwarz It would be really cool if you give this branch a try on some real-world workloads and tell me if it works as expected and actually triggers (look for the warning in the log). It might in theory also be the case that it triggers sometimes, but does not catch all cases, but I think it is likely that if catches most/all of them.

@michael-schwarz
Copy link

Great, thank you! I will start a workload with this fix next time we have some free capacity on our large server (should be this week or beginning of next week at the latest)!

@michael-schwarz
Copy link

michael-schwarz commented Nov 4, 2022

It seems to work! 🎉 For a workload this morning, the warning triggered and execution did not freeze. We are currently running it a lot, so I can keep an eye on our executions for the next few weeks and see if the issue resurfaces or whether this fixes it in all cases.

PhilippWendler added a commit that referenced this issue Nov 7, 2022
This workaround has been tested and found to be working well enough
as an intermediate solution:
#656 (comment)
@PhilippWendler PhilippWendler added the container related to container mode label Nov 7, 2022
@PhilippWendler
Copy link
Member

Current state: A workaround for this problem was added to BenchExec 3.14. BenchExec will now attempt to detect the deadlock and continue benchmarking after 60s. A full solution can only be implemented as part of a larger restructuring that is tracked in #875.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
container related to container mode
Development

No branches or pull requests

3 participants