Skip to content

Commit

Permalink
Attempt a workaround for glibc deadlock described in #656
Browse files Browse the repository at this point in the history
When using BenchExec with concurrent execution of runs,
we sometimes experience a deadlock in our child process
that is forked from the main process.
This is due to missing proper handling of clone() in glibc.

With this workaround, we check if the child process takes unusually long
to create the container (timeout is set to 60s)
and if this happens, we assume the deadlock has occurred.
Because almost everything related to the container creation
happens inside the child process,
we can just kill the child process and attempt to start the run
(with a new child process) again.
This is safe because up to a certain point where we are sure
that the child process cannot have started the benchmarked tool already.

This is not guaranteed to catch all instances of the deadlock,
only those that happen soon enough after clone() in the child.
But in my tests this is the case.
  • Loading branch information
PhilippWendler committed Nov 2, 2022
1 parent bc08946 commit eb7b79f
Showing 1 changed file with 33 additions and 9 deletions.
42 changes: 33 additions & 9 deletions benchexec/containerexecutor.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
import collections
import shutil
import pickle
import select
import signal
import socket
import subprocess
Expand Down Expand Up @@ -506,15 +507,19 @@ def _start_execution(
f"Invalid relative result-files pattern '{pattern}'."
)

return self._start_execution_in_container(
root_dir=root_dir,
output_dir=output_dir,
memlimit=memlimit,
memory_nodes=memory_nodes,
result_files_patterns=result_files_patterns,
*args,
**kwargs,
)
while True:
result = self._start_execution_in_container(
root_dir=root_dir,
output_dir=output_dir,
memlimit=memlimit,
memory_nodes=memory_nodes,
result_files_patterns=result_files_patterns,
*args,
**kwargs,
)
if result is not None:
return result
# else retry as workaround for #656

# --- container implementation with namespaces ---

Expand Down Expand Up @@ -855,6 +860,25 @@ def check_child_exit_code():
os.write(to_grandchild, MARKER_USER_MAPPING_COMPLETED)

try:
# Wait with timeout until from_grandchild becomes ready to be read.
rlist, _, _ = select.select([from_grandchild], [], [], 60)
if from_grandchild not in rlist:
# Timeout has occurred, likely deadlock in child (cf. #656).
logging.warning(
"Child %s not ready after 60s, likely "
"https://github.com/sosy-lab/benchexec/issues/656 occurred. "
"Killing it and trying again.",
child_pid,
)
# As long as we have not sent MARKER_PARENT_COMPLETED, the tool is
# not yet started and it is safe to kill the child and restart.
# Killing child (PID 1 in container) will also kill grandchild if it
# already exists.
util.kill_process(child_pid)
# Open pipes will be close in finally.
# Signal retry to caller.
return None

# read at most 10 bytes because this is enough for 32bit int
grandchild_pid = int(os.read(from_grandchild, 10))
except ValueError:
Expand Down

0 comments on commit eb7b79f

Please sign in to comment.