Skip to content

Commit

Permalink
tests/int/cpt: fix lazy-pages flakiness
Browse files Browse the repository at this point in the history
"checkpoint --lazy-pages and restore" test sometimes fails on restore
in our CI on Fedora 33 when systemd cgroup driver is used:

> (00.076104) Error (compel/src/lib/infect.c:1513): Task 48521 is in unexpected state: f7f
> (00.076122) Error (compel/src/lib/infect.c:1520): Task stopped with 15: Terminated
> ...
> (00.078246) Error (criu/cr-restore.c:2483): Restoring FAILED.

I think what happens is

1. The test runs runc checkpoint in lazy-pages mode in background.
2. The test runs criu lazy-pages in background.
3. The test runs runc restore.

Now, all three are working in together: criu restore restores, criu
lazy-pages listens for page faults on a uffd and fetch missing pages
from runc checkpoint, who serves those pages.

At some point criu lazy-pages decides to fetch the rest of the pages,
and once it's done it exits, and runc checkpoint, as there are no more
pages to serve, exits too.

At the end of runc checkpoint the container is removed (see "defer
destroy(container)" in checkpoint.go. This involves a call to
cgroupManager.Destroy, which, in case systemd manager is used,
calls stopUnit, which makes systemd to not just remove the unit,
but also send SIGTERM to its processes, if there are any.

As the container is being restored into the same systemd unit,
sometimes this results in sending SIGTERM to a process which
criu restores, and thus restoring fails.

The remedy here is to change the name of systemd unit to which the
container is restored.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
  • Loading branch information
kolyshkin committed Apr 1, 2021
1 parent 2dd62b3 commit 36fe3cc
Showing 1 changed file with 3 additions and 1 deletion.
4 changes: 3 additions & 1 deletion tests/integration/checkpoint.bats
Original file line number Diff line number Diff line change
Expand Up @@ -211,11 +211,13 @@ function simple_cr() {
lp_pid=$!

# Restore lazily from checkpoint.
# The restored container needs a different name as the checkpointed
# The restored container needs a different name (as well as systemd
# unit name, in case systemd cgroup driver is used) as the checkpointed
# container is not yet destroyed. It is only destroyed at that point
# in time when the last page is lazily transferred to the destination.
# Killing the CRIU on the checkpoint side will let the container
# continue to run if the migration failed at some point.
[ -n "$RUNC_USE_SYSTEMD" ] && set_cgroups_path
runc_restore_with_pipes ./image-dir test_busybox_restore --lazy-pages

wait $cpt_pid
Expand Down

0 comments on commit 36fe3cc

Please sign in to comment.