-
Notifications
You must be signed in to change notification settings - Fork 606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cuda: prevent task lockup on timeout error #2547
base: criu-dev
Are you sure you want to change the base?
cuda: prevent task lockup on timeout error #2547
Conversation
LGTM I think we need to move run_plugins(CHECKPOINT_DEVICES) out of collect_pstree(). collrect_pstree shoould just freezer processes. |
Unfortunately this problem also requires a driver fix. Due to the way cuda-checkpoint works at the moment killing it via ctrl-c or alarm timeout in the middle of operations can make it get out of sync with later invocations of cuda-checkpoint. Essentially you will get stale operation responses from previous invocations so this may not always be working as intended. Ex: killing it in the middle of a checkpoint operation that then completes behind the scenes the following call to cuda-checkpoint to restore will return the status of the checkpoint rather than the restore. Fixing it requires restarting the target application at the moment which is not very user friendly. I'll forward the issue internally along though as it's been on the radar to fix for a while. Patch itself LGTM, and also agree with Andrei's point to move the checkpoint plugin call out of the pstree walk/freeze. |
When creating a checkpoint of large models, the `checkpoint` action of `cuda-checkpoint` can exceed the CRIU timeout. This causes CRIU to fail with the following error, leaving the CUDA task in a locked state: cuda_plugin: Checkpointing CUDA devices on pid 84145 restore_tid 84202 Error (criu/cr-dump.c:1791): Timeout reached. Try to interrupt: 0 Error (cuda_plugin.c:139): cuda_plugin: Unable to read output of cuda-checkpoint: Interrupted system call Error (cuda_plugin.c:396): cuda_plugin: CHECKPOINT_DEVICES failed with net: Unlock network cuda_plugin: finished cuda_plugin stage 0 err -1 cuda_plugin: resuming devices on pid 84145 cuda_plugin: Restore thread pid 84202 found for real pid 84145 Unfreezing tasks into 1 Unseizing 84145 into 1 Error (criu/cr-dump.c:2111): Dumping FAILED. To fix this, we set `task_info->checkpointed` before invoking the `checkpoint` action to ensure that the CUDA task is resumed even if CRIU times out. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Move `run_plugins(CHECKPOINT_DEVICES)` out of `collect_pstree()` to ensure that the function's sole responsibility is to use the cgroup freezer for the process tree. This allows us to avoid a time-out error when checkpointing applications with large GPU state. Suggested-by: Andrei Vagin <avagin@google.com> Suggested-by: Jesus Ramos <jeramos@nvidia.com> Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
24c158f
to
66cb6de
Compare
@avagin @jesus-ramos I've updated the pull request with this change, and I was able to confirm that CRIU no longer fails with a timeout error when checkpointing large models. |
When creating a checkpoint of large models, the
checkpoint
action ofcuda-checkpoint
can exceed the CRIU timeout. This causes CRIU to fail with the following error, leaving the CUDA task in a locked state:To fix this, we set
task_info->checkpointed
before invoking thecheckpoint
action to ensure that the CUDA task is resumed even if CRIU times out.