Skip to content

Commit

Permalink
cuda: prevent task lockup on timeout error
Browse files Browse the repository at this point in the history
When creating a checkpoint of large models, the `checkpoint` action of
`cuda-checkpoint` can exceed the CRIU timeout. This causes CRIU to fail
with the following error, leaving the CUDA task in a locked state:

	cuda_plugin: Checkpointing CUDA devices on pid 84145 restore_tid 84202
	Error (criu/cr-dump.c:1791): Timeout reached. Try to interrupt: 0
	Error (cuda_plugin.c:139): cuda_plugin: Unable to read output of cuda-checkpoint: Interrupted system call
	Error (cuda_plugin.c:396): cuda_plugin: CHECKPOINT_DEVICES failed with
	net: Unlock network
	cuda_plugin: finished cuda_plugin stage 0 err -1
	cuda_plugin: resuming devices on pid 84145
	cuda_plugin: Restore thread pid 84202 found for real pid 84145
	Unfreezing tasks into 1
		Unseizing 84145 into 1
	Error (criu/cr-dump.c:2111): Dumping FAILED.

To fix this, we set `task_info->checkpointed` before invoking
the `checkpoint` action to ensure that the CUDA task is resumed
even if CRIU times out.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
  • Loading branch information
rst0git committed Dec 14, 2024
1 parent d46cbf7 commit 24c158f
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions plugins/cuda/cuda_plugin.c
Original file line number Diff line number Diff line change
Expand Up @@ -391,14 +391,14 @@ int cuda_plugin_checkpoint_devices(int pid)
if (resume_restore_thread(restore_tid, &save_sigset)) {
return -1;
}

task_info->checkpointed = 1;
status = cuda_process_checkpoint_action(pid, ACTION_CHECKPOINT, 0, msg_buf, sizeof(msg_buf));
if (status) {
pr_err("CHECKPOINT_DEVICES failed with %s\n", msg_buf);
goto interrupt;
}

task_info->checkpointed = 1;

interrupt:
int_ret = interrupt_restore_thread(restore_tid, &save_sigset);

Expand Down

0 comments on commit 24c158f

Please sign in to comment.