-
Notifications
You must be signed in to change notification settings - Fork 606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cuda: enable checkpoint support for paused tasks #2517
cuda: enable checkpoint support for paused tasks #2517
Conversation
03b51f9
to
8f544eb
Compare
8f544eb
to
030bc9a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Is there any real use-case for that? |
030bc9a
to
58de16a
Compare
The use-case is similar to #2514 -- the CUDA tasks may be in a "locked" or "checkpointed" state before |
If a CUDA process is already in a "locked" or "checkpointed" state during criu dump, the CUDA plugin currently fails with an error because it attempts an unnecessary "lock" action using the cuda-checkpoint tool. This patch extends the CUDA plugin to handle such cases by first verifying the initial state of the CUDA processes and skipping unnecessary "lock" and "checkpoint" actions when a process has been locked or checkpointed before CRIU is invoked. In particular, CUDA tasks may already be in a "locked" or "checkpointed" state to ensure consistent checkpoint/restore for distributed workloads, such as model training, where multiple containers run across different cluster nodes. Another use case for this functionality is optimizing resource utilization, where CUDA tasks with low-priority are preempted immediately to release GPU resources needed by high-priority tasks, and the paused workloads are later resumed or migrated to another node. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
58de16a
to
20a7cfa
Compare
@avagin I've updated the commit message with a brief description of the use-cases for this functionality. |
If a CUDA process is already in a "locked" or "checkpointed" state during
criu dump
, the CUDA plugin fails with an error because it attempts an unnecessary "lock" action using thecuda-checkpoint
tool. This pull request extends the CUDA plugin to handle such case by first checking the original state of the CUDA process and skipping unnecessary "lock" and "checkpoint" actions if the process was already locked or checkpointed before CRIU was invoked.