Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuda: enable checkpoint support for paused tasks #2517

Conversation

rst0git
Copy link
Member

@rst0git rst0git commented Nov 12, 2024

If a CUDA process is already in a "locked" or "checkpointed" state during criu dump, the CUDA plugin fails with an error because it attempts an unnecessary "lock" action using the cuda-checkpoint tool. This pull request extends the CUDA plugin to handle such case by first checking the original state of the CUDA process and skipping unnecessary "lock" and "checkpoint" actions if the process was already locked or checkpointed before CRIU was invoked.

@rst0git rst0git force-pushed the 2024-11-12-cuda-checkpointing-paused-tasks branch 6 times, most recently from 03b51f9 to 8f544eb Compare November 12, 2024 17:00
@rst0git rst0git force-pushed the 2024-11-12-cuda-checkpointing-paused-tasks branch from 8f544eb to 030bc9a Compare November 12, 2024 20:11
Copy link
Contributor

@jesus-ramos jesus-ramos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@avagin
Copy link
Member

avagin commented Nov 12, 2024

Is there any real use-case for that?

@rst0git rst0git force-pushed the 2024-11-12-cuda-checkpointing-paused-tasks branch from 030bc9a to 58de16a Compare November 12, 2024 20:35
@rst0git
Copy link
Member Author

rst0git commented Nov 12, 2024

Is there any real use-case for that?

The use-case is similar to #2514 -- the CUDA tasks may be in a "locked" or "checkpointed" state before criu dump is invoked to ensure consistent checkpoint/restore, particularly in distributed model training where multiple containers are running across different cluster nodes.

If a CUDA process is already in a "locked" or "checkpointed" state
during criu dump, the CUDA plugin currently fails with an error because
it attempts an unnecessary "lock" action using the cuda-checkpoint tool.

This patch extends the CUDA plugin to handle such cases by first
verifying the initial state of the CUDA processes and skipping
unnecessary "lock" and "checkpoint" actions when a process has been
locked or checkpointed before CRIU is invoked.

In particular, CUDA tasks may already be in a "locked" or "checkpointed"
state to ensure consistent checkpoint/restore for distributed workloads,
such as model training, where multiple containers run across different
cluster nodes.

Another use case for this functionality is optimizing resource
utilization, where CUDA tasks with low-priority are preempted
immediately to release GPU resources needed by high-priority
tasks, and the paused workloads are later resumed or migrated
to another node.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
@rst0git rst0git force-pushed the 2024-11-12-cuda-checkpointing-paused-tasks branch from 58de16a to 20a7cfa Compare November 13, 2024 10:28
@rst0git
Copy link
Member Author

rst0git commented Nov 13, 2024

@avagin I've updated the commit message with a brief description of the use-cases for this functionality.

@avagin avagin merged commit dd6b580 into checkpoint-restore:criu-dev Nov 13, 2024
37 of 41 checks passed
@rst0git rst0git deleted the 2024-11-12-cuda-checkpointing-paused-tasks branch November 13, 2024 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants