Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for handling training/validation OOMs gracefully #81

Merged
merged 2 commits into from
Mar 5, 2024

Conversation

amorehead
Copy link
Collaborator

  • Adds single-GPU support for handling training/validation OOMs gracefully.
  • Note that multi-GPU support for OOM-catching (although in principle handled by this code) does not currently work and may lead to unpredictable stalls during training/validation forward (or backward) passes (with the latest release of Lightning - note that this behavior may change with future updates to Lightning).

@amorehead amorehead changed the title Add single-GPU support for handling training/validation OOMs gracefully Add support for handling training/validation OOMs gracefully Mar 5, 2024
Copy link
Owner

@a-r-j a-r-j left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Would you be able to link this conversation to any relevant discussions/issues w/ Lightning so we can track progress

@amorehead
Copy link
Collaborator Author

Sure thing. This issue for Lightning seems most relevant and is actually where I got this initial implementation: Lightning-AI/pytorch-lightning#5243 (comment)

@amorehead amorehead merged commit b97cbe9 into main Mar 5, 2024
2 checks passed
@amorehead amorehead deleted the catch-oom branch March 5, 2024 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants