Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent OOMs occur during long-running jobs #2

Open
knagrecha opened this issue Feb 14, 2022 · 1 comment
Open

Inconsistent OOMs occur during long-running jobs #2

knagrecha opened this issue Feb 14, 2022 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@knagrecha
Copy link
Owner

knagrecha commented Feb 14, 2022

Problem:

Due to the inexact nature of the Pilot partitioner's memory estimation, it often underestimates the memory costs of minibatch passes. During training, the model exceeds the allocated memory bounds and errors out. Typically this occurs during the backward pass.

Quick fix: Increase double buffer space to reduce shard sizes and guarantee more free room.
Longer-term fix: Replace the Pilot Partitioner with a more exact algorithm, or one that doesn't push up on the limits of memory bounds.

@knagrecha knagrecha added the bug Something isn't working label Feb 14, 2022
@knagrecha knagrecha self-assigned this Feb 14, 2022
@csci-acct
Copy link

Affecting recent job, fix?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants