Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to set dshm size for training? #1044

Closed
Andrew-Su-0718 opened this issue Feb 29, 2024 · 3 comments · Fixed by #1104
Closed

How to set dshm size for training? #1044

Andrew-Su-0718 opened this issue Feb 29, 2024 · 3 comments · Fixed by #1104
Assignees

Comments

@Andrew-Su-0718
Copy link

Andrew-Su-0718 commented Feb 29, 2024

When I submit a pytorchjob with arena, I could't find parameters related to shared memory size, which is very important for pytorch training.

The size is fixed to 2Gi.

...
    - mountPath: /dev/shm
      name: dshm
...
...
  - emptyDir:
      medium: Memory
      sizeLimit: 2Gi
    name: dshm
...

Can anyone know how to set dshm size?

@Andrew-Su-0718
Copy link
Author

When I submit a pytorchjob with arena, I could't find parameters related to shared memory size, which is very important for pytorch training.

The size is fixed to 2Gi.

...
    - mountPath: /dev/shm
      name: dshm
...
...
  - emptyDir:
      medium: Memory
      sizeLimit: 2Gi
    name: dshm
...

Can anyone know how to set dshm size?

OK. I find a workaround solution.
Modified file /charts/pytorchjob/values.yaml :

shmSize: 2Gi

to

shmSize: 64Gi # or any value you want

@yanshui177
Copy link

Same issue

@Syulin7
Copy link
Collaborator

Syulin7 commented Jun 20, 2024

/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants