Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docker] DeepSpeed image should contain nvcc #1710

Closed
vfdev-5 opened this issue Feb 26, 2021 · 4 comments · Fixed by #1711
Closed

[docker] DeepSpeed image should contain nvcc #1710

vfdev-5 opened this issue Feb 26, 2021 · 4 comments · Fixed by #1711

Comments

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Feb 26, 2021

Currently, "pytorchignite/msdp-apex:latest" docker image can not run cifar10 DeepSpeed example failing with error:

...
    basic_optimizer = self._configure_basic_optimizer(model_parameters)                                                                                                                          
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 661, in _configure_basic_optimizer
    optimizer = FusedAdam(model_parameters, **optimizer_parameters)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/adam/fused_adam.py", line 72, in __init__
    fused_adam_cuda = FusedAdamBuilder().load()
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 174, in load
    return self.jit_load(verbose)
...
  File "/opt/conda/lib/python3.8/subprocess.py", line 1702, in _execute_child         
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/cuda/bin/nvcc'
    raise child_exception_type(errno_num, err_msg, err_filename)
@trsvchn
Copy link
Collaborator

trsvchn commented Feb 26, 2021

is it enough to

RUN apt-get update && apt-get -qq install -y --no-install-recommends nvidia-cuda-toolkit

?
I have never tried to install it inside a container

I'll try it tomorrow.

@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Feb 26, 2021

I think this part of Dockerfile:

apt-get remove -y g++ && \
apt-get autoremove -y && \
rm -rf /var/lib/apt/lists/*

removed nvcc which should be already present in pytorch devel docker image:
FROM pytorch/pytorch:${PTH_VERSION}-devel

@vfdev-5 vfdev-5 added bug and removed enhancement labels Feb 26, 2021
@trsvchn
Copy link
Collaborator

trsvchn commented Feb 27, 2021

@vfdev-5 yeah, you're absolutely right!

Running this line

apt-get remove -y g++ && \

Removes the compiler even before autoremove:

The following packages will be REMOVED:
  build-essential cuda-command-line-tools-10-1 cuda-compiler-10-1 cuda-cupti-10-1
  cuda-minimal-build-10-1 cuda-nvcc-10-1 g++

Can we just remove this line or lines?

@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Feb 27, 2021

Yes, let's juste remove those lines

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants