-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Trial won't run under Windows when gpuNum is set to 1 #1037
Comments
Thanks for raising this issue. Could you provide more details of your installation, nnimanager.log, trial stderr? I can not reproduce this error now. |
nnimanager.log The trail folder doesn't exist. I install nni with pip. |
Is it complete nnimanager.log and make sure not missing [Trial job kDj198ea status changed from WAITING to RUNNING' ] or FAILED? What is your python version, 64-bit? Could you provide the experiment IP? |
Yes it's the complete log. No missing lines. The python version: Python 3.7.2 (default, Feb 21 2019, 17:35:59) [MSC v.1915 64 bit (AMD64)] :: Anaconda, Inc. on win32 Is this the experiment ip?
|
I can not open the url. Can you run mnist.py using gpu without nni? By the way, the url is private, you can edit and delete it if you want. |
CUDA_VISIBLE_DEVICES ='-1' or CUDA_VISIBLE_DEVICES ='' means no gpu, however if you set gpuNum = 0 in yml, there is no need to set CUDA_VISIBLE_DEVICES. What is your gpu experiment command? |
The run.ps1 under the trail folder set CUDA_VISIBLE_DEVICES = '-1' if gpuNum = 0. But my data use channel first order. If I don't do so Tensorflow will tell me that CPU version do not support NCHW format. The command is c:\Miniconda3\envs\tf-alpha\python.exe hyperparam_tunning.py |
Could you provide the run.ps1 file when running gpu experiment. In my reproduce, |
I see. When you run gpu version is there a free gpu(no process running in this gpu)? I want to check here . If there is no free gpu, trial state will waiting in loop until find one. I reproduce this problem when occupying the gpu with other process. Temporary approach is to make a gpu free, it will work. Thanks. |
I am having similar issues running with anaconda, is there a way I can insert In my case with GPU, it's always "Waiting" and with CPU "Failed" after a few seconds, nothing in logs. Edit: activate.bat will not work in PowerShell |
I found it succeeded after typing in PowerShell: Set-ExecutionPolicy RemoteSigned You might want to use a safer approach: Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser It didn't solve the problem for the GPU, status is "waiting" and GPU There is no In
Deeper:
Still doesn't fix.
|
hi @crackcomm , have you tried this command |
Hey, @demianzhang, I can try that. The
When I run On the other hand, |
I did install after
EDIT: Currently trying with your implementation of |
@demianzhang Using your implementation from #1043 it starts to run trials! Great progress. It brings to the light an issue with TensorFlow and CuDNN. One of the trials logs: Notice line 57-537, there is indeed a CuDNN initialization failure but the log is scrambled.
Tensorflow works just fine on GPU, what could be the reason it fails to initialize there? My GPU memory usage is |
Maybe it is the environment problem, I list the possible solutions as follows, you can also have a try with mnist_before.py(without nni) to check the tensorflow.
|
I tried 4. first: Before:
After:
I tried in cmd "Run as Administrator", and:
I already can feel the pain of reinstalling CuDNN. |
I tried to change every parameter like batch size, batch num, hidden size and it still overflows. |
I think it's most likely cudnn, cublas version mismatch.
|
The log is broken as hell: https://gist.github.com/crackcomm/c60ba4a100e9a359624915b17ef90e0a |
My issue was solved with the help of @peterjc123 in another unrelated issue: pytorch/pytorch#20202 (what a nice issue id too) |
Glad to hear that. Thanks for trying the tools and insisting on solving the problem. The things related to the issue get better too. |
@demianzhang Thank you for your help as well. |
Closing the issue per latest feedback from @crackcomm |
Short summary about the issue/question:
The trial won't run under Windows when gpuNum is set to 1. The trail keeps waiting. When I set gpuNum to 0, it works normally.
Brief what process you are following:
How to reproduce it:
My own code didn't work. I tried on the mnist example and it didn't work if gpuNum = 1.
nni Environment:
need to update document(yes/no):
Anything else we need to know:
The text was updated successfully, but these errors were encountered: