Trial won't run under Windows when gpuNum is set to 1 #1037

0x8A63F77D · 2019-04-30T06:41:01Z

Short summary about the issue/question:
The trial won't run under Windows when gpuNum is set to 1. The trail keeps waiting. When I set gpuNum to 0, it works normally.
Brief what process you are following:

How to reproduce it:
My own code didn't work. I tried on the mnist example and it didn't work if gpuNum = 1.

nni Environment:

nni version:0.7
nni mode(local|pai|remote):local
OS:Windows 10
python version:3.7
is conda or virtualenv used?: miniconda
is running in docker?:no

need to update document(yes/no):

Anything else we need to know:

demianzhang · 2019-04-30T07:12:35Z

Thanks for raising this issue. Could you provide more details of your installation, nnimanager.log, trial stderr? I can not reproduce this error now.

0x8A63F77D · 2019-04-30T07:20:19Z

nnimanager.log
[2019-4-30 14:26:29] INFO [ 'Datastore initialization done' ]
[2019-4-30 14:26:29] INFO [ 'Rest server listening on: http://0.0.0.0:8080' ]
[2019-4-30 14:26:29] INFO [ 'Construct local machine training service.' ]
[2019-4-30 14:26:29] INFO [ 'RestServer start' ]
[2019-4-30 14:26:29] INFO [ 'RestServer base port is 8080' ]
[2019-4-30 14:26:33] INFO [ 'NNIManager setClusterMetadata, key: trial_config, value: {"command":"c:\\Miniconda3\\envs\\tensorflow\\python.exe mnist.py","codeDir":"D:\\Users\\0x8A63F77D\\Documents\\GitHub\\nni\\examples\\trials\\mnist\\.","gpuNum":1}' ]
[2019-4-30 14:26:33] INFO [ 'required GPU number is 1' ]
[2019-4-30 14:26:35] INFO [ 'Starting experiment: kDj198ea' ]
[2019-4-30 14:26:35] INFO [ 'Change NNIManager status from: INITIALIZED to: RUNNING' ]
[2019-4-30 14:26:35] INFO [ 'Add event listeners' ]
[2019-4-30 14:26:35] INFO [ 'Run local machine training service.' ]
[2019-4-30 14:26:36] INFO [ 'NNIManager received command from dispatcher: ID, ' ]
[2019-4-30 14:26:36] INFO [ 'NNIManager received command from dispatcher: TR, {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"dropout_rate": 0.6742607414411488, "conv_size": 2, "hidden_size": 1024, "batch_size": 8, "learning_rate": 0.001}}' ]
[2019-4-30 14:26:40] INFO [ 'submitTrialJob: form: {"jobType":"TRIAL","hyperParameters":{"value":"{\"parameter_id\": 0, \"parameter_source\": \"algorithm\", \"parameters\": {\"dropout_rate\": 0.6742607414411488, \"conv_size\": 2, \"hidden_size\": 1024, \"batch_size\": 8, \"learning_rate\": 0.001}}","index":0}}' ]

The trail folder doesn't exist.

I install nni with pip.

demianzhang · 2019-04-30T07:55:18Z

Is it complete nnimanager.log and make sure not missing [Trial job kDj198ea status changed from WAITING to RUNNING' ] or FAILED? What is your python version, 64-bit? Could you provide the experiment IP?

0x8A63F77D · 2019-04-30T08:05:50Z

Yes it's the complete log. No missing lines.

The python version: Python 3.7.2 (default, Feb 21 2019, 17:35:59) [MSC v.1915 64 bit (AMD64)] :: Anaconda, Inc. on win32

Is this the experiment ip?

The experiment id is PuQDTCUx
The Web UI urls are: http://192.168.56.1:8080 http://10.0.0.2:8080 http://127.0.0.1:8080

demianzhang · 2019-04-30T08:18:59Z

I can not open the url. Can you run mnist.py using gpu without nni? By the way, the url is private, you can edit and delete it if you want.

0x8A63F77D · 2019-04-30T08:24:59Z

Well the machine is behind a firewall and I have no rights to access the firewall.

My own code runs without error when I set gpuNum = 0 and set CUDA_VISIBLE_DEVICES = 0 in code.

~~By the way the example code in nni repository needs some modifications to run without nni.~~

The mnist_before.py runs without errors.

demianzhang · 2019-04-30T08:55:52Z

CUDA_VISIBLE_DEVICES ='-1' or CUDA_VISIBLE_DEVICES ='' means no gpu, however if you set gpuNum = 0 in yml, there is no need to set CUDA_VISIBLE_DEVICES. What is your gpu experiment command?

0x8A63F77D · 2019-04-30T09:07:49Z

The run.ps1 under the trail folder set CUDA_VISIBLE_DEVICES = '-1' if gpuNum = 0. But my data use channel first order. If I don't do so Tensorflow will tell me that CPU version do not support NCHW format.
The problem is:
gpuNum = 1 -> No trail folder under the experiment folder -> Trail keeps waiting in the web UI
gpuNum = 0 -> Trails with CUDA_VISIBLE_DEVICES = '-1' -> Set CUDA_VISIBLE_DEVICES = '0' in code -> Everthing is fine

The command is c:\Miniconda3\envs\tf-alpha\python.exe hyperparam_tunning.py

demianzhang · 2019-04-30T09:25:23Z

Could you provide the run.ps1 file when running gpu experiment. In my reproduce, $env:CUDA_VISIBLE_DEVICES="0" has already set in run.ps1 when using gpu.

0x8A63F77D · 2019-04-30T09:28:45Z

I can't run gpu experiment with normal method (set gpuNum = 1 in yml). No trail folder will be created. That is the problem what I am facing.

demianzhang · 2019-04-30T10:59:03Z

I see. When you run gpu version is there a free gpu(no process running in this gpu)? I want to check here . If there is no free gpu, trial state will waiting in loop until find one. I reproduce this problem when occupying the gpu with other process. Temporary approach is to make a gpu free, it will work. Thanks.

0x8A63F77D · 2019-04-30T13:32:20Z

The gpu utilization is almost 0 and I'm sure I didn't start any other computational apps that use the gpu. But in nvidia-smi a lot of apps use gpu. I can't shut down all of them because some apps are system apps.

crackcomm · 2019-05-04T16:23:26Z

I am having similar issues running with anaconda, is there a way I can insert activate base into the .ps file?

In my case with GPU, it's always "Waiting" and with CPU "Failed" after a few seconds, nothing in logs.

Edit:

activate.bat will not work in PowerShell

crackcomm · 2019-05-04T16:35:22Z

I found it succeeded after typing in PowerShell:

Set-ExecutionPolicy RemoteSigned

You might want to use a safer approach:

Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser

It didn't solve the problem for the GPU, status is "waiting" and GPU Compute_0 is 0%.

There is no trials directory.

In nnimanager logs this one line repeats itself a lot:

[5/4/2019, 6:47:10 PM] ERROR [ 'Could not get gpu metrics information!' ]

Deeper:

> python -m nni_gpu_tool.gpu_metrics_collector
'nvidia-smi' is not recognized as an internal or external command,
operable program or batch file.
job exporter error <class 'subprocess.CalledProcessError'>
job exporter error Command 'nvidia-smi -q -x' returned non-zero exit status 1.
job exporter error <traceback object at 0x000001C23571EA88>

set PATH=%PATH%;C:\Program Files\NVIDIA Corporation\NVSMI

Still doesn't fix. gpu_metrics file is being updated.

[5/4/2019, 7:30:22 PM] ERROR [ 'Could not get gpu metrics information!' ]

demianzhang · 2019-05-05T02:48:37Z

hi @crackcomm , have you tried this command Set-ExecutionPolicy -ExecutionPolicy Unrestricted before installing?
As for gpu problem, could you check gpu_metrics in C:\Users\**\AppData\Local\Temp\nni\script\gpu_metrics, if it is blank, 'nvidia-smi -q -x' command does not run successfully. Please make sure your gpu devise and driver are installed successfully, 'nvidia-smi -q -x' can return the gpu information normally.

crackcomm · 2019-05-05T14:07:07Z

Hey, @demianzhang, I can try that.

The gpu_metrics does work and I even tried if execTail function works on Windows and it does return the last line from this file.
The file is updated every so often, once in few seconds but in nnimanager logs is only one line of "Could not get GPU metrics information" and before there were a dozen (when nvidia-smi was not found).

gpu_metrics: 50kb, last line:

{"Timestamp": "Sat May  4 20:00:04 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 10, "gpuMemUtil": "4", "gpuUtil": "3", "index": 0}]}

When I run nvidia-smi -q -x I got few thousands of lines of XML.

On the other hand, nvidia-smi failure was quiet and should be logged.

crackcomm · 2019-05-05T14:24:09Z

I did install after Set-ExecutionPolicy -ExecutionPolicy Unrestricted and

gpu_metrics:

{"Timestamp": "Sun May  5 16:21:02 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 11, "gpuMemUtil": "4", "gpuUtil": "3", "index": 0}]}
{"Timestamp": "Sun May  5 16:21:07 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 11, "gpuMemUtil": "5", "gpuUtil": "8", "index": 0}]}
{"Timestamp": "Sun May  5 16:21:12 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 11, "gpuMemUtil": "4", "gpuUtil": "3", "index": 0}]}
{"Timestamp": "Sun May  5 16:21:18 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 11, "gpuMemUtil": "4", "gpuUtil": "4", "index": 0}]}
{"Timestamp": "Sun May  5 16:21:23 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 11, "gpuMemUtil": "5", "gpuUtil": "5", "index": 0}]}
{"Timestamp": "Sun May  5 16:21:28 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 11, "gpuMemUtil": "4", "gpuUtil": "3", "index": 0}]}
{"Timestamp": "Sun May  5 16:21:33 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 11, "gpuMemUtil": "7", "gpuUtil": "11", "index": 0}]}
{"Timestamp": "Sun May  5 16:21:38 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 11, "gpuMemUtil": "4", "gpuUtil": "3", "index": 0}]}
{"Timestamp": "Sun May  5 16:21:43 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 11, "gpuMemUtil": "4", "gpuUtil": "4", "index": 0}]}
{"Timestamp": "Sun May  5 16:21:49 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 11, "gpuMemUtil": "8", "gpuUtil": "13", "index": 0}]}
{"Timestamp": "Sun May  5 16:21:54 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 11, "gpuMemUtil": "4", "gpuUtil": "4", "index": 0}]}
{"Timestamp": "Sun May  5 16:21:59 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 11, "gpuMemUtil": "4", "gpuUtil": "3", "index": 0}]}
{"Timestamp": "Sun May  5 16:22:04 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 11, "gpuMemUtil": "4", "gpuUtil": "3", "index": 0}]}
{"Timestamp": "Sun May  5 16:22:09 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 11, "gpuMemUtil": "4", "gpuUtil": "3", "index": 0}]}
{"Timestamp": "Sun May  5 16:22:14 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 11, "gpuMemUtil": "4", "gpuUtil": "3", "index": 0}]}
{"Timestamp": "Sun May  5 16:22:20 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 11, "gpuMemUtil": "4", "gpuUtil": "3", "index": 0}]}
{"Timestamp": "Sun May  5 16:22:25 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 11, "gpuMemUtil": "4", "gpuUtil": "3", "index": 0}]}
{"Timestamp": "Sun May  5 16:22:30 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 11, "gpuMemUtil": "4", "gpuUtil": "5", "index": 0}]}
{"Timestamp": "Sun May  5 16:22:35 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 11, "gpuMemUtil": "4", "gpuUtil": "4", "index": 0}]}
{"Timestamp": "Sun May  5 16:22:40 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 11, "gpuMemUtil": "4", "gpuUtil": "3", "index": 0}]}
{"Timestamp": "Sun May  5 16:22:45 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 11, "gpuMemUtil": "4", "gpuUtil": "3", "index": 0}]}
{"Timestamp": "Sun May  5 16:22:50 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 11, "gpuMemUtil": "4", "gpuUtil": "4", "index": 0}]}
{"Timestamp": "Sun May  5 16:22:56 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 11, "gpuMemUtil": "5", "gpuUtil": "7", "index": 0}]}

nnimanager.log

[5/5/2019, 4:20:57 PM] INFO [ 'Datastore initialization done' ]
[5/5/2019, 4:20:57 PM] INFO [ 'Rest server listening on: http://0.0.0.0:8080' ]
[5/5/2019, 4:20:57 PM] INFO [ 'Construct local machine training service.' ]
[5/5/2019, 4:20:57 PM] INFO [ 'RestServer start' ]
[5/5/2019, 4:20:57 PM] INFO [ 'RestServer base port is 8080' ]
[5/5/2019, 4:21:00 PM] INFO [ 'NNIManager setClusterMetadata, key: trial_config, value: {"command":"python mnist.py","codeDir":"C:\\\\Users\\\\Pah\\\\python\\\\nni\\\\examples\\\\trials\\\\mnist\\\\.","gpuNum":1}' ]
[5/5/2019, 4:21:00 PM] INFO [ 'required GPU number is 1' ]
[5/5/2019, 4:21:01 PM] INFO [ 'Starting experiment: NbMXDFrs' ]
[5/5/2019, 4:21:01 PM] INFO [ 'Change NNIManager status from: INITIALIZED to: RUNNING' ]
[5/5/2019, 4:21:01 PM] INFO [ 'Add event listeners' ]
[5/5/2019, 4:21:01 PM] INFO [ 'Run local machine training service.' ]
[5/5/2019, 4:21:02 PM] INFO [ 'NNIManager received command from dispatcher: ID, ' ]
[5/5/2019, 4:21:02 PM] INFO [ 'NNIManager received command from dispatcher: TR, {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"dropout_rate": 0.5958430967093737, "conv_size": 2, "hidden_size": 124, "batch_size": 32, "learning_rate": 0.001}}' ]
[5/5/2019, 4:21:06 PM] INFO [ 'submitTrialJob: form: {"jobType":"TRIAL","hyperParameters":{"value":"{\\"parameter_id\\": 0, \\"parameter_source\\": \\"algorithm\\", \\"parameters\\": {\\"dropout_rate\\": 0.5958430967093737, \\"conv_size\\": 2, \\"hidden_size\\": 124, \\"batch_size\\": 32, \\"learning_rate\\": 0.001}}","index":0}}' ]

(base) > nnictl log stderr
-----------------------------------------------------------------------
                Experiment start time 2019-05-05 16:20:53
-----------------------------------------------------------------------
(node:5520) ExperimentalWarning: The fs.promises API is experimental


(base) > nnictl log stdout
-----------------------------------------------------------------------
                Experiment start time 2019-05-05 16:20:53
-----------------------------------------------------------------------

EDIT: Currently trying with your implementation of getAvailableGPUIndices function.

crackcomm · 2019-05-05T15:39:57Z

@demianzhang Using your implementation from #1043 it starts to run trials! Great progress.

It brings to the light an issue with TensorFlow and CuDNN.

One of the trials logs:
trials: https://gist.github.com/crackcomm/5227f0ec455f434dc4cd7aadbf5dee6e
stderr: https://gist.github.com/crackcomm/938b93bd5e0deee5df43b6b978ac49d9

Notice line 57-537, there is indeed a CuDNN initialization failure but the log is scrambled.

2019-05-05 17:30:36.993255: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library cublas64_100.dll locally 
2019-05-05 17:30:37.259571: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED 
2019-05-05 17:30:37.260302: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED 
2019-05-05 17:30:37.261223: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED 
2019-05-05 17:30:37.261707: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED 
2019-05-05 17:30:37.262430: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED 
2019-05-05 17:30:37.262892: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED 
2019-05-05 17:30:38.133564: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR 
2019-05-05 17:30:39.070983: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

Tensorflow works just fine on GPU, what could be the reason it fails to initialize there? My GPU memory usage is 0.9/2.0 GB.

demianzhang · 2019-05-06T13:27:57Z

Maybe it is the environment problem, I list the possible solutions as follows, you can also have a try with mnist_before.py(without nni) to check the tensorflow.

try to use pip CuDNN version
use root to run tensorflow issue
remove the cache folder stackoverflow
if mnist_before.py also has this error, please try this stackoverflow

crackcomm · 2019-05-06T18:04:11Z

I tried 4. first:

Before:

2019-05-06 19:48:58.903538: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1378 MB memory) -> physical GPU (device: 0, name: GeForce GTX 950, pci bus id: 0000:02:00.0, compute capability: 5.2)
2019-05-06 19:49:00.334877: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library cublas64_100.dll locally
2019-05-06 19:49:02.183060: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-05-06 19:49:02.201263: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-05-06 19:49:02.246908: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-05-06 19:49:02.250334: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-05-06 19:49:02.253934: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-05-06 19:49:02.258173: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-05-06 19:49:02.828163: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-05-06 19:49:03.127638: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
         [[node conv1/Conv2D (defined at mnist_before.py:127) ]]

After:

2019-05-06 19:50:09.249347: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1378 MB memory) -> physical GPU (device: 0, name: GeForce GTX 950, pci bus id: 0000:02:00.0, compute capability: 5.2)
2019-05-06 19:50:09.896994: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library cublas64_100.dll locally
2019-05-06 19:50:15.874104: E tensorflow/stream_executor/cuda/cuda_driver.cc:981] failed to synchronize the stop event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2019-05-06 19:50:17.115978: E tensorflow/stream_executor/cuda/cuda_timer.cc:55] Internal: error destroying CUDA event in context 00000296210E2F80: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2019-05-06 19:50:17.121858: E tensorflow/stream_executor/cuda/cuda_timer.cc:60] Internal: error destroying CUDA event in context 00000296210E2F80: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2019-05-06 19:50:17.126914: F tensorflow/stream_executor/cuda/cuda_dnn.cc:194] Check failed: status == CUDNN_STATUS_SUCCESS (7 vs. 0)Failed to set cuDNN stream.

I tried in cmd "Run as Administrator", and:

2019-05-06 19:55:37.006186: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1378 MB memory) -> physical GPU (device: 0, name: GeForce GTX 950, pci bus id: 0000:02:00.0, compute capability: 5.2)
2019-05-06 19:55:37.498716: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library cublas64_100.dll locally
2019-05-06 19:55:38.860549: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 747.68MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-05-06 19:55:38.878328: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.59GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-05-06 19:55:38.882719: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.34GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-05-06 19:55:38.886318: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.32GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-05-06 19:55:48.892565: W tensorflow/core/common_runtime/bfc_allocator.cc:267] Allocator (GPU_0_bfc) ran out of memory trying to allocate 239.26MiB.  Current allocation summary follows.
2019-05-06 19:55:48.894966: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (256):   Total Chunks: 28, Chunks in use: 28. 7.0KiB allocated for chunks. 7.0KiB in use in bin. 2.1KiB client-requested in use in bin.
2019-05-06 19:55:48.897580: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (512):   Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-05-06 19:55:48.900013: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (1024):  Total Chunks: 1, Chunks in use: 1. 1.3KiB allocated for chunks. 1.3KiB in use in bin. 1.0KiB client-requested in use in bin.
2019-05-06 19:55:48.903058: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (2048):  Total Chunks: 4, Chunks in use: 4. 13.0KiB allocated for chunks. 13.0KiB in use in bin. 12.5KiB client-requested in use in bin.
2019-05-06 19:55:48.905736: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (4096):  Total Chunks: 5, Chunks in use: 5. 20.0KiB allocated for chunks. 20.0KiB in use in bin. 20.0KiB client-requested in use in bin.
2019-05-06 19:55:48.908380: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (8192):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-05-06 19:55:48.911283: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (16384):         Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-05-06 19:55:48.913845: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (32768):         Total Chunks: 4, Chunks in use: 4. 160.0KiB allocated for chunks. 160.0KiB in use in bin. 160.0KiB client-requested in use in bin.
2019-05-06 19:55:48.916615: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (65536):         Total Chunks: 1, Chunks in use: 1. 78.3KiB allocated for chunks. 78.3KiB in use in bin. 78.1KiB client-requested in use in bin.
2019-05-06 19:55:48.919244: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (131072):        Total Chunks: 4, Chunks in use: 4. 835.8KiB allocated for chunks. 835.8KiB in use in bin. 800.0KiB client-requested in use in bin.
2019-05-06 19:55:48.922591: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (262144):        Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-05-06 19:55:48.925083: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (524288):        Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-05-06 19:55:48.927573: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (1048576):       Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-05-06 19:55:48.930526: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (2097152):       Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-05-06 19:55:48.932988: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (4194304):       Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-05-06 19:55:48.935452: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (8388608):       Total Chunks: 1, Chunks in use: 1. 12.25MiB allocated for chunks. 12.25MiB in use in bin. 12.25MiB client-requested in use in bin.
2019-05-06 19:55:48.938745: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (16777216):      Total Chunks: 4, Chunks in use: 3. 82.04MiB allocated for chunks. 51.75MiB in use in bin. 36.75MiB client-requested in use in bin.
2019-05-06 19:55:48.941508: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (33554432):      Total Chunks: 1, Chunks in use: 0. 33.62MiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-05-06 19:55:48.944123: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (67108864):      Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-05-06 19:55:48.947065: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (134217728):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-05-06 19:55:48.949558: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (268435456):     Total Chunks: 1, Chunks in use: 1. 1.00GiB allocated for chunks. 1.00GiB in use in bin. 957.03MiB client-requested in use in bin.
2019-05-06 19:55:48.952248: I tensorflow/core/common_runtime/bfc_allocator.cc:613] Bin for 239.26MiB was 128.00MiB, Chunk State:
2019-05-06 19:55:48.953843: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01EC0000 of size 256
2019-05-06 19:55:48.955265: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01EC0100 of size 256
2019-05-06 19:55:48.961166: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01EC0200 of size 256
2019-05-06 19:55:48.962717: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01EC0300 of size 40960
2019-05-06 19:55:48.964183: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01ECA300 of size 4096
2019-05-06 19:55:48.965620: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01ECB300 of size 4096
2019-05-06 19:55:48.967071: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01ECC300 of size 256
2019-05-06 19:55:48.968464: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01ECC400 of size 256
2019-05-06 19:55:48.970554: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01ECC500 of size 204800
2019-05-06 19:55:48.972364: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01EFE500 of size 256
2019-05-06 19:55:48.973793: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01EFE600 of size 256
2019-05-06 19:55:48.975196: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01EFE700 of size 3328
2019-05-06 19:55:48.976630: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01EFF400 of size 256
2019-05-06 19:55:48.978496: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01EFF500 of size 256
2019-05-06 19:55:48.979939: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01EFF600 of size 1280
2019-05-06 19:55:48.981349: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01EFFB00 of size 256
2019-05-06 19:55:48.982755: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01EFFC00 of size 256
2019-05-06 19:55:48.984154: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01EFFD00 of size 256
2019-05-06 19:55:48.985641: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01EFFE00 of size 40960
2019-05-06 19:55:48.987757: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01F09E00 of size 40960
2019-05-06 19:55:48.989393: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01F13E00 of size 40960
2019-05-06 19:55:48.991142: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01F1DE00 of size 4096
2019-05-06 19:55:48.993055: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01F1EE00 of size 4096
2019-05-06 19:55:48.994776: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01F1FE00 of size 4096
2019-05-06 19:55:48.998344: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01F20E00 of size 256
2019-05-06 19:55:49.000128: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01F20F00 of size 256
2019-05-06 19:55:49.001898: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01F21000 of size 256
2019-05-06 19:55:49.003545: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01F21100 of size 204800
2019-05-06 19:55:49.005417: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01F53100 of size 204800
2019-05-06 19:55:49.009131: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01F85100 of size 241408
2019-05-06 19:55:49.010913: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01FC0000 of size 16777216
2019-05-06 19:55:49.012641: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D02FC0000 of size 16777216
2019-05-06 19:55:49.014359: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D03FC0000 of size 12845056
2019-05-06 19:55:49.016086: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D04C00000 of size 20709376
2019-05-06 19:55:49.019472: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D05FC0000 of size 256
2019-05-06 19:55:49.021280: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D05FC0100 of size 256
2019-05-06 19:55:49.023151: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D05FC0200 of size 256
2019-05-06 19:55:49.024805: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D05FC0300 of size 3328
2019-05-06 19:55:49.028261: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D05FC1000 of size 3328
2019-05-06 19:55:49.029992: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D05FC1D00 of size 3328
2019-05-06 19:55:49.031675: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D05FC2A00 of size 256
2019-05-06 19:55:49.033352: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D05FC2B00 of size 256
2019-05-06 19:55:49.035026: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D05FC2C00 of size 256
2019-05-06 19:55:49.038615: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D05FC2D00 of size 256
2019-05-06 19:55:49.040530: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D05FC2E00 of size 256
2019-05-06 19:55:49.042363: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D05FC2F00 of size 256
2019-05-06 19:55:49.044052: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D05FC3000 of size 256
2019-05-06 19:55:49.045738: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D05FC3100 of size 256
2019-05-06 19:55:49.049285: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D05FC3200 of size 256
2019-05-06 19:55:49.050999: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Free  at 0000000D05FC3300 of size 31760128
2019-05-06 19:55:49.052742: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D07E0D200 of size 256
2019-05-06 19:55:49.054589: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D07E0D300 of size 80128
2019-05-06 19:55:49.056468: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Free  at 0000000D07E20C00 of size 35255296
2019-05-06 19:55:49.060105: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D0A4C0000 of size 1073741824
2019-05-06 19:55:49.061919: I tensorflow/core/common_runtime/bfc_allocator.cc:638]      Summary of in-use Chunks by size:
2019-05-06 19:55:49.063699: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 28 Chunks of size 256 totalling 7.0KiB
2019-05-06 19:55:49.065377: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 1 Chunks of size 1280 totalling 1.3KiB
2019-05-06 19:55:49.068794: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 4 Chunks of size 3328 totalling 13.0KiB
2019-05-06 19:55:49.070512: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 5 Chunks of size 4096 totalling 20.0KiB
2019-05-06 19:55:49.072294: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 4 Chunks of size 40960 totalling 160.0KiB
2019-05-06 19:55:49.074035: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 1 Chunks of size 80128 totalling 78.3KiB
2019-05-06 19:55:49.075745: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 3 Chunks of size 204800 totalling 600.0KiB
2019-05-06 19:55:49.079279: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 1 Chunks of size 241408 totalling 235.8KiB
2019-05-06 19:55:49.081033: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 1 Chunks of size 12845056 totalling 12.25MiB
2019-05-06 19:55:49.082772: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 2 Chunks of size 16777216 totalling 32.00MiB
2019-05-06 19:55:49.084531: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 1 Chunks of size 20709376 totalling 19.75MiB
2019-05-06 19:55:49.086291: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 1 Chunks of size 1073741824 totalling 1.00GiB
2019-05-06 19:55:49.089882: I tensorflow/core/common_runtime/bfc_allocator.cc:645] Sum Total of in-use chunks: 1.06GiB
2019-05-06 19:55:49.091615: I tensorflow/core/common_runtime/bfc_allocator.cc:647] Stats:
Limit:                  1445835571
InUse:                  1141992704
MaxInUse:               1173752832
NumAllocs:                     134
MaxAllocSize:           1073741824

2019-05-06 19:55:49.098856: W tensorflow/core/common_runtime/bfc_allocator.cc:271] ******__*__************************************************************************************xxxxx
2019-05-06 19:55:49.101753: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at pooling_ops_common.cc:148 : Resource exhausted: OOM when allocating tensor with shape[10000,32,14,14] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
OOM when allocating tensor with shape[10000,32,14,14] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[node pool1/MaxPool (defined at mnist_before.py:133) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[node accuracy/Mean (defined at mnist_before.py:122) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.


Caused by op 'pool1/MaxPool', defined at:

totalMemory: 2.00GiB freeMemory: 1.64GiB

I already can feel the pain of reinstalling CuDNN.

crackcomm · 2019-05-06T18:04:45Z

I tried to change every parameter like batch size, batch num, hidden size and it still overflows.

crackcomm · 2019-05-06T21:44:20Z

I think it's most likely cudnn, cublas version mismatch.

2019-05-06 19:50:17.126914: F tensorflow/stream_executor/cuda/cuda_dnn.cc:194] Check failed: status == CUDNN_STATUS_SUCCESS (7 vs. 0)Failed to set cuDNN stream.

2019-05-06 19:50:09.896994: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library cublas64_100.dll locally

crackcomm · 2019-05-07T09:37:36Z

(false positives)

The log is broken as hell: https://gist.github.com/crackcomm/c60ba4a100e9a359624915b17ef90e0a

crackcomm · 2019-05-07T09:41:03Z

My issue was solved with the help of @peterjc123 in another unrelated issue: pytorch/pytorch#20202 (what a nice issue id too)

demianzhang · 2019-05-07T11:26:50Z

Glad to hear that. Thanks for trying the tools and insisting on solving the problem. The things related to the issue get better too.

crackcomm · 2019-05-07T19:15:54Z

@demianzhang Thank you for your help as well.

scarlett2018 · 2019-05-20T02:25:58Z

Closing the issue per latest feedback from @crackcomm

demianzhang mentioned this issue May 5, 2019

Fix Windows available gpu #1043

Merged

demianzhang mentioned this issue May 18, 2019

similar issue to #769, can't run mnist example with gpu, python is installed by pyenv #1087

Closed

scarlett2018 added the user raised label May 20, 2019

scarlett2018 assigned demianzhang May 20, 2019

scarlett2018 closed this as completed May 20, 2019

lizhaodong mentioned this issue Oct 10, 2019

The trial won't run under Ubuntu18.04 when gpuNum is set to 1. #1600

Closed

scarlett2018 added Windows GPU-usage labels Apr 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trial won't run under Windows when gpuNum is set to 1 #1037

Trial won't run under Windows when gpuNum is set to 1 #1037

0x8A63F77D commented Apr 30, 2019

demianzhang commented Apr 30, 2019

0x8A63F77D commented Apr 30, 2019

demianzhang commented Apr 30, 2019

0x8A63F77D commented Apr 30, 2019

demianzhang commented Apr 30, 2019

0x8A63F77D commented Apr 30, 2019 •

edited

Loading

demianzhang commented Apr 30, 2019

0x8A63F77D commented Apr 30, 2019

demianzhang commented Apr 30, 2019

0x8A63F77D commented Apr 30, 2019

demianzhang commented Apr 30, 2019 •

edited

Loading

0x8A63F77D commented Apr 30, 2019

crackcomm commented May 4, 2019 •

edited

Loading

crackcomm commented May 4, 2019 •

edited

Loading

demianzhang commented May 5, 2019 •

edited

Loading

crackcomm commented May 5, 2019 •

edited

Loading

crackcomm commented May 5, 2019 •

edited

Loading

crackcomm commented May 5, 2019 •

edited

Loading

demianzhang commented May 6, 2019

crackcomm commented May 6, 2019

crackcomm commented May 6, 2019

crackcomm commented May 6, 2019 •

edited

Loading

crackcomm commented May 7, 2019

crackcomm commented May 7, 2019 •

edited

Loading

demianzhang commented May 7, 2019

crackcomm commented May 7, 2019

scarlett2018 commented May 20, 2019

Trial won't run under Windows when gpuNum is set to 1 #1037

Trial won't run under Windows when gpuNum is set to 1 #1037

Comments

0x8A63F77D commented Apr 30, 2019

demianzhang commented Apr 30, 2019

0x8A63F77D commented Apr 30, 2019

demianzhang commented Apr 30, 2019

0x8A63F77D commented Apr 30, 2019

demianzhang commented Apr 30, 2019

0x8A63F77D commented Apr 30, 2019 • edited Loading

demianzhang commented Apr 30, 2019

0x8A63F77D commented Apr 30, 2019

demianzhang commented Apr 30, 2019

0x8A63F77D commented Apr 30, 2019

demianzhang commented Apr 30, 2019 • edited Loading

0x8A63F77D commented Apr 30, 2019

crackcomm commented May 4, 2019 • edited Loading

crackcomm commented May 4, 2019 • edited Loading

demianzhang commented May 5, 2019 • edited Loading

crackcomm commented May 5, 2019 • edited Loading

crackcomm commented May 5, 2019 • edited Loading

crackcomm commented May 5, 2019 • edited Loading

demianzhang commented May 6, 2019

crackcomm commented May 6, 2019

crackcomm commented May 6, 2019

crackcomm commented May 6, 2019 • edited Loading

crackcomm commented May 7, 2019

crackcomm commented May 7, 2019 • edited Loading

demianzhang commented May 7, 2019

crackcomm commented May 7, 2019

scarlett2018 commented May 20, 2019

0x8A63F77D commented Apr 30, 2019 •

edited

Loading

demianzhang commented Apr 30, 2019 •

edited

Loading

crackcomm commented May 4, 2019 •

edited

Loading

crackcomm commented May 4, 2019 •

edited

Loading

demianzhang commented May 5, 2019 •

edited

Loading

crackcomm commented May 5, 2019 •

edited

Loading

crackcomm commented May 5, 2019 •

edited

Loading

crackcomm commented May 5, 2019 •

edited

Loading

crackcomm commented May 6, 2019 •

edited

Loading

crackcomm commented May 7, 2019 •

edited

Loading