Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Trial won't run under Windows when gpuNum is set to 1 #1037

Closed
0x8A63F77D opened this issue Apr 30, 2019 · 27 comments
Closed

Trial won't run under Windows when gpuNum is set to 1 #1037

0x8A63F77D opened this issue Apr 30, 2019 · 27 comments

Comments

@0x8A63F77D
Copy link

Short summary about the issue/question:
The trial won't run under Windows when gpuNum is set to 1. The trail keeps waiting. When I set gpuNum to 0, it works normally.
Brief what process you are following:

How to reproduce it:
My own code didn't work. I tried on the mnist example and it didn't work if gpuNum = 1.

nni Environment:

  • nni version:0.7
  • nni mode(local|pai|remote):local
  • OS:Windows 10
  • python version:3.7
  • is conda or virtualenv used?: miniconda
  • is running in docker?:no

need to update document(yes/no):

Anything else we need to know:

@demianzhang
Copy link
Contributor

Thanks for raising this issue. Could you provide more details of your installation, nnimanager.log, trial stderr? I can not reproduce this error now.

@0x8A63F77D
Copy link
Author

nnimanager.log
[2019-4-30 14:26:29] INFO [ 'Datastore initialization done' ]
[2019-4-30 14:26:29] INFO [ 'Rest server listening on: http://0.0.0.0:8080' ]
[2019-4-30 14:26:29] INFO [ 'Construct local machine training service.' ]
[2019-4-30 14:26:29] INFO [ 'RestServer start' ]
[2019-4-30 14:26:29] INFO [ 'RestServer base port is 8080' ]
[2019-4-30 14:26:33] INFO [ 'NNIManager setClusterMetadata, key: trial_config, value: {"command":"c:\\Miniconda3\\envs\\tensorflow\\python.exe mnist.py","codeDir":"D:\\Users\\0x8A63F77D\\Documents\\GitHub\\nni\\examples\\trials\\mnist\\.","gpuNum":1}' ]
[2019-4-30 14:26:33] INFO [ 'required GPU number is 1' ]
[2019-4-30 14:26:35] INFO [ 'Starting experiment: kDj198ea' ]
[2019-4-30 14:26:35] INFO [ 'Change NNIManager status from: INITIALIZED to: RUNNING' ]
[2019-4-30 14:26:35] INFO [ 'Add event listeners' ]
[2019-4-30 14:26:35] INFO [ 'Run local machine training service.' ]
[2019-4-30 14:26:36] INFO [ 'NNIManager received command from dispatcher: ID, ' ]
[2019-4-30 14:26:36] INFO [ 'NNIManager received command from dispatcher: TR, {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"dropout_rate": 0.6742607414411488, "conv_size": 2, "hidden_size": 1024, "batch_size": 8, "learning_rate": 0.001}}' ]
[2019-4-30 14:26:40] INFO [ 'submitTrialJob: form: {"jobType":"TRIAL","hyperParameters":{"value":"{\"parameter_id\": 0, \"parameter_source\": \"algorithm\", \"parameters\": {\"dropout_rate\": 0.6742607414411488, \"conv_size\": 2, \"hidden_size\": 1024, \"batch_size\": 8, \"learning_rate\": 0.001}}","index":0}}' ]

The trail folder doesn't exist.

I install nni with pip.

@demianzhang
Copy link
Contributor

Is it complete nnimanager.log and make sure not missing [Trial job kDj198ea status changed from WAITING to RUNNING' ] or FAILED? What is your python version, 64-bit? Could you provide the experiment IP?

@0x8A63F77D
Copy link
Author

Yes it's the complete log. No missing lines.

The python version: Python 3.7.2 (default, Feb 21 2019, 17:35:59) [MSC v.1915 64 bit (AMD64)] :: Anaconda, Inc. on win32

Is this the experiment ip?

The experiment id is PuQDTCUx
The Web UI urls are: http://192.168.56.1:8080 http://10.0.0.2:8080 http://127.0.0.1:8080

@demianzhang
Copy link
Contributor

I can not open the url. Can you run mnist.py using gpu without nni? By the way, the url is private, you can edit and delete it if you want.

@0x8A63F77D
Copy link
Author

0x8A63F77D commented Apr 30, 2019

Well the machine is behind a firewall and I have no rights to access the firewall.

My own code runs without error when I set gpuNum = 0 and set CUDA_VISIBLE_DEVICES = 0 in code.
X9XMSOV}WEC$YPUM(K$N~6I

By the way the example code in nni repository needs some modifications to run without nni.

The mnist_before.py runs without errors.

@demianzhang
Copy link
Contributor

CUDA_VISIBLE_DEVICES ='-1' or CUDA_VISIBLE_DEVICES ='' means no gpu, however if you set gpuNum = 0 in yml, there is no need to set CUDA_VISIBLE_DEVICES. What is your gpu experiment command?

@0x8A63F77D
Copy link
Author

The run.ps1 under the trail folder set CUDA_VISIBLE_DEVICES = '-1' if gpuNum = 0. But my data use channel first order. If I don't do so Tensorflow will tell me that CPU version do not support NCHW format.
The problem is:
gpuNum = 1 -> No trail folder under the experiment folder -> Trail keeps waiting in the web UI
gpuNum = 0 -> Trails with CUDA_VISIBLE_DEVICES = '-1' -> Set CUDA_VISIBLE_DEVICES = '0' in code -> Everthing is fine

The command is c:\Miniconda3\envs\tf-alpha\python.exe hyperparam_tunning.py

@demianzhang
Copy link
Contributor

Could you provide the run.ps1 file when running gpu experiment. In my reproduce, $env:CUDA_VISIBLE_DEVICES="0" has already set in run.ps1 when using gpu.

@0x8A63F77D
Copy link
Author

I can't run gpu experiment with normal method (set gpuNum = 1 in yml). No trail folder will be created. That is the problem what I am facing.
1

@demianzhang
Copy link
Contributor

demianzhang commented Apr 30, 2019

I see. When you run gpu version is there a free gpu(no process running in this gpu)? I want to check here . If there is no free gpu, trial state will waiting in loop until find one. I reproduce this problem when occupying the gpu with other process. Temporary approach is to make a gpu free, it will work. Thanks.

@0x8A63F77D
Copy link
Author

The gpu utilization is almost 0 and I'm sure I didn't start any other computational apps that use the gpu. But in nvidia-smi a lot of apps use gpu. I can't shut down all of them because some apps are system apps.
2

@crackcomm
Copy link

crackcomm commented May 4, 2019

I am having similar issues running with anaconda, is there a way I can insert activate base into the .ps file?

In my case with GPU, it's always "Waiting" and with CPU "Failed" after a few seconds, nothing in logs.

Edit:

activate.bat will not work in PowerShell

@crackcomm
Copy link

crackcomm commented May 4, 2019

I found it succeeded after typing in PowerShell:

Set-ExecutionPolicy RemoteSigned

You might want to use a safer approach:

Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser

It didn't solve the problem for the GPU, status is "waiting" and GPU Compute_0 is 0%.

There is no trials directory.

In nnimanager logs this one line repeats itself a lot:

[5/4/2019, 6:47:10 PM] ERROR [ 'Could not get gpu metrics information!' ]

Deeper:

> python -m nni_gpu_tool.gpu_metrics_collector
'nvidia-smi' is not recognized as an internal or external command,
operable program or batch file.
job exporter error <class 'subprocess.CalledProcessError'>
job exporter error Command 'nvidia-smi -q -x' returned non-zero exit status 1.
job exporter error <traceback object at 0x000001C23571EA88>
set PATH=%PATH%;C:\Program Files\NVIDIA Corporation\NVSMI

Still doesn't fix. gpu_metrics file is being updated.

[5/4/2019, 7:30:22 PM] ERROR [ 'Could not get gpu metrics information!' ]

@demianzhang
Copy link
Contributor

demianzhang commented May 5, 2019

hi @crackcomm , have you tried this command Set-ExecutionPolicy -ExecutionPolicy Unrestricted before installing?
As for gpu problem, could you check gpu_metrics in C:\Users\**\AppData\Local\Temp\nni\script\gpu_metrics, if it is blank, 'nvidia-smi -q -x' command does not run successfully. Please make sure your gpu devise and driver are installed successfully, 'nvidia-smi -q -x' can return the gpu information normally.

@crackcomm
Copy link

crackcomm commented May 5, 2019

Hey, @demianzhang, I can try that.

The gpu_metrics does work and I even tried if execTail function works on Windows and it does return the last line from this file.
The file is updated every so often, once in few seconds but in nnimanager logs is only one line of "Could not get GPU metrics information" and before there were a dozen (when nvidia-smi was not found).

gpu_metrics: 50kb, last line:

{"Timestamp": "Sat May  4 20:00:04 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 10, "gpuMemUtil": "4", "gpuUtil": "3", "index": 0}]}

When I run nvidia-smi -q -x I got few thousands of lines of XML.

On the other hand, nvidia-smi failure was quiet and should be logged.

@crackcomm
Copy link

crackcomm commented May 5, 2019

I did install after Set-ExecutionPolicy -ExecutionPolicy Unrestricted and

gpu_metrics:

{"Timestamp": "Sun May  5 16:21:02 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 11, "gpuMemUtil": "4", "gpuUtil": "3", "index": 0}]}
{"Timestamp": "Sun May  5 16:21:07 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 11, "gpuMemUtil": "5", "gpuUtil": "8", "index": 0}]}
{"Timestamp": "Sun May  5 16:21:12 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 11, "gpuMemUtil": "4", "gpuUtil": "3", "index": 0}]}
{"Timestamp": "Sun May  5 16:21:18 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 11, "gpuMemUtil": "4", "gpuUtil": "4", "index": 0}]}
{"Timestamp": "Sun May  5 16:21:23 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 11, "gpuMemUtil": "5", "gpuUtil": "5", "index": 0}]}
{"Timestamp": "Sun May  5 16:21:28 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 11, "gpuMemUtil": "4", "gpuUtil": "3", "index": 0}]}
{"Timestamp": "Sun May  5 16:21:33 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 11, "gpuMemUtil": "7", "gpuUtil": "11", "index": 0}]}
{"Timestamp": "Sun May  5 16:21:38 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 11, "gpuMemUtil": "4", "gpuUtil": "3", "index": 0}]}
{"Timestamp": "Sun May  5 16:21:43 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 11, "gpuMemUtil": "4", "gpuUtil": "4", "index": 0}]}
{"Timestamp": "Sun May  5 16:21:49 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 11, "gpuMemUtil": "8", "gpuUtil": "13", "index": 0}]}
{"Timestamp": "Sun May  5 16:21:54 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 11, "gpuMemUtil": "4", "gpuUtil": "4", "index": 0}]}
{"Timestamp": "Sun May  5 16:21:59 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 11, "gpuMemUtil": "4", "gpuUtil": "3", "index": 0}]}
{"Timestamp": "Sun May  5 16:22:04 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 11, "gpuMemUtil": "4", "gpuUtil": "3", "index": 0}]}
{"Timestamp": "Sun May  5 16:22:09 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 11, "gpuMemUtil": "4", "gpuUtil": "3", "index": 0}]}
{"Timestamp": "Sun May  5 16:22:14 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 11, "gpuMemUtil": "4", "gpuUtil": "3", "index": 0}]}
{"Timestamp": "Sun May  5 16:22:20 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 11, "gpuMemUtil": "4", "gpuUtil": "3", "index": 0}]}
{"Timestamp": "Sun May  5 16:22:25 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 11, "gpuMemUtil": "4", "gpuUtil": "3", "index": 0}]}
{"Timestamp": "Sun May  5 16:22:30 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 11, "gpuMemUtil": "4", "gpuUtil": "5", "index": 0}]}
{"Timestamp": "Sun May  5 16:22:35 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 11, "gpuMemUtil": "4", "gpuUtil": "4", "index": 0}]}
{"Timestamp": "Sun May  5 16:22:40 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 11, "gpuMemUtil": "4", "gpuUtil": "3", "index": 0}]}
{"Timestamp": "Sun May  5 16:22:45 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 11, "gpuMemUtil": "4", "gpuUtil": "3", "index": 0}]}
{"Timestamp": "Sun May  5 16:22:50 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 11, "gpuMemUtil": "4", "gpuUtil": "4", "index": 0}]}
{"Timestamp": "Sun May  5 16:22:56 2019", "gpuCount": 1, "gpuInfos": [{"activeProcessNum": 11, "gpuMemUtil": "5", "gpuUtil": "7", "index": 0}]}

nnimanager.log

[5/5/2019, 4:20:57 PM] INFO [ 'Datastore initialization done' ]
[5/5/2019, 4:20:57 PM] INFO [ 'Rest server listening on: http://0.0.0.0:8080' ]
[5/5/2019, 4:20:57 PM] INFO [ 'Construct local machine training service.' ]
[5/5/2019, 4:20:57 PM] INFO [ 'RestServer start' ]
[5/5/2019, 4:20:57 PM] INFO [ 'RestServer base port is 8080' ]
[5/5/2019, 4:21:00 PM] INFO [ 'NNIManager setClusterMetadata, key: trial_config, value: {"command":"python mnist.py","codeDir":"C:\\\\Users\\\\Pah\\\\python\\\\nni\\\\examples\\\\trials\\\\mnist\\\\.","gpuNum":1}' ]
[5/5/2019, 4:21:00 PM] INFO [ 'required GPU number is 1' ]
[5/5/2019, 4:21:01 PM] INFO [ 'Starting experiment: NbMXDFrs' ]
[5/5/2019, 4:21:01 PM] INFO [ 'Change NNIManager status from: INITIALIZED to: RUNNING' ]
[5/5/2019, 4:21:01 PM] INFO [ 'Add event listeners' ]
[5/5/2019, 4:21:01 PM] INFO [ 'Run local machine training service.' ]
[5/5/2019, 4:21:02 PM] INFO [ 'NNIManager received command from dispatcher: ID, ' ]
[5/5/2019, 4:21:02 PM] INFO [ 'NNIManager received command from dispatcher: TR, {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"dropout_rate": 0.5958430967093737, "conv_size": 2, "hidden_size": 124, "batch_size": 32, "learning_rate": 0.001}}' ]
[5/5/2019, 4:21:06 PM] INFO [ 'submitTrialJob: form: {"jobType":"TRIAL","hyperParameters":{"value":"{\\"parameter_id\\": 0, \\"parameter_source\\": \\"algorithm\\", \\"parameters\\": {\\"dropout_rate\\": 0.5958430967093737, \\"conv_size\\": 2, \\"hidden_size\\": 124, \\"batch_size\\": 32, \\"learning_rate\\": 0.001}}","index":0}}' ]
(base) > nnictl log stderr
-----------------------------------------------------------------------
                Experiment start time 2019-05-05 16:20:53
-----------------------------------------------------------------------
(node:5520) ExperimentalWarning: The fs.promises API is experimental


(base) > nnictl log stdout
-----------------------------------------------------------------------
                Experiment start time 2019-05-05 16:20:53
-----------------------------------------------------------------------

EDIT: Currently trying with your implementation of getAvailableGPUIndices function.

@crackcomm
Copy link

crackcomm commented May 5, 2019

@demianzhang Using your implementation from #1043 it starts to run trials! Great progress.

It brings to the light an issue with TensorFlow and CuDNN.

One of the trials logs:
trials: https://gist.github.com/crackcomm/5227f0ec455f434dc4cd7aadbf5dee6e
stderr: https://gist.github.com/crackcomm/938b93bd5e0deee5df43b6b978ac49d9

Notice line 57-537, there is indeed a CuDNN initialization failure but the log is scrambled.

2019-05-05 17:30:36.993255: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library cublas64_100.dll locally 
2019-05-05 17:30:37.259571: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED 
2019-05-05 17:30:37.260302: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED 
2019-05-05 17:30:37.261223: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED 
2019-05-05 17:30:37.261707: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED 
2019-05-05 17:30:37.262430: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED 
2019-05-05 17:30:37.262892: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED 
2019-05-05 17:30:38.133564: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR 
2019-05-05 17:30:39.070983: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

Tensorflow works just fine on GPU, what could be the reason it fails to initialize there? My GPU memory usage is 0.9/2.0 GB.

@demianzhang
Copy link
Contributor

Maybe it is the environment problem, I list the possible solutions as follows, you can also have a try with mnist_before.py(without nni) to check the tensorflow.

  1. try to use pip CuDNN version
  2. use root to run tensorflow issue
  3. remove the cache folder stackoverflow
  4. if mnist_before.py also has this error, please try this stackoverflow

@crackcomm
Copy link

I tried 4. first:

Before:

2019-05-06 19:48:58.903538: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1378 MB memory) -> physical GPU (device: 0, name: GeForce GTX 950, pci bus id: 0000:02:00.0, compute capability: 5.2)
2019-05-06 19:49:00.334877: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library cublas64_100.dll locally
2019-05-06 19:49:02.183060: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-05-06 19:49:02.201263: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-05-06 19:49:02.246908: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-05-06 19:49:02.250334: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-05-06 19:49:02.253934: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-05-06 19:49:02.258173: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-05-06 19:49:02.828163: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-05-06 19:49:03.127638: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
         [[node conv1/Conv2D (defined at mnist_before.py:127) ]]

After:

2019-05-06 19:50:09.249347: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1378 MB memory) -> physical GPU (device: 0, name: GeForce GTX 950, pci bus id: 0000:02:00.0, compute capability: 5.2)
2019-05-06 19:50:09.896994: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library cublas64_100.dll locally
2019-05-06 19:50:15.874104: E tensorflow/stream_executor/cuda/cuda_driver.cc:981] failed to synchronize the stop event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2019-05-06 19:50:17.115978: E tensorflow/stream_executor/cuda/cuda_timer.cc:55] Internal: error destroying CUDA event in context 00000296210E2F80: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2019-05-06 19:50:17.121858: E tensorflow/stream_executor/cuda/cuda_timer.cc:60] Internal: error destroying CUDA event in context 00000296210E2F80: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2019-05-06 19:50:17.126914: F tensorflow/stream_executor/cuda/cuda_dnn.cc:194] Check failed: status == CUDNN_STATUS_SUCCESS (7 vs. 0)Failed to set cuDNN stream.

I tried in cmd "Run as Administrator", and:

2019-05-06 19:55:37.006186: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1378 MB memory) -> physical GPU (device: 0, name: GeForce GTX 950, pci bus id: 0000:02:00.0, compute capability: 5.2)
2019-05-06 19:55:37.498716: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library cublas64_100.dll locally
2019-05-06 19:55:38.860549: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 747.68MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-05-06 19:55:38.878328: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.59GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-05-06 19:55:38.882719: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.34GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-05-06 19:55:38.886318: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.32GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-05-06 19:55:48.892565: W tensorflow/core/common_runtime/bfc_allocator.cc:267] Allocator (GPU_0_bfc) ran out of memory trying to allocate 239.26MiB.  Current allocation summary follows.
2019-05-06 19:55:48.894966: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (256):   Total Chunks: 28, Chunks in use: 28. 7.0KiB allocated for chunks. 7.0KiB in use in bin. 2.1KiB client-requested in use in bin.
2019-05-06 19:55:48.897580: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (512):   Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-05-06 19:55:48.900013: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (1024):  Total Chunks: 1, Chunks in use: 1. 1.3KiB allocated for chunks. 1.3KiB in use in bin. 1.0KiB client-requested in use in bin.
2019-05-06 19:55:48.903058: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (2048):  Total Chunks: 4, Chunks in use: 4. 13.0KiB allocated for chunks. 13.0KiB in use in bin. 12.5KiB client-requested in use in bin.
2019-05-06 19:55:48.905736: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (4096):  Total Chunks: 5, Chunks in use: 5. 20.0KiB allocated for chunks. 20.0KiB in use in bin. 20.0KiB client-requested in use in bin.
2019-05-06 19:55:48.908380: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (8192):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-05-06 19:55:48.911283: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (16384):         Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-05-06 19:55:48.913845: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (32768):         Total Chunks: 4, Chunks in use: 4. 160.0KiB allocated for chunks. 160.0KiB in use in bin. 160.0KiB client-requested in use in bin.
2019-05-06 19:55:48.916615: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (65536):         Total Chunks: 1, Chunks in use: 1. 78.3KiB allocated for chunks. 78.3KiB in use in bin. 78.1KiB client-requested in use in bin.
2019-05-06 19:55:48.919244: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (131072):        Total Chunks: 4, Chunks in use: 4. 835.8KiB allocated for chunks. 835.8KiB in use in bin. 800.0KiB client-requested in use in bin.
2019-05-06 19:55:48.922591: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (262144):        Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-05-06 19:55:48.925083: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (524288):        Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-05-06 19:55:48.927573: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (1048576):       Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-05-06 19:55:48.930526: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (2097152):       Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-05-06 19:55:48.932988: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (4194304):       Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-05-06 19:55:48.935452: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (8388608):       Total Chunks: 1, Chunks in use: 1. 12.25MiB allocated for chunks. 12.25MiB in use in bin. 12.25MiB client-requested in use in bin.
2019-05-06 19:55:48.938745: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (16777216):      Total Chunks: 4, Chunks in use: 3. 82.04MiB allocated for chunks. 51.75MiB in use in bin. 36.75MiB client-requested in use in bin.
2019-05-06 19:55:48.941508: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (33554432):      Total Chunks: 1, Chunks in use: 0. 33.62MiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-05-06 19:55:48.944123: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (67108864):      Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-05-06 19:55:48.947065: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (134217728):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-05-06 19:55:48.949558: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (268435456):     Total Chunks: 1, Chunks in use: 1. 1.00GiB allocated for chunks. 1.00GiB in use in bin. 957.03MiB client-requested in use in bin.
2019-05-06 19:55:48.952248: I tensorflow/core/common_runtime/bfc_allocator.cc:613] Bin for 239.26MiB was 128.00MiB, Chunk State:
2019-05-06 19:55:48.953843: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01EC0000 of size 256
2019-05-06 19:55:48.955265: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01EC0100 of size 256
2019-05-06 19:55:48.961166: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01EC0200 of size 256
2019-05-06 19:55:48.962717: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01EC0300 of size 40960
2019-05-06 19:55:48.964183: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01ECA300 of size 4096
2019-05-06 19:55:48.965620: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01ECB300 of size 4096
2019-05-06 19:55:48.967071: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01ECC300 of size 256
2019-05-06 19:55:48.968464: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01ECC400 of size 256
2019-05-06 19:55:48.970554: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01ECC500 of size 204800
2019-05-06 19:55:48.972364: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01EFE500 of size 256
2019-05-06 19:55:48.973793: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01EFE600 of size 256
2019-05-06 19:55:48.975196: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01EFE700 of size 3328
2019-05-06 19:55:48.976630: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01EFF400 of size 256
2019-05-06 19:55:48.978496: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01EFF500 of size 256
2019-05-06 19:55:48.979939: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01EFF600 of size 1280
2019-05-06 19:55:48.981349: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01EFFB00 of size 256
2019-05-06 19:55:48.982755: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01EFFC00 of size 256
2019-05-06 19:55:48.984154: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01EFFD00 of size 256
2019-05-06 19:55:48.985641: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01EFFE00 of size 40960
2019-05-06 19:55:48.987757: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01F09E00 of size 40960
2019-05-06 19:55:48.989393: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01F13E00 of size 40960
2019-05-06 19:55:48.991142: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01F1DE00 of size 4096
2019-05-06 19:55:48.993055: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01F1EE00 of size 4096
2019-05-06 19:55:48.994776: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01F1FE00 of size 4096
2019-05-06 19:55:48.998344: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01F20E00 of size 256
2019-05-06 19:55:49.000128: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01F20F00 of size 256
2019-05-06 19:55:49.001898: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01F21000 of size 256
2019-05-06 19:55:49.003545: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01F21100 of size 204800
2019-05-06 19:55:49.005417: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01F53100 of size 204800
2019-05-06 19:55:49.009131: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01F85100 of size 241408
2019-05-06 19:55:49.010913: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D01FC0000 of size 16777216
2019-05-06 19:55:49.012641: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D02FC0000 of size 16777216
2019-05-06 19:55:49.014359: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D03FC0000 of size 12845056
2019-05-06 19:55:49.016086: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D04C00000 of size 20709376
2019-05-06 19:55:49.019472: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D05FC0000 of size 256
2019-05-06 19:55:49.021280: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D05FC0100 of size 256
2019-05-06 19:55:49.023151: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D05FC0200 of size 256
2019-05-06 19:55:49.024805: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D05FC0300 of size 3328
2019-05-06 19:55:49.028261: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D05FC1000 of size 3328
2019-05-06 19:55:49.029992: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D05FC1D00 of size 3328
2019-05-06 19:55:49.031675: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D05FC2A00 of size 256
2019-05-06 19:55:49.033352: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D05FC2B00 of size 256
2019-05-06 19:55:49.035026: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D05FC2C00 of size 256
2019-05-06 19:55:49.038615: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D05FC2D00 of size 256
2019-05-06 19:55:49.040530: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D05FC2E00 of size 256
2019-05-06 19:55:49.042363: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D05FC2F00 of size 256
2019-05-06 19:55:49.044052: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D05FC3000 of size 256
2019-05-06 19:55:49.045738: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D05FC3100 of size 256
2019-05-06 19:55:49.049285: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D05FC3200 of size 256
2019-05-06 19:55:49.050999: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Free  at 0000000D05FC3300 of size 31760128
2019-05-06 19:55:49.052742: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D07E0D200 of size 256
2019-05-06 19:55:49.054589: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D07E0D300 of size 80128
2019-05-06 19:55:49.056468: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Free  at 0000000D07E20C00 of size 35255296
2019-05-06 19:55:49.060105: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0000000D0A4C0000 of size 1073741824
2019-05-06 19:55:49.061919: I tensorflow/core/common_runtime/bfc_allocator.cc:638]      Summary of in-use Chunks by size:
2019-05-06 19:55:49.063699: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 28 Chunks of size 256 totalling 7.0KiB
2019-05-06 19:55:49.065377: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 1 Chunks of size 1280 totalling 1.3KiB
2019-05-06 19:55:49.068794: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 4 Chunks of size 3328 totalling 13.0KiB
2019-05-06 19:55:49.070512: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 5 Chunks of size 4096 totalling 20.0KiB
2019-05-06 19:55:49.072294: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 4 Chunks of size 40960 totalling 160.0KiB
2019-05-06 19:55:49.074035: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 1 Chunks of size 80128 totalling 78.3KiB
2019-05-06 19:55:49.075745: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 3 Chunks of size 204800 totalling 600.0KiB
2019-05-06 19:55:49.079279: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 1 Chunks of size 241408 totalling 235.8KiB
2019-05-06 19:55:49.081033: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 1 Chunks of size 12845056 totalling 12.25MiB
2019-05-06 19:55:49.082772: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 2 Chunks of size 16777216 totalling 32.00MiB
2019-05-06 19:55:49.084531: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 1 Chunks of size 20709376 totalling 19.75MiB
2019-05-06 19:55:49.086291: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 1 Chunks of size 1073741824 totalling 1.00GiB
2019-05-06 19:55:49.089882: I tensorflow/core/common_runtime/bfc_allocator.cc:645] Sum Total of in-use chunks: 1.06GiB
2019-05-06 19:55:49.091615: I tensorflow/core/common_runtime/bfc_allocator.cc:647] Stats:
Limit:                  1445835571
InUse:                  1141992704
MaxInUse:               1173752832
NumAllocs:                     134
MaxAllocSize:           1073741824

2019-05-06 19:55:49.098856: W tensorflow/core/common_runtime/bfc_allocator.cc:271] ******__*__************************************************************************************xxxxx
2019-05-06 19:55:49.101753: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at pooling_ops_common.cc:148 : Resource exhausted: OOM when allocating tensor with shape[10000,32,14,14] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
OOM when allocating tensor with shape[10000,32,14,14] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[node pool1/MaxPool (defined at mnist_before.py:133) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[node accuracy/Mean (defined at mnist_before.py:122) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.


Caused by op 'pool1/MaxPool', defined at:
totalMemory: 2.00GiB freeMemory: 1.64GiB

I already can feel the pain of reinstalling CuDNN.

@crackcomm
Copy link

I tried to change every parameter like batch size, batch num, hidden size and it still overflows.

@crackcomm
Copy link

crackcomm commented May 6, 2019

I think it's most likely cudnn, cublas version mismatch.

2019-05-06 19:50:17.126914: F tensorflow/stream_executor/cuda/cuda_dnn.cc:194] Check failed: status == CUDNN_STATUS_SUCCESS (7 vs. 0)Failed to set cuDNN stream.
2019-05-06 19:50:09.896994: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library cublas64_100.dll locally

@crackcomm
Copy link

image (false positives)

The log is broken as hell: https://gist.github.com/crackcomm/c60ba4a100e9a359624915b17ef90e0a

@crackcomm
Copy link

crackcomm commented May 7, 2019

My issue was solved with the help of @peterjc123 in another unrelated issue: pytorch/pytorch#20202 (what a nice issue id too)

@demianzhang
Copy link
Contributor

Glad to hear that. Thanks for trying the tools and insisting on solving the problem. The things related to the issue get better too.

@crackcomm
Copy link

@demianzhang Thank you for your help as well.

@scarlett2018
Copy link
Member

Closing the issue per latest feedback from @crackcomm

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants