A question when training #2

xiaowenhe · 2017-07-28T08:37:08Z

Hey, much thanks for your great work. I follow your work now.
But i met a problem when training, like :
iter: 0 / 100000, total loss: 5.0174, rpn_loss_cls: 1.2912, rpn_loss_box: 0.4618, loss_cls: 3.1485, loss_box: 0.1159, lr: 0.001000
speed: 3.746s / iter
cudaCheckError() failed : invalid device function.
That is say , after the firsr iter , it throw an error.
I do not know how to deal with it! Can you help me?

device: K80, CUDA8.0, cudnn5.1

Zardinality · 2017-07-28T16:55:31Z

@xiaowenhe You might want to check this. smallcorgi/Faster-RCNN_TF#19

I was planning to add it in readme (and I did, but in another repo). Will add it before tomorrow.

xiaowenhe · 2017-07-31T01:27:18Z

@Zardinality ,thanks for your answer. But the error still again. Even I change the -arch=sm_37 (K80) in make.sh and setup.py, and rerun the make.

Zardinality · 2017-07-31T06:22:16Z

@xiaowenhe That is odd. Can you make sure you use the recompiled version by deleting original ones, or comment out all lines related to deform stuff to check if regular roi_pooling op works?

xiaowenhe · 2017-07-31T07:23:53Z

@Zardinality . Thank you very much. I get it by recompiled version by deleting original ones.

feitiandemiaomi · 2017-08-07T15:04:22Z

@xiaowenhe I met the same problem , What did you delete? Makefile has the effect of rm, I can not understand

Zardinality · 2017-08-07T15:34:14Z

@feitiandemiaomi Have you changed -arch to a value compatible with your device?

feitiandemiaomi · 2017-08-07T15:38:30Z

@Zardinality I have changed ,my machine is k40m and I change the -arch=sm_35, and I also rerun the make ,but it did not work

feitiandemiaomi · 2017-08-07T15:40:53Z

gpu2@gpu2-PowerEdge-R730:~/OWFO/TF_Deformable_Net$ python faster_rcnn/demo.py --model tf_deformable_net/restore_output/Resnet50_iter_145000.ckpt
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
filename: /home/gpu2/OWFO/TF_Deformable_Net/lib/psroi_pooling_layer/psroi_pooling.so

/home/gpu2/OWFO/TF_Deformable_Net/lib/psroi_pooling_layer/psroi_pooling.so
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: Tesla K40m
major: 3 minor: 5 memoryClockRate (GHz) 0.745
pciBusID 0000:04:00.0
Total memory: 11.17GiB
Free memory: 11.10GiB
W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context when one is currently active; existing: 0x18aa400
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 1 with properties:
name: Tesla K40m
major: 3 minor: 5 memoryClockRate (GHz) 0.745
pciBusID 0000:82:00.0
Total memory: 11.17GiB
Free memory: 11.10GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 0 and 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 1 and 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y N
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 1: N Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K40m, pci bus id: 0000:04:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K40m, pci bus id: 0000:82:00.0)
Tensor("Placeholder:0", shape=(?, ?, ?, 3), dtype=float32)
Tensor("pool1:0", shape=(?, ?, ?, 64), dtype=float32)
Tensor("bn2a_branch1/batchnorm/add_1:0", shape=(?, ?, ?, 256), dtype=float32)
Tensor("bn2a_branch2c/batchnorm/add_1:0", shape=(?, ?, ?, 256), dtype=float32)
Tensor("res2a_relu:0", shape=(?, ?, ?, 256), dtype=float32)
Tensor("bn2b_branch2c/batchnorm/add_1:0", shape=(?, ?, ?, 256), dtype=float32)
Tensor("res2b_relu:0", shape=(?, ?, ?, 256), dtype=float32)
Tensor("bn2c_branch2c/batchnorm/add_1:0", shape=(?, ?, ?, 256), dtype=float32)
Tensor("res2c_relu:0", shape=(?, ?, ?, 256), dtype=float32)
Tensor("bn3a_branch1/batchnorm/add_1:0", shape=(?, ?, ?, 512), dtype=float32)
Tensor("bn3a_branch2c/batchnorm/add_1:0", shape=(?, ?, ?, 512), dtype=float32)
Tensor("res3a_relu:0", shape=(?, ?, ?, 512), dtype=float32)
Tensor("bn3b_branch2c/batchnorm/add_1:0", shape=(?, ?, ?, 512), dtype=float32)
Tensor("res3b_relu:0", shape=(?, ?, ?, 512), dtype=float32)
Tensor("bn3c_branch2c/batchnorm/add_1:0", shape=(?, ?, ?, 512), dtype=float32)
Tensor("res3c_relu:0", shape=(?, ?, ?, 512), dtype=float32)
Tensor("bn3d_branch2c/batchnorm/add_1:0", shape=(?, ?, ?, 512), dtype=float32)
Tensor("res3d_relu:0", shape=(?, ?, ?, 512), dtype=float32)
Tensor("bn4a_branch1/batchnorm/add_1:0", shape=(?, ?, ?, 1024), dtype=float32)
Tensor("bn4a_branch2c/batchnorm/add_1:0", shape=(?, ?, ?, 1024), dtype=float32)
Tensor("res4a_relu:0", shape=(?, ?, ?, 1024), dtype=float32)
Tensor("bn4b_branch2c/batchnorm/add_1:0", shape=(?, ?, ?, 1024), dtype=float32)
Tensor("res4b_relu:0", shape=(?, ?, ?, 1024), dtype=float32)
Tensor("bn4c_branch2c/batchnorm/add_1:0", shape=(?, ?, ?, 1024), dtype=float32)
Tensor("res4c_relu:0", shape=(?, ?, ?, 1024), dtype=float32)
Tensor("bn4d_branch2c/batchnorm/add_1:0", shape=(?, ?, ?, 1024), dtype=float32)
Tensor("res4d_relu:0", shape=(?, ?, ?, 1024), dtype=float32)
Tensor("bn4e_branch2c/batchnorm/add_1:0", shape=(?, ?, ?, 1024), dtype=float32)
Tensor("res4e_relu:0", shape=(?, ?, ?, 1024), dtype=float32)
Tensor("bn4f_branch2c/batchnorm/add_1:0", shape=(?, ?, ?, 1024), dtype=float32)
Tensor("res4f_relu:0", shape=(?, ?, ?, 1024), dtype=float32)
Tensor("rpn_conv/3x3/Relu:0", shape=(?, ?, ?, 512), dtype=float32)
Tensor("rpn_cls_score/BiasAdd:0", shape=(?, ?, ?, 18), dtype=float32)
Tensor("rpn_cls_prob:0", shape=(?, ?, ?, ?), dtype=float32)
Tensor("Reshape_2:0", shape=(?, ?, ?, 18), dtype=float32)
Tensor("rpn_bbox_pred/BiasAdd:0", shape=(?, ?, ?, 36), dtype=float32)
Tensor("Placeholder_1:0", shape=(?, 3), dtype=float32)
Tensor("res4f_relu:0", shape=(?, ?, ?, 1024), dtype=float32)
Tensor("res4f_relu:0", shape=(?, ?, ?, 1024), dtype=float32)
Tensor("res5a_branch2a_relu:0", shape=(?, ?, ?, 512), dtype=float32)
Tensor("res5a_branch2b_offset/BiasAdd:0", shape=(?, ?, ?, 72), dtype=float32)
Tensor("transpose:0", shape=(?, 512, ?, ?), dtype=float32) Tensor("res5a_branch2b/weights/read:0", shape=(512, 512, 3, 3), dtype=float32) Tensor("transpose_1:0", shape=(?, 72, ?, ?), dtype=float32)
Tensor("bn5a_branch1/batchnorm/add_1:0", shape=(?, ?, ?, 2048), dtype=float32)
Tensor("bn5a_branch2c/batchnorm/add_1:0", shape=(?, ?, ?, 2048), dtype=float32)
Tensor("res5b_branch2a_relu:0", shape=(?, ?, ?, 512), dtype=float32)
Tensor("res5b_branch2b_offset/BiasAdd:0", shape=(?, ?, ?, 72), dtype=float32)
Tensor("transpose_2:0", shape=(?, 512, ?, ?), dtype=float32) Tensor("res5b_branch2b/weights/read:0", shape=(512, 512, 3, 3), dtype=float32) Tensor("transpose_3:0", shape=(?, 72, ?, ?), dtype=float32)
Tensor("res5a_relu:0", shape=(?, ?, ?, 2048), dtype=float32)
Tensor("bn5b_branch2c/batchnorm/add_1:0", shape=(?, ?, ?, 2048), dtype=float32)
Tensor("res5c_branch2a_relu:0", shape=(?, ?, ?, 512), dtype=float32)
Tensor("res5c_branch2b_offset/BiasAdd:0", shape=(?, ?, ?, 72), dtype=float32)
Tensor("transpose_4:0", shape=(?, 512, ?, ?), dtype=float32) Tensor("res5c_branch2b/weights/read:0", shape=(512, 512, 3, 3), dtype=float32) Tensor("transpose_5:0", shape=(?, 72, ?, ?), dtype=float32)
Tensor("res5b_relu:0", shape=(?, ?, ?, 2048), dtype=float32)
Tensor("bn5c_branch2c/batchnorm/add_1:0", shape=(?, ?, ?, 2048), dtype=float32)
Tensor("conv_new_1_relu:0", shape=(?, ?, ?, 256), dtype=float32)
Tensor("rois:0", shape=(?, 5), dtype=float32)
Tensor("conv_new_1_relu:0", shape=(?, ?, ?, 256), dtype=float32)
Tensor("rois:0", shape=(?, 5), dtype=float32)
Tensor("offset_reshape:0", shape=(?, 2, 7, 7), dtype=float32)
Tensor("fc_new_2/fc_new_2:0", shape=(?, 1024), dtype=float32)
Tensor("fc_new_2/fc_new_2:0", shape=(?, 1024), dtype=float32)
Loading network Resnet50_test... restore from the checkpointtf_deformable_net/restore_output/Resnet50_iter_145000.ckpt
done.
cudaCheckError() failed : invalid device function

Zardinality · 2017-08-07T15:59:46Z

@feitiandemiaomi I guess maybe only when you manually write clean lines in Makefile that it would rebuilt them all? Because the only reason I am aware that cause this error is the -arch flag, so I suggest you remove all previous built lib and rebuild again.
Also, I don't know if relevant, but here also contains an -arch flag.

feitiandemiaomi · 2017-08-08T01:17:22Z

@Zardinality Thank you a lot , I will try again, Do not give up

xiaowenhe · 2017-08-08T01:22:59Z

@feitiandemiaomi, I delete all projects and reload it again. I change the -arch=sm_37 (K80) in make.sh and setup.py, and rerun the make.

feitiandemiaomi · 2017-08-08T01:57:55Z

@xiaowenhe Thanks ,you are right , I make it

feitiandemiaomi · 2017-08-08T03:13:34Z

@Zardinality Just now , I did test, It seemed to miss file in /experiments/scripts/, such as voc2007_test_vgg.sh

Zardinality · 2017-08-08T04:12:51Z

@feitiandemiaomi Already pushed.

feitiandemiaomi · 2017-08-08T12:50:30Z

@Zardinality Thank you for your reply, If possible , can I add friend by wechat or qq?

Zardinality · 2017-08-08T14:24:18Z

sure, my wechat nickname is the same as github's

feitiandemiaomi · 2017-08-08T15:10:31Z

I apply for you

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A question when training #2

A question when training #2

xiaowenhe commented Jul 28, 2017

Zardinality commented Jul 28, 2017

xiaowenhe commented Jul 31, 2017

Zardinality commented Jul 31, 2017

xiaowenhe commented Jul 31, 2017

feitiandemiaomi commented Aug 7, 2017

Zardinality commented Aug 7, 2017

feitiandemiaomi commented Aug 7, 2017

feitiandemiaomi commented Aug 7, 2017

Zardinality commented Aug 7, 2017

feitiandemiaomi commented Aug 8, 2017

xiaowenhe commented Aug 8, 2017

feitiandemiaomi commented Aug 8, 2017

feitiandemiaomi commented Aug 8, 2017

Zardinality commented Aug 8, 2017

feitiandemiaomi commented Aug 8, 2017

Zardinality commented Aug 8, 2017

feitiandemiaomi commented Aug 8, 2017

A question when training #2

A question when training #2

Comments

xiaowenhe commented Jul 28, 2017

Zardinality commented Jul 28, 2017

xiaowenhe commented Jul 31, 2017

Zardinality commented Jul 31, 2017

xiaowenhe commented Jul 31, 2017

feitiandemiaomi commented Aug 7, 2017

Zardinality commented Aug 7, 2017

feitiandemiaomi commented Aug 7, 2017

feitiandemiaomi commented Aug 7, 2017

Zardinality commented Aug 7, 2017

feitiandemiaomi commented Aug 8, 2017

xiaowenhe commented Aug 8, 2017

feitiandemiaomi commented Aug 8, 2017

feitiandemiaomi commented Aug 8, 2017

Zardinality commented Aug 8, 2017

feitiandemiaomi commented Aug 8, 2017

Zardinality commented Aug 8, 2017

feitiandemiaomi commented Aug 8, 2017