Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cuda Error with LSTM model, after lot of tries. All files provided ! #6781

Open
arnaud-nt2i opened this issue Oct 2, 2020 · 2 comments
Open
Labels
Training issue Training issue - no-detections / Nan avg-loss / low accuracy:

Comments

@arnaud-nt2i
Copy link

arnaud-nt2i commented Oct 2, 2020

@AlexeyAB
I'm trying to train one of the YoloV3 LSTM models on #3114
But with all of them I got the same error:

CUDA status = cudaDeviceSynchronize() Error: file: C:/Users/nt2i_arnaud_pauwelyn/Documents/darknet-master(02-10)/darknet-master/src/blas_kernels.cu : axpy_ongpu_offset() : line: 741 : build time: Oct  2 2020 - 10:35:39

CUDA Error: an illegal memory access was encountered

My config is that one with the last repo ( but I also tried with August 15 repo and with cuda 10.0)
I add that every model without LSTM works great (train and infer), yoloV3 or YoloV4, sam, pan... with the same dataset.
yolo_v3_tiny_lstm.cfg.txt
train.txt
images samples : sample.zip
obj.names.txt
obj.data.txt

 .\darknet.exe detector train data/obj.data cfg/yolo_v3_tiny_lstm.cfg yolov3-tiny.conv.14 -map -dont_show -cuda_debug_sync -benchmark_layers
 CUDA-version: 10020 (10020), cuDNN: 7.6.5, CUDNN_HALF=1, GPU count: 1
 CUDNN_HALF=1
 OpenCV version: 3.4.0
 Prepare additional network for mAP calculation...
 0 : compute_capability = 750, cudnn_half = 1, GPU: GeForce GTX 1660 Ti
net.optimized_memory = 0
mini_batch = 1, batch = 1, time_steps = 1, train = 0

I am not the only one encountering issue at training lstm : #6531, #6708
What should we do to solve this?

@arnaud-nt2i arnaud-nt2i added the Training issue Training issue - no-detections / Nan avg-loss / low accuracy: label Oct 2, 2020
@arnaud-nt2i
Copy link
Author

Ok I have found a walkaround, by using "Yolo v3 optimal" here: https://github.com/AlexeyAB/darknet/releases/tag/darknet_yolo_v3_optimal

@aotiansysu
Copy link

Setting bottleneck=1 for every conv_lstm layer in yolo_v3_tiny_lstm.cfg works for me:

[conv_lstm]
batch_normalize=1
size=3
pad=1
output=128
peephole=0
bottleneck=1
activation=leaky

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Training issue Training issue - no-detections / Nan avg-loss / low accuracy:
Projects
None yet
Development

No branches or pull requests

2 participants