Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Yolo-LSTM (~+4-9 AP) for detection on Video with high mAP and without blinking issues #3114

Open
AlexeyAB opened this issue May 7, 2019 · 390 comments
Assignees
Labels
ToDo RoadMap

Comments

@AlexeyAB
Copy link
Owner

AlexeyAB commented May 7, 2019

Implement Yolo-LSTM detection network that will be trained on Video-frames for mAP increasing and solve blinking issues.


Think about - can we use Transformer (Vaswani et al., 2017) / GPT2 / BERT for frame-sequences instead of word-sequences https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf and https://vk.com/away.php?to=https%3A%2F%2Farxiv.org%2Fpdf%2F1706.03762.pdf&cc_key=

Or can we use Transformer-XL https://arxiv.org/abs/1901.02860v2 or UNIVERSAL TRANSFORMERS https://arxiv.org/abs/1807.03819v3 for Long-time sequences?

@i-chaochen

This comment has been minimized.

@AlexeyAB

This comment has been minimized.

@i-chaochen

This comment has been minimized.

@AlexeyAB
Copy link
Owner Author

AlexeyAB commented May 20, 2019

Comparison of different models on a very small custom dataset - 250 training and 250 validation images from video: https://drive.google.com/open?id=1QzXSCkl9wqr73GHFLIdJ2IIRMgP1OnXG

Validation video: https://drive.google.com/open?id=1rdxV1hYSQs6MNxBSIO9dNkAiBvb07aun

Ideas are based on:

  • LSTM object detection - model achieves state-of-the-art performance among mobile methods on the Imagenet VID 2015 dataset, while running at speeds of up to 70+ FPS on a Pixel 3 phone: https://arxiv.org/abs/1903.10172v1

  • PANet reaches the 1st place in the COCO 2017 Challenge Instance Segmentation task and the 2nd place in Object Detection task without large-batch training. It is also state-of-the-art on MVD and Cityscapes: https://arxiv.org/abs/1803.01534v4


There are implemented:

  • convolutional-LSTM models for Training and Detection on Video, without interleaving lightweight network - may be will be implemented later

  • PANet models -

    • _pan-networks - there is used [reorg3d] + [convolutional] size=1 instead of Adaptive Feature Pooling (depth-maxpool) for the Path Aggregation - may be will be implemented later
    • _pan2-networks - there is used maxpooling [maxpool] maxpool_depth=1 out_channels=64 acrosss channels as in original PAN-paper, just previous layers are [convolutional] instead of [connected] for resizability
Model (cfg & weights) network size = 544x544 Training chart Validation video BFlops Inference time RTX2070, ms mAP, %
yolo_v3_spp_pan_lstm.cfg.txt (must be trained using frames from the video) - - - - -
yolo_v3_tiny_pan3.cfg.txt and weights-file Features: PAN3, AntiAliasing, Assisted Excitation, scale_x_y, Mixup, GIoU chart video 14 8.5 ms 67.3%
yolo_v3_tiny_pan5 matrix_gaussian_GIoU aa_ae_mixup_new.cfg.txt and weights-file Features: MatrixNet, Gaussian-yolo + GIoU, PAN5, IoU_thresh, Deformation-block, Assisted Excitation, scale_x_y, Mixup, 512x512, use -thresh 0.6 chart video 30 31 ms 64.6%
yolo_v3_tiny_pan3 aa_ae_mixup_scale_giou blur dropblock_mosaic.cfg.txt and weights-file chart video 14 8.5 ms 63.51%
yolo_v3_spp_pan_scale.cfg.txt and weights-file chart video 137 33.8 ms 60.4%
yolo_v3_spp_pan.cfg.txt and weights-file chart video 137 33.8 ms 58.5%
yolo_v3_tiny_pan_lstm.cfg.txt and weights-file (must be trained using frames from the video) chart video 23 14.9 ms 58.5%
tiny_v3_pan3_CenterNet_Gaus ae_mosaic_scale_iouthresh mosaic.txt and weights-file chart video 25 14.5 ms 57.9%
yolo_v3_spp_lstm.cfg.txt and weights-file (must be trained using frames from the video) chart video 102 26.0 ms 57.5%
yolo_v3_tiny_pan3 matrix_gaussian aa_ae_mixup.cfg.txt and weights-file chart video 13 19.0 ms 57.2%
resnet152_trident.cfg.txt and weights-file train by using resnet152.201 pre-trained weights chart video 193 110ms 56.6%
yolo_v3_tiny_pan_mixup.cfg.txt and weights-file chart video 17 8.7 ms 52.4%
yolo_v3_spp.cfg.txt and weights-file (common old model) chart video 112 23.5 ms 51.8%
yolo_v3_tiny_lstm.cfg.txt and weights-file (must be trained using frames from the video) chart video 19 12.0 ms 50.9%
yolo_v3_tiny_pan2.cfg.txt and weights-file chart video 14 7.0 ms 50.6%
yolo_v3_tiny_pan.cfg.txt and weights-file chart video 17 8.7 ms 49.7%
yolov3-tiny_3l.cfg.txt (common old model) and weights-file chart video 12 5.6 ms 46.8%
yolo_v3_tiny_comparison.cfg.txt and weights-file (approximately the same conv-layers as conv+conv_lstm layers in yolo_v3_tiny_lstm.cfg) chart video 20 10.0 ms 36.1%
yolo_v3_tiny.cfg.txt (common old model) and weights-file chart video 9 5.0 ms 32.3%
- -
- -

@i-chaochen
Copy link

i-chaochen commented May 20, 2019

Great work! Thank you very much for sharing this result.

LSTM indeed improves results. I wonder have you evaluated the inference time with LSTM as well?

Thanks

@AlexeyAB
Copy link
Owner Author

AlexeyAB commented May 20, 2019

How to train LSTM networks:

  1. Use one of cfg-file with LSTM in filename

  2. Use pre-trained file

  3. You should train it on sequential frames from one or several videos:

    • ./yolo_mark data/self_driving cap_video self_driving.mp4 1 - it will grab each 1 frame from video (you can vary from 1 to 5)

    • ./yolo_mark data/self_driving data/self_driving.txt data/self_driving.names - to mark bboxes, even if at some point the object is invisible (occlused/obscured by another type of object)

    • ./darknet detector train data/self_driving.data yolo_v3_tiny_pan_lstm.cfg yolov3-tiny.conv.14 -map - to train the detector

    • ./darknet detector demo data/self_driving.data yolo_v3_tiny_pan_lstm.cfg backup/yolo_v3_tiny_pan_lstm_last.weights forward.avi - run detection


If you encounter CUDA Out of memeory error, then reduce the value time_steps= twice in your cfg-file.


The only conditions - the frames from the video must go sequentially in the train.txt file.
You should validate results on a separate Validation dataset, for example, divide your dataset into 2:

  1. train.txt - first 80% of frames (80% from video1 + 80% from video 2, if you use frames from 2 videos)
  2. valid.txt - last 20% of frames (20% from video1 + 20% from video 2, if you use frames from 2 videos)

Or you can use, for example:

  1. train.txt - frames from some 8 videos
  2. valid.txt - frames from some 2 videos

LSTM:
1200px-The_LSTM_cell


61124814-9e630680-a4b0-11e9-9fce-042832210fff

@AlexeyAB
Copy link
Owner Author

@i-chaochen I added the inference time to the table. When I improve the inference time for LSTM-networks, I will change them.

@i-chaochen
Copy link

@i-chaochen I added the inference time to the table. When I improve the inference time for LSTM-networks, I will change them.

Thanks for updates!
What do you mean the inference time for seconds? is for the whole video? How about the inference time for each frame or FPS?

@AlexeyAB
Copy link
Owner Author

AlexeyAB commented May 20, 2019

@i-chaochen This is a millisecond, I fixed )

@i-chaochen
Copy link

Interesting, it seems yolo_v3_spp_lstm has less BFLOPs(102) than yolo_v3_spp.cfg.txt (112), but it still slower...

@AlexeyAB
Copy link
Owner Author

AlexeyAB commented May 20, 2019

@i-chaochen
I removed some overheads (for calling a lot of functions and reading / writing to GPU-RAM) - I replaced these several functions for: f, i, g, o, c

// f = wf + uf + vf
copy_ongpu(l.outputs*l.batch, wf.output_gpu, 1, l.f_gpu, 1);
axpy_ongpu(l.outputs*l.batch, 1, uf.output_gpu, 1, l.f_gpu, 1);
if (l.peephole) axpy_ongpu(l.outputs*l.batch, 1, vf.output_gpu, 1, l.f_gpu, 1);

to the one fast function add_3_arrays_activate(float *a1, float *a2, float *a3, size_t size, ACTIVATION a, float *dst);

@NickiBD
Copy link

NickiBD commented May 23, 2019

Hi @AlexeyAB
I am trying to use yolo_v3_tiny_lstm.cfg to improve small object detection for videos .However I am getting the following error
14 Type not recognized: [conv_lstm]
Unused field: 'batch_normalize = 1'
Unused field: 'size = 3'
Unused field: 'pad = 1'
Unused field: 'output = 128'
Unused field: 'peephole = 0'
Unused field: 'activation = leaky'
15 Type not recognized: [conv_lstm]
Unused field: 'batch_normalize = 1'
Unused field: 'size = 3'
Unused field: 'pad = 1'
Unused field: 'output = 128'
Unused field: 'peephole = 0'
Unused field: 'activation = leaky'

Could you please advice me on this
Many thanks

@AlexeyAB
Copy link
Owner Author

@NickiBD For these models you must use the latest version of this repository: https://github.com/AlexeyAB/darknet

@NickiBD
Copy link

NickiBD commented May 23, 2019

@AlexeyAB

Thanks alot for the help .I will update my repository .

@passion3394
Copy link

passion3394 commented May 25, 2019

@AlexeyAB hi, how did you run yolov3-tiny on the Pixel smart phone, could you give some tips? thanks very much.

@NickiBD
Copy link

NickiBD commented May 27, 2019

Hi @AlexeyAB,
I have trained yolo_v3_tiny_lstm.cfg and I want to convert it to .h5 and then to .tflite for the smart phone . However ,I am getting Unsupported section header type: conv_lstm_0 and unsupported operation while converting . I really need to solve this issue .Could you please advice me on this.
Many thanks .

@AlexeyAB
Copy link
Owner Author

@NickiBD Hi,

Which repository and which script do you use for this conversion?

@NickiBD
Copy link

NickiBD commented May 27, 2019

Hi @AlexeyAB,
I am using the converter in Adamdad/keras-YOLOv3-mobilenet to convert to .h5 and it was converting for other models e.g. yolo-v3-tiny 3layers ,modified yolov3 ,... .Could you please tell me which converter to use .

Many thanks .

@AdamCuellar
Copy link

AdamCuellar commented Apr 2, 2020

Hey @AlexeyAB could you help me use the lstm cfg's properly? Currently, regular yolov3 does much better on a custom dataset. Files are in sequential order in the training file and for some of the videos there are 200 frames and some others 900 frames. The file the mAP is calculated on has videos with 900 frames.

Yolov3:
yolov3-obj.cfg.txt
image

Yolov3-tiny-pan-lstm: yolo_v3_tiny_pan_lstm.cfg.txt
image

I don't have the graph for the following
Yolov3-spp-lstm: highest mAP is around 60%
yolo_v3_spp_lstm.cfg.txt

@AdamCuellar
Copy link

@AlexeyAB any idea on how to improve the performance of the issue mentioned above?

@kaishijeng
Copy link

Any plan to add lstm to yolov4?

Thanks,

@i-chaochen
Copy link

i-chaochen commented May 3, 2020

Any plan to add lstm to yolov4?

Thanks,

I don't think it's necessary, because lstm or conv-lstm is designed for the video scenario, especially there is a sequence-to-sequence "connection" between frames, and the yolo-v4 should be a general model for the image object detection, like ms-coco or imagenet benchmark.

You can add this into your model if your yolo-v4 is used in the video.

@Witek-
Copy link

Witek- commented May 15, 2020

I am processing traffic scenes from a stationary camera, so I think lstm could be helpful. How do I actually add it to yolo-v4?

@LucasSloan
Copy link

Is there a way to train an lstm layer on top of an already trained network?

@i-chaochen
Copy link

Is there a way to train an lstm layer on top of an already trained network?

the purpose of LSTM is to "memorize" some features between frames, if you add it at the very top/beginning of the trained cnn network, where hasn't learned anything yet, LSTM wouldn't learn or memorize any thing.

This paper mentioned some insights about where to put the LSTM to get the optimal result. Basically, it's should be after the 13-Conv.

https://arxiv.org/pdf/1711.06368.pdf

@AlexeyAB
Copy link
Owner Author

@i-chaochen
May be I will add this cheap conv Bottleneck-LSTM #5774

I think the more complex the recurret layer, the later we should add it.
So for Conv1-13 can be used conv-RNN, and for Conv13-FM can be used conv-LSTM.

In this case maybe we should create a workaround for CRNN

[crnn]

[route]
layers=-1,-2

@AlexeyAB
Copy link
Owner Author

Is memory consumption increasing every time and eventually leads to a lack of memory?

@i-chaochen
Copy link

Is memory consumption increasing every time and eventually leads to a lack of memory?

Speaking of memory consumption, maybe you can have a look on gradient check pointing.
https://github.com/cybertronai/gradient-checkpointing

It can save significantly memory for the training.

@smallerhand
Copy link

@AlexeyAB
Hi, I am grateful about yolo versions and yolo-lstm. But is lstm only applicable to yolov3?
If lstm can also be applied to yolov4, I would really appreciate if you let me know how to do that.

@AlexeyAB
Copy link
Owner Author

@smallerhand It is in progress.
Did you train https://github.com/AlexeyAB/darknet/files/3199654/yolo_v3_spp_lstm.cfg.txt on video?
Did you get any improvements?

@smallerhand
Copy link

@AlexeyAB
Thank you for your reply!
Is yolo_v3_spp_lstm.cfg your recommendation? I will try it, although I can only compare it with yolov4.

@HaolyShiit
Copy link

Implement Yolo-LSTM detection network that will be trained on Video-frames for mAP increasing and solve blinking issues.

@AlexeyAB, hello. What are the blinking issues? Does it mean that objects can be detected in this frame, but not in next one?

@fabiozappo
Copy link

Hi Alexey, I really appreciate your work and improvements from previous Pjreddie repo. I had a Yolov3 people detector trained on custom dataset videos using single frames, now i want to test your new model Yolov4 and conv-lstm layers. I trained the model with yolov4-custom.cfg and results improved just by doing this, I am now wondering how to add temporal information (i.e. conv-lstm layers).
Is it possible? If yes how do i have to modify the cfg file, perform transfer learning and then perform the training?

@arnaud-nt2i
Copy link

arnaud-nt2i commented Sep 10, 2020

@smallerhand have you done a comparison between yolo_v3_spp_lstm.cfg and yolov4? What are the results?
have you tried to compare with yolo_v3_tiny_constrastive.cfg from #6004 ?

@HaolyShiit Blinking issues can either mean:

  • objects can be detected in one frame but not in the following one

  • jump from one class to another one on two consecutive frames

  • within the same class, bounding boxes are changing in size more than what is needed, causing flickering.

@fabiozappo not yet possible to add lstm to YoloV4, Alexey is actively working on it.

@arnaud-nt2i
Copy link

TO ALL PEOPLE REDING THIS PAGE, in order to try those LSTM models, you have to use "Yolo v3 optimal" repo
here: https://github.com/AlexeyAB/darknet/releases/tag/darknet_yolo_v3_optimal

@HaolyShiit
Copy link

@arnaud-nt2i
Thank u very much! I will try "Yolo v3 optimal" repo.

@AdamCuellar
Copy link

AdamCuellar commented Mar 2, 2022

@AlexeyAB

If you're interested in fixing the conv_lstm module the issue is in conv_lstm_layer.c with the line 1457:

darknet/src/conv_lstm_layer.c

Lines 1450 to 1458 in b4d03f8

if (l.bottleneck) {
reset_nan_and_inf(l.bottelneck_delta_gpu, l.outputs*l.batch*2);
//constrain_ongpu(l.outputs*l.batch*2, 1, l.bottelneck_delta_gpu, 1);
if (l.dh_gpu) axpy_ongpu(l.outputs*l.batch, l.time_normalizer, l.bottelneck_delta_gpu, 1, l.dh_gpu, 1);
axpy_ongpu(l.outputs*l.batch, 1, l.bottelneck_delta_gpu + l.outputs*l.batch, 1, state.delta, 1); // lead to nan
}
else {
axpy_ongpu(l.outputs*l.batch, l.time_normalizer, l.temp3_gpu, 1, l.dh_gpu, 1);
}

It should check for l.dh_gpu:
if(l.dh_gpu) axpy_ongpu(l.outputs*l.batch, l.time_normalizer, l.temp3_gpu, 1, l.dh_gpu, 1);

This solves cuda errors but can cause NAN during training. To avoid this, I commented it out completely. I trained the small self driving dataset with the some cfg's you provided above and got these results.

yolov3-tiny-pan_lstm.cfg.txt
Had to add bottleneck to avoid cuda errors before the fix
image

yolov3-tiny-pan.cfg.txt
After the fix
image

yolov3-tiny-pan_lstm_noBottleNeck.cfg.txt
After the fix
chart_yolov3-tiny-pan_lstm_noBottleNeck

yolov4-tiny_smallSelfDriving.cfg.txt
yolov4-tiny-custom for comparison
image

@AlexeyAB
Copy link
Owner Author

AlexeyAB commented Mar 2, 2022

@AdamCuellar Thanks! Could you add PR with commented //if(l.dh_gpu) axpy_ongpu(l.outputs*l.batch, l.time_normalizer, l.temp3_gpu, 1, l.dh_gpu, 1); ?

@AdamCuellar
Copy link

@AlexeyAB yep done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ToDo RoadMap
Projects
None yet
Development

No branches or pull requests