^^click here^^
Given the url of a Youtube video (here the Dior - Eau de Parfum commercial), generate a new video that shows the presence of humans by drawing bounding boxes around them.
- split the video into frames;
- apply a detection model every K frame;
- for each frame where no detection has been run, interpolate the bounding boxes;
- draw the bounding boxes found for each frame;
- recombine the frames into the final video.
👉 For the detection model, the script loads either RetinaNet or Faster R-CNN from torchivion.models
👉
To speed up the processing time, the detection of persons is performed on every K frame,
where K is a parameter chosen by the user (argument --stride=K
).
👉 To find the bounding boxes on frames where no detection is performed, the script interpolates boxes between two successive frames where a detection has been made. To decide whether two boxes on two different frames correspond to the same person, it first checks that they have a similar size, and then it selects the pair of boxes that yields the highest IoU scores for each pair of successive interpolated boxes. Finally, it checks that these IoU scores are above some threshold (to avoid interpolating boxes that are too far apart) and that the confidence level of at least one box is high enough (to avoid detection of false positive).
The principle of the algorithm is described below,
where T(box)
is a box with same size but with its lower-left corner (in Cartesian coordinates)
translated to the origin and conf(box)
is the confidence level associated to a box:
input: boxes0, boxes1 (boxes found on successive frames, frame0 and frame1,
where detection has been run and ordered by decreasing confidence level)
output: final_boxes (list of bounding boxes for each frame between frame0 and frame1)
params: min_IoU, min_conf
for each box1 in boxes1:
best_IoU <- 0
best_box <- None
for each box0 in boxes0:
if IoU(T(box0), T(box1)) < min_IoU:
continue
interpolate(box0, box1)
score <- min(IoU bewteen successive interpolated boxes)
if score > best_score:
best_IoU <- score
best_box <- box0
if best_IoU > min_IoU and (conf(best_box) > min_conf or conf(box1) > min_conf):
boxes0 <- boxes0 \ {best_box}
add each box in interpolate(best_box, box1) to final_boxes
👉
Previously to the latter algorithm, a first selection is made to filter out bounding boxes
for which the confidence level is below some threshold eps
.
🚩
The above algorithm will display a box even if its confidence level is below min_conf
,
provided a similar box with a high confidence level is detected on the previous or the next frame.
In other words, it decreases the number of false negative detections (and symmetrically
increases the number of false positives).
🚩
On the other hand, if the primary concern is reducing the number of false positive detections,
it is better to set eps = min_conf = 0.5
(or any suitable value).
First, set up a python virtual environment (named env
here) with pytorch and torchivision
(see details here).
The script has been tested with Python==3.8 and Pytorch==1.7.
Then, install the required packages:
(env)$ python -m pip install -U -r requirements.txt
To generate the output video, activate the virtual environment and run the script
(env)$ python detect.py --stride=4 --eps=0.3 --min_conf=0.7 --min_iou=0.66 --with_conf
The input video is automatically downloaded in the same folder as the script detect.py
,
and the output video is created at the same location.
To try with another video, add --url='https://youtu.be/<xxx>'
.
To use the GPU, add the argument --with_gpu
and choose an adequate batch size (say 4)
with --batch_size=4
.
The default detection model is RetinaNet.
To use Faster R-CNN instead, add the argument --model='fasterrcnn'
.