Skip to content

Scripts and code used to adapt the action detection model "ActionFormer" for a set of cooking videos provided by BSH.

Notifications You must be signed in to change notification settings

Guilleuz/BSH-ActionFormer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

60 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ActionFormer

This folder contains all the scripts and data required to use BSH's cooking video dataset with the ActionFormer model, a transformer-based network for action detection, as well as a modified version of the original ActionFormer, used during the project.

Video metadata

The script process-videos.py extracts information from each video in the input folder, such as its resolution, framerate and duration, which has been output to the video_information.json file. Furthermore, executing it will also modify the videos' names to match the ones used in the annotation files.

The script will also divide the dataset into a training subset and a validation one, randomly assigning each video to one of them following a chance specified as input. To call the script use:

python process-videos.py <path to video folder> <output_file> <training_set_chance>

Action Annotations

To adapt the previously available annotations in the dataset, you can use the convert-annotations.py script, that will transform the csv file into two json annotation files, one for the verbs, and the other one for the actions. The script can be used as follows:

python convert-annotations.py <input annotations> <video info file> <output name>

Using this script, I translated the annotations in the file datasetEntero_verbosYnombres.csv (provided by Alex) to the files video_annotations_verbs|nouns.json, which follow the format used by ActionFormer. It also outputs the file vid_list.csv, which contains a list of the annotated videos, and is required for the SlowFast Feature Extractor. This file has to be included in the same folder as the videos whose features are going to be extracted.

Furthermore, the remove-unused-videos.py script will delete (do NOT execute if you still need them) any videos from the video folder that have not been annotated, that is, videos not listed in vid_list.csv.

Gluon CV

For extracting the features we use the Gluon CV tool, which provides feature extraction from videos using different models, including SlowFast. One of its main problems, is that if we set the number of segments of the video from which to extract features, the process will finish due to a memory error, so it was necessary to make a script to divide each video into 32-frame long clips.

Said script can be found in split_videos.py, which can be executed using:

python split_videos.py <video folder> <video clip list file> <clip size in frames> <frame stride>

Once the videos are split, we can extract their features using Gluon CV (it requires installation), with the following command:

python feat_extract.py --data-list video.txt --model slowfast_4x16_resnet50_kinetics400 --save-dir ./features 
--slowfast --slow-temporal-stride 8 --fast-temporal-stride 1 --new-length 32 --num-segments 1 --use-pretrained
--gpu-id 1

Finally, the script compress_features.py is available to compress all the features extracted from individual files into a single one for each video. It can be executed with:

python compress_features.py <clip feature folder> <output folder>

ActionFormer: Training and evaluation

The model will require two independent sessions of training, one for the nouns and another one for the verbs. To train the model use:

python ./train.py ./configs/bsh_verbs.yaml --output reproduce
python ./train.py ./configs/bsh_nouns.yaml --output reproduce

And for testing:

python ./eval.py ./configs/bsh_verbs.yaml ./ckpt/bsh_verbs_reproduce/
python ./eval.py ./configs/bsh_verbs.yaml ./ckpt/bsh_nouns_reproduce/

The config files used are the ones provided for the EpicKitchens dataset in the ActionFormer repository, modified for our own data. After the testing, the model will output two csv files ground_truth.csv and preds.csv, which contain the original action intervals and the predicted ones.

If you would like to use either the TemporalMaxer or the mixed model, you will need to modify the backbone_type parameter in the libs/core/config.py ActionFormer file. Using the alpha parameter in the same file you will be able to modify the relevance given to the MaxPooling branch and the Transformer branch.

Prediction Results

To better visualize the results obtained after inference, we provide the show_predictions.py script, which will plot a graph showing the predicted action intervals and the actual ones. To run it, we will need the ground_truth.csv and preds.csv files obtained after evaluation, as well as a csv file that contains each label's id and name.

To run the script, execute:

python show_predictions.py --ground_truth <ground csv file> --predictions <preds csv file> 
--label_names <label names file> --threshold <value> --separated

Furthermore, using the option --help will show all the available options. If the flag --web is set, and you run with streamlit run show_predictions.py -- [ARGS], the plot will generate an interactive html graph, using streamlit.

Finally, the confusion_matrix.py script will generate a confusion matrix. It will require both interval files, predicted and ground truth, as well as the label names file. You can use the option --help to show all the available options.

To execute it:

python confusion_matrix.py <ground truth file> <predictions file> <label names file>
<IoU threshold> <score threshold>

About

Scripts and code used to adapt the action detection model "ActionFormer" for a set of cooking videos provided by BSH.

Resources

Stars

Watchers

Forks