A collection of works on self-supervised, deep-learning learning for video. The papers listed here refers to our survey:
Self-Supervised Learning for Videos: A Survey
Madeline Chantry Schiappa, Yogesh Singh Rawat, Mubarak Shah
In this survey, we provide a review of existing approaches on self-supervised learning focusing on the video domain. We summarize these methods into four different categories based on their learning objectives: 1) pretext tasks, 2) generative learning, 3) contrastive learning, and 4) cross-modal agreement. We further introduce the commonly used datasets, downstream evaluation tasks, insights into the limitations of existing works, and the potential future directions in this area.
Statistics of self-supervised (SSL) video representation learning research in recent years. From left to right we show a) the total number of SSL related papers published in top conference venues, b) categorical breakdown of the main research topics studied in SSL, and (c) modality breakdown of the main modalities used in SSL. The year 2022 remains incomplete because a majority of the conferences occur later in the year.
Action recognition performance of models over time for different self-supervised strategies including different modalities: video-only (V), video-text (V+T), video-audio (V+A), video-text-audio (V+T+A). More recently, contrastive learning has become the most popular strategy.
Downstream evaluation of action recognition on pretext self-supervised learning measured by prediction accuracy. Top scores are in bold. Playback speed related tasks typically perform the best.
Model | Subcategory | Visual Backbone | Pre-Train | UCF101 | HMDB51 |
---|---|---|---|---|---|
Geometry | Appearance | AlexNet | UCF101/HMDB51 | 54.10 | 22.60 |
Wang et al. | Appearance | C3D | UCF101 | 61.20 | 33.40 |
3D RotNet | Appearance | 3D R-18 | MT | 62.90 | 33.70 |
VideoJigsaw | Jigsaw | CaffeNet | Kinetics | 54.70 | 27.00 |
3D ST-puzzle | Jigsaw | C3D | Kinetics | 65.80 | 33.70 |
CSJ | Jigsaw | R(2+3)D | Kinetics+UCF101+HMDB51 | 79.50 | 52.60 |
PRP | Speed | R3D | Kinetics | 72.10 | 35.00 |
SpeedNet | Speed | S3D-G | Kinetics | 81.10 | 48.80 |
Jenni et al. | Speed | R(2+1)D | UCF101 | 87.10 | 49.80 |
PacePred | Speed | S3D-G | UCF101 | 87.10 | 52.60 |
ShuffleLearn | Temporal Order | AlexNet | UCF101 | 50.90 | 19.80 |
OPN | Temporal Order | VGG-M | UCF101 | 59.80 | 23.80 |
O3N | Temporal Order | AlexNet | UCF101 | 60.30 | 32.50 |
ClipOrder | Temporal Order | R3D | UCF101 | 72.40 | 30.90 |
Performance for the downstream video retrieval task with top scores for each category in bold. K/U/H indicates using all three datasets for pre-training, i.e. Kinetics, UCF101, and HMDB51.
Model | Category | Subcategory | Visual Backbone | Pre-train | UCF101 R@5 | HMDB51 R@5 |
---|---|---|---|---|---|---|
SpeedNet | Pretext | Speed | S3D-G | Kinetics | 28.10 | -- |
ClipOrder | Pretext | Temporal Order | R3D | UCF101 | 30.30 | 22.90 |
OPN | Pretext | Temporal Order | CaffeNet | UCF101 | 28.70 | -- |
CSJ | Pretext | Jigsaw | R(2+3)D | K/U/H | 40.50 | -- |
PRP | Pretext | Speed | R3D | Kinetics | 38.50 | 27.20 |
Jenni et al. | Pretext | Speed | 3D R-18 | Kinetics | 48.50 | -- |
PacePred | Pretext | Speed | R(2+1)D | UCF101 | 49.70 | 32.20 |
Downstream action recognition evaluation for models that use a generative self-supervised pre-training approach. Top scores are in bold
Model | Subcategory | Visual Backbone | Pre-train | UCF101 | HMDB51 |
---|---|---|---|---|---|
pSwaV | Contrastive | View Aug. | R-50 | Kinetics | 51.7 |
pSimCLR | Contrastive | View Aug. | R-50 | Kinetics | 52.0 |
pMoCo | Contrastive | View Aug. | R-50 | Kinetics | 54.4 |
pBYOL | Contrastive | View Aug. | R-50 | Kinetics | 55.8 |
Mathieu et al. | Frame Prediction | C3D | Sports1M | 52.10 | -- |
VideoGan | Reconstruction | VAE | Flickr | 52.90 | -- |
Liang et al. | Frame Prediction | LSTM | UCF101 | 55.10 | -- |
VideoMoCo | Frame Prediction | R(2+1)D | Kinetics | 78.70 | 49.20 |
MemDPC-Dual | Frame Prediction | R(2+3)D | Kinetics | 86.10 | 54.50 |
Tian et al. | Reconstruction | 3D R-101 | Kinetics | 88.10 | 59.00 |
VideoMAE | MAE | ViT-L | ImageNet | 91.3 | 62.6 |
MotionMAE | MAE | ViT-B | Kinetics | 96.3 | -- |
Downstream evaluation of action recognition on self-supervised learning measured by prediction accuracy for Something-Something (SS) and Kinetics400 (Kinetics). SS is a more temporally relevant dataset and therefore is more challenging. Top scores for each category are in bold and second best scores \underline{underlined}.
Model | Category | Subcategory | Visual Backbone | Pre-Train | SS | Kinetics |
---|---|---|---|---|---|---|
BEVT | Generative | MAE | SWIN-B | Kinetics+ImageNet | 71.4 | 81.1 |
MAE | Generative | MAE | ViT-H | Kinetics | 74.1 | 81.1 |
MaskFeat | Generative | MAE | MViT | Kinetics | 74.4 | 86.7 |
VideoMAE | Generative | MAE | ViT-L | ImageNet | 75.3 | 85.1 |
MotionMAE | Generative | MAE | ViT-B | Kinetics | 75.5 | 81.7 |
Performance for the downstream video retrieval task with top scores for each category in bold. K/U/H indicates using all three datasets for pre-training, i.e. Kinetics, UCF101, and HMDB51.
Model | Category | Subcategory | Visual Backbone | Pre-train | UCF101 R@5 | HMDB51 R@5 |
---|---|---|---|---|---|---|
MemDPC-RGP | Generative | Frame Prediction | R(2+3)D | Kinetics | 40.40 | 25.70 |
MemDPC-Flow | Generative | Frame Prediction | R(2+3)D | Kinetics | 63.20 | 37.60 |
Performance for the downstream video retrieval task. Top scores for each category are in bold. Masked Modeling (MM) is a generative approach that uses both video with text. Cross-modal agreement include a variety of contrastive approaches that can use video with audio and/or text. Cross-modal agreement pre-training approaches typically perform best. Some models have dedicated variations in what they report with fine-tuning () on the target dataset, YouCook2 or MSRVTT. The pre-training datasets titled COMBO are CC3M, WV-2M and COCO.*
Model | Visual | Text | Pre-Train | R@5 YouCook2 | R@5 MSRVTT |
---|---|---|---|---|---|
ActBERT | 3D R-32 | BERT | Kinetics+How2 | 26.70 | 23.40 |
HERO | SlowFast | WordPieces | How2+TV | -- | 43.40 |
ClipBERT | R-50 | WordPieces | VisualGenome | -- | 46.80 |
VLM | S3D-g | BERT | How2 | 56.88 | 55.50 |
UniVL | S3D-g | BERT | How2 | 57.60 | 49.60 |
Amrani et al. | R-152 | Word2Vec | How2 | -- | 21.30 |
Downstream evaluation for video captioning on the YouCook2 dataset for video-language models. Top scores are in bold. MM: Masked modeling with video and text, and K/H: Kinetics+HowTo100M.
Model | Category | Subcategory | Visual | Text | Pre-train | BLEU4 | METEOR | ROUGE | CIDEr |
---|---|---|---|---|---|---|---|---|---|
VideoBert | Generative | MM | S3D-g | BERT | Kinetics | 4.33 | 11.94 | 28.80 | 0.55 |
ActBERT | Generative | MM | 3D R-32 | BERT | K/H | 5.41 | 13.30 | 30.56 | 0.65 |
VLM | Generative | MM | S3D-g | BERT | How2 | 12.27 | 18.22 | 41.51 | 1.39 |
UniVL | Generative | MM | S3D-g | BERT | How2 | 17.35 | 22.35 | 46.52 | 1.81 |
Downstream evaluation of action recognition on self-supervised learning measured by prediction accuracy for Something-Something (SS) and Kinetics400 (Kinetics). SS is a more temporally relevant dataset and therefore is more challenging. Top scores for each category are in bold and second best scores \underline{underlined}.
Model | Category | Subcategory | Visual Backbone | Pre-Train | SS | Kinetics |
---|---|---|---|---|---|---|
pSwaV | Contrastive | View Aug. | R-50 | Kinetics | 51.7 | 62.7 |
pSimCLR | Contrastive | View Aug. | R-50 | Kinetics | 52.0 | 62.0 |
pMoCo | Contrastive | View Aug. | R-50 | Kinetics | 54.4 | 69.0 |
pBYOL | Contrastive | View Aug. | R-50 | Kinetics | 55.8 | 71.5 |
Downstream action recognition on UCF101 and HMDB51 for models that use contrastive learning and/or cross-modal agreement. Top scores for each category are in bold. Modalities include video (V), optical flow (F), human keypoints (K), text (T) and audio (A). Spatio-temporal augmentations with contrastive learning typically are the highest performing approaches.
Model | Subcategory | Visual | Modalities | Pre-Train | UCF101 | HMDB51 |
---|---|---|---|---|---|---|
VIE | Clustering | Slowfast | V | Kinetics | 78.90 | 50.1 |
VIE-2pathway | Clustering | R-18 | V | Kinetics | 80.40 | 52.5 |
Tokmakov et al. | Clustering | 3D R-18 | V | Kinetics | 83.00 | 50.4 |
TCE | Temporal Aug. | R-50 | V | UCF101 | 71.20 | 36.6 |
Lorre et al. | Temporal Aug. | R-18 | V+F | UCF101 | 87.90 | 55.4 |
CMC-Dual | Spatial Aug. | CaffeNet | V+F | UCF101 | 59.10 | 26.7 |
SwAV | Spatial Aug. | R-50 | V | Kinetics | 74.70 | -- |
VDIM | Spatial Aug. | R(2+1)D | V | Kinetics | 79.70 | 49.2 |
CoCon | Spatial Aug. | R-34 | V+F+K | UCF101 | 82.40 | 53.1 |
SimCLR | Spatial Aug. | R-50 | V | Kinetics | 84.20 | -- |
CoCLR | Spatial Aug. | S3D-G | V+F | UCF101 | 90.60 | 62.9 |
MoCo | Spatial Aug. | R-50 | V | Kinetics | 90.80 | -- |
BYOL | Spatial Aug. | R-50 | V | Kinetics | 91.20 | -- |
DVIM | Spatio-Temporal Aug. | R-18 | V+F | UCF101 | 64.00 | 29.7 |
IIC | Spatio-Temporal Aug. | R3D | V+F | Kinetics | 74.40 | 38.3 |
DSM | Spatio-Temporal Aug. | I3D | V | Kinetics | 78.20 | 52.8 |
pSimCLR | Spatio-Temporal Aug. | R-50 | V | Kinetics | 87.90 | -- |
TCLR | Spatio-Temporal Aug. | R(2+1)D | V | UCF101 | 88.20 | 60.0 |
SeCo | Spatio-Temporal Aug. | R-50 | V | ImageNet | 88.30 | 55.6 |
pSwaV | Spatio-Temporal Aug. | R-50 | V | Kinetics | 89.40 | -- |
pBYOL | Spatio-Temporal Aug. | R-50 | V | Kinetics | 93.80 | -- |
CVRL | Spatio-Temporal Aug. | 3D R-50 | V | Kinetics | 93 |
Performance for the downstream video retrieval task. Top scores for each category are in bold. Masked Modeling (MM) is a generative approach that uses both video with text. Cross-modal agreement include a variety of contrastive approaches that can use video with audio and/or text. Cross-modal agreement pre-training approaches typically perform best. Some models have dedicated variations in what they report with fine-tuning () on the target dataset, YouCook2 or MSRVTT. The pre-training datasets titled COMBO are CC3M, WV-2M and COCO.*
Model | Visual | Text | Pre-Train | R@5 YouCook2 | R@5 MSRVTT |
---|---|---|---|---|---|
ActBERT | 3D R-32 | BERT | Kinetics+How2 | 26.70 | 23.40 |
HERO | SlowFast | WordPieces | How2+TV | -- | 43.40 |
ClipBERT | R-50 | WordPieces | VisualGenome | -- | 46.80 |
VLM | S3D-g | BERT | How2 | 56.88 | 55.50 |
UniVL | S3D-g | BERT | How2 | 57.60 | 49.60 |
Amrani et al. | R-152 | Word2Vec | How2 | -- | 21.30 |
MIL-NCE | S3D | Word2Vec | How2 | 38.00 | 24.00 |
COOT | S3D-g | BERT | How2+YouCook2 | 40.20 | -- |
CE* | Experts | NetVLAD | MSRVTT | -- | 29.00 |
VideoClip | S3D-g | BERT | How2 | 50.40 | 22.20 |
VATT | Linear Proj. | Linear Proj. | AS+How2 | -- | -- |
MEE | Experts | NetVLAD | COCO | -- | 39.20 |
JPoSE | TSN | Word2Vec | Kinetics | -- | 38.10 |
Amrani et al.* | R-152 | Word2Vec | How2 | -- | 41.60 |
AVLnet* | 3D R-101 | Word2Vec | How2 | 55.50 | 50.50 |
MMT | Experts | BERT | How2 | -- | 14.40 |
MMT* | Experts | BERT | How2 | -- | 55.70 |
Patrick et al.* | Experts | T-5 | How2 | 58.50 | -- |
VideoClip* | S3D-g | BERT | How2 | 62.60 | 55.40 |
FIT | ViT | BERT | COMBO | -- | 61.50 |
Downstream evaluation for video captioning on the YouCook2 dataset for video-language models. Top scores are in bold. MM: Masked modeling with video and text, and K/H: Kinetics+HowTo100M.
Model | Category | Subcategory | Visual | Text | Pre-train | BLEU4 | METEOR | ROUGE | CIDEr |
---|---|---|---|---|---|---|---|---|---|
CBT | Cross-Modal | Video+Text | S3D-G | BERT | Kinetics | 5.12 | 12.97 | 30.44 | 0.64 |
COOT | Cross-Modal | Video+Text | S3D-g | BERT | YouCook2 | 11.30 | 19.85 | 37.94 | -- |
VideoBert | Generative | MM | S3D-g | BERT | Kinetics | 4.33 | 11.94 | 28.80 | 0.55 |
ActBERT | Generative | MM | 3D R-32 | BERT | K/H | 5.41 | 13.30 | 30.56 | 0.65 |
VLM | Generative | MM | S3D-g | BERT | How2 | 12.27 | 18.22 | 41.51 | 1.39 |
UniVL | Generative | MM | S3D-g | BERT | How2 | 17.35 | 22.35 | 46.52 | 1.81 |
Downstream action segmentation evaluation on COIN for models that use a cross-modal agreement self-supervised pre-training approach. The top score is in bold.
Model | Visual | Text | Pre-train | Frame-Acc |
---|---|---|---|---|
CBT | S3D-G | BERT | Kinetics+How2 | 53.90 |
ActBERT | 3D R-32 | BERT | Kinetics+How2 | 56.95 |
VideoClip (zs) | S3D-g | BERT | How2 | 58.90 |
MIL-NCE | S3D | Word2Vec | How2 | 61.00 |
VLM | S3D-g | BERT | How2 | 68.39 |
VideoClip (ft) | S3D-g | BERT | How2 | 68.70 |
UniVL | S3D-g | BERT | How2 | 70.20 |
Downstream temporal action step localization evaluation on CrossTask for models that use a contrastive multimodal self-supervised pre-training approach. Top scores are in bold.
Model | Visual | Text | Pre-train | Recall |
---|---|---|---|---|
VideoClip (zs) | S3D-g | BERT | How2 | 33.90 |
MIL-NCE | S3D | Word2Vec | How2 | 40.50 |
ActBERT | 3D R-32 | BERT | Kinetics+How2 | 41.40 |
UniVL | S3D-g | BERT | How2 | 42.00 |
VLM | S3D-g | BERT | How2 | 46.50 |
VideoClip (ft) | S3D-g | BERT | How2 | 47.30 |
Downstream action recognition on UCF101 and HMDB51 for models that use contrastive learning and/or cross-modal agreement. Top scores for each category are in bold. Modalities include video (V), optical flow (F), human keypoints (K), text (T) and audio (A). Spatio-temporal augmentations with contrastive learning typically are the highest performing approaches.
Model | Subcategory | Visual | Modalities | Pre-Train | UCF101 | HMDB51 |
---|---|---|---|---|---|---|
Geometry | Appearance | AlexNet | V | UCF101/HMDB51 | 54.10 | 22.60 |
Wang et al. | Appearance | C3D | V | UCF101 | 61.20 | 33.40 |
3D RotNet | Appearance | 3D R-18 | V | MT | 62.90 | 33.70 |
VideoJigsaw | Jigsaw | CaffeNet | Kinetics | 54.70 | 27.00 | |
3D ST-puzzle | Jigsaw | C3D | V | Kinetics | 65.80 | 33.70 |
CSJ | Jigsaw | R(2+3)D | V | Kinetics+UCF101+HMDB51 | 79.50 | 52.60 |
PRP | Speed | R3D | V | Kinetics | 72.10 | 35.00 |
SpeedNet | Speed | S3D-G | V | Kinetics | 81.10 | 48.80 |
Jenni et al. | Speed | R(2+1)D | V | UCF101 | 87.10 | 49.80 |
PacePred | Speed | S3D-G | V | UCF101 | 87.10 | 52.60 |
ShuffleLearn | Temporal Order | AlexNet | V | UCF101 | 50.90 | 19.80 |
OPN | Temporal Order | VGG-M | V | UCF101 | 59.80 | 23.80 |
O3N | Temporal Order | AlexNet | V | UCF101 | 60.30 | 32.50 |
ClipOrder | Temporal Order | R3D | V | UCF101 | 72.40 | 30.90 |
VIE | Clustering | Slowfast | V | Kinetics | 78.90 | 50.1 |
VIE-2pathway | Clustering | R-18 | V | Kinetics | 80.40 | 52.5 |
Tokmakov et al. | Clustering | 3D R-18 | V | Kinetics | 83.00 | 50.4 |
TCE | Temporal Aug. | R-50 | V | UCF101 | 71.20 | 36.6 |
Lorre et al. | Temporal Aug. | R-18 | V+F | UCF101 | 87.90 | 55.4 |
CMC-Dual | Spatial Aug. | CaffeNet | V+F | UCF101 | 59.10 | 26.7 |
SwAV | Spatial Aug. | R-50 | V | Kinetics | 74.70 | -- |
VDIM | Spatial Aug. | R(2+1)D | V | Kinetics | 79.70 | 49.2 |
CoCon | Spatial Aug. | R-34 | V+F+K | UCF101 | 82.40 | 53.1 |
SimCLR | Spatial Aug. | R-50 | V | Kinetics | 84.20 | -- |
CoCLR | Spatial Aug. | S3D-G | V+F | UCF101 | 90.60 | 62.9 |
MoCo | Spatial Aug. | R-50 | V | Kinetics | 90.80 | -- |
BYOL | Spatial Aug. | R-50 | V | Kinetics | 91.20 | -- |
MIL-NCE | Cross-Modal | S3D-G | V+T | How2 | 61.00 | 91.3 |
GDT | Cross-Modal | R(2+1)D | V+T+A | Kinetics | 72.80 | 95.5 |
CBT | Cross-Modal | S3D-G | V+T | Kinetics | 79.50 | 44.6 |
VATT | Cross-Modal | Transformer | V+T | AS+How2 | 85.50 | 64.8 |
AVTS | Cross-Modal | MC3 | V+A | Kinetics | 85.80 | 56.9 |
AVID+Cross | Cross-Modal | R(2+1)D | V+A | Kinetics | 91.00 | 64.1 |
AVID+CMA | Cross-Modal | R(2+1)D | V+A | Kinetics | 91.50 | 64.7 |
MMV-FAC | Cross-Modal | TSM | V+T+A | AS+How2 | 91.80 | 67.1 |
XDC | Cross-Modal | R(2+1)D | V+A | Kinetics | 95.50 | 68.9 |
DVIM | Spatio-Temporal Aug. | R-18 | V+F | UCF101 | 64.00 | 29.7 |
IIC | Spatio-Temporal Aug. | R3D | V+F | Kinetics | 74.40 | 38.3 |
DSM | Spatio-Temporal Aug. | I3D | V | Kinetics | 78.20 | 52.8 |
pSimCLR | Spatio-Temporal Aug. | R-50 | V | Kinetics | 87.90 | -- |
TCLR | Spatio-Temporal Aug. | R(2+1)D | V | UCF101 | 88.20 | 60.0 |
SeCo | Spatio-Temporal Aug. | R-50 | V | ImageNet | 88.30 | 55.6 |
pSwaV | Spatio-Temporal Aug. | R-50 | V | Kinetics | 89.40 | -- |
pBYOL | Spatio-Temporal Aug. | R-50 | V | Kinetics | 93.80 | -- |
CVRL | Spatio-Temporal Aug. | 3D R-50 | V | Kinetics | 93 |
Downstream evaluation of action recognition on self-supervised learning measured by prediction accuracy for Something-Something (SS) and Kinetics400 (Kinetics). SS is a more temporally relevant dataset and therefore is more challenging. Top scores for each category are in bold and second best scores \underline{underlined}.
Model | Category | Subcategory | Visual Backbone | Pre-Train | SS | Kinetics |
---|---|---|---|---|---|---|
pSwaV | Contrastive | View Aug. | R-50 | Kinetics | 51.7 | 62.7 |
pSimCLR | Contrastive | View Aug. | R-50 | Kinetics | 52.0 | 62.0 |
pMoCo | Contrastive | View Aug. | R-50 | Kinetics | 54.4 | 69.0 |
pBYOL | Contrastive | View Aug. | R-50 | Kinetics | 55.8 | 71.5 |
BEVT | Generative | MAE | SWIN-B | Kinetics+ImageNet | 71.4 | 81.1 |
MAE | Generative | MAE | ViT-H | Kinetics | 74.1 | 81.1 |
MaskFeat | Generative | MAE | MViT | Kinetics | 74.4 | 86.7 |
VideoMAE | Generative | MAE | ViT-L | ImageNet | 75.3 | 85.1 |
MotionMAE | Generative | MAE | ViT-B | Kinetics | 75.5 | 81.7 |
Performance for the downstream video retrieval task with top scores for each category in bold. K/U/H indicates using all three datasets for pre-training, i.e. Kinetics, UCF101, and HMDB51.
Model | Category | Subcategory | Visual Backbone | Pre-train | UCF101 R@5 | HMDB51 R@5 |
---|---|---|---|---|---|---|
SpeedNet | Pretext | Speed | S3D-G | Kinetics | 28.10 | -- |
ClipOrder | Pretext | Temporal Order | R3D | UCF101 | 30.30 | 22.90 |
OPN | Pretext | Temporal Order | CaffeNet | UCF101 | 28.70 | -- |
CSJ | Pretext | Jigsaw | R(2+3)D | K/U/H | 40.50 | -- |
PRP | Pretext | Speed | R3D | Kinetics | 38.50 | 27.20 |
Jenni et al. | Pretext | Speed | 3D R-18 | Kinetics | 48.50 | -- |
PacePred | Pretext | Speed | R(2+1)D | UCF101 | 49.70 | 32.20 |
MemDPC-RGP | Generative | Frame Prediction | R(2+3)D | Kinetics | 40.40 | 25.70 |
MemDPC-Flow | Generative | Frame Prediction | R(2+3)D | Kinetics | 63.20 | 37.60 |
DSM | Contrastive | Spatio-Temporal | I3D | Kinetics | 35.20 | 25.90 |
IIC | Contrastive | Spatio-Temporal | R-18 | UCF101 | 60.90 | 42.90 |
SeLaVi | Cross-Modal | Video+Audio | R(2+1)D | Kinetics | 68.60 | 47.60 |
CoCLR | Contrastive | View Augmentation | S3D-G | UCF101 | 70.80 | 45.80 |
GDT | Cross-Modal | Video+Audio | R(2+1)D | Kinetics | 79.00 | 51.70 |
Downstream evaluation for video captioning on the YouCook2 dataset for video-language models. Top scores are in bold. MM: Masked modeling with video and text, and K/H: Kinetics+HowTo100M.
Model | Category | Subcategory | Visual | Text | Pre-train | BLEU4 | METEOR | ROUGE | CIDEr |
---|---|---|---|---|---|---|---|---|---|
CBT | Cross-Modal | Video+Text | S3D-G | BERT | Kinetics | 5.12 | 12.97 | 30.44 | 0.64 |
COOT | Cross-Modal | Video+Text | S3D-g | BERT | YouCook2 | 11.30 | 19.85 | 37.94 | -- |
VideoBert | Generative | MM | S3D-g | BERT | Kinetics | 4.33 | 11.94 | 28.80 | 0.55 |
ActBERT | Generative | MM | 3D R-32 | BERT | K/H | 5.41 | 13.30 | 30.56 | 0.65 |
VLM | Generative | MM | S3D-g | BERT | How2 | 12.27 | 18.22 | 41.51 | 1.39 |
UniVL | Generative | MM | S3D-g | BERT | How2 | 17.35 | 22.35 | 46.52 | 1.81 |
Performance for the downstream video retrieval task. Top scores for each category are in bold. Masked Modeling (MM) is a generative approach that uses both video with text. Cross-modal agreement include a variety of contrastive approaches that can use video with audio and/or text. Cross-modal agreement pre-training approaches typically perform best. Some models have dedicated variations in what they report with fine-tuning () on the target dataset, YouCook2 or MSRVTT. The pre-training datasets titled COMBO are CC3M, WV-2M and COCO.*
Model | Visual | Text | Pre-Train | R@5 YouCook2 | R@5 MSRVTT |
---|---|---|---|---|---|
ActBERT | 3D R-32 | BERT | Kinetics+How2 | 26.70 | 23.40 |
HERO | SlowFast | WordPieces | How2+TV | -- | 43.40 |
ClipBERT | R-50 | WordPieces | VisualGenome | -- | 46.80 |
VLM | S3D-g | BERT | How2 | 56.88 | 55.50 |
UniVL | S3D-g | BERT | How2 | 57.60 | 49.60 |
Amrani et al. | R-152 | Word2Vec | How2 | -- | 21.30 |
MIL-NCE | S3D | Word2Vec | How2 | 38.00 | 24.00 |
COOT | S3D-g | BERT | How2+YouCook2 | 40.20 | -- |
CE* | Experts | NetVLAD | MSRVTT | -- | 29.00 |
VideoClip | S3D-g | BERT | How2 | 50.40 | 22.20 |
VATT | Linear Proj. | Linear Proj. | AS+How2 | -- | -- |
MEE | Experts | NetVLAD | COCO | -- | 39.20 |
JPoSE | TSN | Word2Vec | Kinetics | -- | 38.10 |
Amrani et al.* | R-152 | Word2Vec | How2 | -- | 41.60 |
AVLnet* | 3D R-101 | Word2Vec | How2 | 55.50 | 50.50 |
MMT | Experts | BERT | How2 | -- | 14.40 |
MMT* | Experts | BERT | How2 | -- | 55.70 |
Patrick et al.* | Experts | T-5 | How2 | 58.50 | -- |
VideoClip* | S3D-g | BERT | How2 | 62.60 | 55.40 |
FIT | ViT | BERT | COMBO | -- | 61.50 |
Dataset | Labels | Modalities | Classes | Videos | Tasks |
---|---|---|---|---|---|
ActivityNet (ActN) | Activity, Captions, Bounding Box | Video, Video+Text | 200 | 19,995 | Action-Recognition, Video Captioning, Video Grounding |
AVA | Activity, Face Tracks | Video, Video+Audio | 80 | 430 | Action-Recognition,Audio-Visual Grounding |
Breakfast | Activity | Video | 10 | 1,989 | Action Recognition, Action Segmentation |
Charades | Activity, Objects, Indoor Scenes, Verbs | Video | 157 | 9,848 | Action-Recognition, Object Recognition, Scene Recognition, Temporal Action Step Localization |
COIN | Activity, Temporal Actions, ASR | Video, Video+Text | 180 | 11,827 | Action-Recognition, Action Segmentation, Video-Retrieval |
CrossTask | Temporal Steps, Activity | Video | 83 | 4,700 | Temporal Action Step Localization, Recognition |
HMDB51 | Activity | Video | 51 | 6,849 | Action-Recognition, Video-Retrieval |
HowTo100M (How2) | ASR | Video+Text | - | 136M | Text-to-Video Retrieval, VideoQA |
Kinetics | Activity | Video | 400/600/700 | 1/2 M | Action-Recognition |
MSRVTT | Activity, Captions | Video+Text | 20 | 10,000 | Action-Recognition, Video-Captioning, Video-Retrieval, Visual-Question Answering |
MultiThumos | Activity, Temporal Steps | Video | 65 | 400 | Action Recognition, Temporal Action Step Localization |
UCF101 | Activity | Video | 101 | 13,320 | Recognition, Video-Retrieval |
YouCook2 | Captions | Video+Text | 89 | 2,000 | Video Captioning, Video-Retrieval |
YouTube-8M | Activity | Video | 4,716 | 8M | Action Recognition |
@article{schiappa_survey_ssl_video,
author = {Schiappa, Madeline C. and Rawat, Yogesh S. and Shah, Mubarak},
title = {Self-Supervised Learning for Videos: A Survey},
year = {2023},
issue_date = {December 2023},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {55},
number = {13s},
issn = {0360-0300},
url = {https://doi.org/10.1145/3577925},
doi = {10.1145/3577925},
journal = {ACM Comput. Surv.},
month = {jul},
articleno = {288},
numpages = {37},
}