ViViT

Embedding Video Clips

Transformer model for Video

Model 1: Spatio-temporal attention

This model simply forwards all spatio-temporal tokens extracted from the video, $z_0$, through the transformer encoder

Model 2: Factorised encoder

Model 3: Factorized self-attention

Factorised self-attention (Model 3). Within each transformer block, the multi-headed self-attention operation is factorised into two operations (indicated by striped boxes) that first only compute self-attention spatially, and then temporally\

Model 4: Factorised dot-product attention

Spatial Attention: Across H and W dimension
Temporal Attention: Across T dimension

Ablation

Model Varients

The unfactorised model (Model 1) performs the best on Kinetics 400. However, it can also overfit on smaller datasets such as Epic Kitchens, where we find our “Factorised Encoder” (Model 2) to perform the best

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
images		images
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
test_on_video_mnist.ipynb		test_on_video_mnist.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ViViT

Embedding Video Clips

Transformer model for Video

Model 1: Spatio-temporal attention

Model 2: Factorised encoder

Model 3: Factorized self-attention

Model 4: Factorised dot-product attention

Ablation

Model Varients

About

Releases

Packages

Languages

License

VachanVY/ViVIT

Folders and files

Latest commit

History

Repository files navigation

ViViT

Embedding Video Clips

Transformer model for Video

Model 1: Spatio-temporal attention

Model 2: Factorised encoder

Model 3: Factorized self-attention

Model 4: Factorised dot-product attention

Ablation

Model Varients

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages