Skip to content

VachanVY/ViVIT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ViViT

alt text

Embedding Video Clips

  • alt text
  • alt text

Transformer model for Video

Model 1: Spatio-temporal attention

  • This model simply forwards all spatio-temporal tokens extracted from the video, $z_0$, through the transformer encoder

Model 2: Factorised encoder

  • alt text alt text

Model 3: Factorized self-attention

  • alt text
    Factorised self-attention (Model 3). Within each transformer block, the multi-headed self-attention operation is factorised into two operations (indicated by striped boxes) that first only compute self-attention spatially, and then temporally\
  • alt text

Model 4: Factorised dot-product attention

  • alt text

Spatial Attention: Across H and W dimension
Temporal Attention: Across T dimension


Ablation

Model Varients

  • alt text
  • The unfactorised model (Model 1) performs the best on Kinetics 400. However, it can also overfit on smaller datasets such as Epic Kitchens, where we find our “Factorised Encoder” (Model 2) to perform the best

About

ViViT: Video Vision Transformer in PyTorch

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published