- This model simply forwards all spatio-temporal tokens extracted from the video,
$z_0$ , through the transformer encoder
Factorised self-attention (Model 3). Within each transformer block, the multi-headed self-attention operation is factorised into two operations (indicated by striped boxes) that first only compute self-attention spatially, and then temporally\
Spatial Attention: Across H and W dimension
Temporal Attention: Across T dimension