Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
t2t_vit_14.yaml		t2t_vit_14.yaml
t2t_vit_19.yaml		t2t_vit_19.yaml
t2t_vit_24.yaml		t2t_vit_24.yaml
t2t_vit_t_14.yaml		t2t_vit_t_14.yaml
t2t_vit_t_19.yaml		t2t_vit_t_19.yaml
t2t_vit_t_24.yaml		t2t_vit_t_24.yaml

README.md

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet (arxiv)

(Update 2021-11-10) Code is released and ported weights are uploaded

Introduction

T2T-ViT (Tokens-To-Token Vision Transformer) is a type of Vision Transformer which incorporates 1) a layerwise Tokens-to-Token (T2T) transformation to progressively structurize the image to tokens by recursively aggregating neighboring Tokens into one Token (Tokens-to-Token), such that local structure represented by surrounding tokens can be modeled and tokens length can be reduced; 2) an efficient backbone with a deep-narrow structure for vision transformer motivated by CNN architecture design after empirical study.

For details see Training Vision Transformers from Scratch on ImageNet by Yuan, Li and Chen, Yunpeng and Wang, Tao and Yu, Weihao and Shi, Yujun and Jiang, Zi-Hang and Tay, Francis E.H. and Feng, Jiashi and Yan, Shuicheng

Model Zoo

The results are evaluated on ImageNet2012 validation set

Arch	Weight	Top-1 Acc	Top-5 Acc	Crop ratio	# Params
t2t_vit_14	pretrain 1k	81.50	95.67	0.9	21.5M
t2t_vit_19	pretrain 1k	81.93	95.74	0.9	39.1M
t2t_vit_24	pretrain 1k	82.28	95.89	0.9	64.0M
t2t_vit_t_14	pretrain 1k	81.69	95.85	0.9	21.5M
t2t_vit_t_19	pretrain 1k	82.44	96.08	0.9	39.1M
t2t_vit_t_24	pretrain 1k	82.55	96.07	0.9	64.0M

Note: pretrain 1k is trained directly on the ImageNet-1k dataset

Usage

from passl.modeling.backbones import build_backbone
from passl.modeling.heads import build_head
from passl.utils.config import get_config


class Model(nn.Layer):
    def __init__(self, cfg_file):
        super().__init__()
        cfg = get_config(cfg_file)
        self.backbone = build_backbone(cfg.model.architecture)
        self.head = build_head(cfg.model.head)

    def forward(self, x):

        x = self.backbone(x)
        x = self.head(x)
        return x


cfg_file = "configs/t2t_vit/t2t_vit_14.yaml"
m = Model(cfg_file)

Reference

@InProceedings{Yuan_2021_ICCV,
    author    = {Yuan, Li and Chen, Yunpeng and Wang, Tao and Yu, Weihao and Shi, Yujun and Jiang, Zi-Hang and Tay, Francis E.H. and Feng, Jiashi and Yan, Shuicheng},
    title     = {Tokens-to-Token ViT: Training Vision Transformers From Scratch on ImageNet},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2021},
    pages     = {558-567}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

t2t_vit

t2t_vit

README.md

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet (arxiv)

Introduction

Model Zoo

Usage

Reference

Files

t2t_vit

Directory actions

More options

Directory actions

More options

Latest commit

History

t2t_vit

Folders and files

parent directory

README.md

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet (arxiv)

Introduction

Model Zoo

Usage

Reference