InceptionNeXt: When Inception Meets ConvNeXt (CVPR 2024)
This is a PyTorch implementation of InceptionNeXt proposed by our paper "InceptionNeXt: When Inception Meets ConvNeXt". Many thanks to Ross Wightman, InceptionNeXt is integrated into timm.
TLDR: To speed up ConvNeXt, we build InceptionNeXt by decomposing the large kernel dpethweise convolution with Inception style. Our InceptionNeXt-T enjoys both ResNet-50’s speed and ConvNeXt-T’s accuracy.
Our models are trained and tested in the environment of PyTorch 1.13, NVIDIA CUDA 11.7.1 and timm 0.6.11 (pip install timm==0.6.11
). If you use docker, check Dockerfile that we used.
Data preparation: ImageNet with the following folder structure, you can extract ImageNet by this script.
│imagenet/
├──train/
│ ├── n01440764
│ │ ├── n01440764_10026.JPEG
│ │ ├── n01440764_10027.JPEG
│ │ ├── ......
│ ├── ......
├──val/
│ ├── n01440764
│ │ ├── ILSVRC2012_val_00000293.JPEG
│ │ ├── ILSVRC2012_val_00002138.JPEG
│ │ ├── ......
│ ├── ......
Model | Resolution | Params | MACs | Train throughput | Infer. throughput | Top1 Acc |
---|---|---|---|---|---|---|
mobilenetv2_140 | 224 | 6.1M | 0.60G | 1001 | 5190 | 74.7 |
efficientnet_b0 | 224 | 5.3M | 0.40G | 954 | 5502 | 77.1 |
ghostnet_130 | 224 | 7.3M | 0.24G | 946 | 7451 | 75.7 |
convnext_atto | 224 | 3.7M | 0.55G | 835 | 4539 | 75.7 |
inceptionnext_atto | 224 | 4.2M | 0.51G | 2661 | 9876 | 75.3 |
resnet50 | 224 | 26M | 4.1G | 969 | 3149 | 78.4 |
convnext_tiny | 224 | 29M | 4.5G | 575 | 2413 | 82.1 |
inceptionnext_tiny | 224 | 28M | 4.2G | 901 | 2900 | 82.3 |
inceptionnext_small | 224 | 49M | 8.4G | 521 | 1750 | 83.5 |
inceptionnext_base | 224 | 87M | 14.9G | 375 | 1244 | 84.0 |
inceptionnext_base_384 | 384 | 87M | 43.6G | 139 | 428 | 85.2 |
Model | Resolution | Params | MACs | Train throughput | Infer. throughput | Top1 Acc |
---|---|---|---|---|---|---|
resnet50 | 224 | 26M | 4.1G | 969 | 3149 | 78.4 |
convnext_tiny | 224 | 29M | 4.5G | 575 | 2413 | 82.1 |
convnext_tiny_k5 | 224 | 29M | 4.4G | 675 | 2704 | 82.0 |
convnext_tiny_k3 | 224 | 28M | 4.4G | 798 | 2802 | 81.5 |
convnext_tiny_k3_par1_2 | 224 | 28M | 4.4G | 818 | 2740 | 81.4 |
convnext_tiny_k3_par3_8 | 224 | 28M | 4.4G | 847 | 2762 | 81.4 |
convnext_tiny_k3_par1_4 | 224 | 28M | 4.4G | 871 | 2808 | 81.3 |
convnext_tiny_k3_par1_8 | 224 | 28M | 4.4G | 901 | 2833 | 80.8 |
convnext_tiny_k3_par1_16 | 224 | 28M | 4.4G | 916 | 2846 | 80.1 |
The throughputs are measured on an A100 with full precisioni and batch size of 128. See Benchmarking throughput.
We also provide a Colab notebook which run the steps to perform inference with InceptionNeXt:
To evaluate our CAFormer-S18 models, run:
MODEL=inceptionnext_tiny
python3 validate.py /path/to/imagenet --model $MODEL -b 128 \
--pretrained
On the environment described above, we benchmark throughputs on an A100 with batch size of 128. The beter results of "Channel First" and "Channel Last" memory layouts are reported.
For Channel First:
MODEL=inceptionnext_tiny # convnext_tiny
python3 benchmark.py /path/to/imagenet --model $MODEL
For Channel Last:
MODEL=inceptionnext_tiny # convnext_tiny
python3 benchmark.py /path/to/imagenet --model $MODEL --channel-last
We use batch size of 4096 by default and we show how to train models with 8 GPUs. For multi-node training, adjust --grad-accum-steps
according to your situations.
DATA_PATH=/path/to/imagenet
CODE_PATH=/path/to/code/inceptionnext # modify code path here
ALL_BATCH_SIZE=4096
NUM_GPU=8
GRAD_ACCUM_STEPS=4 # Adjust according to your GPU numbers and memory size.
let BATCH_SIZE=ALL_BATCH_SIZE/NUM_GPU/GRAD_ACCUM_STEPS
MODEL=inceptionnext_tiny # inceptionnext_small, inceptionnext_base
DROP_PATH=0.1 # 0.3, 0.4
cd $CODE_PATH && sh distributed_train.sh $NUM_GPU $DATA_PATH \
--model $MODEL --opt adamw --lr 4e-3 --warmup-epochs 20 \
-b $BATCH_SIZE --grad-accum-steps $GRAD_ACCUM_STEPS \
--drop-path $DROP_PATH
Training (fine-tuning) scripts of other models are shown in scripts.
@inproceedings{yu2024inceptionnext,
title={Inceptionnext: When inception meets convnext},
author={Yu, Weihao and Zhou, Pan and Yan, Shuicheng and Wang, Xinchao},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={5672--5683},
year={2024}
}
Weihao Yu would like to thank TRC program and GCP research credits for the support of partial computational resources. Our implementation is based on pytorch-image-models, poolformer, ConvNeXt and metaformer.