Flamcon is a flexible and powerful deep learning architecture that enables efficient cross-modal attention between language, image, and video data. This repository contains the code for training and using Flamcon, along with detailed descriptions of its components and how to get started.
- Overview
- Requirements
- Usage
- Configuration
- License
- Team
- Completed
- Future plans
- Acknowledgments
- Citations
Flamcon (Flamingo+Falcon Language-Video Cross Attention Network) is a versatile architecture that facilitates cross-modal attention between text and image inputs. It combines vision and language processing using transformer-based models to perform tasks such as text generation, image/video captioning, and more.
Key features of Flamcon:
- Cross-Attention: Flamcon enables efficient and effective cross-attention between language and image inputs, allowing the model to learn rich and meaningful associations.
- Modular Design: The architecture is modular, allowing you to easily configure the number of layers, attention heads, and other hyperparameters.
- Efficient Training: Leveraging the benefits of DeepSpeed and Fully Sharded Data Parallelism (FSDP), Flamcon is designed for efficient and scalable distributed training.
- Video Support: Video is supported with a training example(train.py) and WebVid dataloader.
- Docker image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel
- Hugging Face Transformers
- DeepSpeed (for distributed training with FSDP)
- tqdm (for progress bars)
- At least one A6000 GPU 48GB ram.
The following hardware was used for testing and training the model.
- 2 A6000 GPU's 48GB memory for development
- 8 A6000 GPU's 48GB memory for training
- 1 A6000 GPU 48GB memory for testing
The 10 million WebVid dataset was used to extract samples for training. Each video was preprocessed into a 40 frame 256x256 resolution clip(40,3,256,256). A text filter was then applied to select a more focused sample. Finally, 10k videos were randomly select and split into training and a validation dataset.
To train the Flamcon model, follow these steps:
- Clone this repository:
git clone https://github.com/jamesbenharris/flamcon
- To install the required dependencies, download data, and start training:
sh setup.sh
To test the Flamcon model once trained, follow these steps:
- Run the testing script:
python test.py
You can adjust various parameters in the training and inference scripts to customize the behavior of the Flamcon model. Refer to the script comments and the DeepSpeed documentation for more details on configuring distributed training.
This project is licensed under the MIT License.
- Train model on 1000 videos(More training needed)
- Test generator
- Train model on 10 million videos
- Train model on a custom dataset
Flamcon is developed by:
This code is based on Lucidrains' flamingo implementation and OpenFlamingo. I appreciate your assistance in expediting the process by improving my understanding of fsdp wrapping and checkpointing techniques!
@article{awadalla2023openflamingo,
title={OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models},
author={Anas Awadalla and Irena Gao and Josh Gardner and Jack Hessel and Yusuf Hanafy and Wanrong Zhu and Kalyani Marathe and Yonatan Bitton and Samir Gadre and Shiori Sagawa and Jenia Jitsev and Simon Kornblith and Pang Wei Koh and Gabriel Ilharco and Mitchell Wortsman and Ludwig Schmidt},
journal={arXiv preprint arXiv:2308.01390},
year={2023}
}