Flamcon: Flamingo+Falcon Language-Video Cross Attention Network

Flamcon is a flexible and powerful deep learning architecture that enables efficient cross-modal attention between language, image, and video data. This repository contains the code for training and using Flamcon, along with detailed descriptions of its components and how to get started.

Overview

Flamcon (Flamingo+Falcon Language-Video Cross Attention Network) is a versatile architecture that facilitates cross-modal attention between text and image inputs. It combines vision and language processing using transformer-based models to perform tasks such as text generation, image/video captioning, and more.

Key features of Flamcon:

Cross-Attention: Flamcon enables efficient and effective cross-attention between language and image inputs, allowing the model to learn rich and meaningful associations.
Modular Design: The architecture is modular, allowing you to easily configure the number of layers, attention heads, and other hyperparameters.
Efficient Training: Leveraging the benefits of DeepSpeed and Fully Sharded Data Parallelism (FSDP), Flamcon is designed for efficient and scalable distributed training.
Video Support: Video is supported with a training example(train.py) and WebVid dataloader.

Requirements

Docker image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel
Hugging Face Transformers
DeepSpeed (for distributed training with FSDP)
tqdm (for progress bars)
At least one A6000 GPU 48GB ram.

Usage

Infrastructure

The following hardware was used for testing and training the model.

2 A6000 GPU's 48GB memory for development
8 A6000 GPU's 48GB memory for training
1 A6000 GPU 48GB memory for testing

Data

The 10 million WebVid dataset was used to extract samples for training. Each video was preprocessed into a 40 frame 256x256 resolution clip(40,3,256,256). A text filter was then applied to select a more focused sample. Finally, 10k videos were randomly select and split into training and a validation dataset.

Training

To train the Flamcon model, follow these steps:

Clone this repository: git clone https://github.com/jamesbenharris/flamcon
To install the required dependencies, download data, and start training:
```
sh setup.sh
```

Testing

To test the Flamcon model once trained, follow these steps:

Run the testing script:
```
python test.py
```

Configuration

You can adjust various parameters in the training and inference scripts to customize the behavior of the Flamcon model. Refer to the script comments and the DeepSpeed documentation for more details on configuring distributed training.

License

This project is licensed under the MIT License.

Completed

Train model on 1000 videos(More training needed)
Test generator

Future

Train model on 10 million videos
Train model on a custom dataset

Team

Flamcon is developed by:

Ben Harris

Acknowledgments

This code is based on Lucidrains' flamingo implementation and OpenFlamingo. I appreciate your assistance in expediting the process by improving my understanding of fsdp wrapping and checkpointing techniques!

Citations

@article{awadalla2023openflamingo,
  title={OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models},
  author={Anas Awadalla and Irena Gao and Josh Gardner and Jack Hessel and Yusuf Hanafy and Wanrong Zhu and Kalyani Marathe and Yonatan Bitton and Samir Gadre and Shiori Sagawa and Jenia Jitsev and Simon Kornblith and Pang Wei Koh and Gabriel Ilharco and Mitchell Wortsman and Ludwig Schmidt},
  journal={arXiv preprint arXiv:2308.01390},
  year={2023}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Flamcon: Flamingo+Falcon Language-Video Cross Attention Network

Table of Contents

Overview

Requirements

Usage

Infrastructure

Data

Training

Testing

Configuration

License

Completed

Future

Team

Acknowledgments

Citations

Files

README.md

Latest commit

History

README.md

File metadata and controls

Flamcon: Flamingo+Falcon Language-Video Cross Attention Network

Table of Contents

Overview

Requirements

Usage

Infrastructure

Data

Training

Testing

Configuration

License

Completed

Future

Team

Acknowledgments

Citations