This project is intended to provide a modular framework for using an ensemble of image-to-text models and then synthesizing them together into a single caption using a downstream LLM. The intention is to create more accurate, valid, and descriptive captions than a single model is able to provide. This can be useful for speeding up the creation of datasets that can then be used for fine-tuning generative AI models such as stable diffusion.
As it stands, default values assume the user has a Nvidia GPU with at least 24GB of VRAM.
This project is in active development, and generally should be considered in a pre-release state.
The system includes the following components:
This script generates captions for a collection of images using BLIP2. By default, the captions are saved in separate files in the image input directory with a '.b2cap' extension.
This script uses Open Flamingo to generate captions. By default, the captions are saved in separate files in the image input directory with a '.flamcap' extension.
This script generates tags for images using pre-trained wd14 models. By default, captions are saved in the image input directory with a '.wd14cap' extension
This script attempts to combine captions/tags using a llama derived model
This script attempts to combine captions/tags using one of OpenAI's GPT models
This script creates a venv and installs the requirements for each module
This script serves as a control center, enabling the user to choose which tasks to perform by providing different command-line options.
This project provides a wide range of options for you to customize its behavior. All options are passed to the run.sh control script:
--use_config_file
: absolute path to a config file containing arguments to be used. If using both a config file & CLI arguments this must be the first argument passed. see example_config_file.txt--use_blip2
: Generate BLIP2 captions of images in your input directory.--use_open_flamingo
: Generate Open Flamingo captions of images in your input directory.--use_wd14
: Generate WD14 tags for images in your input directory.--summarize_with_gpt
: Use OpenAI's GPT to attempt to combine your caption files into one. (Requires that summarize_openai_api_key argument be passed with a valid OpenAI API key OR the environment variable OPENAI_API_KEY be set. If this is set, do not use --summarize_with_llama WARNING: this can get expensive, especially if using GPT-4.)--summarize_with_llama
: Use a llama derived local model for combining/summarizing your caption files. If this is set, do not use --summarize_with_gpt--input_directory
: Absolute path to the input directory containing the image files you wish to caption.--output_directory
: Output directory for saving caption files. If not set, defaults to value passed to--input_directory
.
--wd14_stack_models
: If set, runs three wd14 models ('SmilingWolf/wd-v1-4-convnext-tagger-v2', 'SmilingWolf/wd-v1-4-vit-tagger-v2', 'SmilingWolf/wd-v1-4-swinv2-tagger-v2') and takes the mean of their values.--wd14_model
: If not stacking, which wd14 model to run. Default: 'SmilingWolf/wd-v1-4-swinv2-tagger-v2'--wd14_threshold
: Min confidence threshold for wd14 captions. If wd14_stack_models is passed, the threshold is applied before stacking. Default: 0.5--wd14_filter
: Tags to filter out when running wd14 tagger.--wd14_output_extension
: File extension that wd14 captions will be saved with. Default: 'wd14cap'
--blip2_model
: BLIP2 model to use for generating captions. Default: 'blip2_opt/caption_coco_opt6.7b'--blip2_use_nucleus_sampling
: Whether to use nucleus sampling when generating blip2 captions. Default: False--blip2_beams
: Number of beams to use for blip2 captioning. More beams may be more accurate, but are slower and use more VRAM. Default: 6--blip2_max_tokens
: max_tokens value to be passed to blip2 model. Default: 75--blip2_min_tokens
: min_tokens value to be passed to blip2 model. Default: 20--blip2_top_p
: top_p value to be passed to blip2 model. Default: 1.0--blip2_output_extension
: File extension that blip2 captions will be saved with. Default: 'b2cap'
--flamingo_example_img_dir
: Path to Open Flamingo example image/caption pairs.--flamingo_model
: Open Flamingo model to be used for captioning. Default: 'openflamingo/OpenFlamingo-9B-vitl-mpt7b'--flamingo_min_new_tokens
: min_tokens value to be passed to Open Flamingo model. Default: 20--flamingo_max_new_tokens
: max_tokens value to be passed to Open Flamingo model. Default: 48--flamingo_num_beams
: num_beams value to be passed to Open Flamingo model. Default: 6--flamingo_prompt
: prompt value to be passed to Open Flamingo model. Default: 'Output:'--flamingo_temperature
: value to be passed to Open Flamingo model. Default: 1.0--flamingo_top_k
: top_k value to be passed to Open Flamingo model. Default: 0--flamingo_top_p
: top_p value to be passed to Open Flamingo model. Default: 1.0--flamingo_repetition_penalty
: Repetition penalty value to be passed to Open Flamingo model. Default: 1.0--flamingo_length_penalty
: Length penalty value to be passed to Open Flamingo model. Default: 1.0--flamingo_output_extension
: File extension that Open Flamingo captions will be saved with. Default: 'flamcap'
--summarize_gpt_model
: OpenAI model to use for summarization. Default: 'gpt-3.5-turbo'--summarize_gpt_max_tokens
: Max tokens for GPT. Default: 75--summarize_gpt_temperature
: Temperature to be set for GPT. Default: 1.0--summarize_gpt_prompt_file_path
: File path to a TXT file containing the system prompt to be passed to GPT for summarizing your captions.--summarize_file_extensions
: The file extensions/captions you want to be passed to your summarize model. Defaults to values of Flamingo, BLIP2, and WD14 output extensions, e.g., ['wd14cap','flamcap','b2cap'].--summarize_openai_api_key
: Value of a valid OpenAI API key. Not needed if the OPENAI_API_KEY env variable is set.--summarize_llama_model_repo_id
: Huggingface Repository ID of the Llama model to use for summarization. Must be set in conjunction with--summarize_llama_model_filename
. Default: TheBloke/StableBeluga2-70B-GGML--summarize_llama_model_filename
: Filename of the specific model to be used for Llama summarization. Must be set in conjunction with--summarize_llama_model_repo_id
. Default: stablebeluga2-70b.ggmlv3.q2_K.bin--summarize_llama_prompt_filepath
: Path to a prompt file that provides the system prompt for llama summarization--summarize_llama_n_threads
: number of cpu threads to run llama model on Default: 4--summarize_llama_n_batch
: batch size to load llama model with Default:512--summarize_llama_n_gpu_layers
: number of layers to offload to GPU Default: 55--summarize_llama_n_gqa
: I honestly don't know, but it needs to be set to to 8 for 70B models Default: 8--summarize_llama_max_tokens
: Maximum number of ouput tokens to use for Llama summarization. Default: 75--summarize_llama_temperature
: Temperature value for controlling the randomness of Llama summarization. Default: 1.0--summarize_llama_top_p
: top_p value to run llama model with Default: 1.0--summarize_llama_frequency_penalty
: frequency penalty value to run llama model with Default: 0--summarize_llama_top_presence_penalty
: presence penalty value to run llama model with Default: 0--summarize_llama_prompt_template_path
: path to a template file that defines the template to use for your chosen llama model. Prompt templates should contain the following variables, enclosed in curly braces: {SYSTEM}, {CAPTIONS}, {TAGS}, {ASSISTANT}. These will be filled in by the script with their appropriate values for each iteration. See summarize/llama_prompt_template.txt for the default example
git clone https://github.com/jbmiller10/CaptionFusionator.git
cd CaptionFusionator
chmod +x setup.sh
chmod +x run.sh
./setup.sh
git clone https://github.com/jbmiller10/CaptionFusionator.git
cd CaptionFusionator
setup.bat
You can run this project by executing the run.sh
(linux) or run.ps1
(Windows) script with your desired options. Here's an example command that utilizes multiple models and summarizes with a llama-derived model:
./run.sh --input_directory /path/to/your/image/dir --use_blip2 --use_open_flamingo --use_wd14 --wd14_stack_models --summarize_with_llama
Or, using options in a config file:
./run.sh --use_config_file /config_file.txt
./run.ps1 --input_directory /path/to/your/image/dir --use_blip2 --use_open_flamingo --use_wd14 --wd14_stack_models --summarize_with_llama
Or, using a config file
./run.ps1 --use_config_file ./config_file.txt
(in no particular order)
- Create .bat counterparts to setup.sh & run.sh for Windows Thanks, @lrzjason!
- Set better defaults to current modules
- set default models based on user-defined VRAM value
- Add MiniGPT4-Batch module
- Add GIT (i.e. generative image to text) Module
- Add Deepface Module
- Add Described Module