Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Robin: Multimodal (Visual-Language) Models.  - CERC-AAI Lab - Robin v1.0 #184

Open
1 task
irthomasthomas opened this issue Dec 30, 2023 · 0 comments
Open
1 task
Labels
llm Large Language Models llm-experiments experiments with large language models llm-function-calling Function Calling with Large Language Models Models LLM and ML model repos and links multimodal-llm LLMs that combine modes such as text and image recognition.

Comments

@irthomasthomas
Copy link
Owner

  • CERC-AAI Lab - Robin v1.0

    The Robin team is proud to present Robin, a suite of  Multimodal (Visual-Language) Models. 
    These models outperform, or perform on par with, the state of the art models of similar scale. 
    In the ever-evolving realm of artificial intelligence, the intersection of language understanding and visual perception has paved the way for groundbreaking multimodal models. We study different components and methods to merge pretrained vision and language models with the goal to build better visual language models. 
    As part of this first milestone, we release this LLaVA-fork enabling the Mistral-7B & Open-Hermes-2.5 language models to process images. We combine the pretrained LLMs (Vicuna, Mistral and OpenHermes 2.5) and Vision models (CLIP and SigLIP), and further enhance capabilities by finetuning the vision encoder.

Models detailed bellow are available here: https://huggingface.co/agi-collective
The code used is available here: https://github.com/AGI-Collective/Robin/releases/tag/v1.0.0
Also, some related work by our team on aligning multimodal models: https://arxiv.org/abs/2304.13765
LLaVA Architecture Overview
The LLaVA architecture, an acronym for Large Language and Vision Assistant, represents a multimodal Visual Language Model (VLM). At its core, LLaVA integrates a pretrained language model with a pretrained vision encoder, connected through a projection layer. In its original incarnation, the Vicuna model served as the language foundation, while the CLIP ViT-Large from OpenAI assumes the role of the vision encoder.
Building upon this foundation, as part of the first milestone we study the impact of different language models, vision encoders and the effect of finetuning the vision encoder on the performance of our multimodal model. Notably, our journey led us to experiment with the fusion of various versions of the Mistral AI LLM model and the DeepMind SigLip visual encoder.
Architecture Variations
Our model variations are best encapsulated in the table below, outlining the diverse combinations of language models, vision encoders and the fine-tuning strategy.

@irthomasthomas irthomasthomas added inbox-url llm Large Language Models llm-experiments experiments with large language models llm-function-calling Function Calling with Large Language Models multimodal-llm LLMs that combine modes such as text and image recognition. labels Dec 30, 2023
@irthomasthomas irthomasthomas added the Models LLM and ML model repos and links label Jan 9, 2024
@ShellLM ShellLM removed the llama label May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
llm Large Language Models llm-experiments experiments with large language models llm-function-calling Function Calling with Large Language Models Models LLM and ML model repos and links multimodal-llm LLMs that combine modes such as text and image recognition.
Projects
None yet
Development

No branches or pull requests

2 participants