Multi-Modal RAG Agent for Financial Analysis

This repository hosts a project that leverages LLMs to create a Multi-modal Retrieval-Augmented Generation (RAG) agent. This agent uses Google Gemini models, Vertex AI embeddings, Vertex AI SDK, LangChain, and Chainlit to deliver sophisticated financial analysis and investment advice based on user queries and provided documents.

Overview

The Multi-Modal RAG Agent is designed to function as a financial analyst, answering user queries with detailed investment advice. It combines multiple modalities and AI capabilities to interpret and analyze financial documents, containing texts, images and tables, providing responses that are both informed and contextually relevant.

Key Features

Google Gemini Models: Utilizes Gemini family from Google AI for high-quality text and image processing.
Vertex AI Embeddings: Embeds textual data for efficient retrieval and contextual understanding.
Vertex AI SDK: Facilitates seamless access to Gemini models and Vertex AI services.
LangChain Integration: Chains together various AI modules and models to create a coherent analysis pipeline.
Chainlit UI: Provides an interactive user interface for querying the agent and receiving advice.

RAG on Tables, Images and Texts [Source]

LangChain provides multi-vector retriever which allows us to decouple documents, which we want to use for answer synthesis, from a reference, which we want to use for retriever. Quoting LangChain, using this we can create a summary of a verbose document optimized to vector-based similarity search, but still pass the full document into the LLM to ensure no context is lost during answer synthesis. This approach is useful beyond raw text, and can be applied generally to either tables or images to support RAG.

There are at least three ways to approach the problem, which utilize the multi-vector retriever framework as discussed above:

Use multimodal embeddings (such as CLIP) to embed images and text together. Retrieve either using similarity search, but simply link to images in a docstore. Pass raw images and text chunks to a multimodal LLM for synthesis.
Use a multimodal LLM (such as GPT4-V, LLaVA, or FUYU-8b) to produce text summaries from images. Embed and retrieve text summaries using a text embedding model. And, again, reference raw text chunks or tables from a docstore for answer synthesis by a LLM; in this case, we exclude images from the docstore (because it's not feasibile to use a multi-modal LLM for synthesis).
Use a multimodal LLM (such as GPT4-V, LLaVA, or FUYU-8b) to produce text summaries from images. Embed and retrieve image summaries with a reference to the raw image, as we did above in option 1. And, again, pass raw images and text chunks to a multimodal LLM for answer synthesis. This option is sensible if we don't want to use multimodal embeddings.

We are using option 2 in this project making use of Gemini pro text and vision models to generate text and image summaries.

Getting Started

Prerequisites

Python 3.9+
Google Cloud account with Vertex AI enabled
API keys for Vertex AI
Access to Gemini models

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
assets		assets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
multiModalRAG.py		multiModalRAG.py
multimodalGemini.ipynb		multimodalGemini.ipynb
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-Modal RAG Agent for Financial Analysis

Overview

Key Features

RAG on Tables, Images and Texts [Source]

Getting Started

Prerequisites

About

Releases

Packages

Languages

License

ashutosh-iitg/Multi-Modal-RAG-Agent

Folders and files

Latest commit

History

Repository files navigation

Multi-Modal RAG Agent for Financial Analysis

Overview

Key Features

RAG on Tables, Images and Texts [Source]

Getting Started

Prerequisites

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages