Skip to content

Commit

Permalink
Add MultimodalQnA as MMRAG usecase in Example (#751)
Browse files Browse the repository at this point in the history
Signed-off-by: Tiep Le <tiep.le@intel.com>
Signed-off-by: siddhivelankar23 <siddhi.velankar@intel.com>
Signed-off-by: sjagtap1803 <siddhant.jagtap@intel.com>
  • Loading branch information
tileintel committed Sep 14, 2024
1 parent 06696c8 commit b6cce35
Show file tree
Hide file tree
Showing 21 changed files with 2,558 additions and 0 deletions.
31 changes: 31 additions & 0 deletions MultimodalQnA/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

FROM python:3.11-slim

RUN apt-get update -y && apt-get install -y --no-install-recommends --fix-missing \
libgl1-mesa-glx \
libjemalloc-dev \
git

RUN useradd -m -s /bin/bash user && \
mkdir -p /home/user && \
chown -R user /home/user/

WORKDIR /home/user/
RUN git clone https://github.com/opea-project/GenAIComps.git

WORKDIR /home/user/GenAIComps
RUN pip install --no-cache-dir --upgrade pip && \
pip install --no-cache-dir -r /home/user/GenAIComps/requirements.txt

COPY ./multimodalqna.py /home/user/multimodalqna.py

ENV PYTHONPATH=$PYTHONPATH:/home/user/GenAIComps

USER user

WORKDIR /home/user

ENTRYPOINT ["python", "multimodalqna.py"]
# ENTRYPOINT ["/usr/bin/sleep", "infinity"]
188 changes: 188 additions & 0 deletions MultimodalQnA/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,188 @@
# MultimodalQnA Application

Suppose you possess a set of videos and wish to perform question-answering to extract insights from these videos. To respond to your questions, it typically necessitates comprehension of visual cues within the videos, knowledge derived from the audio content, or often a mix of both these visual elements and auditory facts. The MultimodalQnA framework offers an optimal solution for this purpose.

`MultimodalQnA` addresses your questions by dynamically fetching the most pertinent multimodal information (frames, transcripts, and/or captions) from your collection of videos. For this purpose, MultimodalQnA utilizes [BridgeTower model](https://huggingface.co/BridgeTower/bridgetower-large-itm-mlm-gaudi), a multimodal encoding transformer model which merges visual and textual data into a unified semantic space. During the video ingestion phase, the BridgeTower model embeds both visual cues and auditory facts as texts, and those embeddings are then stored in a vector database. When it comes to answering a question, the MultimodalQnA will fetch its most relevant multimodal content from the vector store and feed it into a downstream Large Vision-Language Model (LVM) as input context to generate a response for the user.

The MultimodalQnA architecture shows below:

![architecture](./assets/img/MultimodalQnA.png)

MultimodalQnA is implemented on top of [GenAIComps](https://github.com/opea-project/GenAIComps), the MultimodalQnA Flow Chart shows below:

```mermaid
---
config:
flowchart:
nodeSpacing: 100
rankSpacing: 100
curve: linear
theme: base
themeVariables:
fontSize: 42px
---
flowchart LR
%% Colors %%
classDef blue fill:#ADD8E6,stroke:#ADD8E6,stroke-width:2px,fill-opacity:0.5
classDef orange fill:#FBAA60,stroke:#ADD8E6,stroke-width:2px,fill-opacity:0.5
classDef orchid fill:#C26DBC,stroke:#ADD8E6,stroke-width:2px,fill-opacity:0.5
classDef invisible fill:transparent,stroke:transparent;
style MultimodalQnA-MegaService stroke:#000000
%% Subgraphs %%
subgraph MultimodalQnA-MegaService["MultimodalQnA-MegaService"]
direction LR
EM([Embedding <br>]):::blue
RET([Retrieval <br>]):::blue
LVM([LVM <br>]):::blue
end
subgraph User Interface
direction TB
a([User Input Query]):::orchid
Ingest([Ingest data]):::orchid
UI([UI server<br>]):::orchid
end
subgraph MultimodalQnA GateWay
direction LR
invisible1[ ]:::invisible
GW([MultimodalQnA GateWay<br>]):::orange
end
subgraph .
X([OPEA Microservice]):::blue
Y{{Open Source Service}}
Z([OPEA Gateway]):::orange
Z1([UI]):::orchid
end
TEI_EM{{Embedding service <br>}}
VDB{{Vector DB<br><br>}}
R_RET{{Retriever service <br>}}
DP([Data Preparation<br>]):::blue
LVM_gen{{LVM Service <br>}}
%% Data Preparation flow
%% Ingest data flow
direction LR
Ingest[Ingest data] -->|a| UI
UI -->|b| DP
DP <-.->|c| TEI_EM
%% Questions interaction
direction LR
a[User Input Query] -->|1| UI
UI -->|2| GW
GW <==>|3| MultimodalQnA-MegaService
EM ==>|4| RET
RET ==>|5| LVM
%% Embedding service flow
direction TB
EM <-.->|3'| TEI_EM
RET <-.->|4'| R_RET
LVM <-.->|5'| LVM_gen
direction TB
%% Vector DB interaction
R_RET <-.->|d|VDB
DP <-.->|e|VDB
```

This MultimodalQnA use case performs Multimodal-RAG using LangChain, Redis VectorDB and Text Generation Inference on Intel Gaudi2 or Intel Xeon Scalable Processors. The Intel Gaudi2 accelerator supports both training and inference for deep learning models in particular for LLMs. Visit [Habana AI products](https://habana.ai/products) for more details.

In the below, we provide a table that describes for each microservice component in the MultimodalQnA architecture, the default configuration of the open source project, hardware, port, and endpoint.

<details>
<summary><b>Gaudi default compose.yaml</b></summary>

| MicroService | Open Source Project | HW | Port | Endpoint |
| ------------ | --------------------- | ----- | ---- | ----------------------------------------------- |
| Embedding | Langchain | Xeon | 6000 | /v1/embeddings |
| Retriever | Langchain, Redis | Xeon | 7000 | /v1/multimodal_retrieval |
| LVM | Langchain, TGI | Gaudi | 9399 | /v1/lvm |
| Dataprep | Redis, Langchain, TGI | Gaudi | 6007 | /v1/generate_transcripts, /v1/generate_captions |

</details>

## Required Models

By default, the embedding and LVM models are set to a default value as listed below:

| Service | Model |
| -------------------- | ------------------------------------------- |
| embedding-multimodal | BridgeTower/bridgetower-large-itm-mlm-gaudi |
| LVM | llava-hf/llava-v1.6-vicuna-13b-hf |

You can choose other LVM models, such as `llava-hf/llava-1.5-7b-hf ` and `llava-hf/llava-1.5-13b-hf`, as needed.

## Deploy MultimodalQnA Service

The MultimodalQnA service can be effortlessly deployed on either Intel Gaudi2 or Intel XEON Scalable Processors.

Currently we support deploying MultimodalQnA services with docker compose.

### Setup Environment Variable

To set up environment variables for deploying MultimodalQnA services, follow these steps:

1. Set the required environment variables:

```bash
# Example: export host_ip=$(hostname -I | awk '{print $1}')
export host_ip="External_Public_IP"
# Example: no_proxy="localhost, 127.0.0.1, 192.168.1.1"
export no_proxy="Your_No_Proxy"
```

2. If you are in a proxy environment, also set the proxy-related environment variables:

```bash
export http_proxy="Your_HTTP_Proxy"
export https_proxy="Your_HTTPs_Proxy"
```

3. Set up other environment variables:

> Notice that you can only choose **one** command below to set up envs according to your hardware. Other that the port numbers may be set incorrectly.

```bash
# on Gaudi
source ./docker_compose/intel/hpu/gaudi/set_env.sh
# on Xeon
source ./docker_compose/intel/cpu/xeon/set_env.sh
```

### Deploy MultimodalQnA on Gaudi

Refer to the [Gaudi Guide](./docker_compose/intel/hpu/gaudi/README.md) to build docker images from source.

Find the corresponding [compose.yaml](./docker_compose/intel/hpu/gaudi/compose.yaml).

```bash
cd GenAIExamples/MultimodalQnA/docker_compose/intel/hpu/gaudi/
docker compose -f compose.yaml up -d
```

> Notice: Currently only the **Habana Driver 1.17.x** is supported for Gaudi.

### Deploy MultimodalQnA on Xeon

Refer to the [Xeon Guide](./docker_compose/intel/cpu/xeon/README.md) for more instructions on building docker images from source.

Find the corresponding [compose.yaml](./docker_compose/intel/cpu/xeon/compose.yaml).

```bash
cd GenAIExamples/MultimodalQnA/docker_compose/intel/cpu/xeon/
docker compose -f compose.yaml up -d
```

## MultimodalQnA Demo on Gaudi2

![MultimodalQnA-upload-waiting-screenshot](./assets/img/upload-gen-trans.png)

![MultimodalQnA-upload-done-screenshot](./assets/img/upload-gen-captions.png)

![MultimodalQnA-query-example-screenshot](./assets/img/example_query.png)
Binary file added MultimodalQnA/assets/img/MultimodalQnA.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added MultimodalQnA/assets/img/example_query.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added MultimodalQnA/assets/img/upload-gen-captions.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added MultimodalQnA/assets/img/upload-gen-trans.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit b6cce35

Please sign in to comment.