Visual-Auditory Fusion Perception AI Platform

📻 Installation Guide

Before using our model, you need to ensure that all necessary dependencies are installed in your environment. These dependencies cover various libraries and tools required for the model's operation, ensuring a smooth inference process.

Please follow these steps for the installation:

Open the Terminal or Command Prompt: Depending on your operating system, open the corresponding command-line interface.
Install dependencies using pip: Enter the following command to install the required Python packages and libraries using pip.

pip install -r requirements.txt

🚀 Inference Guide

After installing all the necessary dependencies, you can start using our model for inference. We provide two ways of performing inference: using the terminal and using the interactive inference.

Here, we will use the example image asserts/demo.jpg for illustration:

1. Inference using the Terminal

If you want to directly run the inference script in the terminal, you can use the following command:

python chatme.py --image asserts/demo.jpg --question "How many apples are there on the shelf?"

This command will load the pre-trained model and perform inference using the provided image (demo.jpg) and question ("How many apples are there on the shelf?").

The model will analyze the image and attempt to answer the question. The inference result will be output to the terminal in text form, for example:

Xiaochuan: There are three apples on the shelf.

2. Interactive Inference

In addition to using the terminal for inference, you can also use the interactive inference feature to interact with the large model in real-time. To start the interactive terminal, run the following command:

python main.py

This command will launch an interactive terminal that waits for you to enter the image path. You can type the image path (e.g., asserts/demo.jpg) in the terminal and press Enter.

The model will perform inference based on the provided image and wait for you to enter a question.

Once you enter a question (e.g., "How many apples are there on the shelf?"), the model will analyze the image and attempt to answer it. The inference result will be output to the terminal in text form, for example:

Image Path >>>>> asserts/demo.jpg
User: How many apples are there on the shelf?
Xiaochuan: There are three apples on the shelf.

Using this approach, you can easily interact with the model and ask it various questions.

🧾 References

📈 Benchmark

AGE Challenge Dataset
COVID-DA Dataset
Visually Aligned Sound (VAS) Dataset

📷 Visual Perception

2D Perception
- Intra- and Inter-Slice Contrastive Learning for Point Supervised OCT Fluid Segmentation
- Partitioning Stateful Data Stream Applications in Dynamic Edge Cloud Environments
- Attention Guided Network for Retinal Image Segmentation
- Collaborative Unsupervised Domain Adaptation for Medical Image Diagnosis
3D Perception

🎧 Audio

Automatic Speech Recognition
Dialogue System
Text To Speech
Audio Anti-spoofing
Blizzard_Challenge
Voice Activity Detection
RegNet

💬 NLP

How to Train Your Agent to Read and Write
CogVLM
Qwen

🔮 Multi-Modal

Test-Time Model Adaptation for Visual Question Answering with Debiased Self-Supervisions
Debiased Visual Question Answering from Feature and Sample Perspectives
Intelligent Home 3D: Automatic 3D-House Design from Linguistic Descriptions Only
Cross-Modal Relation-Aware Networks for Audio-Visual Event Localization
Cascade Reasoning Network for Text-based Visual Question Answering
Length-Controllable Image Captioning
V2C: Visual Voice Cloning

🤖 Robotic

Learning Active Camera for Multi-Object Navigation
Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language Navigation
Learning Vision-and-Language Navigation from YouTube Videos

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_en.md

README_en.md

Visual-Auditory Fusion Perception AI Platform

📻 Installation Guide

🚀 Inference Guide

1. Inference using the Terminal

2. Interactive Inference

🧾 References

📈 Benchmark

📷 Visual Perception

🎧 Audio

💬 NLP

🔮 Multi-Modal

🤖 Robotic

Files

README_en.md

Latest commit

History

README_en.md

File metadata and controls

Visual-Auditory Fusion Perception AI Platform

📻 Installation Guide

🚀 Inference Guide

1. Inference using the Terminal

2. Interactive Inference

🧾 References

📈 Benchmark

📷 Visual Perception

🎧 Audio

💬 NLP

🔮 Multi-Modal

🤖 Robotic