简体中文 | English

💌 Table of Contents

💌 Table of Contents
📰 News
📣 Latest Developments
🌈 Introduction
✨ Key Features
🔍 Installation
🔥 Tutorials
🤔 FAQ
📱 Model Library
📝 License
📌 Community

📰 News

🔥 Live Class on October 31, 2024，PaddleMIX release v2.1

🎉 The PaddleMIX Multimodal Model Suite Development Competition release v2.1! Baidu's R&D engineers will provide a detailed explanation of the updated content, as well as the implementation details and case applications of the multi-modal data capability tagging model, PP-InsCapTagger, at 20:00 on Thursday, October 31st. Hurry up and scan the QR code on the poster below to register!

📣 Latest Developments

🎉 2024.10.31 Welcome to the Update of External Developer's Creative Tutorial Page

🌟 Since the launch of our Large Model Suite Premium Project Collection activity on September 6th, we have received 30 high-quality developer projects. Among them, 25 premium projects have successfully passed the platform evaluation and been featured.
🙏 We sincerely thank all developers for their wonderful creations based on our suite! 🚀 We cordially invite you to share your creativity as well - welcome to publish your tutorials on public web pages or in the PaddlePaddle AI Studio community!

🔥 PaddleMIX v2.1 Released on 2024.10.11

Supports the PaddleNLP 3.0 beta version, allowing early access to its latest features.
Added cutting-edge models like Qwen2-VL, InternVL2, and Stable Diffusion 3 (SD3).
Released our self-developed multimodal data capability tagging model PP-InsCapTagger, which can be used for data analysis and filtering. Experimental cases show that it can reduce data volume by 50% while maintaining model performance, significantly improving training efficiency.
The multimodal large models InternVL2, LLaVA, SD3, and SDXL are now adapted to the Ascend 910B, offering training and inference capabilities on domestic computing chips.

PaddleMIX v2.0 Released on 2024.07.25

Multimodal Understanding: Added LLaVA series, Qwen-VL, etc.; introduced Auto module to unify the SFT training process; introduced Mixtoken training strategy, increasing SFT throughput by 5.6 times.
Multimodal Generation: Released PPDiffusers 0.24.1, supporting video generation capabilities, and added LCM to the text-to-image model. Also added a PaddlePaddle version of PEFT and the Accelerate backend. Provided a ComfyUI plugin developed with PaddlePaddle.
Multimodal Data Processing Toolbox DataCopilot: Supports custom data structures, data transformation, and offline format checks. Includes basic statistical information and data visualization functionality.

PaddleMIX v1.0 Released on 2023.10.7

Added distributed training capabilities for vision-language pre-training models, and BLIP-2 now supports trillion-scale training.
Introduced the cross-modal application pipeline AppFlow, which supports 11 cross-modal applications such as automatic annotation, image editing, and audio-to-image with one click.
PPDiffusers released version 0.19.3, adding SDXL and related tasks.

🌈 Introduction

PaddleMIX is a multimodal large model development suite based on PaddlePaddle, integrating various modalities such as images, text, and video. It covers a wide range of multimodal tasks, including vision-language pre-training, fine-tuning, text-to-image, text-to-video, and multimodal understanding. It offers an out-of-the-box development experience while supporting flexible customization to meet diverse needs, empowering the exploration of general artificial intelligence.

The PaddleMIX toolchain includes data processing, model development, pre-training, fine-tuning, and inference deployment, supporting mainstream multimodal models such as EVA-CLIP, BLIP-2, and Stable Diffusion. With cross-modal task pipelines like AppFlow and text-to-image application pipelines, developers can quickly build multimodal applications.

An example of multimodal understanding is shown below:

Multimodal understanding 🤝 integrates visual 👀 and linguistic 💬 processing capabilities. It includes functions such as basic perception, fine-grained image understanding, and complex visual reasoning 🧠. Our Model Library offers practical applications for single-image, multi-image, and video inference. Features include natural image summarization 📝, question answering 🤔, OCR 🔍, sentiment recognition ❤️😢, specialized image analysis 🔬, and code interpretation 💻. These technologies can be applied in various fields such as education 📚, healthcare 🏥, industry 🏭, and more, enabling comprehensive intelligent analysis from static images 🖼️ to dynamic videos 🎥. We invite you to experience and explore these capabilities!

An example of multimodal generation is shown below:

Multimodal generation ✍️ combines the creative power of text 💬 and visuals 👀. It includes various technologies ranging from text-to-image 🖼️ to text-to-video 🎥, featuring advanced models like Stable Diffusion 3 and Open-Sora. We provide practical applications for single-image generation, multi-image synthesis, and video generation in ppdiffusers. These features cover areas such as artistic creation 🎨, animation production 📽️, and content generation 📝. With these technologies, creative generation from static images to dynamic videos can be applied in fields like education 📚, entertainment 🎮, advertising 📺, and more. We invite you to experience and explore these innovations!

Example of featured applications (click the titles for a quick jump to the online experience):

ComfyUI Creative Workflow	Art Style QR Code Model	Mix Image Overlay

Anime Text-to-Image	AI Art｜50+ Lora Style Overlays	ControlNet｜Partial Image Repainting

✨ Key Features

📱 Rich Multimodal Capabilities

PaddleMIX supports a wide range of the latest mainstream algorithm benchmarks and pre-trained models, covering vision-language pre-training, text-to-image, cross-modal visual tasks, and enabling diverse functionalities such as image editing, image description, and data annotation. Gateway: 📱 Model Library

🧩 Simple Development Experience

PaddleMIX provides a unified model development interface, allowing developers to quickly integrate and customize models. With the Auto module, users can efficiently load pre-trained models, perform tokenization, and easily complete model training, fine-tuning (SFT), inference, and deployment through a simplified API. Additionally, the Auto module supports developers in customizing automated model integration, ensuring flexibility and scalability while enhancing development efficiency.

💡 High-Performance Distributed Training and Inference Capabilities

PaddleMIX offers high-performance distributed training and inference capabilities, integrating acceleration operators like ✨Fused Linear✨ and ✨Flash Attention✨. It supports 🌀BF16 mixed-precision training and 4D mixed-parallel strategies. By optimizing inference performance through convolution layout, GroupNorm fusion, and rotating positional encoding optimization, it significantly enhances large-scale pre-training and efficient inference performance.

🔧 Unique Features and Tools

The multimodal data processing toolbox, DataCopilot, accelerates model iteration and upgrades. It allows developers to perform basic data operations with low code based on specific tasks. Gateway: 🏆 Featured Models | Tools

🔍 Installation

1. Clone the PaddleMIX Repository

git clone https://github.com/PaddlePaddle/PaddleMIX
cd PaddleMIX

2. Create a Virtual Environment

conda create -n paddlemix python=3.10 -y
conda activate paddlemix

3. Install PaddlePaddle

Method 1: One-Click Installation (Recommended for GPU/CPU)

CUDA 11.x or 12.3
PaddlePaddle 3.0.0b1

sh build_paddle_env.sh

Method 2: Manual Installation

For detailed instructions on installing PaddlePaddle, please refer to the Installation Guide.

4. Ascend Environment Installation (Optional)

Currently, PaddleMIX supports the Ascend 910B chip (more models are in progress; if you have other model requirements, please submit an issue to let us know). The Ascend driver version is 23.0.3. Considering the variability in environments, we recommend using the standard image provided by PaddlePaddle to prepare your environment.

Refer to the command below to start the container; ASCEND_RT_VISIBLE_DEVICES specifies the visible NPU card numbers.

docker run -it --name paddle-npu-dev -v $(pwd):/work \
    --privileged --network=host --shm-size=128G -w=/work \
    -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -e ASCEND_RT_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" \
    registry.baidubce.com/device/paddle-npu:cann80T13-ubuntu20-$(uname -m)-gcc84-py39 /bin/bash

Install PaddlePaddle inside the container

# Note: You need to install the CPU version of PaddlePaddle first. Currently, only Python 3.9 is supported.
python -m pip install --pre paddlepaddle -i https://www.paddlepaddle.org.cn/packages/nightly/cpu/
python -m pip install --pre paddle-custom-npu -i https://www.paddlepaddle.org.cn/packages/nightly/npu/

5. Install Dependencies

Method 1: One-Click Installation (Recommended)

Run the following command to automatically install all necessary dependencies:

sh build_env.sh

Method 2: Manual Installation (Please refer to build_env.sh)

🔥 Tutorials

Quick Start

Multimodal Understanding: Beginner's Experience
Multimodal Generation: Zero-Basics Getting Started Guide
Cross-Modal Task Pipeline: End-to-End Process Demonstration

Hands-On Practice & Examples

LLaVA Model: Full Process Practice from Training to Inference
SDXL Application: Create Your Own Olympic Poster Generator
PaddleMIX Multimodal AI Applications: Project Classification Overview

Multi-Hardware Usage

For the model list and usage supported by Ascend 910B, please refer to Ascend Hardware Usage

Data Preparation & Fine-Tuning

Model Training and Fine-Tuning Techniques

Inference Deployment

Deployment Guide: From Development to Production Environment

📱 Model Library

Multimodal Understanding

Multimodal Generation

Image-Text Pre-training

CLIP
EVA-CLIP
LLaVA
LLaVA-1.5
LLaVA-1.6
LLaVA-NeXT
Qwen-VL
Qwen2-VL
InternVL2
Mini-Monkey
CoCa
BLIP-2
miniGPT-4
VIsualGLM
CogVLM && CogAgent
InternLM-XComposer2

Open-World Visual Model

Grounding DINO
SAM
YOLO-World

More Multimodal Pre-trained Models

ImageBind

Data Analysis

PP-InsCapTagger

Text-to-Image

Stable Diffusion
Stable Diffusion 3 (SD3)
ControlNet
T2I-Adapter
LDM
Unidiffuser
DiT
HunyuanDiT

Text-to-Video

LVDM
SVD
AnimateAnyone
OpenSora

Audio Generation

AudioLDM
AudioLDM2

For more model capabilities, please refer to the Model Capability Matrix

🏆 Featured Models | Tools

💎 Cross-Modal Task Pipeline AppFlow

Introduction (Click to Expand)

AppFlow, as the cross-modal application task pipeline of PaddleMIX, possesses powerful functionality and ease of use. By integrating cutting-edge algorithms such as LLaVA and Stable Diffusion, AppFlow has comprehensively covered various modalities including images, text, audio, and video. Through a flexible pipeline approach, it has constructed over ten multimodal applications, encompassing text-image generation, text-video generation, text-audio generation, image understanding, and more, providing users with rich demo examples. The highlight of AppFlow is its one-click prediction feature, allowing users to complete model inference with simple commands, eliminating cumbersome training and extensive coding, significantly lowering the barrier to use. Additionally, AppFlow fully leverages the dynamic-static unification advantages of the PaddlePaddle framework; users only need to set simple parameters to automatically complete model dynamic-to-static export and high-performance inference, enhancing work efficiency and optimizing model performance for one-stop application deployment.

Gateway: Application Documentation Example.

💎 Multimodal Data Processing Toolbox DataCopilot

Introduction (Click to Expand)

In real-world application scenarios, there is a substantial demand for fine-tuning multimodal large models using proprietary data to enhance model performance, making data elements the core of this process. Based on this, PaddleMIX provides the DataCopilot tool for data processing and analysis, allowing developers to achieve an end-to-end development experience within the PaddleMIX suite.

PP-InsCapTagger (Instance Capability Tagger) is a dataset capability tagging model implemented by DataCopilot based on PaddleMIX. It is used to label the capabilities of multimodal data instances. By optimizing the dataset through instance capability distribution, it can improve model training efficiency and provide an efficient solution for dataset analysis and evaluation. Combining the model inference labeling results with the LLaVA SFT dataset optimization can improve LLaVA model training efficiency by 50% during the SFT phase.

Gateway: Application Documentation Example.

PP-InsCapTagger (Click to Expand)

Model	ScienceQA	TextVQA	VQAv2	GQA	MMMU	MME
llava-1.5-7b (origin)	66.8	58.2	78.5	62	-	-
llava-1.5-7b (rerun)	69.01	57.6	79	62.95	36.89	1521 323
llava-1.5-7b (random 50%)	67.31	55.6	76.89	61.01	34.67	1421 286
llava-1.5-7b (our 50%)	70.24 (+2.93)	57.12 (+1.52)	78.32 (+1.43)	62.14 (+1.13)	37.11 (+2.44)	1476 (+55) 338 (+52)
`Gateway`: Application Documentation Example.

🤔 FAQ

For answers to some common questions about our project, please refer to the FAQ. If your question is not addressed, feel free to raise it in the Issues.

📝 License

This project is released under the Apache 2.0 license.

📌 Community Communication

Scan the QR code and fill out the questionnaire to join the communication group and engage deeply with numerous community developers and the official team.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_EN.md

README_EN.md

💌 Table of Contents

📰 News

📣 Latest Developments

🌈 Introduction

An example of multimodal understanding is shown below:

An example of multimodal generation is shown below:

Example of featured applications (click the titles for a quick jump to the online experience):

✨ Key Features

📱 Rich Multimodal Capabilities

🧩 Simple Development Experience

💡 High-Performance Distributed Training and Inference Capabilities

🔧 Unique Features and Tools

🔍 Installation

1. Clone the PaddleMIX Repository

2. Create a Virtual Environment

3. Install PaddlePaddle

Method 1: One-Click Installation (Recommended for GPU/CPU)

Method 2: Manual Installation

4. Ascend Environment Installation (Optional)

5. Install Dependencies

Method 1: One-Click Installation (Recommended)

Method 2: Manual Installation (Please refer to build_env.sh)

🔥 Tutorials

📱 Model Library

🏆 Featured Models | Tools

💎 Cross-Modal Task Pipeline AppFlow

💎 Multimodal Data Processing Toolbox DataCopilot

🤔 FAQ

📝 License

📌 Community Communication

Files

README_EN.md

Latest commit

History

README_EN.md

File metadata and controls

💌 Table of Contents

📰 News

📣 Latest Developments

🌈 Introduction

An example of multimodal understanding is shown below:

An example of multimodal generation is shown below:

Example of featured applications (click the titles for a quick jump to the online experience):

✨ Key Features

📱 Rich Multimodal Capabilities

🧩 Simple Development Experience

💡 High-Performance Distributed Training and Inference Capabilities

🔧 Unique Features and Tools

🔍 Installation

1. Clone the PaddleMIX Repository

2. Create a Virtual Environment

3. Install PaddlePaddle

Method 1: One-Click Installation (Recommended for GPU/CPU)

Method 2: Manual Installation

4. Ascend Environment Installation (Optional)

5. Install Dependencies

Method 1: One-Click Installation (Recommended)

Method 2: Manual Installation (Please refer to build_env.sh)

🔥 Tutorials

📱 Model Library

🏆 Featured Models | Tools

💎 Cross-Modal Task Pipeline AppFlow

💎 Multimodal Data Processing Toolbox DataCopilot

🤔 FAQ

📝 License

📌 Community Communication