This repo contains the code and data for the LREC2022 paper MMChat: Multi-Modal Chat Dataset on Social Media.
from datasets import load_dataset
dataset = load_dataset("silver/mmchat")
# or
# dataset = load_dataset("silver/mmchat", "mmchat_hf")
# dataset = load_dataset("silver/mmchat", "mmchat_raw")
# dataset = load_dataset("silver/mmchat", "mmchat_lccc_filtered")
MMChat is a large-scale dialogue dataset that contains image-grounded dialogues in Chinese.
Each dialogue in MMChat is associated with one or more images (maximum 9 images per dialogue).
We design various strategies to ensure the quality of the dialogues in MMChat. Please read our paper for more details.
The images in the dataset are hosted on Weibo's static image server.
You can refer to the scripts provided in data_processing/weibo_image_crawler
to download these images.
Two sample dialogues form MMChat are given below (translated from Chinese):
MMChat is released in different versions:
The MMChat dataset reported in our paper are given here. The Weibo content corresponding to these dialogues are all "分享图片", (i.e., "Share Images" in English). The following table shows some basic statistics:
Item Description | Count |
---|---|
Sessions | 120.84 K |
Sessions with more than 4 utterances | 17.32 K |
Utterances | 314.13 K |
Images | 198.82 K |
Avg. utterance per session | 2.599 |
Avg. image per session | 2.791 |
Avg. character per utterance | 8.521 |
The above dialogues can be downloaded from either Google Drive or Baidu Netdisk.
We perform human annotation on the sampled dialogues to determine whether the given images are related to the corresponding dialogues. The following table only shows the statistics for dialogues that are annotated as image-related.
Item Description | Count |
---|---|
Sessions | 19.90 K |
Sessions with more than 4 utterances | 8.91 K |
Utterances | 81.06 K |
Images | 52.66K |
Avg. utterance per session | 4.07 |
Avg. image per session | 2.70 |
Avg. character per utterance | 11.93 |
We annotated about 100K dialogues. All the annotated dialogues can be downloaded from either Google Drive or Baidu Netdisk.
We are also releasing the raw dialogues we collected to faciliate further research. This version of MMChat contains raw dialogues filtered by our rules. The following table shows some basic statistics:
Item Description | Count |
---|---|
Sessions | 4.257 M |
Sessions with more than 4 utterances | 2.304 M |
Utterances | 18.590 M |
Images | 4.874 M |
Avg. utterance per session | 4.367 |
Avg. image per session | 1.670 |
Avg. character per utterance | 14.104 |
We devide above dialogues into 9 splits to facilitate the download:
- Split0 Google Drive, Baidu Netdisk
- Split1 Google Drive, Baidu Netdisk
- Split2 Google Drive, Baidu Netdisk
- Split3 Google Drive, Baidu Netdisk
- Split4 Google Drive, Baidu Netdisk
- Split5 Google Drive, Baidu Netdisk
- Split6 Google Drive, Baidu Netdisk
- Split7 Google Drive, Baidu Netdisk
- Split8 Google Drive, Baidu Netdisk
This version of MMChat contains the dialogues that are filtered based on the LCCC (Large-scale Cleaned Chinese Conversation) dataset.
Specifically, some dialogues in MMChat are also contained in LCCC.
We regard these dialogues as cleaner dialogues since sophisticated schemes are designed in LCCC to filter out noises.
This version of MMChat is obtained using the script data_processing/LCCC_filter.py
The following table shows some basic statistics:
Item Description | Count |
---|---|
Sessions | 492.6 K |
Sessions with more than 4 utterances | 208.8 K |
Utterances | 1.986 M |
Images | 1.066 M |
Avg. utterance per session | 4.031 |
Avg. image per session | 2.514 |
Avg. character per utterance | 11.336 |
We devide above dialogues into 9 splits to facilitate the download:
- Split0 Google Drive, Baidu Netdisk
- Split1 Google Drive, Baidu Netdisk
- Split2 Google Drive, Baidu Netdisk
- Split3 Google Drive, Baidu Netdisk
- Split4 Google Drive, Baidu Netdisk
- Split5 Google Drive, Baidu Netdisk
- Split6 Google Drive, Baidu Netdisk
- Split7 Google Drive, Baidu Netdisk
- Split8 Google Drive, Baidu Netdisk
We are also releasing all the codes used for our experiments.
You can use the script run_training.sh
in each folder to launch the distributed training.
For models that require image features, you can extract the image features using the scripts in data_processing/extract_image_features
The model shown in our paper can be found in dialog_image
:
The pre-trained chinese_gpt_original
model used in our experiments can be downloaded from Baidu Netdisk with extract code of nmoc
, or downloaded from Google Drive.
Please cite our paper if you find our work useful ;)
@inproceedings{zheng2022MMChat,
author = {Zheng, Yinhe and Chen, Guanyi and Liu, Xin and Sun, Jian},
title = {MMChat: Multi-Modal Chat Dataset on Social Media},
booktitle = {Proceedings of The 13th Language Resources and Evaluation Conference},
year = {2022},
publisher = {European Language Resources Association},
}
@inproceedings{wang2020chinese,
title = {A Large-Scale Chinese Short-Text Conversation Dataset},
author = {Wang, Yida and Ke, Pei and Zheng, Yinhe and Huang, Kaili and Jiang, Yong and Zhu, Xiaoyan and Huang, Minlie},
booktitle = {NLPCC},
year = {2020},
url = {https://arxiv.org/abs/2008.03946}
}