diff --git a/README.md b/README.md
index 9731845e6..b5106aa88 100644
--- a/README.md
+++ b/README.md
@@ -1,4 +1,4 @@
-[[中文主页]](README_ZH.md) | [[Docs]](README.md#documentation-index--文档索引-a-namedocumentationindex) | [[API]](https://alibaba.github.io/data-juicer) | [[*DJ-SORA*]](docs/DJ_SORA.md)
+[[中文主页]](README_ZH.md) | [[Docs]](#documents) | [[API]](https://alibaba.github.io/data-juicer) | [[*DJ-SORA*]](docs/DJ_SORA.md)
# Data-Juicer: A One-Stop Data Processing System for Large Language Models
@@ -16,8 +16,8 @@
-[![Document_List](https://img.shields.io/badge/Docs-English-blue?logo=Markdown)](README.md#documentation-index--文档索引-a-namedocumentationindex)
-[![文档列表](https://img.shields.io/badge/文档-中文-blue?logo=Markdown)](README_ZH.md#documentation-index--文档索引-a-namedocumentationindex)
+[![Document_List](https://img.shields.io/badge/Docs-English-blue?logo=Markdown)](#documents)
+[![文档列表](https://img.shields.io/badge/文档-中文-blue?logo=Markdown)](README_ZH.md#documents)
[![API Reference](https://img.shields.io/badge/Docs-API_Reference-blue?logo=Markdown)](https://alibaba.github.io/data-juicer/)
[![Paper](http://img.shields.io/badge/cs.LG-arXiv%3A2309.02033-B31B1B?logo=arxiv&logoColor=red)](https://arxiv.org/abs/2309.02033)
@@ -45,7 +45,7 @@ In this new version, we support more features for **multimodal data (including v
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-02-05] Our paper has been accepted by SIGMOD'24 industrial track!
- [2024-01-10] Discover new horizons in "Data Mixture"—Our second data-centric LLM competition has kicked off! Please visit the competition's [official website](https://tianchi.aliyun.com/competition/entrance/532174) for more information.
- [2024-01-05] We release **Data-Juicer v0.1.3** now!
-In this new version, we support **more Python versions** (3.7-3.10), and support **multimodal** dataset [converting](tools/multimodal/README.md)/[processing](docs/Operators.md) (Including texts, images, and audios. More modalities will be supported in the future).
+In this new version, we support **more Python versions** (3.8-3.10), and support **multimodal** dataset [converting](tools/multimodal/README.md)/[processing](docs/Operators.md) (Including texts, images, and audios. More modalities will be supported in the future).
Besides, our paper is also updated to [v3](https://arxiv.org/abs/2309.02033).
- [2023-10-13] Our first data-centric LLM competition begins! Please
@@ -59,7 +59,7 @@ Table of Contents
* [Data-Juicer: A One-Stop Data Processing System for Large Language Models](#data-juicer-a-one-stop-data-processing-system-for-large-language-models)
* [Table of Contents](#table-of-contents)
* [Features](#features)
- * [Documentation Index | 文档索引](#documentation-index--文档索引-a-namedocumentationindex)
+ * [Documentation Index](#documents)
* [Demos](#demos)
* [Prerequisites](#prerequisites)
* [Installation](#installation)
@@ -111,19 +111,19 @@ Table of Contents
-## Documentation Index | 文档索引
+## Documentation Index
-- [Overview](README.md) | [概览](README_ZH.md)
-- [Operator Zoo](docs/Operators.md) | [算子库](docs/Operators_ZH.md)
-- [Configs](configs/README.md) | [配置系统](configs/README_ZH.md)
-- [Developer Guide](docs/DeveloperGuide.md) | [开发者指南](docs/DeveloperGuide_ZH.md)
-- ["Bad" Data Exhibition](docs/BadDataExhibition.md) | [“坏”数据展览](docs/BadDataExhibition_ZH.md)
-- Dedicated Toolkits | 专用工具箱
- - [Quality Classifier](tools/quality_classifier/README.md) | [质量分类器](tools/quality_classifier/README_ZH.md)
- - [Auto Evaluation](tools/evaluator/README.md) | [自动评测](tools/evaluator/README_ZH.md)
- - [Preprocess](tools/preprocess/README.md) | [前处理](tools/preprocess/README_ZH.md)
- - [Postprocess](tools/postprocess/README.md) | [后处理](tools/postprocess/README_ZH.md)
-- [Third-parties (LLM Ecosystems)](thirdparty/README.md) | [第三方库(大语言模型生态)](thirdparty/README_ZH.md)
+- [Overview](README.md)
+- [Operator Zoo](docs/Operators.md)
+- [Configs](configs/README.md)
+- [Developer Guide](docs/DeveloperGuide.md)
+- ["Bad" Data Exhibition](docs/BadDataExhibition.md)
+- Dedicated Toolkits
+ - [Quality Classifier](tools/quality_classifier/README.md)
+ - [Auto Evaluation](tools/evaluator/README.md)
+ - [Preprocess](tools/preprocess/README.md)
+ - [Postprocess](tools/postprocess/README.md)
+- [Third-parties (LLM Ecosystems)](thirdparty/README.md)
- [API references](https://alibaba.github.io/data-juicer/)
- [Awesome LLM-Data](docs/awesome_llm_data.md)
- [DJ-SORA](docs/DJ_SORA.md)
diff --git a/README_ZH.md b/README_ZH.md
index 405adaed5..b5b138907 100644
--- a/README_ZH.md
+++ b/README_ZH.md
@@ -1,4 +1,4 @@
-[[English Page]](README.md) | [[文档]](README_ZH.md#documentation-index--文档索引-a-namedocumentationindex) | [[API]](https://alibaba.github.io/data-juicer) | [[*DJ-SORA*]](docs/DJ_SORA_ZH.md)
+[[English Page]](README.md) | [[文档]](#documents) | [[API]](https://alibaba.github.io/data-juicer) | [[*DJ-SORA*]](docs/DJ_SORA_ZH.md)
# Data-Juicer: 为大语言模型提供更高质量、更丰富、更易“消化”的数据
@@ -14,8 +14,8 @@
[![ModelScope- Demos](https://img.shields.io/badge/ModelScope-Demos-4e29ff.svg?logo=data:image/svg+xml;base64,PHN2ZyB2aWV3Qm94PSIwIDAgMjI0IDEyMS4zMyIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj4KCTxwYXRoIGQ9Im0wIDQ3Ljg0aDI1LjY1djI1LjY1aC0yNS42NXoiIGZpbGw9IiM2MjRhZmYiIC8+Cgk8cGF0aCBkPSJtOTkuMTQgNzMuNDloMjUuNjV2MjUuNjVoLTI1LjY1eiIgZmlsbD0iIzYyNGFmZiIgLz4KCTxwYXRoIGQ9Im0xNzYuMDkgOTkuMTRoLTI1LjY1djIyLjE5aDQ3Ljg0di00Ny44NGgtMjIuMTl6IiBmaWxsPSIjNjI0YWZmIiAvPgoJPHBhdGggZD0ibTEyNC43OSA0Ny44NGgyNS42NXYyNS42NWgtMjUuNjV6IiBmaWxsPSIjMzZjZmQxIiAvPgoJPHBhdGggZD0ibTAgMjIuMTloMjUuNjV2MjUuNjVoLTI1LjY1eiIgZmlsbD0iIzM2Y2ZkMSIgLz4KCTxwYXRoIGQ9Im0xOTguMjggNDcuODRoMjUuNjV2MjUuNjVoLTI1LjY1eiIgZmlsbD0iIzYyNGFmZiIgLz4KCTxwYXRoIGQ9Im0xOTguMjggMjIuMTloMjUuNjV2MjUuNjVoLTI1LjY1eiIgZmlsbD0iIzM2Y2ZkMSIgLz4KCTxwYXRoIGQ9Im0xNTAuNDQgMHYyMi4xOWgyNS42NXYyNS42NWgyMi4xOXYtNDcuODR6IiBmaWxsPSIjNjI0YWZmIiAvPgoJPHBhdGggZD0ibTczLjQ5IDQ3Ljg0aDI1LjY1djI1LjY1aC0yNS42NXoiIGZpbGw9IiMzNmNmZDEiIC8+Cgk8cGF0aCBkPSJtNDcuODQgMjIuMTloMjUuNjV2LTIyLjE5aC00Ny44NHY0Ny44NGgyMi4xOXoiIGZpbGw9IiM2MjRhZmYiIC8+Cgk8cGF0aCBkPSJtNDcuODQgNzMuNDloLTIyLjE5djQ3Ljg0aDQ3Ljg0di0yMi4xOWgtMjUuNjV6IiBmaWxsPSIjNjI0YWZmIiAvPgo8L3N2Zz4K)](https://modelscope.cn/studios?name=Data-Jiucer&page=1&sort=latest&type=1)
[![HuggingFace- Demos](https://img.shields.io/badge/🤗HuggingFace-Demos-4e29ff.svg)](https://huggingface.co/spaces?&search=datajuicer)
-[![Document_List](https://img.shields.io/badge/Docs-English-blue?logo=Markdown)](README.md#documentation-index--文档索引-a-namedocumentationindex)
-[![文档列表](https://img.shields.io/badge/文档-中文-blue?logo=Markdown)](README_ZH.md#documentation-index--文档索引-a-namedocumentationindex)
+[![Document_List](https://img.shields.io/badge/Docs-English-blue?logo=Markdown)](README.md#documents)
+[![文档列表](https://img.shields.io/badge/文档-中文-blue?logo=Markdown)](#documents)
[![API Reference](https://img.shields.io/badge/Docs-API_Reference-blue?logo=Markdown)](https://alibaba.github.io/data-juicer/)
[![Paper](http://img.shields.io/badge/cs.LG-arXiv%3A2309.02033-B31B1B?logo=arxiv&logoColor=red)](https://arxiv.org/abs/2309.02033)
@@ -40,7 +40,7 @@ Data-Juicer(包含[DJ-SORA](docs/DJ_SORA_ZH.md))正在积极更新和维护
- [2024-01-10] 开启“数据混合”新视界——第二届Data-Juicer大模型数据挑战赛已经正式启动!立即访问[竞赛官网](https://tianchi.aliyun.com/competition/entrance/532174),了解赛事详情。
-[2024-01-05] 现在,我们发布了 **Data-Juicer v0.1.3** 版本!
-在这个新版本中,我们支持了**更多Python版本**(3.7-3.10),同时支持了**多模态**数据集的[转换](tools/multimodal/README_ZH.md)和[处理](docs/Operators_ZH.md)(包括文本、图像和音频。更多模态也将会在之后支持)。
+在这个新版本中,我们支持了**更多Python版本**(3.8-3.10),同时支持了**多模态**数据集的[转换](tools/multimodal/README_ZH.md)和[处理](docs/Operators_ZH.md)(包括文本、图像和音频。更多模态也将会在之后支持)。
此外,我们的论文也更新到了[第三版](https://arxiv.org/abs/2309.02033) 。
- [2023-10-13] 我们的第一届以数据为中心的 LLM 竞赛开始了!
@@ -53,7 +53,7 @@ Data-Juicer(包含[DJ-SORA](docs/DJ_SORA_ZH.md))正在积极更新和维护
* [Data-Juicer: 为大语言模型提供更高质量、更丰富、更易“消化”的数据](#data-juicer-为大语言模型提供更高质量更丰富更易消化的数据)
* [目录](#目录)
* [特点](#特点)
- * [Documentation Index | 文档索引](#documentation-index--文档索引-a-namedocumentationindex)
+ * [文档索引](#documents)
* [演示样例](#演示样例)
* [前置条件](#前置条件)
* [安装](#安装)
@@ -93,20 +93,20 @@ Data-Juicer(包含[DJ-SORA](docs/DJ_SORA_ZH.md))正在积极更新和维护
* **灵活 & 易扩展**:支持大多数数据格式(如jsonl、parquet、csv等),并允许灵活组合算子。支持[自定义算子](docs/DeveloperGuide_ZH.md#构建自己的算子),以执行定制化的数据处理。
-## Documentation Index | 文档索引
-
-* [Overview](README.md) | [概览](README_ZH.md)
-* [Operator Zoo](docs/Operators.md) | [算子库](docs/Operators_ZH.md)
-* [Configs](configs/README.md) | [配置系统](configs/README_ZH.md)
-* [Developer Guide](docs/DeveloperGuide.md) | [开发者指南](docs/DeveloperGuide_ZH.md)
-* ["Bad" Data Exhibition](docs/BadDataExhibition.md) | [“坏”数据展览](docs/BadDataExhibition_ZH.md)
-* Dedicated Toolkits | 专用工具箱
- * [Quality Classifier](tools/quality_classifier/README.md) | [质量分类器](tools/quality_classifier/README_ZH.md)
- * [Auto Evaluation](tools/evaluator/README.md) | [自动评测](tools/evaluator/README_ZH.md)
- * [Preprocess](tools/preprocess/README.md) | [前处理](tools/preprocess/README_ZH.md)
- * [Postprocess](tools/postprocess/README.md) | [后处理](tools/postprocess/README_ZH.md)
-* [Third-parties (LLM Ecosystems)](thirdparty/README.md) | [第三方库(大语言模型生态)](thirdparty/README_ZH.md)
-* [API references](https://alibaba.github.io/data-juicer/)
+## 文档索引
+
+* [概览](README_ZH.md)
+* [算子库](docs/Operators_ZH.md)
+* [配置系统](configs/README_ZH.md)
+* [开发者指南](docs/DeveloperGuide_ZH.md)
+* [“坏”数据展览](docs/BadDataExhibition_ZH.md)
+* 专用工具箱
+ * [质量分类器](tools/quality_classifier/README_ZH.md)
+ * [自动评测](tools/evaluator/README_ZH.md)
+ * [前处理](tools/preprocess/README_ZH.md)
+ * [后处理](tools/postprocess/README_ZH.md)
+* [第三方库(大语言模型生态)](thirdparty/README_ZH.md)
+* [API 参考](https://alibaba.github.io/data-juicer/)
* [Awesome LLM-Data](docs/awesome_llm_data.md)
* [DJ-SORA](docs/DJ_SORA_ZH.md)
diff --git a/data_juicer/__init__.py b/data_juicer/__init__.py
index 4a91ee691..f0927b8b7 100644
--- a/data_juicer/__init__.py
+++ b/data_juicer/__init__.py
@@ -1,4 +1,4 @@
-__version__ = '0.1.3'
+__version__ = '0.2.0'
import os
import subprocess
diff --git a/setup.py b/setup.py
index 9ce0369b9..0cf944927 100644
--- a/setup.py
+++ b/setup.py
@@ -55,7 +55,7 @@ def get_install_requirements(require_f_paths, env_dir='environments'):
name='py-data-juicer',
version=version,
url='https://github.com/alibaba/data-juicer',
- author='SysML team of Alibaba DAMO Academy',
+ author='SysML Team of Alibaba Tongyi Lab',
description='A One-Stop Data Processing System for Large Language '
'Models.',
long_description=readme_md,