Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix readme link #228

Merged
merged 18 commits into from
Mar 7, 2024
34 changes: 17 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
[[中文主页]](README_ZH.md) | [[Docs]](README.md#documentation-index--文档索引-a-namedocumentationindex) | [[API]](https://alibaba.github.io/data-juicer) | [[*DJ-SORA*]](docs/DJ_SORA.md)
[[中文主页]](README_ZH.md) | [[Docs]](#documents) | [[API]](https://alibaba.github.io/data-juicer) | [[*DJ-SORA*]](docs/DJ_SORA.md)

# Data-Juicer: A One-Stop Data Processing System for Large Language Models

Expand All @@ -16,8 +16,8 @@



[![Document_List](https://img.shields.io/badge/Docs-English-blue?logo=Markdown)](README.md#documentation-index--文档索引-a-namedocumentationindex)
[![文档列表](https://img.shields.io/badge/文档-中文-blue?logo=Markdown)](README_ZH.md#documentation-index--文档索引-a-namedocumentationindex)
[![Document_List](https://img.shields.io/badge/Docs-English-blue?logo=Markdown)](#documents)
[![文档列表](https://img.shields.io/badge/文档-中文-blue?logo=Markdown)](README_ZH.md#documents)
[![API Reference](https://img.shields.io/badge/Docs-API_Reference-blue?logo=Markdown)](https://alibaba.github.io/data-juicer/)
[![Paper](http://img.shields.io/badge/cs.LG-arXiv%3A2309.02033-B31B1B?logo=arxiv&logoColor=red)](https://arxiv.org/abs/2309.02033)

Expand Down Expand Up @@ -45,7 +45,7 @@ In this new version, we support more features for **multimodal data (including v
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-02-05] Our paper has been accepted by SIGMOD'24 industrial track!
- [2024-01-10] Discover new horizons in "Data Mixture"—Our second data-centric LLM competition has kicked off! Please visit the competition's [official website](https://tianchi.aliyun.com/competition/entrance/532174) for more information.
- [2024-01-05] We release **Data-Juicer v0.1.3** now!
In this new version, we support **more Python versions** (3.7-3.10), and support **multimodal** dataset [converting](tools/multimodal/README.md)/[processing](docs/Operators.md) (Including texts, images, and audios. More modalities will be supported in the future).
In this new version, we support **more Python versions** (3.8-3.10), and support **multimodal** dataset [converting](tools/multimodal/README.md)/[processing](docs/Operators.md) (Including texts, images, and audios. More modalities will be supported in the future).
Besides, our paper is also updated to [v3](https://arxiv.org/abs/2309.02033).

- [2023-10-13] Our first data-centric LLM competition begins! Please
Expand All @@ -59,7 +59,7 @@ Table of Contents
* [Data-Juicer: A One-Stop Data Processing System for Large Language Models](#data-juicer-a-one-stop-data-processing-system-for-large-language-models)
* [Table of Contents](#table-of-contents)
* [Features](#features)
* [Documentation Index | 文档索引](#documentation-index--文档索引-a-namedocumentationindex)
* [Documentation Index](#documents)
* [Demos](#demos)
* [Prerequisites](#prerequisites)
* [Installation](#installation)
Expand Down Expand Up @@ -111,19 +111,19 @@ Table of Contents



## Documentation Index | 文档索引 <a name="documentationindex"/>
## Documentation Index <a name="documents"/>

- [Overview](README.md) | [概览](README_ZH.md)
- [Operator Zoo](docs/Operators.md) | [算子库](docs/Operators_ZH.md)
- [Configs](configs/README.md) | [配置系统](configs/README_ZH.md)
- [Developer Guide](docs/DeveloperGuide.md) | [开发者指南](docs/DeveloperGuide_ZH.md)
- ["Bad" Data Exhibition](docs/BadDataExhibition.md) | [“坏”数据展览](docs/BadDataExhibition_ZH.md)
- Dedicated Toolkits | 专用工具箱
- [Quality Classifier](tools/quality_classifier/README.md) | [质量分类器](tools/quality_classifier/README_ZH.md)
- [Auto Evaluation](tools/evaluator/README.md) | [自动评测](tools/evaluator/README_ZH.md)
- [Preprocess](tools/preprocess/README.md) | [前处理](tools/preprocess/README_ZH.md)
- [Postprocess](tools/postprocess/README.md) | [后处理](tools/postprocess/README_ZH.md)
- [Third-parties (LLM Ecosystems)](thirdparty/README.md) | [第三方库(大语言模型生态)](thirdparty/README_ZH.md)
- [Overview](README.md)
- [Operator Zoo](docs/Operators.md)
- [Configs](configs/README.md)
- [Developer Guide](docs/DeveloperGuide.md)
- ["Bad" Data Exhibition](docs/BadDataExhibition.md)
- Dedicated Toolkits
- [Quality Classifier](tools/quality_classifier/README.md)
- [Auto Evaluation](tools/evaluator/README.md)
- [Preprocess](tools/preprocess/README.md)
- [Postprocess](tools/postprocess/README.md)
- [Third-parties (LLM Ecosystems)](thirdparty/README.md)
- [API references](https://alibaba.github.io/data-juicer/)
- [Awesome LLM-Data](docs/awesome_llm_data.md)
- [DJ-SORA](docs/DJ_SORA.md)
Expand Down
38 changes: 19 additions & 19 deletions README_ZH.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
[[English Page]](README.md) | [[文档]](README_ZH.md#documentation-index--文档索引-a-namedocumentationindex) | [[API]](https://alibaba.github.io/data-juicer) | [[*DJ-SORA*]](docs/DJ_SORA_ZH.md)
[[English Page]](README.md) | [[文档]](#documents) | [[API]](https://alibaba.github.io/data-juicer) | [[*DJ-SORA*]](docs/DJ_SORA_ZH.md)

# Data-Juicer: 为大语言模型提供更高质量、更丰富、更易“消化”的数据

Expand All @@ -14,8 +14,8 @@
[![ModelScope- Demos](https://img.shields.io/badge/ModelScope-Demos-4e29ff.svg?logo=data:image/svg+xml;base64,PHN2ZyB2aWV3Qm94PSIwIDAgMjI0IDEyMS4zMyIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj4KCTxwYXRoIGQ9Im0wIDQ3Ljg0aDI1LjY1djI1LjY1aC0yNS42NXoiIGZpbGw9IiM2MjRhZmYiIC8+Cgk8cGF0aCBkPSJtOTkuMTQgNzMuNDloMjUuNjV2MjUuNjVoLTI1LjY1eiIgZmlsbD0iIzYyNGFmZiIgLz4KCTxwYXRoIGQ9Im0xNzYuMDkgOTkuMTRoLTI1LjY1djIyLjE5aDQ3Ljg0di00Ny44NGgtMjIuMTl6IiBmaWxsPSIjNjI0YWZmIiAvPgoJPHBhdGggZD0ibTEyNC43OSA0Ny44NGgyNS42NXYyNS42NWgtMjUuNjV6IiBmaWxsPSIjMzZjZmQxIiAvPgoJPHBhdGggZD0ibTAgMjIuMTloMjUuNjV2MjUuNjVoLTI1LjY1eiIgZmlsbD0iIzM2Y2ZkMSIgLz4KCTxwYXRoIGQ9Im0xOTguMjggNDcuODRoMjUuNjV2MjUuNjVoLTI1LjY1eiIgZmlsbD0iIzYyNGFmZiIgLz4KCTxwYXRoIGQ9Im0xOTguMjggMjIuMTloMjUuNjV2MjUuNjVoLTI1LjY1eiIgZmlsbD0iIzM2Y2ZkMSIgLz4KCTxwYXRoIGQ9Im0xNTAuNDQgMHYyMi4xOWgyNS42NXYyNS42NWgyMi4xOXYtNDcuODR6IiBmaWxsPSIjNjI0YWZmIiAvPgoJPHBhdGggZD0ibTczLjQ5IDQ3Ljg0aDI1LjY1djI1LjY1aC0yNS42NXoiIGZpbGw9IiMzNmNmZDEiIC8+Cgk8cGF0aCBkPSJtNDcuODQgMjIuMTloMjUuNjV2LTIyLjE5aC00Ny44NHY0Ny44NGgyMi4xOXoiIGZpbGw9IiM2MjRhZmYiIC8+Cgk8cGF0aCBkPSJtNDcuODQgNzMuNDloLTIyLjE5djQ3Ljg0aDQ3Ljg0di0yMi4xOWgtMjUuNjV6IiBmaWxsPSIjNjI0YWZmIiAvPgo8L3N2Zz4K)](https://modelscope.cn/studios?name=Data-Jiucer&page=1&sort=latest&type=1)
[![HuggingFace- Demos](https://img.shields.io/badge/🤗HuggingFace-Demos-4e29ff.svg)](https://huggingface.co/spaces?&search=datajuicer)

[![Document_List](https://img.shields.io/badge/Docs-English-blue?logo=Markdown)](README.md#documentation-index--文档索引-a-namedocumentationindex)
[![文档列表](https://img.shields.io/badge/文档-中文-blue?logo=Markdown)](README_ZH.md#documentation-index--文档索引-a-namedocumentationindex)
[![Document_List](https://img.shields.io/badge/Docs-English-blue?logo=Markdown)](README.md#documents)
[![文档列表](https://img.shields.io/badge/文档-中文-blue?logo=Markdown)](#documents)
[![API Reference](https://img.shields.io/badge/Docs-API_Reference-blue?logo=Markdown)](https://alibaba.github.io/data-juicer/)
[![Paper](http://img.shields.io/badge/cs.LG-arXiv%3A2309.02033-B31B1B?logo=arxiv&logoColor=red)](https://arxiv.org/abs/2309.02033)

Expand All @@ -40,7 +40,7 @@ Data-Juicer(包含[DJ-SORA](docs/DJ_SORA_ZH.md))正在积极更新和维护
- [2024-01-10] 开启“数据混合”新视界——第二届Data-Juicer大模型数据挑战赛已经正式启动!立即访问[竞赛官网](https://tianchi.aliyun.com/competition/entrance/532174),了解赛事详情。

-[2024-01-05] 现在,我们发布了 **Data-Juicer v0.1.3** 版本!
在这个新版本中,我们支持了**更多Python版本**(3.7-3.10),同时支持了**多模态**数据集的[转换](tools/multimodal/README_ZH.md)和[处理](docs/Operators_ZH.md)(包括文本、图像和音频。更多模态也将会在之后支持)。
在这个新版本中,我们支持了**更多Python版本**(3.8-3.10),同时支持了**多模态**数据集的[转换](tools/multimodal/README_ZH.md)和[处理](docs/Operators_ZH.md)(包括文本、图像和音频。更多模态也将会在之后支持)。
此外,我们的论文也更新到了[第三版](https://arxiv.org/abs/2309.02033) 。

- [2023-10-13] 我们的第一届以数据为中心的 LLM 竞赛开始了!
Expand All @@ -53,7 +53,7 @@ Data-Juicer(包含[DJ-SORA](docs/DJ_SORA_ZH.md))正在积极更新和维护
* [Data-Juicer: 为大语言模型提供更高质量、更丰富、更易“消化”的数据](#data-juicer-为大语言模型提供更高质量更丰富更易消化的数据)
* [目录](#目录)
* [特点](#特点)
* [Documentation Index | 文档索引](#documentation-index--文档索引-a-namedocumentationindex)
* [文档索引](#documents)
* [演示样例](#演示样例)
* [前置条件](#前置条件)
* [安装](#安装)
Expand Down Expand Up @@ -93,20 +93,20 @@ Data-Juicer(包含[DJ-SORA](docs/DJ_SORA_ZH.md))正在积极更新和维护
* **灵活 & 易扩展**:支持大多数数据格式(如jsonl、parquet、csv等),并允许灵活组合算子。支持[自定义算子](docs/DeveloperGuide_ZH.md#构建自己的算子),以执行定制化的数据处理。


## Documentation Index | 文档索引 <a name="documentationindex"/>

* [Overview](README.md) | [概览](README_ZH.md)
* [Operator Zoo](docs/Operators.md) | [算子库](docs/Operators_ZH.md)
* [Configs](configs/README.md) | [配置系统](configs/README_ZH.md)
* [Developer Guide](docs/DeveloperGuide.md) | [开发者指南](docs/DeveloperGuide_ZH.md)
* ["Bad" Data Exhibition](docs/BadDataExhibition.md) | [“坏”数据展览](docs/BadDataExhibition_ZH.md)
* Dedicated Toolkits | 专用工具箱
* [Quality Classifier](tools/quality_classifier/README.md) | [质量分类器](tools/quality_classifier/README_ZH.md)
* [Auto Evaluation](tools/evaluator/README.md) | [自动评测](tools/evaluator/README_ZH.md)
* [Preprocess](tools/preprocess/README.md) | [前处理](tools/preprocess/README_ZH.md)
* [Postprocess](tools/postprocess/README.md) | [后处理](tools/postprocess/README_ZH.md)
* [Third-parties (LLM Ecosystems)](thirdparty/README.md) | [第三方库(大语言模型生态)](thirdparty/README_ZH.md)
* [API references](https://alibaba.github.io/data-juicer/)
## 文档索引 <a name="documents"/>

* [概览](README_ZH.md)
* [算子库](docs/Operators_ZH.md)
* [配置系统](configs/README_ZH.md)
* [开发者指南](docs/DeveloperGuide_ZH.md)
* [“坏”数据展览](docs/BadDataExhibition_ZH.md)
* 专用工具箱
* [质量分类器](tools/quality_classifier/README_ZH.md)
* [自动评测](tools/evaluator/README_ZH.md)
* [前处理](tools/preprocess/README_ZH.md)
* [后处理](tools/postprocess/README_ZH.md)
* [第三方库(大语言模型生态)](thirdparty/README_ZH.md)
* [API 参考](https://alibaba.github.io/data-juicer/)
* [Awesome LLM-Data](docs/awesome_llm_data.md)
* [DJ-SORA](docs/DJ_SORA_ZH.md)

Expand Down
2 changes: 1 addition & 1 deletion data_juicer/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
__version__ = '0.1.3'
__version__ = '0.2.0'

import os
import subprocess
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ def get_install_requirements(require_f_paths, env_dir='environments'):
name='py-data-juicer',
version=version,
url='https://github.com/alibaba/data-juicer',
author='SysML team of Alibaba DAMO Academy',
author='SysML Team of Alibaba Tongyi Lab',
description='A One-Stop Data Processing System for Large Language '
'Models.',
long_description=readme_md,
Expand Down
Loading