Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support megatron dataset for T5 #6659

Merged
merged 1 commit into from
Sep 13, 2023
Merged

Conversation

LaiXinyi823
Copy link
Contributor

PR types

New features

PR changes

APIs

Description

Support megatron dataset for T5

@paddle-bot
Copy link

paddle-bot bot commented Aug 9, 2023

Thanks for your contribution!

@codecov
Copy link

codecov bot commented Aug 9, 2023

Codecov Report

Merging #6659 (6a41b11) into develop (e49842c) will decrease coverage by 0.17%.
Report is 2 commits behind head on develop.
The diff coverage is 0.00%.

@@             Coverage Diff             @@
##           develop    #6659      +/-   ##
===========================================
- Coverage    60.06%   59.90%   -0.17%     
===========================================
  Files          552      554       +2     
  Lines        81755    81975     +220     
===========================================
  Hits         49105    49105              
- Misses       32650    32870     +220     
Files Changed Coverage Δ
paddlenlp/experimental/transformers/__init__.py 0.00% <0.00%> (ø)
...dlenlp/experimental/transformers/bloom/__init__.py 0.00% <0.00%> (ø)
...dlenlp/experimental/transformers/bloom/modeling.py 0.00% <0.00%> (ø)
...erimental/transformers/fused_transformer_layers.py 0.00% <0.00%> (ø)
...enlp/experimental/transformers/generation_utils.py 0.00% <0.00%> (ø)

@LaiXinyi823 LaiXinyi823 force-pushed the develop branch 2 times, most recently from a47f2dc to 3bed5b1 Compare August 9, 2023 03:43
@CLAassistant
Copy link

CLAassistant commented Aug 9, 2023

CLA assistant check
All committers have signed the CLA.

@LaiXinyi823 LaiXinyi823 force-pushed the develop branch 5 times, most recently from e843a53 to 37ce41e Compare August 9, 2023 09:12
@@ -26,11 +26,13 @@
python -u create_pretraining_data.py \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在数据ID化步骤中,我们需要配置tokenzer_name,选择t5模型对应的tokenizer;通过下面脚本转化,我们可以得到处理好的预训练数据,token ids:baike_sample_ids.npy, 文章索引信息baike_sample_idx.npz.(这里提供了一个处理好的预训练数据,可点击链接下载)

这块需要搞一个样例数据出来

Copy link
Contributor

@KB-Ding KB-Ding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 目前有冲突需要解决一下。
  2. 是否对齐新旧版本的数据?比如同一份数据处理出的旧版npy和新版bin,设定seed,跑前几步看看两版拿到的数据是否一样。

@@ -95,6 +98,7 @@ python -u -m paddle.distributed.launch \
- `dataloader_num_workers` DataLoader采样进程,当数据输入为瓶颈时,可尝试提高采样进程数目。
- `eval_steps` 模型评估间隔。
- `device` 训练设备,默认为GPU。
- `data_impl` 指定输入文件数据制作类型,默认为mmap,可指定mmap或lazy。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

补充一下mmap和lazy的区别:“mmap”格式在读入数据时会建立内存映射,“lazy”格式在读入数据时直接从文件读取。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改为:指定输入文件数据制作类型,默认为mmap,可指定mmap或lazy。“mmap”格式在读入数据时会建立内存映射,“lazy”格式在读入数据时直接从文件读取。

@@ -120,6 +120,7 @@ class DataArguments:
default=3,
metadata={"help": "Max N Grams"},
)
data_impl: str = field(default="mmap", metadata={"help": "Data implementation."})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议和llama一致:help="mmap/lazy format converted from preprocessed data."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改为:"help": "mmap/lazy format converted from preprocessed data."

@@ -32,7 +32,7 @@ def parse_args(MODEL_CLASSES):
parser.add_argument("--input_dir", default=None, type=str, required=True, help="The input directory where the data will be read from.", )
parser.add_argument("--output_dir", default=None, type=str, required=True, help="The output directory where the training logs and checkpoints will be written.")
parser.add_argument("--split", type=str, default='949,50,1', help="Train/valid/test data split.")

parser.add_argument("--data_impl", type=str, default='mmap', help="mmap/lazy format converted from json.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

最好和llama保持一致,“mmap/lazy format converted from preprocessed data”

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改为:help="mmap/lazy format converted from preprocessed data."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

数据准备中,预置的token ids:baike_sample_ids.npy, 文章索引信息baike_sample_idx.npz样例应改为bin格式与idx格式,数据制作可以参考这里,注意参数配置,参考这里

@LaiXinyi823
Copy link
Contributor Author

  1. 目前有冲突需要解决一下。
  2. 是否对齐新旧版本的数据?比如同一份数据处理出的旧版npy和新版bin,设定seed,跑前几步看看两版拿到的数据是否一样。
  1. 冲突已解决
  2. 测试跑几个step,新旧版本数据相同,loss一致。

Copy link
Member

@gongel gongel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paddlenlp/data/indexed_dataset.py没看到任何改动,可以不用放在 PR 里

@@ -20,17 +20,19 @@

数据流是预训练的非常重要的,[预处理文档](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/model_zoo/ernie-1.0/preprocess/README.md)提供了整体的数据变动的流程示意,用户可以查看数据制作的细节文档。

在数据ID化步骤中,我们需要配置tokenzer_name,选择t5模型对应的tokenizer;通过下面脚本转化,我们可以得到处理好的预训练数据,token ids:[`baike_sample_ids.npy`](https://paddlenlp.bj.bcebos.com/models/transformers/t5/data//baike_sample_ids.npy), 文章索引信息[`baike_sample_idx.npz`](https://paddlenlp.bj.bcebos.com/models/transformers/t5/data//baike_sample_idx.npz).(这里提供了一个处理好的预训练数据,可点击链接下载)
在数据ID化步骤中,我们需要配置tokenzer_name,选择t5模型对应的tokenizer;通过下面脚本转化,我们可以得到处理好的预训练数据,token ids:[`gpt_openwebtext.bin`](https://paddlenlp.bj.bcebos.com/models/transformers/t5/data/gpt_openwebtext.bin), 文章索引信息[`gpt_openwebtext.idx`](https://paddlenlp.bj.bcebos.com/models/transformers/t5/data/gpt_openwebtext.idx).(这里提供了一个处理好的预训练数据,可点击链接下载)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里为啥出来的是 gpt prefix的数据集?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

最后t5用的是gpt的openwebtext数据集,所以prefix是gpt,改成t5吗?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

用户按你的步骤跑出来的数据集是什么名称呢?保持一致,要可复现。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

用户按你的步骤跑出来的数据集是什么名称呢?保持一致,要可复现。

已修改

examples/language_model/t5/t5_run_pretrain_trainer.py Outdated Show resolved Hide resolved
@LaiXinyi823
Copy link
Contributor Author

paddlenlp/data/indexed_dataset.py没看到任何改动,可以不用放在 PR 里

ok

Support megatron dataset for T5
@gongel gongel merged commit a43138b into PaddlePaddle:develop Sep 13, 2023
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants