-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support megatron dataset for T5 #6659
Conversation
Thanks for your contribution! |
Codecov Report
@@ Coverage Diff @@
## develop #6659 +/- ##
===========================================
- Coverage 60.06% 59.90% -0.17%
===========================================
Files 552 554 +2
Lines 81755 81975 +220
===========================================
Hits 49105 49105
- Misses 32650 32870 +220
|
a47f2dc
to
3bed5b1
Compare
e843a53
to
37ce41e
Compare
@@ -26,11 +26,13 @@ | |||
python -u create_pretraining_data.py \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
在数据ID化步骤中,我们需要配置tokenzer_name,选择t5模型对应的tokenizer;通过下面脚本转化,我们可以得到处理好的预训练数据,token ids:baike_sample_ids.npy
, 文章索引信息baike_sample_idx.npz
.(这里提供了一个处理好的预训练数据,可点击链接下载)
这块需要搞一个样例数据出来
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 目前有冲突需要解决一下。
- 是否对齐新旧版本的数据?比如同一份数据处理出的旧版npy和新版bin,设定seed,跑前几步看看两版拿到的数据是否一样。
examples/language_model/t5/README.md
Outdated
@@ -95,6 +98,7 @@ python -u -m paddle.distributed.launch \ | |||
- `dataloader_num_workers` DataLoader采样进程,当数据输入为瓶颈时,可尝试提高采样进程数目。 | |||
- `eval_steps` 模型评估间隔。 | |||
- `device` 训练设备,默认为GPU。 | |||
- `data_impl` 指定输入文件数据制作类型,默认为mmap,可指定mmap或lazy。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
补充一下mmap和lazy的区别:“mmap”格式在读入数据时会建立内存映射,“lazy”格式在读入数据时直接从文件读取。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改为:指定输入文件数据制作类型,默认为mmap,可指定mmap或lazy。“mmap”格式在读入数据时会建立内存映射,“lazy”格式在读入数据时直接从文件读取。
@@ -120,6 +120,7 @@ class DataArguments: | |||
default=3, | |||
metadata={"help": "Max N Grams"}, | |||
) | |||
data_impl: str = field(default="mmap", metadata={"help": "Data implementation."}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
建议和llama一致:help="mmap/lazy format converted from preprocessed data."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改为:"help": "mmap/lazy format converted from preprocessed data."
model_zoo/ernie-1.0/args.py
Outdated
@@ -32,7 +32,7 @@ def parse_args(MODEL_CLASSES): | |||
parser.add_argument("--input_dir", default=None, type=str, required=True, help="The input directory where the data will be read from.", ) | |||
parser.add_argument("--output_dir", default=None, type=str, required=True, help="The output directory where the training logs and checkpoints will be written.") | |||
parser.add_argument("--split", type=str, default='949,50,1', help="Train/valid/test data split.") | |||
|
|||
parser.add_argument("--data_impl", type=str, default='mmap', help="mmap/lazy format converted from json.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
最好和llama保持一致,“mmap/lazy format converted from preprocessed data”
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改为:help="mmap/lazy format converted from preprocessed data."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
在数据准备中,预置的token ids:baike_sample_ids.npy, 文章索引信息baike_sample_idx.npz样例应改为bin格式与idx格式,数据制作可以参考这里,注意参数配置,参考这里
e7ab798
to
9cd67e7
Compare
|
9cd67e7
to
08ce7df
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
paddlenlp/data/indexed_dataset.py
没看到任何改动,可以不用放在 PR 里
examples/language_model/t5/README.md
Outdated
@@ -20,17 +20,19 @@ | |||
|
|||
数据流是预训练的非常重要的,[预处理文档](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/model_zoo/ernie-1.0/preprocess/README.md)提供了整体的数据变动的流程示意,用户可以查看数据制作的细节文档。 | |||
|
|||
在数据ID化步骤中,我们需要配置tokenzer_name,选择t5模型对应的tokenizer;通过下面脚本转化,我们可以得到处理好的预训练数据,token ids:[`baike_sample_ids.npy`](https://paddlenlp.bj.bcebos.com/models/transformers/t5/data//baike_sample_ids.npy), 文章索引信息[`baike_sample_idx.npz`](https://paddlenlp.bj.bcebos.com/models/transformers/t5/data//baike_sample_idx.npz).(这里提供了一个处理好的预训练数据,可点击链接下载) | |||
在数据ID化步骤中,我们需要配置tokenzer_name,选择t5模型对应的tokenizer;通过下面脚本转化,我们可以得到处理好的预训练数据,token ids:[`gpt_openwebtext.bin`](https://paddlenlp.bj.bcebos.com/models/transformers/t5/data/gpt_openwebtext.bin), 文章索引信息[`gpt_openwebtext.idx`](https://paddlenlp.bj.bcebos.com/models/transformers/t5/data/gpt_openwebtext.idx).(这里提供了一个处理好的预训练数据,可点击链接下载) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里为啥出来的是 gpt prefix
的数据集?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
最后t5用的是gpt的openwebtext数据集,所以prefix是gpt,改成t5吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
用户按你的步骤跑出来的数据集是什么名称呢?保持一致,要可复现。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
用户按你的步骤跑出来的数据集是什么名称呢?保持一致,要可复现。
已修改
ok |
06d6b9a
to
673d356
Compare
1c9e3c4
to
d225406
Compare
Support megatron dataset for T5
d225406
to
6a41b11
Compare
PR types
New features
PR changes
APIs
Description
Support megatron dataset for T5