-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add model Prohetnet #1698
Add model Prohetnet #1698
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
数据集读取的问题再看看,cnn_daliymail和gigaword数据集都可以通过load_dataset传入名称加载,不同点是前者是paddlenlp数据集,后者是HuggingFace数据集。但是访问和处理方式应该没什么差别
--epochs=6 \ | ||
--lr=0.0001 \ | ||
--warmup_init_lr=1e-07 \ | ||
--warmup_updates=1000 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
用warmup_steps比较好
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
test_data_src = 'data/' + args.dataset + '_data/uncased_tok_data/test.src' | ||
test_data_tgt = 'data/' + args.dataset + '_data/uncased_tok_data/test.tgt' | ||
|
||
test_dataset = load_dataset( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里可以直接使用paddlenlp内置的cnn_daliymail数据集么
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
源码使用的是GLGE baseline的cnn_dailymail,和paddlenlp的cnn_daliymail有点区别,GLGE的文本会多个[S_SEP]标签,不知道会不会产生影响。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
那GLGE baseline的这两个数据集和hugging face的这两个数据集一样么
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cnndm和gigaword都存在一些差别
from .. import PretrainedTokenizer, BasicTokenizer, WordpieceTokenizer | ||
|
||
|
||
class Trie: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个trie在基类里有,应该不用重新定义吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
dev_data_src = 'data/' + args.dataset + '_data/uncased_tok_data/dev.src' | ||
dev_data_tgt = 'data/' + args.dataset + '_data/uncased_tok_data/dev.tgt' | ||
|
||
train_dataset = load_dataset( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里应该可以直接读内置的cnn_daliymail数据集,gigaword数据集在huggingface上也有,paddlenlp的load_dataset也可以读取HF的数据集
如果都能通过传入数据集名称直接加载,应该可以省略一些数据处理代码 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
example下的那个__init__.py去掉吧
已修改 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description
Add new model Prophetnet
The model weight:
链接:https://pan.baidu.com/s/1FOnd01rNvDJoONYegacq1Q
提取码:o28q
The tokenizer vocab file:
链接:https://pan.baidu.com/s/1pUxLy6eGTZFqzf85OlIzUg
提取码:ltp6