Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pre-training corpus #80

Open
Humorloos opened this issue Jun 2, 2022 · 0 comments
Open

pre-training corpus #80

Humorloos opened this issue Jun 2, 2022 · 0 comments

Comments

@Humorloos
Copy link

Humorloos commented Jun 2, 2022

Hello @autoliuweijie, thank you for your amazing and inspiring work!

I would like to pre-train a K-Bert model on an english language corpus and to make it work I am currently trying to get the function in train_and_validate() to run, with args.target set to "bert". I notice that with this setting, BertDataLoader will be used for loading the data, but I am not sure what exact format the dataset file at dataset_path has to be. From the code, I see that it has to be pickle file, but I am having trouble trying to reconstruct one that works with the data loader.

It would be very helpful to have access to the data file originally used for pre-training. Could you provide a link or instructions on how to construct it myself?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant