Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BERT Work-in-progress [edoardo] #336

Open
ehoelzl opened this issue Feb 26, 2021 · 0 comments
Open

BERT Work-in-progress [edoardo] #336

ehoelzl opened this issue Feb 26, 2021 · 0 comments

Comments

@ehoelzl
Copy link
Contributor

ehoelzl commented Feb 26, 2021

The BERT task is currently being added to MLBench on this branch. Pre-processing works, and all pre-processed data is already on a bucket. However, the pre-training requires scaling the data by 10x, resulting in almost 370GB of data. This amount of data cannot be downloaded by each worker, as it would require huge disk sizes.

One way of going around this, would be to mount the bucket containing all preprocessed shards, and download them on demand

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant