Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add enwiki-20200101 streaming dataset #23

Merged
merged 8 commits into from
Sep 29, 2022
Merged

Add enwiki-20200101 streaming dataset #23

merged 8 commits into from
Sep 29, 2022

Conversation

knighton
Copy link
Contributor

@knighton knighton commented Sep 28, 2022

Command to perform conversion to MDS:

streaming$ python3 -m streaming.text.convert.enwiki --in_root /workdisk/vlad/mlperf-bert-nvidia/workspace/bert_data/download/results4/ --out_root /workdisk/james/enwiki/

@dskhudia
Copy link
Contributor

Can you add the command you used to call this script in the description so that it's available for future?

@karan6181
Copy link
Collaborator

Looks good overall. Can you please address below two points? Thanks!

  1. Add a description to the PR on what the PR is about
  2. pre-commit is failing with unused import and Missing docstring.

@knighton knighton changed the title enwiki processing Add enwiki-20200101 streaming dataset Sep 29, 2022
streaming/text/convert/enwiki.py Outdated Show resolved Hide resolved
streaming/text/convert/enwiki.py Show resolved Hide resolved
streaming/text/enwiki.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@karan6181 karan6181 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@knighton knighton merged commit 311a94b into main Sep 29, 2022
@knighton knighton deleted the james/enwiki branch September 29, 2022 21:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants