-
Notifications
You must be signed in to change notification settings - Fork 538
[nlp_data] Add BookCorpus #1406
Comments
the data source "smashwords" has a term of service that prohibits redistribution. neither in the links above nor in soskek/bookcorpus#27 was there any mention of getting approval from smashwords or approval from authors. we should clarify the legal risks before proceeding. |
There is no legal risk linking to the dataset. All risk is being taken on by The Eye. The sole reason not to merge it is because someone doesn't like the idea of using the dataset. Which is fine. But anyone who says there is risk, is mistaken. |
(In other words, don't host the data yourself. Rely on the URL from The Eye. So, for example, all dataset preparation scripts should download from https://the-eye.eu/public/AI/pile_preliminary_components/books1.tar.gz, and books1.tar.gz itself should not be hosted anywhere else. By following this pattern, all risk is transferred to The Eye.) |
In the US there's recognition of the secondary infringement liability. One can be found guilty for affirmative encouragement or inducing behavior for known copyright violations. |
The datasets are hosted by The Eye, which fully respects DMCA: http://the-eye.eu/dmca If anyone were to file a DMCA notice against books1 or books3, they would extract the tarball, remove the infringing content, then re-upload the modified tarball. There is no risk linking to The Eye. |
In GluonNLP we store a hash of the tarball in source to ensure reproducibility. Linking to a source that will periodically change the contents of the file may not be optimal. |
We may try to first add it and later figure out if we can hold a snapchat of BookCorpus by ourselves. What do you think? |
Happy to announce that bookcorpus was just merged into huggingface's Datasets library as So, huggingface is officially supporting this dataset now. The Eye also seems to be a trustworthy steward; I mentioned that "the tarball might change due to DMCA" as more of a theoretical concern rather than a practical reality. I doubt this tarball is going to change. |
@shawwn Really appreciate the information! I've tried out huggingface/datasets and find that it's quite good. In fact we can add it even if the tarball changes. It's the same as the strategy of the wikipedia corpus that we added: https://github.com/dmlc/gluon-nlp/blob/master/scripts/datasets/pretrain_corpus/prepare_wikipedia.py. Part of the purpose of |
Description
The book corpus can now have a reliable, stable download link from https://the-eye.eu/public/AI/pile_preliminary_components/books1.tar.gz. Also, there are more links in https://the-eye.eu/public/AI/pile_preliminary_components/ that are worthwhile being included in
nlp_data
. We may try to download from their link and provide the corresponding license.The text was updated successfully, but these errors were encountered: