Skip to content

Latest commit

 

History

History

pretrain_corpus

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

We provide the following three pretraining data files extracted from the Wikipedia and OpenWebText Corpus:

  • openwebtext_questions.txt contains questions extracted from a subset of the OpenWebText Corpus downloaded here.
  • wiki_long.txt contains long Wikipedia sequences (between 20 and 70 words) extracted from the 1M Wikipedia sentences downloaded with this script.
  • wiki_short.txt contains short Wikipedia sequences (between 5 and 30 words) extracted from the 1M Wikipedia sentences downloaded with this script.