Skip to content

v0.4.0 - monolingual data, mtdata echo, add new datasets at runtime

Compare
Choose a tag to compare
@thammegowda thammegowda released this 27 Mar 04:09
· 7 commits to master since this release
  • Fix: allenai_nllb.json is now included in MANIFEST.in #137. Also fixed CI: Travis -> github actions
  • Update ELRC datasets #138. Thanks @AlexUmnov
  • Add Jparacrawl Chinese-Japanese subset #143. Thanks @BrightXiaoHan
  • Add Flores200 dev and devtests #145. Thanks @ZenBel
  • Add support for mtdata echo <ID>
  • dataset entries only store bibtext keys and not full citation text
    • creates index cache as JSONLine file. (WIP towards dataset statistics)
  • Simplified index loading
  • simplified compression format handlers. Added support for opening .bz2 files without creating temp files.
  • all resources are moved to mtdata/resource dir and any new additions to that dir are automatically included in python package (Fail proof for future issues like #137 )

New and exciting features:

  • Support for adding new datasets at runtime (mtdata*.py from run dir). Note: you have to reindex by calling mtdata -ri list
  • Monolingual datasets support in progress (currently testing)
    • Dataset IDs are now Group-name-version-lang1-lang2 for bitext and Group-name-version-lang for monolingual
    • mtdata list is updated. mtdata list -l eng-deu for bitext and mtdata list -l eng for monolingual
    • Added: Statmt Newscrawl, news-discussions, Leipzig corpus, ...

skipped 0.3.9 because the changes are too significant and wanted to bump from 0.3x -> 0.4x