Release v0.4.0 - monolingual data, mtdata echo, add new datasets at runtime · thammegowda/mtdata

Fix: allenai_nllb.json is now included in MANIFEST.in #137. Also fixed CI: Travis -> github actions
Update ELRC datasets #138. Thanks @AlexUmnov
Add Jparacrawl Chinese-Japanese subset #143. Thanks @BrightXiaoHan
Add Flores200 dev and devtests #145. Thanks @ZenBel
Add support for mtdata echo <ID>
dataset entries only store bibtext keys and not full citation text
- creates index cache as JSONLine file. (WIP towards dataset statistics)
Simplified index loading
simplified compression format handlers. Added support for opening .bz2 files without creating temp files.
all resources are moved to mtdata/resource dir and any new additions to that dir are automatically included in python package (Fail proof for future issues like #137 )

New and exciting features:

Support for adding new datasets at runtime (mtdata*.py from run dir). Note: you have to reindex by calling mtdata -ri list
Monolingual datasets support in progress (currently testing)
- Dataset IDs are now Group-name-version-lang1-lang2 for bitext and Group-name-version-lang for monolingual
- mtdata list is updated. mtdata list -l eng-deu for bitext and mtdata list -l eng for monolingual
- Added: Statmt Newscrawl, news-discussions, Leipzig corpus, ...

skipped 0.3.9 because the changes are too significant and wanted to bump from 0.3x -> 0.4x

Provide feedback