v0.4.0 - monolingual data, mtdata echo, add new datasets at runtime
- Fix: allenai_nllb.json is now included in MANIFEST.in #137. Also fixed CI: Travis -> github actions
- Update ELRC datasets #138. Thanks @AlexUmnov
- Add Jparacrawl Chinese-Japanese subset #143. Thanks @BrightXiaoHan
- Add Flores200 dev and devtests #145. Thanks @ZenBel
- Add support for
mtdata echo <ID>
- dataset entries only store bibtext keys and not full citation text
- creates index cache as JSONLine file. (WIP towards dataset statistics)
- Simplified index loading
- simplified compression format handlers. Added support for opening .bz2 files without creating temp files.
- all resources are moved to
mtdata/resource
dir and any new additions to that dir are automatically included in python package (Fail proof for future issues like #137 )
New and exciting features:
- Support for adding new datasets at runtime (
mtdata*.py
from run dir). Note: you have to reindex by callingmtdata -ri list
- Monolingual datasets support in progress (currently testing)
- Dataset IDs are now
Group-name-version-lang1-lang2
for bitext andGroup-name-version-lang
for monolingual mtdata list
is updated.mtdata list -l eng-deu
for bitext andmtdata list -l eng
for monolingual- Added: Statmt Newscrawl, news-discussions, Leipzig corpus, ...
- Dataset IDs are now
skipped 0.3.9 because the changes are too significant and wanted to bump from 0.3x -> 0.4x