Releases · thammegowda/mtdata

26 Apr 05:08

thammegowda

v0.4.1

9579e11

v0.4.1 Latest

Latest

Better parallelization: parallel and mono data are scheduled at once (previously it was one after the other)
mtdata cache added. Improves concurrency by supporting multiple recipes
Added WMT general test 2022 and 2023
Added news commentary 18.1. news crawl 2023
mtdata-bcp47 : -p/--pipe to map codes from stdin -> stdout
mtdata-bcp47 : --script {suppress-default,suppress-all,express}
Uses pigz to read and write gzip files by default when pigz is in PATH. export USE_PIGZ=0 to disable

Assets 2

27 Mar 04:09

thammegowda

v0.4.0

a50b916

v0.4.0 - monolingual data, mtdata echo, add new datasets at runtime

Fix: allenai_nllb.json is now included in MANIFEST.in #137. Also fixed CI: Travis -> github actions
Update ELRC datasets #138. Thanks @AlexUmnov
Add Jparacrawl Chinese-Japanese subset #143. Thanks @BrightXiaoHan
Add Flores200 dev and devtests #145. Thanks @ZenBel
Add support for mtdata echo <ID>
dataset entries only store bibtext keys and not full citation text
- creates index cache as JSONLine file. (WIP towards dataset statistics)
Simplified index loading
simplified compression format handlers. Added support for opening .bz2 files without creating temp files.
all resources are moved to mtdata/resource dir and any new additions to that dir are automatically included in python package (Fail proof for future issues like #137 )

New and exciting features:

Support for adding new datasets at runtime (mtdata*.py from run dir). Note: you have to reindex by calling mtdata -ri list
Monolingual datasets support in progress (currently testing)
- Dataset IDs are now Group-name-version-lang1-lang2 for bitext and Group-name-version-lang for monolingual
- mtdata list is updated. mtdata list -l eng-deu for bitext and mtdata list -l eng for monolingual
- Added: Statmt Newscrawl, news-discussions, Leipzig corpus, ...

skipped 0.3.9 because the changes are too significant and wanted to bump from 0.3x -> 0.4x

Assets 2

25 Nov 03:31

thammegowda

v0.3.8

d04438d

0.3.8 - log level, progress bar, refresh OPUS and ELRC; stats

CLI arg --log-level with default set to WARNING
progressbar can be disabled from CLI --no-pbar; default is enabled --pbar
mtdata stats --quick does HTTP HEAD and shows content length; e.g. mtdata stats --quick Statmt-commoncrawl-wmt19-fra-deu
python -m mtdata.scripts.recipe_stats to read stats from output directory
Security fix with tar extract | Thanks @TrellixVulnTeam
Added NLLB datasets prepared by AllenAI | Thanks @AlexUmnov
Opus and ELRC datasets update | Thanks @ZenBel
default for fail_on_error is set to true; returns non zero exit code on error. set --no-fail flag to ignore errors while mtdata get command

Contributors

ZenBel, AlexUmnov, and TrellixVulnTeam

Assets 2

11 Jul 20:43

thammegowda

v0.3.7

b1c0b21

0.3.7

Update ELRC data including EU acts which is used for wmt22 (thanks @kpu)

Contributors

kpu

Assets 2

08 Jul 22:37

thammegowda

v0.3.6

f26eda9

v0.3.6 : fixes and additions for wmt22

Fixed KECL-JParaCrawl
added Paracrawl bonus for ukr-eng
added Yandex rus-eng corpus
added Yakut sah-eng
update recipe for wmt22 constrained eval

Assets 2

11 Mar 03:20

thammegowda

v0.3.5

5a9c034

disable JW300; add WMT22 recipes; auto generate references.bib

Parallel download support -j/--n-jobs argument (with default 4)
Automatically create references.bib file based on datasets selected
Add histogram to web search interface (Thanks, @sgowdaks)
ELRC index updates; (Thanks @kpu)
Update OPUS index. Use OPUS API to download all datasets
- A lot of new datasets added.
- WARNING: Some OPUS IDs are not backward compatible (version number mismatch)
Fix: JESC dataset language IDs were wrong
New datasets:
- jpn-eng: add paracrawl v3, and wmt19 TED
- backtranslation datasets for en2ru ru2en en2ru
Option to set MTDATA_RECIPES dir (default is $PWD). All files matching the glob ${MTDATA_RECIPES}/mtdata.recipes*.yml are loaded
WMT22 recipes added
JW300 is disabled #77

Contributors

kpu and sgowdaks

Assets 2

28 Jan 06:58

thammegowda

v0.3.3

9990d94

v0.3.3

bug fix: xml reading inside tar: Element tree's complain about TarPath
mtdata list has -g/--groups and -ng/--not-groups as include exclude filters on group name | closes #91
mtdata list has -id/--id flag to print only dataset IDs | closes #91
add WMT21 tests | closes #90
add ccaligned datasets wmt21 | closes #89
add ParIce datasets | closes #88
add wmt21 en-ha | closes #87
add wmt21 wikititles v3 | closes #86
Add train and test sets from StanfordNLP NMT page (large: en-cs, medium: en-de, small: en-vi) | closes #84
- Add support for two URLs for a single dataset (i.e. without zip/tar files)
Fixed a language match bug #92 / #93
Fix: language compatibility checks; Closes #94

Assets 2

06 Dec 17:41

thammegowda

v0.3.2

ca6615a

v0.3.2 - 20211205

Fix: recipes.yml is missing in the pip installed package
Add Project Anuvaad: 196 datasets belonging to Indian languages
add CLI mtdata get has --fail / --no-fail arguments to tell whether to crash or no-crash upon errors

Assets 2

29 Oct 01:58

thammegowda

v0.3.1

4380a10

faster tar reading; recipes, stats; multiligual source or target support

mtdata [list|get]-recipe :: Add support for recipes; list-recipe get-recipe subcommands added
mtdata stats:: add support for viewing stats of dataset; words, chars, segs
FIX url for UN dev and test sets (source was updated so we updated too)
Multilingual experiment support; ISO 639-3 code mul implies multilingual; e.g. mul-eng or eng-mul
--dev accepts multiple datasets, and merges it (useful for multilingual experiments)
tar files are extracted before read (performance improvements)
setup.py: version and descriptions accessed via regex

Assets 2

21 Oct 22:39

thammegowda

v0.3.0

9546f20

v0.3.0 - BCP47, new dataset-id, dataset compression; JW300 v1c

Big Changes: BCP-47, data compression

BCP47: (Language, Script, Region)
- Our implementation is strictly not BCP-47. We differ on the following
  - We use ISO 639-3 codes (i.e three letters) for all languages, where as BCP47 uses two letters for some (e.g. en) and three letters for many.
  - We use _ (underscore) to join language, script, region whereas BCP-47 uses - (hyphen)
Dataset IDs (aka did in short) are standardized <group>-<name>-<version>-<lang1>-<lang2>
- <group> can have mixed case, <name> has to be lowercase
CLI interface now accept dids.
mtdata get --dev <did> now accepts a single dataset ID; creates dev.{xxx,yyy} links at the root of out dir
mtdata get --test <did1> ... <did3> creates test{1..4}.{xxx,yyy} links at the root of out dir
--compress option to store compressed datasets under output dir
zip and tar files are no longer extracted. we read directly from compressed files without extracting them
._lock files are removed after download job is done
Add JESC, jpn paracrawl, news commentary 15 and 16
Force unicode encoding; make it work on windows (Issue #71)
JW300 -> JW300_v1 (tokenized); Added JW300_v1c (raw) (Issue #70)
Add all Wikititle datasets from lingual tool (Issue #63)
progressbar : englighten is used
wget is replaced with requests. User-Agent header along with mtdata version is sent in HTTP request headers
Paracrawl v9 added

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contributors

Contributors

Contributors

Releases: thammegowda/mtdata

v0.4.1

v0.4.0 - monolingual data, mtdata echo, add new datasets at runtime

0.3.8 - log level, progress bar, refresh OPUS and ELRC; stats

Contributors

0.3.7

Contributors

v0.3.6 : fixes and additions for wmt22

disable JW300; add WMT22 recipes; auto generate references.bib

Contributors

v0.3.3

v0.3.2 - 20211205

faster tar reading; recipes, stats; multiligual source or target support

v0.3.0 - BCP47, new dataset-id, dataset compression; JW300 v1c