Release v0.3.0 - BCP47, new dataset-id, dataset compression; JW300 v1c · thammegowda/mtdata

Big Changes: BCP-47, data compression

BCP47: (Language, Script, Region)
- Our implementation is strictly not BCP-47. We differ on the following
  - We use ISO 639-3 codes (i.e three letters) for all languages, where as BCP47 uses two letters for some (e.g. en) and three letters for many.
  - We use _ (underscore) to join language, script, region whereas BCP-47 uses - (hyphen)
Dataset IDs (aka did in short) are standardized <group>-<name>-<version>-<lang1>-<lang2>
- <group> can have mixed case, <name> has to be lowercase
CLI interface now accept dids.
mtdata get --dev <did> now accepts a single dataset ID; creates dev.{xxx,yyy} links at the root of out dir
mtdata get --test <did1> ... <did3> creates test{1..4}.{xxx,yyy} links at the root of out dir
--compress option to store compressed datasets under output dir
zip and tar files are no longer extracted. we read directly from compressed files without extracting them
._lock files are removed after download job is done
Add JESC, jpn paracrawl, news commentary 15 and 16
Force unicode encoding; make it work on windows (Issue #71)
JW300 -> JW300_v1 (tokenized); Added JW300_v1c (raw) (Issue #70)
Add all Wikititle datasets from lingual tool (Issue #63)
progressbar : englighten is used
wget is replaced with requests. User-Agent header along with mtdata version is sent in HTTP request headers
Paracrawl v9 added

Provide feedback