Skip to content

v0.3.0 - BCP47, new dataset-id, dataset compression; JW300 v1c

Compare
Choose a tag to compare
@thammegowda thammegowda released this 21 Oct 22:39
· 200 commits to master since this release

Big Changes: BCP-47, data compression

  • BCP47: (Language, Script, Region)

    • Our implementation is strictly not BCP-47. We differ on the following
      • We use ISO 639-3 codes (i.e three letters) for all languages, where as BCP47 uses two letters for some (e.g. en) and three letters for many.
      • We use _ (underscore) to join language, script, region whereas BCP-47 uses - (hyphen)
  • Dataset IDs (aka did in short) are standardized <group>-<name>-<version>-<lang1>-<lang2>

    • <group> can have mixed case, <name> has to be lowercase
  • CLI interface now accept dids.

  • mtdata get --dev <did> now accepts a single dataset ID; creates dev.{xxx,yyy} links at the root of out dir

  • mtdata get --test <did1> ... <did3> creates test{1..4}.{xxx,yyy} links at the root of out dir

  • --compress option to store compressed datasets under output dir

  • zip and tar files are no longer extracted. we read directly from compressed files without extracting them

  • ._lock files are removed after download job is done

  • Add JESC, jpn paracrawl, news commentary 15 and 16

  • Force unicode encoding; make it work on windows (Issue #71)

  • JW300 -> JW300_v1 (tokenized); Added JW300_v1c (raw) (Issue #70)

  • Add all Wikititle datasets from lingual tool (Issue #63)

  • progressbar : englighten is used

  • wget is replaced with requests. User-Agent header along with mtdata version is sent in HTTP request headers

  • Paracrawl v9 added