Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.3.1 #79

Merged
merged 13 commits into from
Oct 29, 2021
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
*.egg-info
*.pytest_cache

# Byte-compiled / optimized / DLL files
/tmp
__pycache__/
Expand Down
14 changes: 13 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,18 @@
# Change Log

## v0.3.0 - WIP
## v0.3.1 - 20211028

- Add support for recipes; list-recipe get-recipe subcommands added
- add support for viewing stats of dataset; words, chars, segs
- FIX url for UN dev and test sets (source was updated so we updated too)
- Multilingual experiment support; ISO 639-3 code `mul` implies multilingual; e.g. mul-eng or eng-mul
- `--dev` accepts multiple datasets, and merges it (useful for multilingual experiments)
- tar files are extracted before read (performance improvements)
- setup.py: version and descriptions accessed via regex

---

## v0.3.0 - 20211021

> Big Changes: BCP-47, data compression

Expand Down
24 changes: 22 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -240,6 +240,23 @@ $ tree data/deu-eng/
└── train.stats.json
```

## Recipes

> Since v0.3.1

Recipe is a set of datasets nominated for train, dev, and tests, and are meant to improve reproducibility of experiments.
Recipes are loaded from
1. Default: [`mtdata/recipe/recipes.yml`](mtdata/recipe/recipes.yml) from source code
2. Cache dir: `$MTDATA/mtdata.recipe.yml` where `$MTDATA` has default of `~/.mtdata`
3. Current dir: `$PWD/mtdata.recipe.yml`

See [`mtdata/recipe/recipes.yml`](mtdata/recipe/recipes.yml) for format and examples.

```bash
mtdata list-recipe # see all recipes
mtdata get-recipe -ri <recipe_id> -o <out_dir> # get recipe, recreate dataset
```

## Language Name Standardization
### ISO 639 3
Internally, all language codes are mapped to ISO-639 3 codes.
Expand Down Expand Up @@ -287,6 +304,9 @@ print(iso3_code('eNgLIsH', fail_error=True)) # case doesnt matter
```

### BCP-47

> Since v0.3.0

We used ISO 639-3 from the beginning, however, we soon faced the limitation that ISO 639-3 cannot distinguish script and region variants of language. So we have upgraded to BCP-47 like language tags in `v0.3.0`.

* BCP47 uses two-letter codes to some and three-letter codes to the rest, we use three-letter codes to all languages.
Expand All @@ -305,9 +325,9 @@ Our tags are of form `xxx_Yyyy_ZZ` where
Notes:
* Region is preserved when available and left blank when unavailable
* Script `Yyyy` is forcibly suppressed in obvious cases. E.g. `eng` is written using `Latn` script, writing `eng-Latn` is just awkward to read as `Latn` is default we suppress `Latn` script for English. On the other hand a language like `Kannada` is written using `Knda` script (`kan-Knda` -> `kan`), but occasionally written using `Latn` script, so `kan-Latn` is not suppressed.
* The information about whats default script is obtained from IANA language code registry
* The information about what is default script is obtained from IANA language code registry
* Language code `mul` stands for _multiple languages, and is used as a placeholder for multilingual datasets (See `mul-eng` to represent many-to-English dataset recipes in [(mtdata/recipe/recipes.yml](mtdata/recipe/recipes.yml))


#### Example:
To inspect parsing/mapping, use `python -m mtdata.iso.bcp47 <args>`

Expand Down
15 changes: 6 additions & 9 deletions mtdata/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,27 +3,24 @@
# Author: Thamme Gowda [tg (at) isi (dot) edu]
# Created: 4/4/20

__version__ = '0.3.0'
__version__ = '0.3.1'
__description__ = 'mtdata is a tool to download datasets for machine translation'
__author__ = 'Thamme Gowda'

import logging as log
from pathlib import Path
import os
import enlighten
from ruamel.yaml import YAML


yaml = YAML()
debug_mode = False
_log_format = '%(asctime)s %(module)s.%(funcName)s:%(lineno)s %(levelname)s:: %(message)s'
log.basicConfig(level=log.INFO, datefmt='%Y-%m-%d %H:%M:%S', format=_log_format)
cache_dir = Path(os.environ.get('MTDATA', '~/.mtdata')).expanduser()
cached_index_file = cache_dir / f'mtdata.index.{__version__}.pkl'

try:
import enlighten
pbar_man = enlighten.get_manager()
except ImportError as e:
log.warning("enlighten package maybe required. please run 'pip install englighten'")
log.warning(e)
FILE_LOCK_TIMEOUT = 2 * 60 * 60 # 2 hours
pbar_man = enlighten.get_manager()


class MTDataException(Exception):
Expand Down
9 changes: 6 additions & 3 deletions mtdata/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,15 @@
#
# Author: Thamme Gowda [tg (at) isi (dot) edu]
# Created: 4/4/20
import errno
from mtdata import main, log

if __name__ == '__main__':
def main():
from mtdata import main
try:
main.main()
except BrokenPipeError as e:
# this happens when piped to '| head' which aborts pipe after limit. And that is okay.
pass


if __name__ == '__main__':
main()
60 changes: 49 additions & 11 deletions mtdata/cache.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,9 @@
from dataclasses import dataclass
from pathlib import Path
from mtdata.index import Entry
from mtdata import log, __version__, pbar_man, MTDataException
from mtdata import log, __version__, pbar_man, MTDataException, FILE_LOCK_TIMEOUT
from mtdata.utils import ZipPath, TarPath
from mtdata.parser import Parser
from typing import List, Union

import portalocker
Expand All @@ -19,7 +20,6 @@
import requests
import math

MAX_TIMEOUT = 2 * 60 * 60 # 2 hours

headers = {'User-Agent': f'mtdata downloader {__version__}; cURL and wget like.'}

Expand All @@ -43,6 +43,46 @@ def get_entry(self, entry: Entry, fix_missing=True) -> Union[Path, List[Path]]:
local = self.get_local_in_paths(path=local, entry=entry)
return local

def get_stats(self, entry: Entry):
path = self.get_entry(entry)
parser = Parser(path, ext=entry.in_ext or None, ent=entry)
count, skips, noise = 0, 0, 0
toks = [0, 0]
chars = [0, 0]
for rec in parser.read_segs():
if len(rec) < 2 or not rec[0] or not rec[1]:
skips += 1
continue
if entry.is_noisy(seg1=rec[0], seg2=rec[1]):
noise += 1
skips += 1
continue
count += 1
s1, s2 = rec[:2] # get the first two recs
chars[0] += len(s1)
chars[1] += len(s2)
s1_tok, s2_tok = s1.split(), s2.split()
toks[0] += len(s1_tok)
toks[1] += len(s2_tok)

l1, l2 = entry.did.langs
l1, l2 = l1.lang, l2.lang
assert count > 0, f'No valid records are found for {entry.did}'
if l2 < l1:
l1, l2 = l2, l1
toks = toks[1], toks[0]
chars = chars[1], chars[0]
return {
'id': str(entry.did),
'segs': count,
'segs_err': skips,
'segs_noise': noise,
f'{l1}_toks': toks[0],
f'{l2}_toks': toks[1],
f'{l1}_chars': chars[0],
f'{l2}_chars': chars[0]
}

def get_flag_file(self, file: Path):
return file.with_name(file.name + '._valid')

Expand Down Expand Up @@ -74,39 +114,37 @@ def opus_xces_format(self, entry, fix_missing=True) -> List[Path]:
l2_path = self.get_local_path(l2_url, fix_missing=fix_missing)
return [align_file, l1_path, l2_path]

def get_local_in_paths(self, path:Path, entry: Entry,):
def get_local_in_paths(self, path: Path, entry: Entry,):
in_paths = entry.in_paths
if zipfile.is_zipfile(path):
with zipfile.ZipFile(path) as root:
in_paths = self.match_globs(names=root.namelist(), globs=in_paths)
return [ZipPath(path, p) for p in in_paths] # stdlib is buggy, so I made a workaround
elif tarfile.is_tarfile(path):
with tarfile.open(path, encoding='utf-8') as root:
in_paths = self.match_globs(names=root.getnames(), globs=in_paths)
return [TarPath(path, p) for p in in_paths]
else:
raise Exception(f'Unable to read {entry.did}; the file is neither zip nor tar')

def download(self, url: str, save_at: Path):
def download(self, url: str, save_at: Path, timeout=(5, 10)):
valid_flag = self.get_flag_file(save_at)
lock_file = valid_flag.with_suffix("._lock")
if valid_flag.exists() and save_at.exists():
return save_at
save_at.parent.mkdir(parents=True, exist_ok=True)

log.info(f"Acquiring lock on {lock_file}")
with portalocker.Lock(lock_file, 'w', timeout=MAX_TIMEOUT) as fh:
with portalocker.Lock(lock_file, 'w', timeout=FILE_LOCK_TIMEOUT) as fh:
# check if downloaded by other parallel process
if valid_flag.exists() and save_at.exists():
return save_at
log.info(f"Downloading {url} --> {save_at}")
resp = requests.get(url=url, allow_redirects=True, headers=headers, stream=True)
log.info(f"GET {url} {save_at}")
resp = requests.get(url=url, allow_redirects=True, headers=headers, stream=True, timeout=timeout)
assert resp.status_code == 200, resp.status_code
buf_size = 2 ** 10
n_buffers = math.ceil(int(resp.headers.get('Content-Length', '0')) / buf_size) or None
desc = url
if len(desc) > 40:
desc = desc[:30] + '...' + desc[-10:]
if len(desc) > 60:
desc = desc[:30] + '...' + desc[-28:]
with pbar_man.counter(color='green', total=n_buffers, unit='KiB', leave=False,
desc=f"{desc}") as pbar, open(save_at, 'wb', buffering=2**24) as out:
for chunk in resp.iter_content(chunk_size=buf_size):
Expand Down
Loading