Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Implement merge-devset, train-backwards, and evaluate-backwards pipel…
…ine steps (#125) * Add support for `merge-devset` pipeline step in Taskcluster I ended up reworking (and renaming) the `merge-corpus` kind for this. It now handles `merge-corpus` and `merge-devset`, and should be able to handle `merge-mono` fairly easily in the future as well. The `find_upstreams` transform handles most of the difference between `merge-corpus` and `merge-devset`, with only a little custom `command-context` needed for both of them (to make sure the input file prefix argument is correct). * Remove outdated comment from train vocab kind * Implement `train-backwards` pipeline step in Taskcluster This is another fairly simple and straightforward pipeline step to add. Because it's a single task per locale pair (instead of per dataset + locale pair) we can trivially add our dependencies and fetches in the kind. In addition to the extra `experiment` parameter being introduced (`best-model`), we also have an entire new section of `marian-args`. This section will eventually have many subsections, with each subsection being used for a different pipeline step. Ultimately, the keys + values of each subsection need be to translated as command line arguments that are passed to a script (and then passed onto marian). This is accomplished with a new transform that pulls a dictionary from the training config, and translates its keys + values to a single `marian_arg` string that is available in `command_context`. * Support zstd in train.sh pipeline script Unlike previous pipeline scripts we've worked with, this one can't (AFAICT) take the input through a pipe, and marian doesn't support decoding zstd files. Instead, we simply decompress all of the input files, and let marian have them in plaintext format instead. (This is skipped for gz, to avoid modifying the existing pipeline.) * Fix caching of tasks with a `/` in them. * Add support for filtering datasets by category in from_datasets kind Beginning with `evaluate` steps we need to be able to generate tasks that consume a subset of datasets matching a given category. We don't categorize datasets in `ci/config.yml` (that's just a record of all supported datasets), so this information must be pulled from the parameters instead (as we do in other transforms for other runtime parameters). This change also moves the `substitute` helper function elsewhere, to make it more re-usable. * Add new transform that allows substiting parameters into arbitary parts of the task definition This is pretty similar to `command-context-from-parameters`, except it allows substitution into any place in the task definition. Arguably, this could replace `command-context-from-parameters` entirely, which I may look at doing in the future. * Fix bug in target tasks when multiple datasets exist for a category * Add support for zstd to eval.sh pipeline script * Fix eval.sh pipeline script to properly create the result directory if it does not exist. Right now, this creates a directory with `res_prefix` in the current directory. I'm pretty sure that's wrong - and we want to create its `dirname` instead (a similar thing is done in a bunch of other pipeline scripts already...). * Add requirements file for eval step, which depends on `sacrebleu` * Add support for `evaluate-backwards` in Taskcluster Besides the usual bits, I tried to write this in a fairly forward-compatible way, as we'll be adding `evaluate` steps for other parts of the pipeline in the near future. To that end, I kept all the `backwards` specific parts in a specific task, and tried to write everything in the `task-defaults` section in a general way. I'm sure some tweaks will be needed later, but this a headstart for the next time.
- Loading branch information