Skip to content

Commit

Permalink
Implement merge-devset, train-backwards, and evaluate-backwards pipel…
Browse files Browse the repository at this point in the history
…ine steps (#125)

* Add support for `merge-devset` pipeline step in Taskcluster

I ended up reworking (and renaming) the `merge-corpus` kind for this. It now handles `merge-corpus` and `merge-devset`, and should be able to handle `merge-mono` fairly easily in the future as well.

The `find_upstreams` transform handles most of the difference between `merge-corpus` and `merge-devset`, with only a little custom `command-context` needed for both of them (to make sure the input file prefix argument is correct).

* Remove outdated comment from train vocab kind

* Implement `train-backwards` pipeline step in Taskcluster

This is another fairly simple and straightforward pipeline step to add. Because it's a single task per locale pair (instead of per dataset + locale pair) we can trivially add our dependencies and fetches in the kind.

In addition to the extra `experiment` parameter being introduced (`best-model`), we also have an entire new section of `marian-args`. This section will eventually have many subsections, with each subsection being used for a different pipeline step. Ultimately, the keys + values of each subsection need be to translated as command line arguments that are passed to a script (and then passed onto marian). This is accomplished with a new transform that pulls a dictionary from the training config, and translates its keys + values to a single `marian_arg` string that is available in `command_context`.

* Support zstd in train.sh pipeline script

Unlike previous pipeline scripts we've worked with, this one can't (AFAICT) take the input through a pipe, and marian doesn't support decoding zstd files. Instead, we simply decompress all of the input files, and let marian have them in plaintext format instead. (This is skipped for gz, to avoid modifying the existing pipeline.)

* Fix caching of tasks with a `/` in them.

* Add support for filtering datasets by category in from_datasets kind

Beginning with `evaluate` steps we need to be able to generate tasks that consume a subset of datasets matching a given category. We don't categorize datasets in `ci/config.yml` (that's just a record of all supported datasets), so this information must be pulled from the parameters instead (as we do in other transforms for other runtime parameters).

This change also moves the `substitute` helper function elsewhere, to make it more re-usable.

* Add new transform that allows substiting parameters into arbitary parts of the task definition

This is pretty similar to `command-context-from-parameters`, except it allows substitution into any place in the task definition. Arguably, this could replace `command-context-from-parameters` entirely, which I may look at doing in the future.

* Fix bug in target tasks when multiple datasets exist for a category

* Add support for zstd to eval.sh pipeline script

* Fix eval.sh pipeline script to properly create the result directory if it does not exist.

Right now, this creates a directory with `res_prefix` in the current directory. I'm pretty sure that's wrong - and we want to create its `dirname` instead (a similar thing is done in a bunch of other pipeline scripts already...).

* Add requirements file for eval step, which depends on `sacrebleu`

* Add support for `evaluate-backwards` in Taskcluster

Besides the usual bits, I tried to write this in a fairly forward-compatible way, as we'll be adding `evaluate` steps for other parts of the pipeline in the near future. To that end, I kept all the `backwards` specific parts in a specific task, and tried to write everything in the `task-defaults` section in a general way. I'm sure some tweaks will be needed later, but this a headstart for the next time.
  • Loading branch information
bhearsum authored Jun 2, 2023
1 parent 6ace8b2 commit b014c27
Show file tree
Hide file tree
Showing 17 changed files with 770 additions and 124 deletions.
9 changes: 6 additions & 3 deletions pipeline/eval/eval.sh
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,16 @@ marian=$5
decoder_config=$6
args=( "${@:7}" )

mkdir -p "$(basename "${res_prefix}")"
COMPRESSION_CMD="${COMPRESSION_CMD:-pigz}"
ARTIFACT_EXT="${ARTIFACT_EXT:-gz}"

mkdir -p "$(dirname "${res_prefix}")"

echo "### Evaluating dataset: ${dataset_prefix}, pair: ${src}-${trg}, Results prefix: ${res_prefix}"

pigz -dc "${dataset_prefix}.${trg}.gz" > "${res_prefix}.${trg}.ref"
${COMPRESSION_CMD} -dc "${dataset_prefix}.${trg}.${ARTIFACT_EXT}" > "${res_prefix}.${trg}.ref"

pigz -dc "${dataset_prefix}.${src}.gz" |
${COMPRESSION_CMD} -dc "${dataset_prefix}.${src}.${ARTIFACT_EXT}" |
tee "${res_prefix}.${src}" |
"${marian}"/marian-decoder \
-c "${decoder_config}" \
Expand Down
1 change: 1 addition & 0 deletions pipeline/eval/requirements/eval.in
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
sacrebleu
221 changes: 221 additions & 0 deletions pipeline/eval/requirements/eval.txt

Large diffs are not rendered by default.

18 changes: 16 additions & 2 deletions pipeline/train/train.sh
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@ vocab=$8
best_model_metric=$9
extra_params=( "${@:10}" )

ARTIFACT_EXT="${ARTIFACT_EXT:-gz}"

test -v GPUS
test -v MARIAN
test -v WORKSPACE
Expand All @@ -32,10 +34,22 @@ echo "### Training ${model_dir}"

# if doesn't fit in RAM, remove --shuffle-in-ram and add --shuffle batches

# Marian doesn't support zst natively; we need to decompress before passing them
# along.
if [ "${ARTIFACT_EXT}" = "zst" ]; then
zstdmt --rm -d "${train_set_prefix}.${src}.${ARTIFACT_EXT}"
zstdmt --rm -d "${train_set_prefix}.${trg}.${ARTIFACT_EXT}"
zstdmt --rm -d "${valid_set_prefix}.${src}.${ARTIFACT_EXT}"
zstdmt --rm -d "${valid_set_prefix}.${trg}.${ARTIFACT_EXT}"
ARTIFACT_EXT=""
else
ARTIFACT_EXT=".gz"
fi

"${MARIAN}/marian" \
--model "${model_dir}/model.npz" \
-c "configs/model/${model_type}.yml" "configs/training/${model_type}.${training_type}.yml" \
--train-sets "${train_set_prefix}".{"${src}","${trg}"}.gz \
--train-sets "${train_set_prefix}".{"${src}","${trg}"}${ARTIFACT_EXT} \
-T "${model_dir}/tmp" \
--shuffle-in-ram \
--vocabs "${vocab}" "${vocab}" \
Expand All @@ -44,7 +58,7 @@ echo "### Training ${model_dir}"
--sharding local \
--sync-sgd \
--valid-metrics "${best_model_metric}" ${all_model_metrics[@]/$best_model_metric} \
--valid-sets "${valid_set_prefix}".{"${src}","${trg}"}.gz \
--valid-sets "${valid_set_prefix}".{"${src}","${trg}"}${ARTIFACT_EXT} \
--valid-translation-output "${model_dir}/devset.out" \
--quiet-translation \
--overwrite \
Expand Down
133 changes: 133 additions & 0 deletions taskcluster/ci/evaluate/kind.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
# This Source Code Form is subject to the terms of the Mozilla Public
# License, v. 2.0. If a copy of the MPL was not distributed with this
# file, You can obtain one at http://mozilla.org/MPL/2.0/.
---

loader: taskgraph.loader.transform:loader

transforms:
- translations_taskgraph.transforms.task_substitution_from_params:transforms
- translations_taskgraph.transforms.from_datasets:per_dataset
- taskgraph.transforms.job:transforms
- translations_taskgraph.transforms.cache:transforms
- taskgraph.transforms.cached_tasks:transforms
- taskgraph.transforms.task:transforms

kind-dependencies:
- dataset
- train
- train-vocab
- fetch
- toolchain

task-defaults:
attributes:
cache:
resources:
- pipeline/eval/eval-gpu.sh
- pipeline/eval/eval.sh
dataset-config:
include-categories:
- test
substitution-fields:
- description
- name
- dependencies
- fetches
- treeherder.symbol
- worker.env
- run.command-context.pipeline_args1
task-substitution:
from-parameters:
best_model: training_config.experiment.best-model
src_locale: training_config.experiment.src_locale
trg_locale: training_config.experiment.trg_locale
substitution-fields:
- run.command-context.pipeline_args2
- fetches.train-backwards
worker-type: t-linux-v100-gpu
worker:
artifacts:
- name: public/build
path: artifacts
type: directory
max-run-time: 3600
env:
SRC: "{src_locale}"
TRG: "{trg_locale}"
GPUS: "0"
WORKSPACE: "8000"
COMPRESSION_CMD: zstdmt
ARTIFACT_EXT: zst

# Don't run unless explicitly scheduled
run-on-tasks-for: []

treeherder:
platform: evaluate/opt

run:
using: run-task
command:
- bash
- -c
# The two sed commands here are the unfortunate result of us consuming
# a marian config that was produced by an earlier step. These configs
# have hardcoded absolute paths to the models they were trained on,
# and end invalid when used on a different machine. In theory it is
# possible to adjust them at generation time to use relative paths,
# but in practice we have not been able to make this work.
- >-
pip install -r $VCS_PATH/pipeline/eval/requirements/eval.txt &&
export PATH=$PATH:~/.local/bin &&
export MARIAN=$MOZ_FETCHES_DIR &&
sed -i -e "s,- .*fetches,- $MOZ_FETCHES_DIR," $TASK_WORKDIR/fetches/*.yml &&
sed -i -e "s,- .*artifacts,- $MOZ_FETCHES_DIR," $TASK_WORKDIR/fetches/*.yml &&
{pipeline_script}
{pipeline_args1}
{pipeline_args2}
fetches:
toolchain:
- marian

tasks:
backward-{provider}-{dataset}-{src_locale}-{trg_locale}:
description: backwards evaluation for {dataset} {src_locale}-{trg_locale}
attributes:
stage: evaluate-backwards
dataset-category: test
cache:
type: evaluate-backwards
treeherder:
symbol: "bw-{provider}-{dataset_short}-{src_locale}-{trg_locale}"
run:
command-context:
pipeline_script: $VCS_PATH/pipeline/eval/eval-gpu.sh
pipeline_args1: >-
$TASK_WORKDIR/artifacts/{dataset_sanitized}
$MOZ_FETCHES_DIR/{dataset_sanitized}
{trg_locale}
{src_locale}
pipeline_args2: >-
$MOZ_FETCHES_DIR/final.model.npz.best-{best_model}.npz.decoder.yml
$MOZ_FETCHES_DIR/final.model.npz.best-{best_model}.npz
dependencies:
dataset: dataset-{provider}-{dataset_sanitized}-{src_locale}-{trg_locale}
train-backwards: train-backwards-{src_locale}-{trg_locale}
train-vocab: train-vocab-{src_locale}-{trg_locale}
fetches:
dataset:
- artifact: "{dataset_sanitized}.{src_locale}.zst"
extract: false
- artifact: "{dataset_sanitized}.{trg_locale}.zst"
extract: false
train-backwards:
- artifact: final.model.npz.best-{best_model}.npz
extract: false
- artifact: final.model.npz.best-{best_model}.npz.decoder.yml
extract: false
train-vocab:
- artifact: vocab.spm
extract: false
94 changes: 0 additions & 94 deletions taskcluster/ci/merge-corpus/kind.yml

This file was deleted.

Loading

0 comments on commit b014c27

Please sign in to comment.