Implement merge-devset, train-backwards, and evaluate-backwards pipel…

…ine steps (#125) * Add support for `merge-devset` pipeline step in Taskcluster I ended up reworking (and renaming) the `merge-corpus` kind for this. It now handles `merge-corpus` and `merge-devset`, and should be able to handle `merge-mono` fairly easily in the future as well. The `find_upstreams` transform handles most of the difference between `merge-corpus` and `merge-devset`, with only a little custom `command-context` needed for both of them (to make sure the input file prefix argument is correct). * Remove outdated comment from train vocab kind * Implement `train-backwards` pipeline step in Taskcluster This is another fairly simple and straightforward pipeline step to add. Because it's a single task per locale pair (instead of per dataset + locale pair) we can trivially add our dependencies and fetches in the kind. In addition to the extra `experiment` parameter being introduced (`best-model`), we also have an entire new section of `marian-args`. This section will eventually have many subsections, with each subsection being used for a different pipeline step. Ultimately, the keys + values of each subsection need be to translated as command line arguments that are passed to a script (and then passed onto marian). This is accomplished with a new transform that pulls a dictionary from the training config, and translates its keys + values to a single `marian_arg` string that is available in `command_context`. * Support zstd in train.sh pipeline script Unlike previous pipeline scripts we've worked with, this one can't (AFAICT) take the input through a pipe, and marian doesn't support decoding zstd files. Instead, we simply decompress all of the input files, and let marian have them in plaintext format instead. (This is skipped for gz, to avoid modifying the existing pipeline.) * Fix caching of tasks with a `/` in them. * Add support for filtering datasets by category in from_datasets kind Beginning with `evaluate` steps we need to be able to generate tasks that consume a subset of datasets matching a given category. We don't categorize datasets in `ci/config.yml` (that's just a record of all supported datasets), so this information must be pulled from the parameters instead (as we do in other transforms for other runtime parameters). This change also moves the `substitute` helper function elsewhere, to make it more re-usable. * Add new transform that allows substiting parameters into arbitary parts of the task definition This is pretty similar to `command-context-from-parameters`, except it allows substitution into any place in the task definition. Arguably, this could replace `command-context-from-parameters` entirely, which I may look at doing in the future. * Fix bug in target tasks when multiple datasets exist for a category * Add support for zstd to eval.sh pipeline script * Fix eval.sh pipeline script to properly create the result directory if it does not exist. Right now, this creates a directory with `res_prefix` in the current directory. I'm pretty sure that's wrong - and we want to create its `dirname` instead (a similar thing is done in a bunch of other pipeline scripts already...). * Add requirements file for eval step, which depends on `sacrebleu` * Add support for `evaluate-backwards` in Taskcluster Besides the usual bits, I tried to write this in a fairly forward-compatible way, as we'll be adding `evaluate` steps for other parts of the pipeline in the near future. To that end, I kept all the `backwards` specific parts in a specific task, and tried to write everything in the `task-defaults` section in a general way. I'm sure some tweaks will be needed later, but this a headstart for the next time.
mozilla · Jun 2, 2023 · b014c27 · b014c27
1 parent 6ace8b2
commit b014c27
Show file tree

Hide file tree

Showing 17 changed files with 770 additions and 124 deletions.
diff --git a/pipeline/eval/eval.sh b/pipeline/eval/eval.sh
@@ -16,13 +16,16 @@ marian=$5
 decoder_config=$6
 args=( "${@:7}" )
 
-mkdir -p "$(basename "${res_prefix}")"
+COMPRESSION_CMD="${COMPRESSION_CMD:-pigz}"
+ARTIFACT_EXT="${ARTIFACT_EXT:-gz}"
+
+mkdir -p "$(dirname "${res_prefix}")"
 
 echo "### Evaluating dataset: ${dataset_prefix}, pair: ${src}-${trg}, Results prefix: ${res_prefix}"
 
-pigz -dc "${dataset_prefix}.${trg}.gz" > "${res_prefix}.${trg}.ref"
+${COMPRESSION_CMD} -dc "${dataset_prefix}.${trg}.${ARTIFACT_EXT}" > "${res_prefix}.${trg}.ref"
 
-pigz -dc "${dataset_prefix}.${src}.gz" |
+${COMPRESSION_CMD} -dc "${dataset_prefix}.${src}.${ARTIFACT_EXT}" |
   tee "${res_prefix}.${src}" |
   "${marian}"/marian-decoder \
     -c "${decoder_config}" \

diff --git a/pipeline/eval/requirements/eval.in b/pipeline/eval/requirements/eval.in
@@ -0,0 +1 @@
+sacrebleu
diff --git a/pipeline/eval/requirements/eval.txt b/pipeline/eval/requirements/eval.txt
diff --git a/pipeline/train/train.sh b/pipeline/train/train.sh
@@ -19,6 +19,8 @@ vocab=$8
 best_model_metric=$9
 extra_params=( "${@:10}" )
 
+ARTIFACT_EXT="${ARTIFACT_EXT:-gz}"
+
 test -v GPUS
 test -v MARIAN
 test -v WORKSPACE
@@ -32,10 +34,22 @@ echo "### Training ${model_dir}"
 
 # if doesn't fit in RAM, remove --shuffle-in-ram and add --shuffle batches
 
+# Marian doesn't support zst natively; we need to decompress before passing them
+# along.
+if [ "${ARTIFACT_EXT}" = "zst" ]; then
+  zstdmt --rm -d "${train_set_prefix}.${src}.${ARTIFACT_EXT}"
+  zstdmt --rm -d "${train_set_prefix}.${trg}.${ARTIFACT_EXT}"
+  zstdmt --rm -d "${valid_set_prefix}.${src}.${ARTIFACT_EXT}"
+  zstdmt --rm -d "${valid_set_prefix}.${trg}.${ARTIFACT_EXT}"
+  ARTIFACT_EXT=""
+else
+  ARTIFACT_EXT=".gz"
+fi
+
 "${MARIAN}/marian" \
   --model "${model_dir}/model.npz" \
   -c "configs/model/${model_type}.yml" "configs/training/${model_type}.${training_type}.yml" \
-  --train-sets "${train_set_prefix}".{"${src}","${trg}"}.gz \
+  --train-sets "${train_set_prefix}".{"${src}","${trg}"}${ARTIFACT_EXT} \
   -T "${model_dir}/tmp" \
   --shuffle-in-ram \
   --vocabs "${vocab}" "${vocab}" \
@@ -44,7 +58,7 @@ echo "### Training ${model_dir}"
   --sharding local \
   --sync-sgd \
   --valid-metrics "${best_model_metric}" ${all_model_metrics[@]/$best_model_metric} \
-  --valid-sets "${valid_set_prefix}".{"${src}","${trg}"}.gz \
+  --valid-sets "${valid_set_prefix}".{"${src}","${trg}"}${ARTIFACT_EXT} \
   --valid-translation-output "${model_dir}/devset.out" \
   --quiet-translation \
   --overwrite \

diff --git a/taskcluster/ci/evaluate/kind.yml b/taskcluster/ci/evaluate/kind.yml
@@ -0,0 +1,133 @@
+# This Source Code Form is subject to the terms of the Mozilla Public
+# License, v. 2.0. If a copy of the MPL was not distributed with this
+# file, You can obtain one at http://mozilla.org/MPL/2.0/.
+---
+
+loader: taskgraph.loader.transform:loader
+
+transforms:
+    - translations_taskgraph.transforms.task_substitution_from_params:transforms
+    - translations_taskgraph.transforms.from_datasets:per_dataset
+    - taskgraph.transforms.job:transforms
+    - translations_taskgraph.transforms.cache:transforms
+    - taskgraph.transforms.cached_tasks:transforms
+    - taskgraph.transforms.task:transforms
+
+kind-dependencies:
+    - dataset
+    - train
+    - train-vocab
+    - fetch
+    - toolchain
+
+task-defaults:
+    attributes:
+        cache:
+            resources:
+                - pipeline/eval/eval-gpu.sh
+                - pipeline/eval/eval.sh
+    dataset-config:
+        include-categories:
+            - test
+        substitution-fields:
+            - description
+            - name
+            - dependencies
+            - fetches
+            - treeherder.symbol
+            - worker.env
+            - run.command-context.pipeline_args1
+    task-substitution:
+        from-parameters:
+            best_model: training_config.experiment.best-model
+            src_locale: training_config.experiment.src_locale
+            trg_locale: training_config.experiment.trg_locale
+        substitution-fields:
+            - run.command-context.pipeline_args2
+            - fetches.train-backwards
+    worker-type: t-linux-v100-gpu
+    worker:
+        artifacts:
+            - name: public/build
+              path: artifacts
+              type: directory
+        max-run-time: 3600
+        env:
+            SRC: "{src_locale}"
+            TRG: "{trg_locale}"
+            GPUS: "0"
+            WORKSPACE: "8000"
+            COMPRESSION_CMD: zstdmt
+            ARTIFACT_EXT: zst
+
+    # Don't run unless explicitly scheduled
+    run-on-tasks-for: []
+
+    treeherder:
+        platform: evaluate/opt
+
+    run:
+        using: run-task
+        command:
+            - bash
+            - -c
+            # The two sed commands here are the unfortunate result of us consuming
+            # a marian config that was produced by an earlier step. These configs
+            # have hardcoded absolute paths to the models they were trained on,
+            # and end invalid when used on a different machine. In theory it is
+            # possible to adjust them at generation time to use relative paths,
+            # but in practice we have not been able to make this work.
+            - >-
+                pip install -r $VCS_PATH/pipeline/eval/requirements/eval.txt &&
+                export PATH=$PATH:~/.local/bin &&
+                export MARIAN=$MOZ_FETCHES_DIR &&
+                sed -i -e "s,- .*fetches,- $MOZ_FETCHES_DIR," $TASK_WORKDIR/fetches/*.yml &&
+                sed -i -e "s,- .*artifacts,- $MOZ_FETCHES_DIR," $TASK_WORKDIR/fetches/*.yml &&
+                {pipeline_script}
+                {pipeline_args1}
+                {pipeline_args2}
+
+    fetches:
+        toolchain:
+            - marian
+
+tasks:
+    backward-{provider}-{dataset}-{src_locale}-{trg_locale}:
+        description: backwards evaluation for {dataset} {src_locale}-{trg_locale}
+        attributes:
+            stage: evaluate-backwards
+            dataset-category: test
+            cache:
+                type: evaluate-backwards
+        treeherder:
+            symbol: "bw-{provider}-{dataset_short}-{src_locale}-{trg_locale}"
+        run:
+            command-context:
+                pipeline_script: $VCS_PATH/pipeline/eval/eval-gpu.sh
+                pipeline_args1: >-
+                    $TASK_WORKDIR/artifacts/{dataset_sanitized}
+                    $MOZ_FETCHES_DIR/{dataset_sanitized}
+                    {trg_locale}
+                    {src_locale}
+                pipeline_args2: >-
+                    $MOZ_FETCHES_DIR/final.model.npz.best-{best_model}.npz.decoder.yml
+                    $MOZ_FETCHES_DIR/final.model.npz.best-{best_model}.npz
+
+        dependencies:
+            dataset: dataset-{provider}-{dataset_sanitized}-{src_locale}-{trg_locale}
+            train-backwards: train-backwards-{src_locale}-{trg_locale}
+            train-vocab: train-vocab-{src_locale}-{trg_locale}
+        fetches:
+            dataset:
+                - artifact: "{dataset_sanitized}.{src_locale}.zst"
+                  extract: false
+                - artifact: "{dataset_sanitized}.{trg_locale}.zst"
+                  extract: false
+            train-backwards:
+                - artifact: final.model.npz.best-{best_model}.npz
+                  extract: false
+                - artifact: final.model.npz.best-{best_model}.npz.decoder.yml
+                  extract: false
+            train-vocab:
+                - artifact: vocab.spm
+                  extract: false
diff --git a/taskcluster/ci/merge-corpus/kind.yml b/taskcluster/ci/merge-corpus/kind.yml