Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use nextstrain/ingest (git subtree) #162

Closed
wants to merge 30 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
4c0264e
Initial (empty) commit
tsibley Jul 6, 2023
71fbe29
README: Set the scene
tsibley Jul 6, 2023
92a8868
README: Link to @joverlee521's workflows document
tsibley Jul 6, 2023
f27560e
Copy s3-object-exists from ncov-ingest
joverlee521 Jul 13, 2023
a70ac51
Copy trigger from ncov-ingest
joverlee521 Jul 13, 2023
51970b7
Copy sha256sum from ncov-ingest
joverlee521 Jul 13, 2023
6d39c87
Copy cloudfront-invalidate from ncov-ingest
joverlee521 Jul 13, 2023
154d16a
Copy merge-user-metadata from monkeypox
joverlee521 Jul 14, 2023
608a3e0
Copy transform-authors from monkeypox
joverlee521 Jul 14, 2023
b4034d6
Copy transform-field-names from monkeypox
joverlee521 Jul 14, 2023
41f137c
Copy tranform-genbank-location from monkeypox
joverlee521 Jul 14, 2023
9c9acd6
Merge pull request #6 from nextstrain/identical-scripts
joverlee521 Jul 17, 2023
84047cc
Add CI workflow
joverlee521 Jul 18, 2023
a77e727
Merge pull request #7 from nextstrain/add-ci
joverlee521 Jul 18, 2023
193c311
Copy notify-slack from ncov-ingest
joverlee521 Jul 14, 2023
ba0769b
notify-slack: support threaded messages
joverlee521 Jul 29, 2022
83e871d
Copy notify-on-job-start from monkeypox
joverlee521 Jul 15, 2023
98b235a
notify-on-job-start: Add job_name and repo_name args
joverlee521 Jul 15, 2023
0047255
notify-on-job-start: Add optional build_dir arg
joverlee521 Jul 18, 2023
1926008
Copy notify-on-job-fail from ncov-ingest
joverlee521 Jul 17, 2023
309ebbf
notify-on-job-fail: stylistic updates
joverlee521 Jul 18, 2023
b4b406f
notify-on-job-fail: Add job_name and repo_name args
joverlee521 Jul 18, 2023
b663e17
notify-slack: remove reply_broadcast option for uploads
joverlee521 Jul 18, 2023
b2a0de7
notify-on-job-start/fail: replace repo_name with github_repo
joverlee521 Jul 26, 2023
9082700
Merge pull request #8 from nextstrain/notify-slack
joverlee521 Jul 26, 2023
c640ff5
Add 'ingest/vendored/' from commit '9082700fdbb99007d5e852e657cedf372…
victorlin Jul 28, 2023
7d261c7
Describe subtree setup
victorlin Jul 28, 2023
e773635
Use centralized scripts that are functionally identical
victorlin Jul 28, 2023
30279f9
Use centralized Slack notification scripts
victorlin Jul 28, 2023
a8dd9f6
Remove unused bin variable
victorlin Jul 28, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/fetch-and-ingest.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -73,4 +73,4 @@ jobs:

- name: notify_pipeline_failed
if: ${{ failure() }}
run: ./ingest/bin/notify-on-job-fail
run: ./ingest/vendored/notify-on-job-fail Ingest nextstrain/monkeypox
2 changes: 1 addition & 1 deletion .github/workflows/rebuild-all.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,6 @@ jobs:
steps:
- uses: actions/checkout@v3
- name: Repository Dispatch
run: ./ingest/bin/trigger monkeypox rebuild
run: ./ingest/vendored/trigger monkeypox rebuild
env:
PAT_GITHUB_DISPATCH: ${{ secrets.GH_TOKEN_NEXTSTRAIN_BOT_WORKFLOW_DISPATCH }}
4 changes: 2 additions & 2 deletions bin/notify-on-deploy
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,11 @@ set -euo pipefail
: "${SLACK_CHANNELS:?The SLACK_CHANNELS environment variable is required.}"

base="$(realpath "$(dirname "$0")/..")"
ingest_bin="$base/ingest/bin"
ingest_vendored="$base/ingest/vendored"

deployment_url="${1:?A deployment url is required as the first argument.}"
slack_ts_file="${2:?A Slack thread timestamp file is required as the second argument.}"

echo "Notifying Slack about deployed builds."
"$ingest_bin"/notify-slack "Deployed this build to $deployment_url" \
"$ingest_vendored"/notify-slack "Deployed this build to $deployment_url" \
--thread-ts="$(cat "$slack_ts_file")"
4 changes: 2 additions & 2 deletions bin/notify-on-error
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ set -euo pipefail
: "${GITHUB_RUN_ID:=}"

base="$(realpath "$(dirname "$0")/..")"
ingest_bin="$base/ingest/bin"
ingest_vendored="$base/ingest/vendored"

slack_ts_file="${1:-}"

Expand All @@ -26,6 +26,6 @@ elif [[ -n "${GITHUB_RUN_ID}" ]]; then
message+="See GitHub Action <https://github.com/nextstrain/monkeypox/actions/runs/${GITHUB_RUN_ID}?check_suite_focus=true|${GITHUB_RUN_ID}> for error details."
fi

"$ingest_bin"/notify-slack "$message" \
"$ingest_vendored"/notify-slack "$message" \
--thread-ts="$thread_ts" \
--broadcast
4 changes: 2 additions & 2 deletions bin/notify-on-start
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ set -euo pipefail
: "${GITHUB_RUN_ID:=}"

base="$(realpath "$(dirname "$0")/..")"
ingest_bin="$base/ingest/bin"
ingest_vendored="$base/ingest/vendored"

build_name="${1:?A build name is required as the first argument.}"
slack_ts_output="${2:?A Slack thread timestamp file is required as the second argument}"
Expand All @@ -29,7 +29,7 @@ if [[ -n "${AWS_BATCH_JOB_ID}" ]]; then
message+=" Follow along in your local \`monkeypox\` repo with: "'```'"nextstrain build --aws-batch --no-download --attach ${AWS_BATCH_JOB_ID} . "'```'
fi

"$ingest_bin"/notify-slack "$message" --output="$slack_response"
"$ingest_vendored"/notify-slack "$message" --output="$slack_response"

echo "Saving Slack thread timestamp to '$slack_ts_output'."

Expand Down
4 changes: 2 additions & 2 deletions bin/notify-on-success
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,10 @@ set -euo pipefail
: "${SLACK_CHANNELS:?The SLACK_CHANNELS environment variable is required.}"

base="$(realpath "$(dirname "$0")/..")"
ingest_bin="$base/ingest/bin"
ingest_vendored="$base/ingest/vendored"

slack_ts_file="${1:?A Slack thread timestamp file is required as the first argument.}"

echo "Notifying Slack about successful build."
"$ingest_bin"/notify-slack "✅ This pipeline has successfully finished 🎉" \
"$ingest_vendored"/notify-slack "✅ This pipeline has successfully finished 🎉" \
--thread-ts="$(cat "$slack_ts_file")"
13 changes: 13 additions & 0 deletions ingest/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,3 +85,16 @@ These are optional environment variables used in our automated pipeline for prov

GenBank sequences and metadata are fetched via NCBI Virus.
The exact URL used to fetch data is constructed in `bin/genbank-url`.

## `ingest/vendored`

This repository uses `git subtree` to manage copies of ingest scripts in `ingest/vendored`, from [nextstrain/ingest](https://github.com/nextstrain/ingest). To pull new changes from the central ingest repository, run:
victorlin marked this conversation as resolved.
Show resolved Hide resolved

```sh
git subtree pull --prefix ingest/vendored https://github.com/nextstrain/ingest HEAD
```

Changes should not be pushed using `git subtree push`.

1. For pathogen-specific changes, make them in this repository via a pull request.
2. For pathogen-agnostic changes, make them on [nextstrain/ingest](https://github.com/nextstrain/ingest) via pull request there, then use `git subtree pull` to add those changes to this repository.
4 changes: 2 additions & 2 deletions ingest/bin/download-from-s3
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# Originally copied from nextstrain/ncov-ingest repo
set -euo pipefail

bin="$(dirname "$0")"
vendored="$(dirname "$0")"/../vendored

main() {
local src="${1:?A source s3:// URL is required as the first argument.}"
Expand All @@ -13,7 +13,7 @@ main() {
local key="${s3path#*/}"

local src_hash dst_hash no_hash=0000000000000000000000000000000000000000000000000000000000000000
dst_hash="$("$bin/sha256sum" < "$dst" || true)"
dst_hash="$("$vendored/sha256sum" < "$dst" || true)"
src_hash="$(aws s3api head-object --bucket "$bucket" --key "$key" --query Metadata.sha256sum --output text 2>/dev/null || echo "$no_hash")"

echo "[ INFO] Downloading $src → $dst"
Expand Down
7 changes: 4 additions & 3 deletions ingest/bin/notify-on-diff
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ set -euo pipefail
: "${SLACK_CHANNELS:?The SLACK_CHANNELS environment variable is required.}"

bin="$(dirname "$0")"
vendored="$(dirname "$0")"/../vendored

src="${1:?A source file is required as the first argument.}"
dst="${2:?A destination s3:// URL is required as the second argument.}"
Expand All @@ -16,7 +17,7 @@ diff="$(mktemp -t diff-XXXXXX)"
trap "rm -f '$dst_local' '$diff'" EXIT

# if the file is not already present, just exit
"$bin"/s3-object-exists "$dst" || exit 0
"$vendored"/s3-object-exists "$dst" || exit 0

"$bin"/download-from-s3 "$dst" "$dst_local"

Expand All @@ -26,10 +27,10 @@ diff "$dst_local" "$src" > "$diff" || diff_exit_code=$?

if [[ "$diff_exit_code" -eq 1 ]]; then
echo "Notifying Slack about diff."
"$bin"/notify-slack --upload "$src.diff" < "$diff"
"$vendored"/notify-slack --upload "$src.diff" < "$diff"
elif [[ "$diff_exit_code" -gt 1 ]]; then
echo "Notifying Slack about diff failure"
"$bin"/notify-slack "Diff failed for $src"
"$vendored"/notify-slack "Diff failed for $src"
else
echo "No change in $src."
fi
21 changes: 0 additions & 21 deletions ingest/bin/notify-on-job-fail

This file was deleted.

5 changes: 3 additions & 2 deletions ingest/bin/notify-on-record-change
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,14 @@ set -euo pipefail
: "${SLACK_CHANNELS:?The SLACK_CHANNELS environment variable is required.}"

bin="$(dirname "$0")"
vendored="$(dirname "$0")"/../vendored

src="${1:?A source ndjson file is required as the first argument.}"
dst="${2:?A destination ndjson s3:// URL is required as the second argument.}"
source_name=${3:?A record source name is required as the third argument.}

# if the file is not already present, just exit
"$bin"/s3-object-exists "$dst" || exit 0
"$vendored"/s3-object-exists "$dst" || exit 0

s3path="${dst#s3://}"
bucket="${s3path%%/*}"
Expand Down Expand Up @@ -51,4 +52,4 @@ fi

slack_message+=" (Total record count: $src_record_count)"

"$bin"/notify-slack "$slack_message"
"$vendored"/notify-slack "$slack_message"
6 changes: 3 additions & 3 deletions ingest/bin/trigger-on-new-data
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ set -euo pipefail

: "${PAT_GITHUB_DISPATCH:?The PAT_GITHUB_DISPATCH environment variable is required.}"

bin="$(dirname "$0")"
vendored="$(dirname "$0")"/../vendored

metadata="${1:?A metadata upload output file is required as the first argument.}"
sequences="${2:?An sequence FASTA upload output file is required as the second argument.}"
Expand All @@ -17,14 +17,14 @@ slack_message=""
# grep exit status 0 for found match, 1 for no match, 2 if an error occurred
if [[ $new_metadata -eq 1 || $new_sequences -eq 1 ]]; then
slack_message="Triggering new builds due to updated metadata and/or sequences"
"$bin"/trigger "monkeypox" "rebuild"
"$vendored"/trigger "monkeypox" "rebuild"
elif [[ $new_metadata -eq 0 && $new_sequences -eq 0 ]]; then
slack_message="Skipping trigger of rebuild: Both metadata TSV and sequences FASTA are identical to S3 files."
else
slack_message="Skipping trigger of rebuild: Unable to determine if data has been updated."
fi


if ! "$bin"/notify-slack "$slack_message"; then
if ! "$vendored"/notify-slack "$slack_message"; then
echo "Notifying Slack failed, but exiting with success anyway."
fi
8 changes: 4 additions & 4 deletions ingest/bin/upload-to-s3
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# Originally copied from nextstrain/ncov-ingest repo
set -euo pipefail

bin="$(dirname "$0")"
vendored="$(dirname "$0")"/../vendored

main() {
local quiet=0
Expand All @@ -26,7 +26,7 @@ main() {
local key="${s3path#*/}"

local src_hash dst_hash no_hash=0000000000000000000000000000000000000000000000000000000000000000
src_hash="$("$bin/sha256sum" < "$src")"
src_hash="$("$vendored/sha256sum" < "$src")"
dst_hash="$(aws s3api head-object --bucket "$bucket" --key "$key" --query Metadata.sha256sum --output text 2>/dev/null || echo "$no_hash")"

if [[ $src_hash != "$dst_hash" ]]; then
Expand All @@ -46,7 +46,7 @@ main() {

if [[ -n $cloudfront_domain ]]; then
echo "Creating CloudFront invalidation for $cloudfront_domain/$key"
if ! "$bin"/cloudfront-invalidate "$cloudfront_domain" "/$key"; then
if ! "$vendored"/cloudfront-invalidate "$cloudfront_domain" "/$key"; then
echo "CloudFront invalidation failed, but exiting with success anyway."
fi
fi
Expand All @@ -56,7 +56,7 @@ main() {
exit 0
fi

if ! "$bin"/notify-slack "Updated $dst available."; then
if ! "$vendored"/notify-slack "Updated $dst available."; then
echo "Notifying Slack failed, but exiting with success anyway."
fi
else
Expand Down
13 changes: 13 additions & 0 deletions ingest/vendored/.github/workflows/ci.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
name: CI

on:
- push
- pull_request
- workflow_dispatch

jobs:
shellcheck:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: nextstrain/.github/actions/shellcheck@master
60 changes: 60 additions & 0 deletions ingest/vendored/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# ingest

Shared internal tooling for pathogen data ingest. Used by our individual
pathogen repos which produce Nextstrain builds. Expected to be vendored by
each pathogen repo using `git subtree` (or `git subrepo`).

Some tools may only live here temporarily before finding a permanent home in
`augur curate` or Nextstrain CLI. Others may happily live out their days here.

## History

Much of this tooling originated in
[ncov-ingest](https://github.com/nextstrain/ncov-ingest) and was passaged thru
[monkeypox's ingest/](https://github.com/nextstrain/monkeypox/tree/@/ingest/).
It subsequently proliferated from [monkeypox][] to other pathogen repos
([rsv][], [zika][], [dengue][], [hepatitisB][], [forecasts-ncov][]) primarily
thru copying. To [counter that
proliferation](https://bedfordlab.slack.com/archives/C7SDVPBLZ/p1688577879947079),
this repo was made.

[monkeypox]: https://github.com/nextstrain/monkeypox
[rsv]: https://github.com/nextstrain/rsv
[zika]: https://github.com/nextstrain/zika/pull/24
[dengue]: https://github.com/nextstrain/dengue/pull/10
[hepatitisB]: https://github.com/nextstrain/hepatitisB
[forecasts-ncov]: https://github.com/nextstrain/forecasts-ncov

## Elsewhere

The creation of this repo, in both the abstract and concrete, and the general
approach to "ingest" has been discussed in various internal places, including:

- https://github.com/nextstrain/private/issues/59
- @joverlee521's [workflows document](https://docs.google.com/document/d/1rLWPvEuj0Ayc8MR0O1lfRJZfj9av53xU38f20g8nU_E/edit#heading=h.4g0d3mjvb89i)
- [5 July 2023 Slack thread](https://bedfordlab.slack.com/archives/C7SDVPBLZ/p1688577879947079)
- [6 July 2023 team meeting](https://docs.google.com/document/d/1FPfx-ON5RdqL2wyvODhkrCcjgOVX3nlXgBwCPhIEsco/edit)
- _…many others_

## Scripts

Scripts for supporting ingest workflow automation that don’t really belong in any of our existing tools.

- [notify-on-job-fail](notify-on-job-fail) - Send Slack message with details about failed workflow job on GitHub Actions and/or AWS Batch
- [notify-on-job-start](notify-on-job-start) - Send Slack message with details about workflow job on GitHub Actions and/or AWS Batch
- [notify-slack](notify-slack) - Send message or file to Slack
- [s3-object-exists](s3-object-exists) - Used to prevent 404 errors during S3 file comparisons in the notify-* scripts
- [trigger](trigger) - Triggers downstream GitHub Actions via the GitHub API using repository_dispatch events.

Potential Nextstrain CLI scripts

- [sha256sum](sha256sum) - Used to check if files are identical in upload-to-s3 and download-from-s3 scripts.
- [cloudfront-invalidate](cloudfront-invalidate) - CloudFront invalidation is already supported in the [nextstrain remote command for S3 files](https://github.com/nextstrain/cli/blob/a5dda9c0579ece7acbd8e2c32a4bbe95df7c0bce/nextstrain/cli/remote/s3.py#L104).
This exists as a separate script to support CloudFront invalidation when using the upload-to-s3 script.

Potential augur curate scripts

- [merge-user-metadata](merge-user-metadata) - Merges user annotations with NDJSON records
- [transform-authors](transform-authors) - Abbreviates full author lists to '<first author> et al.'
- [transform-field-names](transform-field-names) - Rename fields of NDJSON records
- [transform-genbank-location](transform-genbank-location) - Parses `location` field with the expected pattern `"<country_value>[:<region>][, <locality>]"` based on [GenBank's country field](https://www.ncbi.nlm.nih.gov/genbank/collab/country/)
File renamed without changes.
File renamed without changes.
23 changes: 23 additions & 0 deletions ingest/vendored/notify-on-job-fail
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
#!/bin/bash
set -euo pipefail

: "${SLACK_TOKEN:?The SLACK_TOKEN environment variable is required.}"
: "${SLACK_CHANNELS:?The SLACK_CHANNELS environment variable is required.}"

: "${AWS_BATCH_JOB_ID:=}"
: "${GITHUB_RUN_ID:=}"

bin="$(dirname "$0")"
job_name="${1:?A job name is required as the first argument}"
github_repo="${2:?A GitHub repository with owner and repository name is required as the second argument}"

echo "Notifying Slack about failed ${job_name} job."
message="❌ ${job_name} job has FAILED 😞 "

if [[ -n "${AWS_BATCH_JOB_ID}" ]]; then
message+="See AWS Batch job \`${AWS_BATCH_JOB_ID}\` (<https://console.aws.amazon.com/batch/v2/home?region=us-east-1#jobs/detail/${AWS_BATCH_JOB_ID}|link>) for error details. "
elif [[ -n "${GITHUB_RUN_ID}" ]]; then
message+="See GitHub Action <https://github.com/${github_repo}/actions/runs/${GITHUB_RUN_ID}?check_suite_focus=true|${GITHUB_RUN_ID}> for error details. "
fi

"$bin"/notify-slack "$message"
Original file line number Diff line number Diff line change
Expand Up @@ -8,17 +8,20 @@ set -euo pipefail
: "${GITHUB_RUN_ID:=}"

bin="$(dirname "$0")"
job_name="${1:?A job name is required as the first argument}"
github_repo="${2:?A GitHub repository with owner and repository name is required as the second argument}"
build_dir="${3:-ingest}"

echo "Notifying Slack about started ingest job."
message="🐵 Monkeypox ingest job has started."
echo "Notifying Slack about started ${job_name} job."
message="${job_name} job has started."

if [[ -n "${GITHUB_RUN_ID}" ]]; then
message+=" The job was submitted by GitHub Action <https://github.com/nextstrain/monkeypox/actions/runs/${GITHUB_RUN_ID}?check_suite_focus=true|${GITHUB_RUN_ID}>."
message+=" The job was submitted by GitHub Action <https://github.com/${github_repo}/actions/runs/${GITHUB_RUN_ID}?check_suite_focus=true|${GITHUB_RUN_ID}>."
fi

if [[ -n "${AWS_BATCH_JOB_ID}" ]]; then
message+=" The job was launched as AWS Batch job \`${AWS_BATCH_JOB_ID}\` (<https://console.aws.amazon.com/batch/v2/home?region=us-east-1#jobs/detail/${AWS_BATCH_JOB_ID}|link>)."
message+=" Follow along in your local \`monkeypox\` repo with: "'```'"nextstrain build --aws-batch --no-download --attach ${AWS_BATCH_JOB_ID} ingest/"'```'
message+=" Follow along in your local clone of ${github_repo} with: "'```'"nextstrain build --aws-batch --no-download --attach ${AWS_BATCH_JOB_ID} ${build_dir}"'```'
fi

"$bin"/notify-slack "$message"
2 changes: 0 additions & 2 deletions ingest/bin/notify-slack → ingest/vendored/notify-slack
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
#!/bin/bash
# Originally copied from nextstrain/ncov-ingest repo
set -euo pipefail

: "${SLACK_TOKEN:?The SLACK_TOKEN environment variable is required.}"
Expand Down Expand Up @@ -38,7 +37,6 @@ if [[ "$upload" == 1 ]]; then
--form-string title="$text" \
--form-string filename="$text" \
--form-string thread_ts="$thread_ts" \
--form-string reply_broadcast="$broadcast" \
--form file=@/dev/stdin \
--form filetype=text \
--fail --silent --show-error \
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
#!/bin/bash
# Originally copied from nextstrain/ncov-ingest
set -euo pipefail

url="${1#s3://}"
Expand Down
1 change: 0 additions & 1 deletion ingest/bin/sha256sum → ingest/vendored/sha256sum
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
#!/usr/bin/env python3
# Originally copied from nextstrain/ncov-ingest repo
"""
Portable sha256sum utility.
"""
Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
Loading