Initial commit

nextstrain · Jun 10, 2024 · ad0b045 · ad0b045
commit ad0b045
Show file tree

Hide file tree

Showing 68 changed files with 2,715 additions and 0 deletions.
diff --git a/.gitattributes b/.gitattributes
@@ -0,0 +1,3 @@
+# Allow Git to decide if file is text or binary
+# Always use LF line endings even on Windows.
+* text=auto eol=lf
diff --git a/.github/workflows/pre-commit.yaml b/.github/workflows/pre-commit.yaml
@@ -0,0 +1,14 @@
+name: pre-commit
+
+on:
+  - push
+
+jobs:
+  pre-commit:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+      - uses: pre-commit/action@v3.0.1
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,51 @@
+# Files created by workflows that we usually want to keep out of git
+auspice/
+builds/
+data/
+results/
+logs/
+benchmarks/
+
+# Sensitive environment variables
+environment*
+env.d/
+
+# Snakemake
+.snakemake/
+
+# For Python #
+##############
+*.pyc
+.tox/
+.cache/
+
+# Compiled source #
+###################
+*.com
+*.class
+*.dll
+*.exe
+*.o
+*.so
+
+# OS generated files #
+######################
+.DS_Store
+.DS_Store?
+._*
+.Spotlight-V100
+.Trashes
+Icon?
+ehthumbs.db
+Thumbs.db
+*~
+
+# IDE generated files #
+######################
+.vscode/
+
+# nohup output
+nohup.out
+
+# cluster logs
+slurm-*
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -0,0 +1,41 @@
+default_language_version:
+  python: python3
+exclude: '\.(tsv|fasta|gb)$|^ingest/vendored/'
+repos:
+  - repo: https://github.com/pre-commit/sync-pre-commit-deps
+    rev: v0.0.1
+    hooks:
+      - id: sync-pre-commit-deps
+  - repo: https://github.com/shellcheck-py/shellcheck-py
+    rev: v0.10.0.1
+    hooks:
+      - id: shellcheck
+  - repo: https://github.com/rhysd/actionlint
+    rev: v1.6.27
+    hooks:
+      - id: actionlint
+        entry: env SHELLCHECK_OPTS='--exclude=SC2027' actionlint
+  - repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: v4.6.0
+    hooks:
+      - id: trailing-whitespace
+      - id: check-ast
+      - id: check-case-conflict
+      - id: check-docstring-first
+      - id: check-json
+      - id: check-executables-have-shebangs
+      - id: check-merge-conflict
+      - id: check-shebang-scripts-are-executable
+      - id: check-symlinks
+      - id: check-toml
+      - id: check-yaml
+      - id: destroyed-symlinks
+      - id: detect-private-key
+      - id: end-of-file-fixer
+      - id: fix-byte-order-marker
+  - repo: https://github.com/astral-sh/ruff-pre-commit
+    # Ruff version.
+    rev: v0.4.6
+    hooks:
+      # Run the linter.
+      - id: ruff
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,7 @@
+# CHANGELOG
+
+We use this CHANGELOG to document breaking changes, new features, bug fixes,
+and config value changes that may affect both the usage of the workflows and
+the outputs of the workflows. See the [changelog for the ncov
+repository](https://github.com/nextstrain/ncov/blob/HEAD/docs/src/reference/change_log.md)
+for an example of formatting.
diff --git a/README.md b/README.md
@@ -0,0 +1,9 @@
+# Pathogen Repo Guide
+
+This is a Nextstrain pathogen repository guide for setting up a pathogen
+repo to hold the files necessary to run and maintain a Nextstrain pathogen build.
+
+Using this guide will allow you to start with the general repository
+and workflow organization that is expected of a Nextstrain maintained pathogen.
+However, the workflows will require customizations to support your specific pathogen
+and should not be expected to "just work".
diff --git a/ingest/README.md b/ingest/README.md
@@ -0,0 +1,93 @@
+# Ingest
+
+This workflow ingests public data from NCBI and outputs curated metadata and
+sequences that can be used as input for the phylogenetic workflow.
+
+If you have another data source or private data that needs to be formatted for
+the phylogenetic workflow, then you can use a similar workflow to curate your
+own data.
+
+## Workflow Usage
+
+The workflow can be run from the top level pathogen repo directory:
+```
+nextstrain build ingest
+```
+
+Alternatively, the workflow can also be run from within the ingest directory:
+```
+cd ingest
+nextstrain build .
+```
+
+This produces the default outputs of the ingest workflow:
+
+- metadata      = results/metadata.tsv
+- sequences     = results/sequences.fasta
+
+### Dumping the full raw metadata from NCBI Datasets
+
+The workflow has a target for dumping the full raw metadata from NCBI Datasets.
+
+```
+nextstrain build ingest dump_ncbi_dataset_report
+```
+
+This will produce the file `ingest/data/ncbi_dataset_report_raw.tsv`,
+which you can inspect to determine what fields and data to use if you want to
+configure the workflow for your pathogen.
+
+## Defaults
+
+The defaults directory contains all of the default configurations for the ingest workflow.
+
+[defaults/config.yaml](defaults/config.yaml) contains all of the default configuration parameters
+used for the ingest workflow. Use Snakemake's `--configfile`/`--config`
+options to override these default values.
+
+## Snakefile and rules
+
+The rules directory contains separate Snakefiles (`*.smk`) as modules of the core ingest workflow.
+The modules of the workflow are in separate files to keep the main ingest [Snakefile](Snakefile) succinct and organized.
+
+The `workdir` is hardcoded to be the ingest directory so all filepaths for
+inputs/outputs should be relative to the ingest directory.
+
+Modules are all [included](https://snakemake.readthedocs.io/en/stable/snakefiles/modularization.html#includes)
+in the main Snakefile in the order that they are expected to run.
+
+### Nextclade
+
+Nextstrain is pushing to standardize ingest workflows with Nextclade runs to include Nextclade outputs in our publicly
+hosted data. However, if a Nextclade dataset does not already exist, it requires curated data as input, so we are making
+Nextclade steps optional here.
+
+If Nextclade config values are included, the Nextclade rules will create the final metadata TSV by joining the Nextclade
+output with the metadata. If Nextclade configs are not included, we rename the subset metadata TSV to the final metadata TSV.
+
+To run Nextclade rules, include the `defaults/nextclade_config.yaml` config file with:
+
+```
+nextstrain build ingest --configfile defaults/nextclade_config.yaml
+```
+
+> [!TIP]
+> If the Nextclade dataset is stable and you always want to run the Nextclade rules as part of ingest, we recommend
+moving the Nextclade related config parameters from the `defaults/nextclade_config.yaml` file to the default config file
+`defaults/config.yaml`.
+
+## Build configs
+
+The build-configs directory contains custom configs and rules that override and/or
+extend the default workflow.
+
+- [nextstrain-automation](build-configs/nextstrain-automation/) - automated internal Nextstrain builds.
+
+
+## Vendored
+
+This repository uses [`git subrepo`](https://github.com/ingydotnet/git-subrepo)
+to manage copies of ingest scripts in [vendored](vendored), from [nextstrain/ingest](https://github.com/nextstrain/ingest).
+
+See [vendored/README.md](vendored/README.md#vendoring) for instructions on how to update
+the vendored scripts.
diff --git a/ingest/Snakefile b/ingest/Snakefile
@@ -0,0 +1,73 @@
+"""
+This is the main ingest Snakefile that orchestrates the full ingest workflow
+and defines its default outputs.
+"""
+# The workflow filepaths are written relative to this Snakefile's base directory
+workdir: workflow.current_basedir
+
+# Use default configuration values. Override with Snakemake's --configfile/--config options.
+configfile: "defaults/config.yaml"
+
+# This is the default rule that Snakemake will run when there are no specified targets.
+# The default output of the ingest workflow is usually the curated metadata and sequences.
+# Nextstrain-maintained ingest workflows will produce metadata files with the
+# standard Nextstrain fields and additional fields that are pathogen specific.
+# We recommend using these standard fields in custom ingests as well to minimize
+# the customizations you will need for the downstream phylogenetic workflow.
+# TODO: Add link to centralized docs on standard Nextstrain metadata fields
+rule all:
+    input:
+        "results/sequences.fasta",
+        "results/metadata.tsv",
+
+
+# Note that only PATHOGEN-level customizations should be added to these
+# core steps, meaning they are custom rules necessary for all builds of the pathogen.
+# If there are build-specific customizations, they should be added with the
+# custom_rules imported below to ensure that the core workflow is not complicated
+# by build-specific rules.
+include: "rules/fetch_from_ncbi.smk"
+include: "rules/curate.smk"
+
+
+# We are pushing to standardize ingest workflows with Nextclade runs to include
+# Nextclade outputs in our publicly hosted data. However, if a Nextclade dataset
+# does not already exist, creating one requires curated data as input, so we are making
+# Nextclade steps optional here.
+#
+# If Nextclade config values are included, the nextclade rules will create the
+# final metadata TSV by joining the Nextclade output with the metadata.
+# If Nextclade configs are not included, we rename the subset metadata TSV
+# to the final metadata TSV.
+# To run nextclade.smk rules, include the `defaults/nextclade_config.yaml`
+# config file with `nextstrain build ingest --configfile defaults/nextclade_config.yaml`.
+if "nextclade" in config:
+
+    include: "rules/nextclade.smk"
+
+else:
+
+    rule create_final_metadata:
+        input:
+            metadata="data/subset_metadata.tsv"
+        output:
+            metadata="results/metadata.tsv"
+        shell:
+            """
+            mv {input.metadata} {output.metadata}
+            """
+
+# Allow users to import custom rules provided via the config.
+# This allows users to run custom rules that can extend or override the workflow.
+# A concrete example of using custom rules is the extension of the workflow with
+# rules to support the Nextstrain automation that uploads files and sends internal
+# Slack notifications.
+# For extensions, the user will have to specify the custom rule targets when
+# running the workflow.
+# For overrides, the custom Snakefile will have to use the `ruleorder` directive
+# to allow Snakemake to handle ambiguous rules
+# https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#handling-ambiguous-rules
+if "custom_rules" in config:
+    for rule_file in config["custom_rules"]:
+
+        include: rule_file
diff --git a/ingest/build-configs/nextstrain-automation/README.md b/ingest/build-configs/nextstrain-automation/README.md
@@ -0,0 +1,38 @@
+# Nextstrain automation
+
+> [!NOTE]
+> External users can ignore this directory!
+> This build config/customization is tailored for the internal Nextstrain team
+> to extend the core ingest workflow for automated workflows.
+
+## Update the config
+
+Update the [config.yaml](config.yaml) for your pathogen:
+
+1. Edit the `s3_dst` param to add the pathogen repository name.
+2. Edit the `files_to_upload` param to a mapping of files you need to upload for your pathogen.
+The default includes suggested files for uploading curated data and Nextclade outputs.
+
+## Run the workflow
+
+Provide the additional config file to the Snakemake options in order to
+include the custom rules from [upload.smk](upload.smk) in the workflow.
+Specify the `upload_all` target in order to run the additional upload rules.
+
+The upload rules will require AWS credentials for a user that has permissions
+to upload to the Nextstrain data bucket.
+
+The customized workflow can be run from the top level pathogen repo directory with:
+```
+nextstrain build \
+    --env AWS_ACCESS_KEY_ID \
+    --env AWS_SECRET_ACCESS_KEY \
+    ingest \
+        upload_all \
+        --configfile build-configs/nextstrain-automation/config.yaml
+```
+
+## Automated GitHub Action workflows
+
+Additional instructions on how to use this with the shared `pathogen-repo-build`
+GitHub Action workflow to come!
diff --git a/ingest/build-configs/nextstrain-automation/config.yaml b/ingest/build-configs/nextstrain-automation/config.yaml
@@ -0,0 +1,23 @@
+# This configuration file should contain all required configuration parameters
+# for the ingest workflow to run with additional Nextstrain automation rules.
+
+# Custom rules to run as part of the Nextstrain automated workflow
+# The paths should be relative to the ingest directory.
+custom_rules:
+  - build-configs/nextstrain-automation/upload.smk
+
+# Nextstrain CloudFront domain to ensure that we invalidate CloudFront after the S3 uploads
+# This is required as long as we are using the AWS CLI for uploads
+cloudfront_domain: "data.nextstrain.org"
+
+# Nextstrain AWS S3 Bucket with pathogen prefix
+# Replace <pathogen> with the pathogen repo name.
+s3_dst: "s3://nextstrain-data/files/workflows/<pathogen>"
+
+# Mapping of files to upload
+files_to_upload:
+  ncbi.ndjson.zst: data/ncbi.ndjson
+  metadata.tsv.zst: results/metadata.tsv
+  sequences.fasta.zst: results/sequences.fasta
+  alignments.fasta.zst: results/alignment.fasta
+  translations.zip: results/translations.zip