Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modernize/cleanup 2 #3

Merged
merged 7 commits into from
May 29, 2024
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 55 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
default_language_version:
python: python3
# TODO remove _LEGACY once it's gone
exclude: '\.(tsv|fasta|gb)$|^ingest/vendored/|^_LEGACY'
repos:
- repo: https://github.com/snakemake/snakefmt
rev: v0.10.1
hooks:
- id: snakefmt
language_version: python3
genehack marked this conversation as resolved.
Show resolved Hide resolved
- repo: https://github.com/rhysd/actionlint
rev: v1.6.27
hooks:
- id: actionlint
entry: env SHELLCHECK_OPTS='--exclude=SC2027' actionlint
- repo: https://github.com/codespell-project/codespell
rev: v2.2.6
hooks:
- id: codespell
additional_dependencies:
- tomli
- repo: https://github.com/google/yamlfmt
rev: v0.12.1
hooks:
- id: yamlfmt
- repo: https://github.com/pappasam/toml-sort
rev: v0.23.1
hooks:
- id: toml-sort-fix
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.6.0
hooks:
- id: trailing-whitespace
- id: check-ast
- id: check-case-conflict
- id: check-docstring-first
- id: check-json
- id: check-executables-have-shebangs
- id: check-merge-conflict
- id: check-shebang-scripts-are-executable
- id: check-symlinks
- id: check-toml
- id: check-yaml
- id: destroyed-symlinks
- id: detect-private-key
- id: end-of-file-fixer
- id: fix-byte-order-marker
- repo: https://github.com/pre-commit/sync-pre-commit-deps
rev: v0.0.1
hooks:
- id: sync-pre-commit-deps
- repo: https://github.com/shellcheck-py/shellcheck-py
rev: v0.10.0.1
hooks:
- id: shellcheck
5 changes: 5 additions & 0 deletions .yamlfmt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
formatter:
type: basic
line_ending: lf
retain_line_breaks: true
max_line_length: 120
88 changes: 9 additions & 79 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,82 +1,12 @@
# Nextstrain build for Yellow Fever

This build is currently designed for in-the-field running and is not yet generalised for a stable, updated nextstrain.org page.

## Install augur + auspice using conda

```
curl http://data.nextstrain.org/nextstrain.yml --compressed -o nextstrain.yml
conda env create -f nextstrain.yml
conda activate nextstrain
npm install --global auspice
```

When you're inside the "nextstrain" environment (via `conda activate nextstrain`) you should have both `augur` & `auspice` installed.
You can test this by running `augur --version` and `auspice --version`.
Currently, augur is around `v5.1.1` and auspice is around `v1.36.6`.


## Clone this repo

```
git clone https://github.com/nextstrain/yellow-fever.git
cd yellow-fever
mkdir data results auspice
```


## Make input files available

The bioinformatics "pipeline" for YFV starts with 4 main files as input:
* `./data/genbankReleased.fasta`
* `./data/genbankReleased.csv`
* `./data/newSequences.fasta`
* `./data/newSequences.csv`

These are not committed to the github repo (they're "gitignored"), so you'll have to put them there.

These files are specified in the first few lines of the Snakemake file, which contains all the commands necessary to run the pipeline.
It's possible to use >2 sets of input files, or different file names, but you'll have to add / change them in the snakemake file.


Additionally, there are 2 other input-like files, also defined in the snakemake file:
* `./config/auspice_config.json` which contains options -- such as what traits to display as the color-by's on the tree -- which are used to control how auspice will visualise the data.
* `./config/YFV112.gb` the YFV reference used here -- currently [YF112](https://www.ncbi.nlm.nih.gov/nuccore/1269012770). Please replace this if needed & update the snakemake file accordingly.


## Run the pipeline

```
snakemake --printshellcmds
```

This will run all the steps defined in the Snakefile 🎉

These steps are (roughly):
1. __parse__ convert the (potentially multiple) CSV + FASTA files into the correct format for augur (TSV + FASTA). Also performs some field manipulation, such as extracting "country" from the "Sequence_name", extracting collection year, storing which file a sequence came from etcetera.
2. __align__ Using mafft
3. __tree__ Using IQ-TREE (cahn change this to RAxML or FastTree if needed)
4. __refine__ Normally this is where we date the internal nodes, but I haven't enabled this here. It is needed however to label the internal nodes & reroot the tree (see below).
5. __ancestral__ Infer ancestral mutations on the tree. This step could easily be dropped if desired!
6. __traits__ Use DTA to infer some traits across the tree. Currently used for "country" only. You can easily add fields to the snakemake file which will perform this for additional traits.
7. __export__ Create the final JSON for auspice to visualise.

Steps 1-6 produce output in `./results`, while step 7 (export) produces the JSONs in `./auspice`. Both of these directories are gitignored (as well as `./data`) so that files here won't be pushed up to GitHub.

## Visualise the data

```
auspice view --datasetDir auspice
```
Then open a browser at http://localhost:4000

Current color-bys include most of the metadata provided, as well as which file the samples came from, year of collection.
The GPS co-ordinates are per-strain, so that is the geographic-resolution available. We could also aggregate municipalities if desired (we'd need GPS coordinates for each one if so).


## To-Do
* The tree is rooted on the oldest available sequence ("JF912179", 1980), but there may be a better choice? This is defined in the Snakefile and is really easy to change.
* Reference sequence used may not be ideal.
# Nextstrain repository for yellow fever virus

##TODO## finish updating this
This repository is in the process of being upgraded to follow the
[pathogen repo
guide](https://github.com/nextstrain/pathogen-repo-guide/).

## Installation

Follow the [standard installation
instructions](https://docs.nextstrain.org/en/latest/install.html) for
Nextstrain's suite of software tools.
File renamed without changes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, the move of the top level auspice directory will break the link to the community build that can be viewed at https://nextstrain.org/community/nextstrain/yellow-fever.

Not sure how much it gets used since this build was updated ~5 years ago.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you mean that community build is directly coming out of the GitHub repo?

In that case, what I might do instead is branch a new main off the current master, and then use that new main branch as the place where in-progress "modernization" work is collected, and then only loop back and remove master when the modernization work is complete and we're ready to move yellow-fever into the officially supported builds.

Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I don't have insight into whether we want to maintain the community build here - for our canonical pathogen repos I wouldn't expect usage of /community URLs but 🤷 )

Community URLs will, in the absence of an explicit branch, use the default branch.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for our canonical pathogen repos I wouldn't expect usage of /community URLs

Right, just wanted to point this out in case there was usage of it. I also didn't realize we had a yellow-fever build at https://nextstrain.org/yellow-fever (although it's two years older than the community URL).

Should we just upload this community dataset as a core yellow-fever dataset and remove from here? (Saying that with no historical context of who produced this community dataset)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The community dataset links to http://evolve.zoo.ox.ac.uk/Evolve/Sarah_Hill.html -- which 404s -- but Google suggests that Sarah Hill is https://www.rvc.ac.uk/about/our-people/sarah-hill; there is at least yellow fever research in her publication trail.

I think (unless I'm very confused), my goal here is replacing/updating our yellow fever build.

I'm not sure what sort of guarantees we provide (or even should provide) to folks who have set up community builds, or what we do when a community build "breaks".

Guidance appreciated. 😁

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think (unless I'm very confused), my goal here is replacing/updating our yellow fever build.

Totally, sorry for the confusion. Since the community build was within our Nextstrain repo, I had thought this was our yellow fever build and was wondering if we should upload it as an official Nextstrain build.

I'm not sure what sort of guarantees we provide (or even should provide) to folks who have set up community builds, or what we do when a community build "breaks".

You're right, we don't want to put any guarantees on community builds. I just thought it would nice to preserve this particular build if this was created by the Nextstrain team.


All that said, seems like we don't have to dwell on it now and just proceed with moving it to under _LEGACY. We can revisit what to do with it when the time comes to completely delete it from the repo.

File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
105 changes: 105 additions & 0 deletions ingest/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# Ingest

This workflow ingests public data from NCBI and outputs curated
metadata and sequences that can be used as input for the phylogenetic
workflow.

## Workflow Usage

The workflow can be run from the top level pathogen repo directory:

```bash
nextstrain build ingest
```

Alternatively, the workflow can also be run from within the ingest
directory:

```bash
cd ingest
nextstrain build .
```

This produces the default outputs of the ingest workflow:

- metadata = results/metadata.tsv
- sequences = results/sequences.fasta

### Dumping the full raw metadata from NCBI Datasets

The workflow has a target for dumping the full raw metadata from NCBI
Datasets.

```bash
nextstrain build ingest dump_ncbi_dataset_report
```

This will produce the file `ingest/data/ncbi_dataset_report_raw.tsv`,
which you can inspect to determine what fields and data to use if you
want to configure the workflow for your pathogen.

## Defaults

The defaults directory contains all of the default configurations for
the ingest workflow.

[defaults/config.yaml](defaults/config.yaml) contains all of the
default configuration parameters used for the ingest workflow. Use
Snakemake's `--configfile`/`--config` options to override these
default values.

## Snakefile and rules

The rules directory contains separate Snakefiles (`*.smk`) as modules
of the core ingest workflow. The modules of the workflow are in
separate files to keep the main ingest [Snakefile](Snakefile) succinct
and organized.

The `workdir` is hardcoded to be the ingest directory so all filepaths
for inputs/outputs should be relative to the ingest directory.

Modules are all
[included](https://snakemake.readthedocs.io/en/stable/snakefiles/modularization.html#includes)
in the main Snakefile in the order that they are expected to run.

### Nextclade

Nextstrain is pushing to standardize ingest workflows with Nextclade
runs to include Nextclade outputs in our publicly hosted data.
However, if a Nextclade dataset does not already exist, it requires
curated data as input, so we are making Nextclade steps optional here.

If Nextclade config values are included, the Nextclade rules will
create the final metadata TSV by joining the Nextclade output with the
metadata. If Nextclade configs are not included, we rename the subset
metadata TSV to the final metadata TSV.

To run Nextclade rules, include the `defaults/nextclade_config.yaml`
config file with:

```bash
nextstrain build ingest --configfile defaults/nextclade_config.yaml
```

> [!TIP]
> If the Nextclade dataset is stable and you always want to run the
> Nextclade rules as part of ingest, we recommend moving the Nextclade
> related config parameters from the `defaults/nextclade_config.yaml`
> file to the default config file `defaults/config.yaml`.

## Build configs

The build-configs directory contains custom configs and rules that
override and/or extend the default workflow.

- [nextstrain-automation](build-configs/nextstrain-automation/) - automated internal Nextstrain builds.

## Vendored

This repository uses
[`git subrepo`](https://github.com/ingydotnet/git-subrepo) to manage copies
of ingest scripts in [vendored](vendored), from
[nextstrain/ingest](https://github.com/nextstrain/ingest).

See [vendored/README.md](vendored/README.md#vendoring) for
instructions on how to update the vendored scripts.
86 changes: 86 additions & 0 deletions ingest/Snakefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
"""
This is the main ingest Snakefile that orchestrates the full ingest workflow
and defines its default outputs.
"""


# The workflow filepaths are written relative to this Snakefile's base
# directory
workdir: workflow.current_basedir


# Use default configuration values. Override with Snakemake's
# --configfile/--config options.
configfile: "defaults/config.yaml"


# This is the default rule that Snakemake will run when there are no
# specified targets. The default output of the ingest workflow is
# usually the curated metadata and sequences. Nextstrain-maintained
# ingest workflows will produce metadata files with the standard
# Nextstrain fields and additional fields that are pathogen specific.
# We recommend using these standard fields in custom ingests as well
# to minimize the customizations you will need for the downstream
# phylogenetic workflow.


# TODO: Add link to centralized docs on standard Nextstrain metadata fields
rule all:
input:
"results/sequences.fasta",
"results/metadata.tsv",


# Note that only PATHOGEN-level customizations should be added to
# these core steps, meaning they are custom rules necessary for all
# builds of the pathogen. If there are build-specific customizations,
# they should be added with the custom_rules imported below to ensure
# that the core workflow is not complicated by build-specific rules.
include: "rules/fetch_from_ncbi.smk"
include: "rules/curate.smk"


# We are pushing to standardize ingest workflows with Nextclade runs
# to include Nextclade outputs in our publicly hosted data. However,
# if a Nextclade dataset does not already exist, creating one requires
# curated data as input, so we are making Nextclade steps optional
# here.
#
# If Nextclade config values are included, the nextclade rules will
# create the final metadata TSV by joining the Nextclade output with
# the metadata. If Nextclade configs are not included, we rename the
# subset metadata TSV to the final metadata TSV. To run nextclade.smk
# rules, include the `defaults/nextclade_config.yaml` config file with
# `nextstrain build ingest --configfile
# defaults/nextclade_config.yaml`.
if "nextclade" in config:

include: "rules/nextclade.smk"

else:

rule create_final_metadata:
input:
metadata="data/subset_metadata.tsv",
output:
metadata="results/metadata.tsv",
shell:
"""
mv {input.metadata} {output.metadata}
"""


# Allow users to import custom rules provided via the config.
# This allows users to run custom rules that can extend or override
# the workflow. A concrete example of using custom rules is the
# extension of the workflow with rules to support the Nextstrain
# automation that uploads files and sends internal Slack
# notifications. For extensions, the user will have to specify the
# custom rule targets when running the workflow. For overrides, the
# custom Snakefile will have to use the `ruleorder` directive to allow
# Snakemake to handle ambiguous rules
# https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#handling-ambiguous-rules
if "custom_rules" in config:
for rule_file in config["custom_rules"]:

include: rule_file
38 changes: 38 additions & 0 deletions ingest/build-configs/nextstrain-automation/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Nextstrain automation

> [!NOTE]
> External users can ignore this directory!
> This build config/customization is tailored for the internal Nextstrain team
> to extend the core ingest workflow for automated workflows.

## Update the config

Update the [config.yaml](config.yaml) for your pathogen:

1. Edit the `s3_dst` param to add the pathogen repository name.
2. Edit the `files_to_upload` param to a mapping of files you need to upload for your pathogen.
The default includes suggested files for uploading curated data and Nextclade outputs.

## Run the workflow

Provide the additional config file to the Snakemake options in order to
include the custom rules from [upload.smk](upload.smk) in the workflow.
Specify the `upload_all` target in order to run the additional upload rules.

The upload rules will require AWS credentials for a user that has permissions
to upload to the Nextstrain data bucket.

The customized workflow can be run from the top level pathogen repo directory with:
```
nextstrain build \
--env AWS_ACCESS_KEY_ID \
--env AWS_SECRET_ACCESS_KEY \
ingest \
upload_all \
--configfile build-configs/nextstrain-automation/config.yaml
```

## Automated GitHub Action workflows

Additional instructions on how to use this with the shared `pathogen-repo-build`
GitHub Action workflow to come!
Loading