Download data from S3 to start workflow #22

huddlej · 2024-04-29T23:09:25Z

Description of proposed changes

Replaces workflow logic for downloading data from fauna with 1) a custom workflow that downloads from fauna, parses sequences and metadata, and uploads to S3 and 2) new main workflow logic to download parsed sequences/metadata from S3 and filter to the requested subtype before continuing the rest of the workflow.

One major change from this implementation is the replacement of one metadata file per subtype and segment with a single metadata file across all segments. The metadata file includes a n_segments column with the number of segment sequences available for each metadata record which allows the original "same strains" path through the phylogenetic workflow to work.

To run upload to S3:

cd ingest
nextstrain build \
    --env RETHINK_HOST \
    --env RETHINK_AUTH_KEY \
    --env AWS_ACCESS_KEY_ID \
    --env AWS_SECRET_ACCESS_KEY \
    .

See the ingest README for more details.

After the upload, S3 will have one metadata for all subtypes and segments and one sequences file per gene segment across all subtypes like:

s3://nextstrain-data-private/files/workflows/avian-flu/metadata.tsv.zst
s3://nextstrain-data-private/files/workflows/avian-flu/ha/sequences.fasta.zst

What this means for users

The changes in this PR will be breaking changes for some users including people who currently have credentials to access fauna but do not have AWS credentials to access the private bucket above. We will need to issue these users with AWS credentials that provide at least read access to nextstrain-data-private and they will need to learn how to pass those credentials to tools like the Nextstrain CLI (or through the envdir argument).

Users who want to run the upload workflow will need read/write access to the private bucket. Ideally, we could limit the number of users who need these permissions by making the GitHub Action described in the next steps below.

Next steps

One immediate improvement to user experience of running the "upload" workflow would be to expose it through a GitHub Action in this repository, such that running the workflow only entails an authorized GitHub user clicking a "Run" button. Once this Action is in place, it could easily be expanded to automatically trigger new phylogenetic builds when the upload completes just like we do in the seasonal-flu workflow.

Related issue(s)

Checklist

Checks pass

Replaces workflow logic for downloading data from fauna with 1) a custom workflow that downloads from fauna, parses sequences and metadata, and uploads to S3 and 2) new main workflow logic to download parsed sequences/metadata from S3 and filter to the requested subtype before continuing the rest of the workflow. This approach keeps a separate metadata file per segment to simplify replacement of fauna download logic in the original workflow and allow existing rules that expect segment-specific metadata (e.g., add segment counts, etc.) to work without additional changes.

jameshadfield · 2024-04-29T23:17:47Z

This approach keeps a separate metadata file per segment to simplify replacement of fauna download logic in the original workflow and allow existing rules that expect segment-specific metadata (e.g., add segment counts, etc.) to work without additional changes.

This doesn't have to be part of this PR, but a nicer interface to aim towards would be using a single metadata file and adding the segment counts to that file. Would simplify the snakemake workflow a bit. I'm not sure whether metadata fields would have to be joined across the inputs (i.e. is there metadata that's only supplied for some segments and not others).

huddlej · 2024-04-29T23:53:23Z

@jameshadfield Good call. The first commit was my attempt to get S3-based data working without breaking any downstream steps in the workflow. But @trvrb had the same request for a single metadata file, so I'll try this out for this PR. Maybe we can chat tomorrow about specifics, though?

In the mean time, I'll also fix the paths to input data for the CI builds.

Replaces unparsed sequences (with metadata in headers) with parsed sequences and metadata as separate files. This change allows the CI workflow to copy example data into the data directory and run the workflow from these subtype- and segment-specific files, bypassing the new download and filter-by-subtype rules. One side effect of this change is that the subtype- and segment-specific sequences and metadata now live in the `data/` directory instead of the `results/` directory. This change makes this workflow more consistent with other Nextstrain workflows like Zika, etc.

trvrb · 2024-04-30T23:52:51Z

I think we can plan to merge this PR when we're happy with it to include a single metadata file on S3. Then in a separate PR we can update the workflow to use S3 files and switch to using a single metadata file.

Updates the "upload" workflow to create a single metadata file from the 8 individual metadata files by moving the "add segment counts" rule from the main phylogenetic workflow to the upload workflow. As a result, all subtypes have segment counts in their metadata regardless of whether the "same strains" path through the phylogenetic workflow is used or not. This commit updates the phylogenetic workflow to use a single metadata file for all segments and retains the conditional input logic for adding H5 clades for specific subtypes.

Snakefile

trvrb · 2024-05-01T21:17:54Z

upload.smk

+        sequences = "upload/results/sequences_{segment}.fasta",
+        metadata = "upload/results/metadata_{segment}.tsv",
+    params:
+        fasta_fields =  "strain virus isolate_id date region country division location host domestic_status subtype originating_lab submitting_lab authors PMID gisaid_clade h5_clade",


I dropped PMID in efb9e8b because it was almost fully empty. I think now would be the most appropriate time to slim columns from the S3 metadata that we're not actually using. Okay to drop here @lmoncla?

The metadata entry for h5_clade is also very incomplete. It's not used as a coloring and instead we're using either GISAID clade or LABEL clade. This would seem to just add confusion. How about dropping this as well?

Happy to leave this change to someone who knows the data better. I think we could merge this PR first and refine metadata in future commits/PRs, though.

Snakefile

Adds an initial README and moves the upload workflow Snakefile into the standard structure for an ingest workflow.

Updates the upload workflow to work as a top-level ingest workflow through a standard Snakefile entry point. As part of this standardization, this commit moves the script for adding segment counts into the ingest directory and updates the README to reflect constraints on how we need to run this workflow with the Nextstrain CLI (i.e., with the Docker runtime).

Avoid a situation where a user tries to run the ingest workflow with a different Nextstrain runtime that doesn't have fauna installed.

Usage shifted to ingest workflow in <#22>

Benchmarks are newly added as of <#22>

This section wasn't updated with <#22>. References to fauna are removed as they are now covered in the ingest's README

With PR #22 merged there is a single metadata TSV under data/ that can be used in the genome workflow rather than relying on the HA metadata.

Usage shifted to ingest workflow in <#22>

Benchmarks are newly added as of <#22>

This section wasn't updated with <#22>. References to fauna are removed as they are now covered in the ingest's README

Data source paths had changed via #22

huddlej marked this pull request as ready for review May 1, 2024 17:54

trvrb reviewed May 1, 2024

View reviewed changes

Snakefile Outdated Show resolved Hide resolved

Download from S3 to same local file names

0b2b13d

trvrb reviewed May 1, 2024

View reviewed changes

Snakefile Show resolved Hide resolved

trvrb reviewed May 1, 2024

View reviewed changes

Snakefile Outdated Show resolved Hide resolved

Decompress data on download from S3

0d56cb8

huddlej mentioned this pull request May 2, 2024

Annotate H5 clades through a node data JSON file instead of modifying metadata #25

Open

huddlej added 3 commits May 3, 2024 10:04

Move "upload" workflow into an ingest directory

5817dfb

Adds an initial README and moves the upload workflow Snakefile into the standard structure for an ingest workflow.

Specify the Docker runtime for ingest

62f3445

Avoid a situation where a user tries to run the ingest workflow with a different Nextstrain runtime that doesn't have fauna installed.

huddlej merged commit 8c3957b into master May 3, 2024
6 checks passed

huddlej deleted the use-s3-storage branch May 3, 2024 23:05

jameshadfield added a commit that referenced this pull request May 6, 2024

Remove unused fauna variable

5c36b32

Usage shifted to ingest workflow in <#22>

jameshadfield added a commit that referenced this pull request May 6, 2024

[ingest] gitignore (snakemake) benchmark files

2fe2a4a

Benchmarks are newly added as of <#22>

jameshadfield added a commit that referenced this pull request May 6, 2024

restore example data usage

7680bb9

This section wasn't updated with <#22>. References to fauna are removed as they are now covered in the ingest's README

jameshadfield mentioned this pull request May 6, 2024

Update docs / snakemake #27

Merged

trvrb added a commit that referenced this pull request May 7, 2024

Update metadata for genome workflow

3df4416

With PR #22 merged there is a single metadata TSV under data/ that can be used in the genome workflow rather than relying on the HA metadata.

trvrb mentioned this pull request May 7, 2024

Include traits and update metadata, include strains and Auspice config for genome workflow #30

Merged

jameshadfield added a commit that referenced this pull request May 7, 2024

Remove unused fauna variable

395e567

Usage shifted to ingest workflow in <#22>

jameshadfield added a commit that referenced this pull request May 7, 2024

[ingest] gitignore (snakemake) benchmark files

9867a4b

Benchmarks are newly added as of <#22>

jameshadfield added a commit that referenced this pull request May 7, 2024

restore example data usage

3004f55

This section wasn't updated with <#22>. References to fauna are removed as they are now covered in the ingest's README

jameshadfield added a commit that referenced this pull request May 8, 2024

Update README

1304aec

Data source paths had changed via #22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Download data from S3 to start workflow #22

Download data from S3 to start workflow #22

huddlej commented Apr 29, 2024 •

edited

Loading

jameshadfield commented Apr 29, 2024

huddlej commented Apr 29, 2024

trvrb commented Apr 30, 2024

trvrb May 1, 2024

trvrb May 1, 2024

huddlej May 2, 2024

Download data from S3 to start workflow #22

Download data from S3 to start workflow #22

Conversation

huddlej commented Apr 29, 2024 • edited Loading

Description of proposed changes

What this means for users

Next steps

Related issue(s)

Checklist

jameshadfield commented Apr 29, 2024

huddlej commented Apr 29, 2024

trvrb commented Apr 30, 2024

trvrb May 1, 2024

Choose a reason for hiding this comment

trvrb May 1, 2024

Choose a reason for hiding this comment

huddlej May 2, 2024

Choose a reason for hiding this comment

huddlej commented Apr 29, 2024 •

edited

Loading