Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download data from S3 to start workflow #22

Merged
merged 8 commits into from
May 3, 2024
Merged

Download data from S3 to start workflow #22

merged 8 commits into from
May 3, 2024

Conversation

huddlej
Copy link
Contributor

@huddlej huddlej commented Apr 29, 2024

Description of proposed changes

Replaces workflow logic for downloading data from fauna with 1) a custom workflow that downloads from fauna, parses sequences and metadata, and uploads to S3 and 2) new main workflow logic to download parsed sequences/metadata from S3 and filter to the requested subtype before continuing the rest of the workflow.

One major change from this implementation is the replacement of one metadata file per subtype and segment with a single metadata file across all segments. The metadata file includes a n_segments column with the number of segment sequences available for each metadata record which allows the original "same strains" path through the phylogenetic workflow to work.

To run upload to S3:

cd ingest
nextstrain build \
    --env RETHINK_HOST \
    --env RETHINK_AUTH_KEY \
    --env AWS_ACCESS_KEY_ID \
    --env AWS_SECRET_ACCESS_KEY \
    .

See the ingest README for more details.

After the upload, S3 will have one metadata for all subtypes and segments and one sequences file per gene segment across all subtypes like:

  • s3://nextstrain-data-private/files/workflows/avian-flu/metadata.tsv.zst
  • s3://nextstrain-data-private/files/workflows/avian-flu/ha/sequences.fasta.zst

What this means for users

The changes in this PR will be breaking changes for some users including people who currently have credentials to access fauna but do not have AWS credentials to access the private bucket above. We will need to issue these users with AWS credentials that provide at least read access to nextstrain-data-private and they will need to learn how to pass those credentials to tools like the Nextstrain CLI (or through the envdir argument).

Users who want to run the upload workflow will need read/write access to the private bucket. Ideally, we could limit the number of users who need these permissions by making the GitHub Action described in the next steps below.

Next steps

One immediate improvement to user experience of running the "upload" workflow would be to expose it through a GitHub Action in this repository, such that running the workflow only entails an authorized GitHub user clicking a "Run" button. Once this Action is in place, it could easily be expanded to automatically trigger new phylogenetic builds when the upload completes just like we do in the seasonal-flu workflow.

Related issue(s)

Checklist

  • Checks pass

Replaces workflow logic for downloading data from fauna with 1) a custom
workflow that downloads from fauna, parses sequences and metadata, and
uploads to S3 and 2) new main workflow logic to download parsed
sequences/metadata from S3 and filter to the requested subtype before
continuing the rest of the workflow.

This approach keeps a separate metadata file per segment to simplify
replacement of fauna download logic in the original workflow and allow
existing rules that expect segment-specific metadata (e.g., add segment
counts, etc.) to work without additional changes.
@jameshadfield
Copy link
Member

This approach keeps a separate metadata file per segment to simplify replacement of fauna download logic in the original workflow and allow existing rules that expect segment-specific metadata (e.g., add segment counts, etc.) to work without additional changes.

This doesn't have to be part of this PR, but a nicer interface to aim towards would be using a single metadata file and adding the segment counts to that file. Would simplify the snakemake workflow a bit. I'm not sure whether metadata fields would have to be joined across the inputs (i.e. is there metadata that's only supplied for some segments and not others).

@huddlej
Copy link
Contributor Author

huddlej commented Apr 29, 2024

@jameshadfield Good call. The first commit was my attempt to get S3-based data working without breaking any downstream steps in the workflow. But @trvrb had the same request for a single metadata file, so I'll try this out for this PR. Maybe we can chat tomorrow about specifics, though?

In the mean time, I'll also fix the paths to input data for the CI builds.

Replaces unparsed sequences (with metadata in headers) with parsed
sequences and metadata as separate files. This change allows the CI
workflow to copy example data into the data directory and run the
workflow from these subtype- and segment-specific files, bypassing the
new download and filter-by-subtype rules.

One side effect of this change is that the subtype- and segment-specific
sequences and metadata now live in the `data/` directory instead of the
`results/` directory. This change makes this workflow more consistent
with other Nextstrain workflows like Zika, etc.
@trvrb
Copy link
Member

trvrb commented Apr 30, 2024

I think we can plan to merge this PR when we're happy with it to include a single metadata file on S3. Then in a separate PR we can update the workflow to use S3 files and switch to using a single metadata file.

Updates the "upload" workflow to create a single metadata file from the
8 individual metadata files by moving the "add segment counts" rule from
the main phylogenetic workflow to the upload workflow. As a result, all
subtypes have segment counts in their metadata regardless of whether the
"same strains" path through the phylogenetic workflow is used or not.

This commit updates the phylogenetic workflow to use a single metadata
file for all segments and retains the conditional input logic for adding
H5 clades for specific subtypes.
@huddlej huddlej marked this pull request as ready for review May 1, 2024 17:54
Snakefile Outdated Show resolved Hide resolved
Snakefile Show resolved Hide resolved
upload.smk Outdated
sequences = "upload/results/sequences_{segment}.fasta",
metadata = "upload/results/metadata_{segment}.tsv",
params:
fasta_fields = "strain virus isolate_id date region country division location host domestic_status subtype originating_lab submitting_lab authors PMID gisaid_clade h5_clade",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dropped PMID in efb9e8b because it was almost fully empty. I think now would be the most appropriate time to slim columns from the S3 metadata that we're not actually using. Okay to drop here @lmoncla?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metadata entry for h5_clade is also very incomplete. It's not used as a coloring and instead we're using either GISAID clade or LABEL clade. This would seem to just add confusion. How about dropping this as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to leave this change to someone who knows the data better. I think we could merge this PR first and refine metadata in future commits/PRs, though.

Snakefile Outdated Show resolved Hide resolved
huddlej added 3 commits May 3, 2024 10:04
Adds an initial README and moves the upload workflow Snakefile into the
standard structure for an ingest workflow.
Updates the upload workflow to work as a top-level ingest workflow
through a standard Snakefile entry point. As part of this
standardization, this commit moves the script for adding segment counts
into the ingest directory and updates the README to reflect constraints
on how we need to run this workflow with the Nextstrain CLI (i.e., with
the Docker runtime).
Avoid a situation where a user tries to run the ingest workflow with a
different Nextstrain runtime that doesn't have fauna installed.
@huddlej huddlej merged commit 8c3957b into master May 3, 2024
6 checks passed
@huddlej huddlej deleted the use-s3-storage branch May 3, 2024 23:05
jameshadfield added a commit that referenced this pull request May 6, 2024
Usage shifted to ingest workflow in <#22>
jameshadfield added a commit that referenced this pull request May 6, 2024
Benchmarks are newly added as of <#22>
jameshadfield added a commit that referenced this pull request May 6, 2024
This section wasn't updated with <#22>.
References to fauna are removed as they are now covered in the ingest's
README
trvrb added a commit that referenced this pull request May 7, 2024
With PR #22 merged there is a single metadata TSV under data/ that can be used in the genome workflow rather than relying on the HA metadata.
jameshadfield added a commit that referenced this pull request May 7, 2024
Usage shifted to ingest workflow in <#22>
jameshadfield added a commit that referenced this pull request May 7, 2024
Benchmarks are newly added as of <#22>
jameshadfield added a commit that referenced this pull request May 7, 2024
This section wasn't updated with <#22>.
References to fauna are removed as they are now covered in the ingest's
README
jameshadfield added a commit that referenced this pull request May 8, 2024
Data source paths had changed via #22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants