Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

augur subsample command #635

Open
Tracked by #1481
jameshadfield opened this issue Dec 2, 2020 · 4 comments
Open
Tracked by #1481

augur subsample command #635

jameshadfield opened this issue Dec 2, 2020 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@jameshadfield
Copy link
Member

jameshadfield commented Dec 2, 2020

Tasks

@victorlin to fill this out

Links


Original issue

A common use case is versatile sub-sampling of datasets to suit a particular research question. The current best example of this is the (wonderful) SARS-CoV-2 pipeline which leverages a augur filter rule, a script to calculate priorities and snakemake wizardry to allow versatile, declarative subsampling schemes to be simply and intuitively defined.

This allows a simple-to-reason-with YAML file to result in a very bespoke subsampling scheme:
image

The question arises: how do we do this for a different pathogen?

As the SARS-CoV-2 example leverages snakemake, one solution would be to abstract that logic into a importable snakemake rule. The alternative approach would be a new augur command augur subsample which takes a YAML file declaring the desired subsampling settings. Learning from our work on nCoV, this would essentially replace the snakemake-controlled augur filter commands with a single augur subsample command. The yaml file would look similar / identical to the current snakemake implementation. The subcommand would leverage the functions used by augur filter as well as the priorities script from nCoV.

Thoughts?

Examples

subsampling.yaml:

schemes:
  switzerland:
    # Focal samples for country
    country:
      group_by: "division year month"
      max_sequences: 1500
      exclude: "--exclude-where 'country!={country}'"
    # Contextual samples from country's region
    region:
      group_by: "country year month"
      seq_per_group: 20
      exclude: "--exclude-where 'country={country}' 'region!={region}'"
      priorities:
        type: "proximity"
        focus: "country"
    # Contextual samples from the rest of the world,
    # excluding the current region to avoid resampling.
    global:
      group_by: "country year month"
      seq_per_group: 10
      exclude: "--exclude-where 'region={region}'"
      priorities:
        type: "proximity"
        focus: "country"
augur subsample --include <TXT> --sequences <FASTA> \
    --metadata <TSV> --schemes <YAML> --output <FASTA>
@huddlej
Copy link
Contributor

huddlej commented Jul 29, 2021

After our recent conversations internally and with @dpark01 about reducing the complexity of the ncov workflow and improving the portability of the existing workflow with other workflow languages and/or platforms, I'm bumping this here as a higher priority issue and moving it from the "backlog" to the "next up".

@huddlej huddlej added priority: high To be resolved before other issues needs triage Needs triage by a Nextstrain team member labels Jul 29, 2021
@dpark01
Copy link

dpark01 commented Jul 29, 2021

Here is my current hack--would love to replace all that with augur subsample

It would be nice if a command like this could include emit as output a numeric count of selected samples in each deme.

@jameshadfield
Copy link
Member Author

PR #762 begins an implementation of augur subsample

@victorlin
Copy link
Member

Update: we've had internal discussions considering this again with a different YAML schema and the addition of weighted sampling (#1318).

@victorlin victorlin removed priority: high To be resolved before other issues needs triage Needs triage by a Nextstrain team member labels Apr 18, 2024
@victorlin victorlin changed the title RFC: augur subsample command augur subsample command Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
No open projects
Development

No branches or pull requests

8 participants