Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement subsampling via a script #711

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Implement subsampling via a script #711

wants to merge 3 commits into from

Conversation

jameshadfield
Copy link
Member

@jameshadfield jameshadfield commented Aug 23, 2021

Description of proposed changes

This implements subsampling via a single script, rather than a series of snakemake commands. The behavior is intended to be the same, and the the subsampling definitions (in builds.yaml) have not changed. This script was initially developed as augur subsample in nextstrain/augur#762, however we are choosing to move development here to allow us to improve the proximity/priority calculations here. The eventual aim is for this functionality to be part of augur.

Verbosity
The subsample script contains overly verbose print statements. Some of these should be removed before merge, but they may be helpful for review.

Subsampling syntax
We introduce a new rule, extract_subsampling_scheme, which takes the subsampling definition (in builds.yaml), applies wildcard expansion, and then transforms this into a syntax more in line with how augur filter is called. This rule replaces a number of functions formerly in the snakefiles. As an example:

## builds.yaml
subsampling:
  custom-scheme:
    sampleName:
      exclude: "--exclude-where 'aus!=yes'"
      group_by: year month
      seq_per_group: 5

# output from rule extract_subsampling_scheme
sampleName:
  exclude-where:
  - aus!=yes
  group-by:
  - year
  - month
  sequences-per-group: 5

Going forward, I suggest slowly updaating our nCoV subsampling definitions to the latter format. This syntax is more in line with the rest of the augur ecosystem and avoids the sticking point where some values need to specify the --argument and some don't (this has tripped up people). It is also easier (I think) to reason with YAML arrays than strings which are then coerced into arrays.

DAG simplification

By essentially bringing the complexity into the subsample script (eventually to be augur subsample), we have a simpler snakemake DAG, with far fewer rules, which should be easier to port to other workflow languages. The DAG for subsampling now only contains two steps per build: extract_subsampling_scheme (see above) and subsample.

image

DAG for Nextstrain Open (GenBank) builds. Top: current master, bottom: this PR

Future improvements

  • Instead of using augur filter’s run() function, we could refactor that function slightly to have a function which returns a strain list for inclusion, as well as logging data etc. Currently the run() function writes data to disk which we immediately read in.
  • Similarly, this work doesn't change the behavior of the priorities.py and get_distance_to_focal_set.py scripts. These can be refactored to return data rather than writing to disk.
  • Include / exclude files are defined by arguments to augur subsample, however they could be set within the scheme definition, which would allow per-sample differences. (Breaking change.)
  • Allow a log file to be written explaining why strains were not included (or why they were!) which leverages the new logging functionality of augur filter
  • Expand priorities schema definition so that sample definitions define ignore_seqs, which is currently hardcoded to be "Wuhan/Hu-1/2019". Similarly for proximity / priority config parameters. (Breaking change.)
  • For samples which can be computed in parallel (e.g. those which don’t need priorities), in principle we could refactor the code in augur filter to allow independent sets of (sets of) filters to be applied each time we iterate over a chunk of metadata. This should speed things up quite a bit, but would come at increased code complexity. I don’t think this is immediately necessary.
  • Proximity calculations can be slow for large datasets. We can speed this up by prefiltering based on certain settings - for instance, if we are computing a sample with a --min-date and proximity to another sample, we can do this in two stages so that the proximity calculations are run against a pre-filtered set (via `--min-date).

Related issue(s)

Related to nextstrain/augur#635

Testing

Test builds triggered (via GitHub). I will update this section when they complete. Update: they failed. Updated PR, and AWS info below updated.

## open
 nextstrain build --aws-batch --attach 8008fccf-24de-4d64-86c8-d5e0a887cc54   .
https://nextstrain.org/staging/ncov/open/trial/15426a2dbfbb0739560c431c1db1da7d0f60184b/{global, ...}

AWS batch console link

Release checklist

This PR should not introduce any backwards incompatible changes

  • Update docs/change_log.md in this pull request to document these changes by the date they were added.

This is in preparation for a  ubsample script to call these
python functions directly, as opposed to scripts called from the
command line. In time, this script will become `augur subsample`.

The work done by the functions is unchanged, thus they still don't
return any information rather they write information to disk. Future
work will change this so that information can flow back to the calling
program without needing to be written to disk.
This implements a new script to encapsulate the subsampling logic formerly encoded in snakemake rules. This is in preparation for moving this script to the augur repo where it will become `augur subsample`. (We have chosen to develop this in the ncov repo for simplicity.)

The script currently uses the same approach as the former snakemake rules, however python functions are called rather than scripts / augur commands. Briefly, the steps are:

1. A subsampling scheme is provided, parsed, validated, and turned into a simple graph to indicate which samples rely on other samples having been computed (i.e. which are needed for priorities)
2. Each sample is computed by calling the run function of augur filter
3. If priorities need to be calculated for a sample to be computed, this is achieved by calling functions from the two existing scripts.
4. The set of sequences to include in each sample is combined, and outputs written.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

2 participants