Use Snakemake HTTP remote to download starting points #15

trvrb · 2022-03-22T01:16:24Z

Here, I chose to use snakemake.remote.HTTP as it reads from the Cloudfront-backed https://data.nextstrain.org rather than the S3 bucket nextstrain-data. I did this for two reasons:

Cloud-front backed data.nextstrain.org should be more efficient to download from and sets a good pattern
I liked surfacing the simple URLs data.nextstrain.org/files/zika/sequences.fasta.xz and data.nextstrain.org/files/zika/metadata.tsv.gz directly in the workflow.

I also used metadata.tsv.gz rather than metadata.tsv.xz to mirror what we do for ncov.

j23414 · 2022-03-22T05:45:55Z

Ah agree with your changes to use HTTP.remote instead of s3!

I like having the same compression method for metadata and sequences, but am willing to give this a pass so that metadata.tsv.gz is consistent with ncov.

Optional: Do we want to modify the .travis test to run on a smaller dataset (split out the example_data/zika.fasta into its own metadata.tsv.gz and sequences.fasta.xz)? (I'm happy to add this but no pressure.)

Right now the test runs on the full dataset, which is perfectly okay if we're adding a final deploy step at some point.

trvrb · 2022-03-22T16:01:01Z

I like having the same compression method for metadata and sequences, but am willing to give this a pass so that metadata.tsv.gz is consistent with ncov.

We can revisit with larger group, but I thought there was a reason that .gz was chosen for metadata in this case, but I don't remember what it is. In splitting between .xz and .gz I was assuming there was some rationale. Maybe let's start with .gz for metadata and follow up.

Optional: Do we want to modify the .travis test to run on a smaller dataset (split out the example_data/zika.fasta into its own metadata.tsv.gz and sequences.fasta.xz)? (I'm happy to add this but no pressure.)
Right now the test runs on the full dataset, which is perfectly okay if we're adding a final deploy step at some point.

Thanks for catching this. I think Travis should run quickly on small example data. This is also how ncov works. If you could update PR that would be great.

j23414 · 2022-03-22T18:32:04Z

Finished connecting the smaller dataset, and checks passed (down to 4 minutes).

Looks good to merge on my end! Although feel free to make changes and/or suggestions.

This swaps to downloading via "curl" rather than the Snakemake remote input through HTTP provider. This is more straight forward and avoids issue with identification of gzip encoding by HTTP provider.

Switching to uncompressed example data to make it easier for someone to understand file format via GitHub inspection.

trvrb · 2022-03-22T23:21:36Z

This is now working and documented. I'm going to merge this PR.

tsibley

A few post-merge notes.

tsibley · 2022-03-23T00:01:20Z

README.md

+[https://data.nextstrain.org/files/zika/sequences.fasta.xz](data.nextstrain.org/files/zika/sequences.fasta.xz)
+and metadata from
+[https://data.nextstrain.org/files/zika/metadata.tsv.gz](data.nextstrain.org/files/zika/metadata.tsv.gz).


These links are broken because the URL-part doesn't include the scheme (https://). When the link text should be the same as the URL, I'd suggest relying on auto-linking of base URLs.

Suggested change

[https://data.nextstrain.org/files/zika/sequences.fasta.xz](data.nextstrain.org/files/zika/sequences.fasta.xz)

and metadata from

[https://data.nextstrain.org/files/zika/metadata.tsv.gz](data.nextstrain.org/files/zika/metadata.tsv.gz).

https://data.nextstrain.org/files/zika/sequences.fasta.xz

and metadata from

https://data.nextstrain.org/files/zika/metadata.tsv.gz.

tsibley · 2022-03-23T00:02:53Z

README.md

+from NCBI GenBank via ViPR and performing additional bespoke curation. Our
+curation is described
+[here](https://github.com/nextstrain/fauna/blob/master/builds/ZIKA.md).


Links like "here" and "click here" are an anti-pattern. Among their issues is that they impede the accessibility of the links. I'd suggest linking the previous reference to curation instead:

Suggested change

from NCBI GenBank via ViPR and performing additional bespoke curation. Our

curation is described

[here](https://github.com/nextstrain/fauna/blob/master/builds/ZIKA.md).

from NCBI GenBank via ViPR and performing

[additional bespoke curation](https://github.com/nextstrain/fauna/blob/master/builds/ZIKA.md).

Snakefile

tsibley · 2022-03-23T00:09:11Z

Snakefile

+        gzip --decompress --keep {input.metadata}
+        xz --decompress --keep {input.sequences}


Was there a reason you chose to a) decompress separately and b) keep the compressed copies around? My instinct would be decompress on the fly during download and thus make this whole decompress rule unnecessary and avoid the double disk space usage (which while insignificant for Zika, sets what I think is a bad precedent).

I vote to drop --keep in order to remove unnessary intermediate files. Although I also vote for xz over gzip for a smaller memory footprint. ;) I have no strong opinions on "decompress on the fly during download", will follow the group decision. @trvrb feel free to comment on your flag decisions

Use Snakemake HTTP remote to download starting points

c272a9e

trvrb requested a review from j23414 March 22, 2022 01:16

j23414 approved these changes Mar 22, 2022

View reviewed changes

use smaller example dataset for travis test

951e672

trvrb added 3 commits March 22, 2022 15:27

Use curl rather than HTTP provider

4109097

This swaps to downloading via "curl" rather than the Snakemake remote input through HTTP provider. This is more straight forward and avoids issue with identification of gzip encoding by HTTP provider.

Update readme to describe input data

6905a28

Update example data

e02db60

Switching to uncompressed example data to make it easier for someone to understand file format via GitHub inspection.

trvrb merged commit d98d4c7 into master Mar 22, 2022

trvrb deleted the http-download branch March 22, 2022 23:21

tsibley reviewed Mar 23, 2022

View reviewed changes

j23414 mentioned this pull request Mar 25, 2022

pull data from https instead of fauna nextstrain/measles#3

Merged

victorlin mentioned this pull request Feb 27, 2023

fix missing https in the links #26

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Snakemake HTTP remote to download starting points #15

Use Snakemake HTTP remote to download starting points #15

trvrb commented Mar 22, 2022

j23414 commented Mar 22, 2022

trvrb commented Mar 22, 2022

j23414 commented Mar 22, 2022

trvrb commented Mar 22, 2022

tsibley left a comment

tsibley Mar 23, 2022

tsibley Mar 23, 2022

tsibley Mar 23, 2022

j23414 Apr 1, 2022

		gzip --decompress --keep {input.metadata}
		xz --decompress --keep {input.sequences}

Use Snakemake HTTP remote to download starting points #15

Use Snakemake HTTP remote to download starting points #15

Conversation

trvrb commented Mar 22, 2022

j23414 commented Mar 22, 2022

trvrb commented Mar 22, 2022

j23414 commented Mar 22, 2022

trvrb commented Mar 22, 2022

tsibley left a comment

Choose a reason for hiding this comment

tsibley Mar 23, 2022

Choose a reason for hiding this comment

tsibley Mar 23, 2022

Choose a reason for hiding this comment

tsibley Mar 23, 2022

Choose a reason for hiding this comment

j23414 Apr 1, 2022

Choose a reason for hiding this comment