-
Notifications
You must be signed in to change notification settings - Fork 13
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
The old instructions were written for ViPR which became obsolete and was replaced by BV-BRC. The old instructions no longer work and we have since moved to using NCBI datasets for downloading sequences and metadata files. The filtering steps are already part of the phylogenetic build steps so are no longer a consideration during ingest. Point team members to how to ingest recent zika data and push to the nextstrain data endpoint. Point team members to the current phylogenetic build steps.
- Loading branch information
Showing
1 changed file
with
25 additions
and
71 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,93 +1,47 @@ | ||
# ZIKA Pipeline Notes | ||
|
||
## Setup | ||
## Ingest data from NCBI GenBank | ||
|
||
1. Make sure environment variables for connecting to fauna are set. | ||
|
||
## Upload via ViPR and update citations | ||
|
||
### [ViPR sequences](https://www.viprbrc.org/brc/vipr_genome_search.spg?method=ShowCleanSearch&decorator=flavi_zika) | ||
|
||
1. Download sequences | ||
* Select year >= 2013 and genome length >= 5000 | ||
* Download as Genome Fasta | ||
* Set Custom Format Fields to 0: GenBank Accession, 1: Strain Name, 2: Segment, 3: Date, 4: Host, 5: Country, 6: Subtype, 7: Virus Species | ||
* May also use the [ViPR API](https://www.viprbrc.org/brc/staticContent.spg?decorator=reo&type=ViprInfo&subtype=API) | ||
|
||
``` | ||
curl "https://www.viprbrc.org/brc/api/sequence?datatype=genome&family=flavi&species=Zika%20virus&fromyear=2013&minlength=5000&metadata=genbank,strainname,segment,date,host,country,genotype,species&output=fasta" |\ | ||
tr '-' '_' |\ | ||
tr ' ' '_' |\ | ||
sed 's:N/A:NA:g' >\ | ||
GenomicFastaResults.fasta | ||
``` | ||
|
||
The search-and-replace commands (`tr`, `sed`) are necessary because the API downloads fasta headers similar to: | ||
|
||
`>KY241742|ZIKV_SG_072|N/A|2016-08-28|Human|Singapore|Asian|Zika virus` | ||
|
||
but need to match the GUI downloaded headers similar to: | ||
|
||
`>KY241742|ZIKV_SG_072|NA|2016_08_28|Human|Singapore|Asian|Zika_virus` | ||
|
||
|
||
2. Move downloaded sequences to `fauna/data` | ||
3. Extract `GenomicFastaResults.tar.gz` and rename the extracted file to `GenomicFastaResults.fasta` | ||
4. Upload to vdb database | ||
* `python3 vdb/zika_upload.py -db vdb -v zika --source genbank --locus genome --fname GenomicFastaResults.fasta` | ||
|
||
### Update | ||
|
||
* Update citation fields | ||
* `python3 vdb/zika_update.py -db vdb -v zika --update_citations` | ||
* updates `authors`, `title`, `url`, `journal` and `puburl` fields from genbank files | ||
* If you get `ERROR: Couldn't connect with entrez, please run again` just run command again | ||
|
||
## Download from Fauna, parse, compress and push to S3 | ||
|
||
### Download from Fauna | ||
Navigate to the nextstrain/zika repository and [follow the instructions for ingest](https://github.com/nextstrain/zika/tree/persephone/ingest). | ||
|
||
``` | ||
python3 vdb/download.py \ | ||
--database vdb \ | ||
--virus zika \ | ||
--fasta_fields strain virus accession collection_date region country division location source locus authors url title journal puburl \ | ||
--resolve_method choose_genbank \ | ||
--fstem zika | ||
git clone https://github.com/nextstrain/zika.git | ||
cd zika | ||
git checkout persephone | ||
cd ingest | ||
nextstrain build . | ||
``` | ||
|
||
This results in the file `data/zika.fasta` with FASTA header ordered as above. | ||
This results in the files `results/metadata.tsv` and `results/sequences.fasta` | ||
|
||
### Parse | ||
## Compress | ||
|
||
``` | ||
augur parse \ | ||
--sequences data/zika.fasta \ | ||
--output-sequences data/sequences.fasta \ | ||
--output-metadata data/metadata.tsv \ | ||
--fields strain virus accession date region country division city db segment authors url title journal paper_url \ | ||
--prettify-fields region country division city | ||
zstd -T0 results/sequences.fasta | ||
zstd -T0 results/metadata.tsv | ||
``` | ||
|
||
This results in the files `data/sequences.fasta` and `data/metadata.tsv`. | ||
|
||
### Compress | ||
|
||
``` | ||
zstd -T0 data/sequences.fasta | ||
zstd -T0 data/metadata.tsv | ||
``` | ||
This results in the files `results/sequences.fasta.zst` and `results/metadata.tsv.zst`. | ||
|
||
This results in the files `data/sequences.fasta.zst` and `data/metadata.tsv.zst`. | ||
## Upload data to s3 | ||
|
||
### Push to S3 | ||
Make sure environment variables for connecting to nextstrain remote are set. | ||
|
||
``` | ||
nextstrain remote upload s3://nextstrain-data/files/zika/ data/sequences.fasta.zst data/metadata.tsv.zst | ||
nextstrain remote upload s3://nextstrain-data/files/zika/ results/sequences.fasta.zst | ||
nextstrain remote upload s3://nextstrain-data/files/zika/ results/metadata.tsv.zst | ||
``` | ||
|
||
This pushes files to S3 to be made available at https://data.nextstrain.org/files/zika/sequences.fasta.zst and https://data.nextstrain.org/files/zika/metadata.tsv.zst. | ||
|
||
## Run zika workflow | ||
|
||
See instructions at https://github.com/nextstrain/zika. | ||
See instructions at https://github.com/nextstrain/zika/tree/persephone/phylogenetic | ||
|
||
``` | ||
git clone https://github.com/nextstrain/zika.git | ||
cd zika | ||
git checkout persephone | ||
cd phylogenetic | ||
nextstrain build . | ||
``` |