Skip to content

Commit

Permalink
Update zika instructions
Browse files Browse the repository at this point in the history
The old instructions were written for ViPR which became obsolete and was replaced
by BV-BRC. The old instructions no longer work and we have since moved to using
NCBI datasets for downloading sequences and metadata files.

The filtering steps are already part of the phylogenetic build steps so are
no longer a consideration during ingest. Point team members to how to ingest
recent zika data and push to the nextstrain data endpoint. Point team members
to the current phylogenetic build steps.
  • Loading branch information
j23414 committed Nov 28, 2023
1 parent 5e967a6 commit 34236c4
Showing 1 changed file with 25 additions and 71 deletions.
96 changes: 25 additions & 71 deletions builds/ZIKA.md
Original file line number Diff line number Diff line change
@@ -1,93 +1,47 @@
# ZIKA Pipeline Notes

## Setup
## Ingest data from NCBI GenBank

1. Make sure environment variables for connecting to fauna are set.

## Upload via ViPR and update citations

### [ViPR sequences](https://www.viprbrc.org/brc/vipr_genome_search.spg?method=ShowCleanSearch&decorator=flavi_zika)

1. Download sequences
* Select year >= 2013 and genome length >= 5000
* Download as Genome Fasta
* Set Custom Format Fields to 0: GenBank Accession, 1: Strain Name, 2: Segment, 3: Date, 4: Host, 5: Country, 6: Subtype, 7: Virus Species
* May also use the [ViPR API](https://www.viprbrc.org/brc/staticContent.spg?decorator=reo&type=ViprInfo&subtype=API)

```
curl "https://www.viprbrc.org/brc/api/sequence?datatype=genome&family=flavi&species=Zika%20virus&fromyear=2013&minlength=5000&metadata=genbank,strainname,segment,date,host,country,genotype,species&output=fasta" |\
tr '-' '_' |\
tr ' ' '_' |\
sed 's:N/A:NA:g' >\
GenomicFastaResults.fasta
```

The search-and-replace commands (`tr`, `sed`) are necessary because the API downloads fasta headers similar to:

`>KY241742|ZIKV_SG_072|N/A|2016-08-28|Human|Singapore|Asian|Zika virus`

but need to match the GUI downloaded headers similar to:

`>KY241742|ZIKV_SG_072|NA|2016_08_28|Human|Singapore|Asian|Zika_virus`


2. Move downloaded sequences to `fauna/data`
3. Extract `GenomicFastaResults.tar.gz` and rename the extracted file to `GenomicFastaResults.fasta`
4. Upload to vdb database
* `python3 vdb/zika_upload.py -db vdb -v zika --source genbank --locus genome --fname GenomicFastaResults.fasta`

### Update

* Update citation fields
* `python3 vdb/zika_update.py -db vdb -v zika --update_citations`
* updates `authors`, `title`, `url`, `journal` and `puburl` fields from genbank files
* If you get `ERROR: Couldn't connect with entrez, please run again` just run command again

## Download from Fauna, parse, compress and push to S3

### Download from Fauna
Navigate to the nextstrain/zika repository and [follow the instructions for ingest](https://github.com/nextstrain/zika/tree/persephone/ingest).

```
python3 vdb/download.py \
--database vdb \
--virus zika \
--fasta_fields strain virus accession collection_date region country division location source locus authors url title journal puburl \
--resolve_method choose_genbank \
--fstem zika
git clone https://github.com/nextstrain/zika.git
cd zika
git checkout persephone
cd ingest
nextstrain build .
```

This results in the file `data/zika.fasta` with FASTA header ordered as above.
This results in the files `results/metadata.tsv` and `results/sequences.fasta`

### Parse
## Compress

```
augur parse \
--sequences data/zika.fasta \
--output-sequences data/sequences.fasta \
--output-metadata data/metadata.tsv \
--fields strain virus accession date region country division city db segment authors url title journal paper_url \
--prettify-fields region country division city
zstd -T0 results/sequences.fasta
zstd -T0 results/metadata.tsv
```

This results in the files `data/sequences.fasta` and `data/metadata.tsv`.

### Compress

```
zstd -T0 data/sequences.fasta
zstd -T0 data/metadata.tsv
```
This results in the files `results/sequences.fasta.zst` and `results/metadata.tsv.zst`.

This results in the files `data/sequences.fasta.zst` and `data/metadata.tsv.zst`.
## Upload data to s3

### Push to S3
Make sure environment variables for connecting to nextstrain remote are set.

```
nextstrain remote upload s3://nextstrain-data/files/zika/ data/sequences.fasta.zst data/metadata.tsv.zst
nextstrain remote upload s3://nextstrain-data/files/zika/ results/sequences.fasta.zst
nextstrain remote upload s3://nextstrain-data/files/zika/ results/metadata.tsv.zst
```

This pushes files to S3 to be made available at https://data.nextstrain.org/files/zika/sequences.fasta.zst and https://data.nextstrain.org/files/zika/metadata.tsv.zst.

## Run zika workflow

See instructions at https://github.com/nextstrain/zika.
See instructions at https://github.com/nextstrain/zika/tree/persephone/phylogenetic

```
git clone https://github.com/nextstrain/zika.git
cd zika
git checkout persephone
cd phylogenetic
nextstrain build .
```

0 comments on commit 34236c4

Please sign in to comment.