A curated cyanobacterial 16S rRNA gene reference package for sequence placement and de novo phylogenetic analysis
Daniel Roush, Ana Giraldo-Silva, and Ferran Garcia-Pichel
1School of Life Sciences, Arizona State University, 85282 Tempe, Arizona, USA
2Center for Fundamental and Applied Microbiomics, Biodesign Institute, Arizona State University, 85282 Tempe, Arizona, USA
Cydrasil is a comprehensive, curated, and community-built cyanobacterial reference package containing over 1300 16S rRNA gene sequences with lengths exceeding 1100 base pairs. Cydrasil offers a curated alignment and a maximum-likelihood phylogenetic tree that can be used for sequence placement or de novo phylogenetic reconstructions. Cydrasil utilizes phylogenetic placement to give you a complete phylogenetic perspective of your new isolate or amplicon survey.
Total Sequences: | 1327 |
---|---|
Cyanobacteria: | 1288 |
Sibling Clades: | 34 |
Plastids: | 5 |
File/Folder Name | Description |
---|---|
cydrasil-v3-reference-sequence-list.fasta | FASTA file containing the entire Cydrasil reference sequence set used as input for phylogenetic analysis |
cydrasil-v3-database.json | JSON formatted database containing metadata about Cydrasil reference sequences. The file includes a link to the source repository and any warnings about the sequence. |
cydrasil-v3-database.tsv | TSV formatted database containing metadata about Cydrasil reference sequences. The file includes a link to the source repository and any warnings about the sequence. |
cydrasil-v3-reference-alignment.afa | Aligned FASTA formatted version of the Cydrasil reference alignment. |
cydrasil-v3-reference-alignment.phy | Aligned relaxed PHYLIP formatted version of the Cydrasil reference alignment. |
cydrasil-v3-ssu-align.bacteria.mask | SSU-Align mask file for use with the SSU-Align version of the pipeline |
cydrasil-v3-reference-tree.nwk | Newick formatted Cydrasil reference phylogenetic tree. |
cydrasil-v3.bestModel | File containing model parameters for the generation of the reference tree. Needed for the model parameter specification (--model) for EPA-ng. |
Cydrasil was created originally as a comprehensive cyanobacterial 16S rRNA gene reference package to be used with phylogenetic placement algorithms like pplacer,
RAxML epa, and
EPA-ng. When we originally populated the database,
we had strict inclusion criteria for reference sequences.
- The sequence must come from an isolated strain or a single-cell genome. Exceptions were made for metagenome-assembled genomes on a case by case basis after manual review of the genome (needed for most sibling clade sequences).
- Each reference sequence must be 1100 base pairs or longer. The minimum length was chosen as a compromise between strain coverage and sequence information for phylogenetic reconstruction. Of note, this excludes all cyanobacteria sequenced using the Nübel et al. 1997 cyanobacteria-specific primers.
Cydrasil is populated with sequences collected from NCBI, JGI-IMG/M, and user submissions. The reference alignment was generated using SSU-Align1 with default parameters. This aligner uses a profile-based alignment strategy, in which each target sequence is aligned independently to a covariance model that uses the 16S rRNA gene secondary structure, and then masked using SSUMASK with the automatically computed alignment confidence values (posterior probabilities). A maximum-likelihood phylogenetic tree was then generated using the RAxML-HPC22 Workflow on XSEDE (8.2.12) on the CIPRES Science Gateway3. The ML + thorough bootstrap workflow was used with the following modified parameters: 1000 bootstraps (-N 1000) and the GTRGAMMA model. All other parameters were left at default values.
-
A FASTA file containing sequences of interest (typically Cyanobacterial sequences extracted from the reference sequences file output from Qiime 1/2) is aligned to the reference alignment.
Here you can use either PaPaRa4 or SSU-Align1 to align your query sequences to the Cydrasil reference alignment. The reference package contains a phylip (.phy) alignment (Cydrasil reference alignment) to use with PaPaRa.
PaPaRa Information
PaPaRa GithibFlag Description Cydrasil Filename -t Reference tree in newick (.nwk) format cydrasil-v3-reference-tree.nwk -s Reference alignment in phylip (.phy) format cydrasil-v3-reference-alignment.phy -q FASTA file containing sequences to be aligned (query sequences) to the reference alignment -n Name of output alignment -r Prevent PaPaRa from adding gaps in the reference alignment papara -t cydrasil-v3-reference-tree.nwk -s cydrasil-v3-reference-alignment.phy -q query-seqs.fasta -r -n combined-aln
SSU-Align uses a workflow instead of just one command, but has the advantage of using the same alignment model as the Cydrasil reference alignment.
Importantly, SSU-Align uses input order instead of input flags (-t) for parameters.
The commandssu-align
will align your query sequences to the SSU-Align 16S model.ssu-align query-seqs.fasta output-directory-name
ssu-align query-seqs.fasta aln
Mask the alignment using the mask file in the Cydrasil reference package
ssu-mask -s alignment.bacteria.mask aln/
Reformat the SSU-Align alignment from stockholm (.stk) to aligned fasta (.afa)
ssu-esl-reformat -o aligned-query-sequences-masked.afa afa aln/aln.bacteria.mask.stk
Alternatively, the command
ssu-prep -x
allows for a multi-threaded alignment run on a desktop computer, but requires a few extra steps.
This is recommended for large query sequence files.ssu-prep -x query-seqs.fasta output-directory-name number-of-threads
ssu-prep -x query-seqs.fasta aln 1
Output is a bash script: aln.ssu-align.sh
Running the script will execute the SSU-Align alignmentbash aln.ssu-align.sh
Cleanup the bash script
rm aln.ssu-align.sh
Mask the alignment using the mask file in the Cydrasil reference package
ssu-mask -s alignment.bacteria.mask aln/
Reformat the SSU-Align alignment from stockholm (.stk) to aligned fasta (.afa)
ssu-esl-reformat -o aligned-query-sequences-masked.afa afa aln/aln.bacteria.mask.stk
-
Conduct a sequence placement run using EPA-ng
Next, we use the aligned query sequences as the input into a sequence placement algorithm. We recommend EPA-ng.
Flag Description Filename -s Cydrasil reference alignment file in afa format cydrasil-v3-reference-alignment.afa -t Reference tree file cydrasil-v3-reference-tree.nwk -q Query sequence alignment SSU-Align: aligned-query-sequences-masked.afa | PaPaRa: query.fasta --model Cydrasil model parameter specification file cydrasil-v3.bestModel -w Output directory -T Number of threads for multithreading IMPORTANT NOTE If you used PaPaRa to align your query sequences, you need to run an extra command before you do placements to remove the reference sequences from the query alignment. This is a product of how PaPaRa was coded. If you skip this step, your output file will contain both query and reference sequences as placements.
epa-ng --split cydrasil-v3-reference-alignment.phy papara_alignment.combined-aln
The output file from this command is
query.fasta
Example EPA-ng Command
epa-ng -s cydrasil-v3-reference-alignment.afa -t cydrasil-v3-reference-tree.nwk -q aligned-query-sequences-masked.afa --model
Output
EPA-ng outputs a .JPLACE file named
placements.jplace
. -
Tree Visualization
To visualize the .JPLACE containing the sequences of interest, the placement file ```placements.jplace is uploaded onto [iTOL] (https://itol.embl.de/)5. Placements act as a dataset within iTOL and can be toggled. Nodes with sequences of interest are visualized with red circles and clicking a node will show a breakdown of sequence ids and the corresponding confidence values for that node.
Steps to visualize tree in iTOL:
-
Create an user account with iTOL
-
Upload (drag and drop) .jplace file into an iTOL project
-
Reroot the tree by searching for WOR1 and then clicking on its parent node (I2504) and navigating to Tree structure > Re-root the tree here
-
From the Controls box set parameters as follows:
Basic
- Display mode: Normal
- Parameters: 0 degree rotation
- Invert: No
- Slanted: No
- Branch lengths: Use
- Labels: At tips NOTE: You can toggle labels to off if the tree slows your computer.
- Label rotation: On
- Label alignment: Left
ADVANCED: No changes. If you are having trouble reading your results, you can use the scaling factors to separate leaves.
DATASETS
- Turn on phylogenetic placements
- Use “Show query form” to search placement of individiual query sequences
- Insert query sequence ID and and click on the sequence ID name to display red circles indicating phylogenic placement of a given query sequence.
-
- Nawrocki E. 2009. Structural RNA Homology Search and Alignment Using Covariance Models. Washington University School of Medicine.
- Stamatakis A. 2014. RAxML version 8: A tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30:1312–1313.
- Miller MA, Pfeiffer W, Schwartz T. 2010. Creating the CIPRES Science Gateway for inference of large phylogenetic trees. 2010 Gatew Comput Environ Work GCE 2010.
- Berger SA, Stamatakis A. 2011. Aligning short reads to reference alignments and trees. Bioinformatics 27:2068–2075.
- Letunic I, Bork P. 2016. Interactive tree of life (iTOL) v3: an online tool for the display and annotation of phylogenetic and other trees. Nucleic Acids Res 44:W242–W245.