-
Notifications
You must be signed in to change notification settings - Fork 7
Taxonomic profiling with the GTDB‐R214 database
In this tutorial, we will use sylph to do
- Multi-sample prokaryotic metagenomic profiling with the GTDB-R214 database
- output a taxonomic output like MetaPhlAn.
- sylph is installed.
- Python3 is installed with pandas
Note
This tutorial uses the GTDB-R214 database, an older database. A new GTDB version (R220) has 35% more genomes. Refer to the pre-built databases and updated metadata files to swap out the R214 database with R220 during this tutorial.
Download the pre-sketched (i.e., indexed) GTDB-R214 database provided here. For example,
wget https://storage.googleapis.com/sylph-stuff/v0.3-c1000-gtdb-r214.syldb -O gtdb_database.syldb
# OR
#wget https://storage.googleapis.com/sylph-stuff/v0.3-c200-gtdb-r214.syldb -O gtdb_database.syldb
- The
c1000
database is smaller and faster, but less sensitive for low-abundance genomes compared to thec200
database.
Assuming you have the GTDB-R214 genome database present in a folder called gtdb_genomes_reps_r214/
, do the following:
- Compile all GTDB genomes into a list file:
find gtdb_genomes_reps_r214 | grep .fna > gtdb_all.txt
- Create a database with:
sylph sketch -l gtdb_all.txt -t 50 -o gtdb_database
Let's download and index two mouse gut metagenomes:
wget https://storage.googleapis.com/sylph-stuff/mouse_1.fq.gz
wget https://storage.googleapis.com/sylph-stuff/B-mouse_1.fq.gz
wget https://storage.googleapis.com/sylph-stuff/mouse_2.fq.gz
wget https://storage.googleapis.com/sylph-stuff/B-mouse_2.fq.gz
We now have two sets of paired-reads. For paired-end reads, sylph must sketch the reads first as follows:
sylph sketch -1 *mouse_1.fq.gz -2 *mouse_2.fq.gz
sylph can sketch multiple samples at once. This allows for multi-threaded sketching and is more efficient.
This outputs two files called mouse_1.fq.gz.paired.sylsp
and B-mouse_1.fq.gz.paired.sylsp
. The paired
indicates the sketch was created with the -1,-2
options, and the sylsp
indicates that it is indexed.
Caution
Do not interleave paired-end reads. sylph will give more accurate results if you do not concatenate/interleave your reads.
To multi-sample profile with sylph, run the following:
sylph profile gtdb_database.syldb *mouse_1.fq.gz.paired.sylsp -t 10 -o results.tsv
This uses 10 threads to profile the metagenome against our database into a file called results.tsv
.
Tip
Since sylph v0.6, direct profiling of paired-end reads without sketching is possible:
sylph profile gtdb_database.syldb -1 *_1.fq.gz -2 *_2.fq.gz -t 10 -o results.tsv
Inspecting the results.tsv
file gives:
head results.tsv
Sample_file Genome_file Taxonomic_abundance Sequence_abundance Adjusted_ANI Eff_cov ANI_5-95_percentile Eff_lambda Lambda_5-95_percentile Median_cov Mean_cov_geq1 Containment_ind Naive_ANI Contig_name
mouse_1.fq gtdb_genomes_reps_r214/database/GCA/910/577/315/GCA_910577315.1_genomic.fna.gz 10.6258 10.1684 97.66 0.656 97.33-98.07 0.656 0.55-0.76 1 1.456 616/2624 95.43 CAJTME010000001.1 TPA_asm: uncultured Muribaculaceae bacterium isolate MGBC104416 genome assembly, contig: MGBC104416.1, whole genome shotgun sequence
mouse_1.fq gtdb_genomes_reps_r214/database/GCF/001/945/605/GCF_001945605.1_genomic.fna.gz 9.1588 7.5114 98.36 0.565 97.92-98.84 0.565 0.47-0.66 1 1.421 553/2100 95.79 NZ_MPKA01000006.1 Dubosiella newyorkensis strain NYU-BL-A4 NODE_100_length_1023_cov_1250.21_ID_199, whole genome shotgun sequence
mouse_1.fq gtdb_genomes_reps_r214/database/GCA/910/589/675/GCA_910589675.1_genomic.fna.gz 8.3282 5.1014 95.18 0.514 94.47-95.99 0.514 0.37-0.67 1 1.299 144/1629 92.47 CAJUTO010000001.1 uncultured Lactobacillus sp. isolate MGBC166701 genome assembly, contig: MGBC166701.1, whole genome shotgun sequence
mouse_1.fq gtdb_genomes_reps_r214/database/GCA/910/588/855/GCA_910588855.1_genomic.fna.gz 8.1294 11.9822 98.20 0.502 97.87-98.59 0.502 0.43-0.56 1 1.345 892/3903 95.35 CAJUSR010000001.1 TPA_asm: uncultured Kineothrix sp. isolate MGBC162921 genome assembly, contig: MGBC162921.1, whole genome shotgun sequence
mouse_1.fq gtdb_genomes_reps_r214/database/GCA/910/579/675/GCA_910579675.1_genomic.fna.gz 7.7809 7.5080 98.71 0.480 98.27-99.11 0.480 0.40-0.55 1 1.303 654/2522 95.74 CAJTUF010000001.1 TPA_asm: uncultured Muribaculaceae bacterium isolate MGBC114255 genome assembly, contig: MGBC114255.1, whole genome shotgun sequence
mouse_1.fq gtdb_genomes_reps_r214/database/GCF/001/591/705/GCF_001591705.1_genomic.fna.gz 7.1538 6.0999 96.41 0.441 95.86-97.20 0.441 0.32-0.54 1 1.216 264/2258 93.31 NZ_BCVK01000001.1 Lactococcus lactis subsp. cremoris NBRC 100676, whole genome shotgun sequence
mouse_1.fq gtdb_genomes_reps_r214/database/GCF/000/403/395/GCF_000403395.2_genomic.fna.gz 6.9472 9.1490 95.75 0.429 95.27-96.45 0.429 0.32-0.51 1 1.251 299/3240 92.60 NZ_KE159657.1 Anaerotruncus sp. G3(2012) strain G3 acPFl-supercont1.1, whole genome shotgun sequence
mouse_1.fq gtdb_genomes_reps_r214/database/GCA/003/833/075/GCA_003833075.1_genomic.fna.gz 5.7036 7.2196 97.02 0.352 96.46-97.75 0.352 0.27-0.42 1 1.234 394/3324 93.35 RIAY01000001.1 Muribaculaceae bacterium Isolate-036 (Harlan) seq1, whole genome shotgun sequence
mouse_1.fq gtdb_genomes_reps_r214/database/GCA/910/587/685/GCA_910587685.1_genomic.fna.gz 5.0394 3.2464 97.82 0.311 97.06-98.58 0.311 0.24-0.39 1 1.200 230/1672 93.80 CAJUMY010000001.1 TPA_asm: uncultured Christensenellaceae bacterium isolate MGBC161649 genome assembly, contig: MGBC161649.1, whole genome shotgun sequence
You can check https://github.com/bluenote-1577/sylph/wiki/Output-format for the definitions of each of the columns. Various metrics, including (effective) coverage estimates, ANIs, abundances, etc are output.
Sylph's TSV output has no taxonomic information, only genome information. To get a taxonomic profile with ranks, we integrate GTDB's taxonomy with sylph's output. See here for information on how this works and how to customize taxonomy/databases.
Run the following commands:
git clone https://github.com/bluenote-1577/sylph-utils
python sylph-utils/sylph_to_taxprof.py -s results.tsv -m sylph-utils/prokaryote/gtdb_r214_metadata.tsv.gz -o prefix_
ls prefix_mouse_1.fq.gz.sylphmpa
ls prefix_B-mouse_2.fq.gz.sylphmpa
The python script takes in a sylph result (-s
), a taxonomy metadata file (-m
).
Important
-m
's metadata file must correspond to database used. We provide these custom metadata files in the sylph-utils
repository. If you use GTDB-R220 or R214, you must use the correct R220 or R214 metadata file.
The script outputs a new file called prefix_MYSAMPLENAME.sylphmpa
for each sample in the results file. Investigating one file gives the following:
head -n 20 prefix_mouse_1.fq.sylphmpa
#SampleID mouse_1.fq
clade_name relative_abundance sequence_abundance ANI (if strain-level)
d__Bacteria 100.00010000000002 99.99999999999999 NA
d__Bacteria|p__Bacillota 24.640800000000002 18.712699999999998 NA
d__Bacteria|p__Bacillota_A 47.333499999999994 52.5969 NA
d__Bacteria|p__Bacillota_A|c__Clostridia 47.333499999999994 52.5969 NA
d__Bacteria|p__Bacillota_A|c__Clostridia|o__Christensenellales 9.0019 6.252800000000001 NA
d__Bacteria|p__Bacillota_A|c__Clostridia|o__Christensenellales|f__Pumilibacteraceae 3.9625 3.0064 NA
d__Bacteria|p__Bacillota_A|c__Clostridia|o__Christensenellales|f__Pumilibacteraceae|g__Pumilibacter 3.9625 3.0064 NA
d__Bacteria|p__Bacillota_A|c__Clostridia|o__Christensenellales|f__Pumilibacteraceae|g__Pumilibacter|s__Pumilibacter 3.9625 3.0064 NA
d__Bacteria|p__Bacillota_A|c__Clostridia|o__Christensenellales|f__Pumilibacteraceae|g__Pumilibacter|s__Pumilibacter|t__GCA_910587595.1 3.9625 3.0064 95.45
d__Bacteria|p__Bacillota_A|c__Clostridia|o__Christensenellales|f__UBA3700 5.0394 3.2464 NA
d__Bacteria|p__Bacillota_A|c__Clostridia|o__Christensenellales|f__UBA3700|g__MGBC161649 5.0394 3.2464 NA
d__Bacteria|p__Bacillota_A|c__Clostridia|o__Christensenellales|f__UBA3700|g__MGBC161649|s__MGBC161649 5.0394 3.2464 NA
d__Bacteria|p__Bacillota_A|c__Clostridia|o__Christensenellales|f__UBA3700|g__MGBC161649|s__MGBC161649|t__GCA_910587685.1 5.0394 3.2464 97.82
d__Bacteria|p__Bacillota_A|c__Clostridia|o__Lachnospirales 20.3153 24.426900000000003 NA
d__Bacteria|p__Bacillota_A|c__Clostridia|o__Lachnospirales|f__Lachnospiraceae 20.3153 24.426900000000003 NA
d__Bacteria|p__Bacillota_A|c__Clostridia|o__Lachnospirales|f__Lachnospiraceae|g__1XD42-69 4.1703 4.1626 NA
d__Bacteria|p__Bacillota_A|c__Clostridia|o__Lachnospirales|f__Lachnospiraceae|g__1XD42-69|s__1XD42-69 4.1703 4.1626 NA
d__Bacteria|p__Bacillota_A|c__Clostridia|o__Lachnospirales|f__Lachnospiraceae|g__1XD42-69|s__1XD42-69|t__GCA_910589105.1 4.1703 4.1626 97.47
This file is a taxonomic profile similar to what MetaPhlAn outputs. Each taxonomic rank has an associated taxonomic or sequence abundance. Taxonomic abundance normalizes by genome sizes, whereas sequence abundance is the % of reads assigned to a taxonomic rank.
Note that the t__GCA_...
ranks, which are strain/genome-level ranks, have an associated ANI output. The other ranks do not.
We have just shown how sylph can do multi-sample profiling against the GTDB-R214 database very efficiently while giving ANI information as well.
To do taxonomic profiling with other databases, see https://github.com/bluenote-1577/sylph/wiki/Integrating-taxonomic-information-with-sylph. Alternatively, consider just analyzing sylph's resulting TSV file; it contains some useful information not present in the taxonomic profile.