-
Notifications
You must be signed in to change notification settings - Fork 24
Home
This Tutorial explains the adding of a new dataset to curatedMetagenomicData (cmd) step by step.
A publication of metagenomic samples, which shall be included to cmd, needs to meet the requirement that metadata is existing for each sample and can be linked to public available raw data for each sample distinctively. The metadata for each dataset has to be curated and standardized manually in order to create a data repository, which samples can be compared among studies and conditions.
Simultaneously, the raw data from each sample were used to calculate the bacterial, fungal, archaeal, and viral taxonomic abundance using MetaPhlAn2 and the metabolic functional potential using HUMAnN2 respectively. Finally, the curated metadata and profiled raw data were structured and documented as a Bioconductor object.
- Overview
- Requirements a new dataset has to meet
- Downloading raw reads and link them to sampleIDs.
- Calculating sequencing statistics from raw data
- Join all metadata information
- Template table on GitHub
- Metadata Curation and standardization
- 6.1. Required columns in the metadata table
- 6.2. The Study condition & Disease column
- 6.3. westernized and non-westernized samples
- Add metadata table to GitHub repository
- Run HUMAnN2
- 8.1. Curate the HUMAnN2 output
A dataset which shall be included to cmd needs to meet the following requirements:
- availability of metagenomic raw reads in the NCBI Bioproject site. look for an accession code for the metagenomic raw reads. Often you can find this either in the method part or at the end paragraph of the study.
- existing metadata. search for a metadata table in the study and in the supplementary files. Often, the authors provide some phenotypic characteristics to each sample like age, weight, sex, country, of the study participant, clinical values as blood levels or description of diseases. In addition, some metadata information which is true for all samples can be extracted from the study main text.
- metadata can be assigned to raw reads. It is not sufficient to have metadata. We also have to be able to link this metadata information to each raw read by a distinctive sample ID. You can check this to a certain degree by looking if the metadata tables provide a column with unique sample IDs.
- the author using an Illumina® application as their sequencing method. Simply search for the word Illumina in the study text to make sure that they used an Illumina® application like MiSeq, HiSeq, NextSeq,... Mostly this is notated in the method section of the study.
A name of a dataset in cmd is composed by the surname of the first author, the first letter of the first author’s given name, underscore and the publishing year. example: LassalleF_2017
Once the dataset meets all requirements, you can start to download the raw data with the python script ncbi_downloader.py, available in the bitbucket repository of the Computational Metagenomics Lab.
You can use it as following on a command line: python ncbi_downloader.py name_of_the_dataset BioProject_accession_code -xv 1
For the BioProject Accession code, it is the best to use the ID you can find on the right upper part of the corresponding BioProject site. The script will list the sampleID together with the corresponding NCBI accession code for each sample directly to the shell. One has to copy it and save it as a tab-separated text file. In the following, this file will be named mapping.
Next, the script will start to download all raw reads. If you only want the mapping information for now, without downloading the raw reads, you can run the script with the addition “-m xml”.
During downloading, the script creates a folder which is named by the dataset name. It contains the folder reads, in which a again folder for each sample can be found. Each one contains the corresponding raw reads in an appropriate format (.sra, .fastq).
For further usage of the downloaded raw reads, one has to unzip the raw reads in a new folder. In addition, the script is creating a file name-of-the-dataset_init_metadata.csv. This file contains the metadata information which is deposited in the NCBI BioProject site for each sample.
In the metadata table, the number of reads, the number of bases, minimum read length, and median read length of the sample sequencing will be noted as well. To calculate this statistics from the raw data, you can use the python script fna_len.py available in the PyPhlAn bitbucket repository
This script can be used for single samples. Since the datasets usually containing multiple samples, it is recommended to run multiple calculation steps in parallel. A bash script for a Linux OS could look like the following example for the LassalleF_2017 study:
#!/bin/bash
parallel -j15 'bzcat/.../LassalleF_2017/reads/{}/{}*.bz2 | python /.../pyphlan/fna_len.py -q --stat > ...my_curation/all_read_stats/{}.stats' ::: SID0002_gti SID0004_cle SID0005_aaf SID0006_evh SID0008_kcu SID0009_rhg SID0011_vkb SID0012_qdz SID0013_eop
The script will create for each sample a file sample-name.stats
Usually, we have now 3 or more files for the metadata:
- the mapping file from the ncbi_downloader.py script
- the init_metadata.csv file from the ncbi_downloader.py script
- metadata tables from the study supplementary
- the statistic file .stats for each sample.
It is the goal to have one final standardized metadata table, which is a tab separated file. Thereby, each row represents one sample and the different metadata properties are represented in the columns.
Unfortunately, the approaches for metadata acquisition are different for each study. The way, the metadata is presented is dependent on the author. Therefore, the curator of a new cmd dataset has at the latest now to get familiar with the metadata and the study design. The curator has to figure out, how he/she can join all data tables to one big final metadata table without losing the uniqueness for rows (sample). The joining can only be done, if it is guaranteed that each row from one table is only matched to one row to the other table. There are multiple ways to merge all those tables. As an example, the “Metadata_curation_example” file shows the curation of the metadata for the LassalleF_2017 study, which was done with Rstudio by using the packages readxl, readr, tibble, dplyr and tidyr from CRAN.
With the cmd package, the end-user wants to compare metagenomic samples between studies. Therefore, the metadata has to be standardized according to a template table which summarizes all yet used descriptions for sample features
The template table gives us information about the allowed values in each column. You can find the following requirements:
- col.name. In this column, the allowed column names are written for the metadata table
- uniqueness. If “non-unique”, then duplicated values are allowed in the column. If “unique”, then each duplicated value is an error.
- requiredness. If a feature is “required”, NA values are not allowed in the column. If it is “optional”, NA values are allowed.
- multiplevalues. For TRUE, each sample can have in its column multiple values separated by a semicolon. For FALSE, the notation of multiple values is not allowed.
- allowedvalues. a regular expression, abbreviations or written-out words which defining the legal values for a single cell.
- description. Here you can find additional information on how you should use the features. This can be explanations of the column names or abbreviations, descriptions where you can get this informations or annotations which units you have to use for clinical values.
It is possible, that some metadata features of the new study are not yet described in the template table. In this case, one can change the template table directly in the GitHub repository with adequate care. Click on the pencil symbol (“edit this file”) on the right side above the table to edit the comma separated file. Now you can either add a total new row for new features or you can add allowed values to already existing features (for example adding a new country code to the country row).
The joint metadata table has now to be standardized by using the guidelines from the template table on GitHub. To acquire an entirely curated metadata table, the curator has may change column names, column content and recalculate values to fit the default unities. The “Metadata_curation_example” file shows some examples for curation. But again, this is different for each dataset and the curator has to use his biological and statistical expertise to curate the metadata. Following a few information about some metadata features.
- sampleID. the sampleID has to be unique and distinctive for each sample. The study usually provides already sample IDs to identify each sample, which can be used in most of the time. The sampleID has to start with a letter.
- subjectID. The subjectID refers to the study participant. In timeline studies or similar, each participant provides more than one sample. In this case, the sampleID is still unique for each sample, but the subjectID would be the same. It identifies the participant and has also to start with a letter
- body_site. This column describes the body site of sample acquisition and can be one of the following defined words: stool, skin, vagina, oralcavity, nasalcavity, lung, milk
- country. the country in which the participant is living and trialed is notated in the ISO3 code The already used country codes are noted in the template table.
- sequencing_platform. since the usage of an Illumina® application is necessary for a dataset to be included to cmd, the sequencing platform is also a required column.
- PMID. the PMID accession code for the PubMed database, which contains only numbers. You can also put unpublished as a value here.
- number_reads. number of final reads, calculated from raw data
- number_bases. total number of bases sequenced in the sample, calculated from raw data
- minimum_read_length. calculated from raw data
- median_read_length. calculated from raw data.
- Curator the name of the Curator with an underscore between given and family name. If you are a new curator, you have to add your name to the template table.
At a first glance, the study condition and the disease seem very similar, but there are a few minor but important differences which have to be considered.
- study_condition. The study condition describes the main condition under study. This doesn’t have to be necessarily a disease, it can also be linked to premature birth, transmission studies or evolutionary aspects. For all control samples, the study condition is “control”.
- disease. The disease, which a participant is suffering from. This column can also have multiple valued. A control sample is not necessarily healthy. Therefore, healthy should only be noted, if the participant is noted “healthy” in the study.
The allowed values for the study_condition and disease columns can be seen and extended in the template table.
Another feature which is added to the most of the datasets in cmd is the determination whether the samples coming from a human living in the western civilization or from a human living a more traditional lifestyle. Since the human species evolved on our planet, the microbiome coevolved with our body. In the last few hundred years, urbanization and westernization had a great impact of the lifestyle on the modern era humans. Changes like the use of antibiotics and exposure to xenobiotics, the more hygienic households and the reduced contact to our natural world concern also our microbiome since it is deeply interconnected to the human body. In addition, the food we eat is often high-calorie high-fat meals prepared by heating and are highly controlled in the production process. Pasolli et al. 2018 (unpublished) introduce umbrella terms to distinguish the microbiome samples coming from humans in industrial populations which are exposed to the majority of changes described above (“westernized”) and from humans, which are living a more traditional and natural hunter-gathering lifestyle (“non-westernized”).
It is understandable, that studies which are not considering this evolutionary aspect don’t comment on their samples if they are westernized or non-westernized. This is the task of the cmd curator to decide and add this column to the curated metadata by evaluating the content and design of the study.
Once the curated metadata table is prepared, one can upload it to the curatedMetagenomicDataCuration GitHub repository.
In the repository path curatedMetagenomicDataCuration/inst/curated/ you can find a folder for each already curated dataset. Create a directory for the new dataset (named after the dataset-name) and upload in there the curated metadata table as a tab-separated text document (.tsv) with the name: dataset-name_metadata.tsv
A vignette in the GitHub repository is checking the files with the ending _metadata.tsv against the template file. Thereby, two types of errors can appear:
- “Column name error”: means a column name is not found at all in the template file
- “Entry Error”: at least one value in a column does not match the rules defined for that column
After a few minutes, the possible errors are listed in the curatedMetagenomicData curation report
Run HUMAnN2 profiling with the raw data. For this have a look at the HUMAnN2 Tutorial and the HUMAnN2 User Manual
The output of the HUMAnN2 will be organized like this:
- Dataset_name [folder]
- humann2_temp [folder]
- samplenumber_humann2_temp [folder for each sample]
- samplenumber_metaphlan_bugs_list.tsv [file]
- samplenumber.marker_ab_table [file]
- samplenumber.marker_pres_table [file]
- samplenumber_bowtie2_aligned.sam [file]
- … more files which are not listed here
- samplenumber [folder for each sample]
- samplenumber_genefamilies.tsv [file]
- samplename_pathabundance.tsv [file]
- samplename_pathcoverage.tsv [file]
- samplenumber_humann2_temp [folder for each sample]
- humann2_temp [folder]
The main output (genefamilies, pathabundance, and pathcoverage) is stored in an extra folder for each sample. It is recommended to unzip the files to common folders for each output type. As a result, you should end up with a genefamilies folder containing all _genefamilies.tsv, with a pathabundance folder containing all _pathabundance.tsv and with a “pathcoverage” folder containing the _pathcoverage.tsv files. The files in folders should be named solely by their sample name, means without the _pathcoverage.tsv, _genefamilies.tsv, _pathabundance.tsv endings.
In addition, HUMAnN2 is creating the _metaphlan_bugs_list.tsv file for each sample and stores it in the humann2_tmp folder. For cmd purpose, we want to copy those files to an extra methaphlan_bugs_list folder, which is in the same directory than the other folders described above.
The genefamilies and pathabundance files have to be converted to relative abundances for the ability to compare the data with other datasets. For this purpose, you can use the script humann2_renorm_table (have a look at the HUMAnN2 User Manual) and create additional folders genefamilies_relab and pathabundance_relab with the output files.
The files in the folder marker_abundance and marker_presence were created by the script __run_markers2.py
After those steps, the folder for the dataset should be organized as following:
- Dataset_name[folder]
- humann2_temp [folder] ← can be removed now
- genefamilies[folder]
- samplename.tsv[file for each sample]
- genefamilies_relab[folder]
- samplename.tsv[file for each sample]
- marker_abundance[folder]
- samplename.marker_ab_table [file for each sample]
- samplename.marker_pres_table [file for each sample]
- samplename.profile [file for each sample]
- marker_presence[folder]
- samplename.tsv [file for each sample]
- metaphlan_bugs_list[folder]
- samplename.tsv [file for each sample]
- pathabundance[folder]
- samplename.tsv [file for each sample]
- pathabundance_relab[folder]
- samplename.tsv [file for each sample]
- pathcoverage[folder]
- samplename.tsv [file for each sample]
After the Curation, you end up with a metadata table and the corresponding MethaPhlan2 and HUMAnN2 profiles for each sample.