Skip to content

HimesGroup/brocade

Repository files navigation

brocade

Reproducible analysis of ChIP-Seq data

Authors: Mengyuan Kan, Avantika Diwadkar, Blanca Himes

Overview

The goal of brocade pipeline is to preform reproducible analysis of ChIP-Seq data:

  • Download SRA .fastq data and prepare phenotype file
  • Align reads to a reference genome
  • Call peaks enriched in protein-DNA binding regsions
  • Perform QC on aligned files
  • Create a report that can be used to verify that sequencing was successful and/or identify sample outliers
  • Perform differential binding analysis of reads aligned to genome according to a given reference genome
  • Annotate genomic features to peaks
  • Create a report that summarizes the differential binding results Automatically generate LSF scripts in each step for HPC use.

Bioinformatics Tools

  • Alignment: BWA
  • Peak Calling: macs2
  • QC: fastqc, trimmomatic, samtools, bamtools, picardtools, bedtools, bedGraphToBigWig
  • Differential binding analysis: DiffBind
  • Peak genomic feature annotation: Corresponding R genome database packages and ChIPseeker
  • Pipeline scripts: the Python scripts make use of various modules including subprocess, os, argparse, sys
  • R markdown scripts for summary reporot: Require various R libraries such as DT, gplots, ggplot2, rmarkdown, RColorBrewer, dplyr. Note that the current RMD scripts require pandoc version 1.12.3 or higher to generate HTML report.

Prerequisite Files

  • Species-specific genome reference files: reference genome fasta files (.fa) and corresponding BWA index files for alignment, and chromosome length file for creating bigwig files.
  • Blacklisted region files: bed files containing "blacklisted regions", i.e. uniquely mappable regions are found at specific types of repeats such as centromeres, telomeres and satellite repeats. These lists are downloaded from Kundaje Lab, and converted to .bed file format and provided in template_files/[ref_genome]_blacklist.bed.
  • Adapter and primer sequences: a list of adapter and primer sequences is provided in template_files/adapter_primer_sequences.txt. For fastqc, replace . For trimming we provide adapter and primer sequences for the following types: Ilumina TruSeq single index, Illumina unique dual (UD) index adpter and PrepX. Users can tailor this file by adding sequences from other protocols.

Workflow

Download data from GEO/SRA

Run pipeline_scripts/chipseq_sra_download.py to download .fastq files from SRA. Read in template_files/chipseq_sra_download_Rmd_template.txt from specified directory template_dir to create a RMD script. Ftp addresses for corresponding samples are obtained from SRA SQLite database using R package SRAdb.

chipseq_sra_download.py --geo_id GEO_Accession --path_start output_path --project_name output_prefix --template_dir templete_file_directory --fastqc

bsub < project_name_download.lsf

The option --pheno_info refers to using user provided SRA ID for download which is included in the SRA_ID column in the provided phenotype file. If the phenotype file is not provided, use phenotype information from GEO. SRA_ID is retrieved from the field relation.1.

The option --fastqc refers to running FastQC for downloaded .fastq files.

Output files:

Fastq files downloaded from SRA are saved in the following directory.

path_start/project_name_SRAdownload/

Fastqc results of corresponding sample are saved in:

path_start/sample_name/fastq_name_fastqc.zip

The RMD and corresponding HTML report files:

path_start/project_name_SRAdownload/project_name__SRAdownload_ChIPSeqReport.Rmd path_start/project_name_SRAdownload/project_name__SRAdownload_ChIPSeqReport.html

GEO phenotype file generated by RMD reports:

path_start/project_name_SRAdownload/GEO_Accession_GEO_phenotype.txt

SRA information file generated by RMD reports:

path_start/project_name_SRAdownload/project_name_sraFile.info

User-tailored phentoype file

The sample info file used in the following steps should be provided by users.

Required columns:

  • 'Sample' column containing sample ID
  • 'Status' column containing variables of comparison state
  • 'Antibody' column containing antibody used for ChIP
  • 'Tissue' column containing tissue used
  • 'Treatment' column containing treatment conditions
  • 'Disease' column containing diseased conditions
  • 'Input' column containing sample ID used for DNA intput control that did not undergo ChIPSeq
  • 'Peak' column specifying 'narrow' or 'broad' peak type. Generally apply narrow peak calling for transcription binding site. Encode provides a list of narrow and broad marks for histone modifications (in the Target-specific Standards section).
  • 'R1' and/or 'R2' columns containing full paths of .fastq files

For DNA intput controls, specify Antibody='Input', Input='NA', and Peak='NA'

Other columns:

'Disease', 'Donor' (i.e. cell line ID if the same donor underwent treated vs. untreated comparison), and 'protocol' designating sample preparation kit information.

'Index' column contains index sequence for each sample. If provided, trim raw .fastq files based on corresponding adapter sequences.

If use data from GEO, most GEO phenotype data do not have index information. However, FastQC is able to detect them as "Overrepresented sequences". Users can tailor the 'Index' column based on FastQC results. We provide a file template_files/adapter_primer_sequences.txt with most updated adapter and primer sequences for FastQC detection.

In this example, we used a customized adapter file example_files/adapter_primer_sequences.txt.

An example phenotype file can be found here: example_files/GSE95632_Phenotype_withoutQC.txt. Note that column naming is rigid for the following columns: 'Sample', 'Status', 'Antibody', 'Tissue', 'Treatment', 'Input', 'Index', 'R1', 'R2', 'Disease', 'Donor', because pipeline scripts will recognize these name strings, but the column order can be changed.

Alignment, peak calling and QC

  1. Run pipeline_scripts/chipseq_align_and_qc.py to: 1) trim adapter and primer sequences if index information is available, 2) run FastQC for (un)trimmed .fastq files, 3) align reads to reference genome, and 5) obtain various QC metrics from .bam files.

Edit pipeline_scripts/chipseq_userdefine_variables.py with a list user-defined variables (e.g. paths of genome reference file, paths of bioinformatics tools, versions of bioinformatics tools), and save the file under an executable search path.

If perform adapter trimming, read in template_files/adapter_primer_sequences.txt from specified directory template_dir used as a reference list of index and primer sequences for various library preparation kits. In this example, a customized adapter file example_files/adapter_primer_sequences.txt was used. To replicate this result, save this file example_files/adapter_primer_sequences.txt in user-defined templete_file_directory.

chipseq_align_and_qc.py --project_name output_prefix --samples_in sample_info_file.txt --ref_genome hg38 --library_type SE --index_type GSE95632 --path_start output_path --template_dir templete_file_directory --bam2bw

for i in *align.lsf; do bsub < $i; done

The "--ref_genome" option refers to using selected version of genome reference.

The "--library_type" option refers to PE (paired-end) or SE (single-end) library.

The "--index_type" option refers to index used in sample library preparation.

Common index types provided in template_files/chipseq_adapter_primer_sequences.txt are:

  • truseq_single_index (TruSeq Single Indexes)
  • illumina_ud_sys1 (Illumina UD indexes for NovaSeq, MiSeq, HiSeq 2000/2500)
  • illumina_ud_sys2 (Illumina UD indexed for MiniSeq, NextSeq, HiSeq 3000/4000)
  • prepX (PrepX for Apollo 324 NGS Library Prep System)

The "--bam2bw" option refers to enabling the step of generating bigwig (.bw) files and creating a ucsc track annotation file for visualization.

template_files/chipaseq_adapter_primer_sequences.txt contains four columns (i.e. Type, Index, Description, Sequence). Sequences in the Index column is used to match those in Index column in sample info file. This column naming is rigid.

The list is based on the following resources:

If users provide new sequences, add the new index type in the 1st column 'Type' and specify it in "--index_type".

Read sample preparation protocal carefully. Reads not in the specified strand will be discarded. Double check proprotion of reads mapped to no feature category in QC report. If a lot of reads are mapped to 'no feature', the strand option setting is likely incorrect.

Output files:

Various output files will be written for each sample in directories structured as:

path_start/sample_name/sample_name_R1_Trimmed.fastq
path_start/sample_name/sample_name_R2_Trimmed.fastq
path_start/sample_name/sample_name_R1_Trimmed_fastqc.zip
path_start/sample_name/sample_name_R2_Trimmed_fastqc.zip
path_start/sample_name/sample_name_ReadCount
path_start/projectname_name_ucsc_track.txt
path_start/sample_name/bwa_out

  1. Run pipeline_scripts/chipseq_peakcaller.py to call peaks.

chipseq_peakcaller.py --project_name output_prefix --samples_in output_prefix --ref_genome hg38 --path_start output_path --template_dir templete_file_directory

for i in *_macs2.lsf; do bsub < $i; done

To use narrow or broad peak calling was specified in Peak column of sample info file.

Filtering out peaks within 'blacklist regions'. Blacklist regions for hg38 and hg19 are provided in templete_files: hg19_blacklist.bed and hg38_blacklist.bed.

Output files:

Peak files will be written for samples that underwent ChIPSeq in the macs2 output directory:

path_start/sample_name/macs2_out

  1. Run pipeline_scripts/chipseq_align_and_qc_report.py to create an HTML report of QC and alignment summary statistics for ChIP-Seq samples. Read in template_files/chipseq_align_and_qc_report_Rmd_template.txt from specified directory template_dir to create a RMD script.

chipseq_align_and_qc_report.py --project_name output_prefix --sample_in sample_info_file.txt --ref_genome hg38 --library_type SE --path_start output_path --template_dir templete_file_directory

bsub < project_name_qc.lsf

Output files:

This script uses the many output files created in step 1), converts these sample-specific files into matrices that include data for all samples, and then creates an Rmd document.

The report and accompanying files are contained in:

path_start/project_name_Alignment_QC_Report/

The RMD and corresponding HTML report files:

path_start/project_name_Alignment_QC_Report/project_name_QC_ChIPSeqReport.Rmd path_start/project_name_Alignment_QC_Report/project_name_QC_ChIPSeqReport.html

Differential binding analysis and genomic feature annotation

Run pipeline_scripts/chipseq_diffbind_report.py to perform differential binding analysis and create an HTML report of differential binding summary statistics. Read in template_files/chipseq_diffbind_report_Rmd_template.txt from specified directory template_dir to create a RMD script.

chipseq_diffbind_report.py --project_name output_prefix --samples_in sample_info_file_withQC.txt --comp sample_comp_file.txt --ref_genome hg38 --path_start output_path --template_dir templete_file_directory

bsub < project_name_diffbind.lsf

The "--sample_in" option specifies user provided phenotype file for differential binding analysis (example_files/GSE95632_Phenotype_withQC.txt). The columns are the same as example_files/GSE95632_Phenotype_withoutQC.txt but with an additional column "QC_Pass" designating samples to be included (QC_Pass=1) or excluded (QC_Pass=0) after QC. This column naming is rigid which will be recoganized in pipeline scripts, but column order can be changed.

The "--comp" option specifies comparisons of interest in a tab-delimited text file with one comparison per line with three columns (i.e. Condition1, Condition0, Design), designating Condition1 vs. Condition2. The current version of differential analysis does not support paired analysis.

Find the example comp file here example_files/GSE95632_comp_file.txt.

Output files:

Files generated for this step are contained in:

project_name/project_name_diffbind_out/

The sample sheet .csv files used as DiffBind input:

project_name_Condition1_vs_Condition2.sampleinfo.csv

The pairwise differential binding results and normalized counts for significant binding sites:

project_nameCondition1_vs_Condition2_sig_diffbind_results.csv project_nameCondition1_vs_Condition2_sig_counts_normalized_by_diffbind.csv

The RMD and corresponding HTML report file:

path_start/project_name_deseq2_out/project_name_DiffBind_Report.Rmd path_start/project_name_deseq2_out/project_name_DiffBind_Report.html

About

Reproducible analysis of CHIP-Seq data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published