Snakemake workflow: Analysis of epigenetic signals captured by fragmentation patterns of cell-free DNA
This repository contains snakemake workflows for analysing epigenetic signals captured by fragmentation patterns of cell-free DNA.
The following analyses are based on the following publication:
Snyder MW, Kircher M, Hill AJ, Daza RM, Shendure J. Cell-free DNA Comprises an In Vivo Nucleosome Footprint that Informs Its Tissues-Of-Origin. Cell. 2016 Jan 14;164(1-2):57-68. doi: 10.1016/j.cell.2015.11.050. PubMed PMID: 26771485
More details on analyses in this repository.
All scripts and binaries are provided as is, without any warrenty and for use at your own risk. This is not the release of a software package. We are only providing this information and code in addition to a description of methods for making it easier to reproduce our analyses. We are not providing any support for these scripts.
- Snakemake workflow: Analysis of epigenetic signals captured by fragmentation patterns of cell-free DNA
- Sebastian Röner (@sroener)
Our scripts largely depend on Python 2, the pysam, bx and numpy libraries. Here a list of versions that we used:
dependencies:
- python=2.7.15
- numpy=1.7.2
- pysam=0.7.6
- bx-python=0.8.9
We were also using R 4.0.2, UCSC binaries for working with bigWig and bigBed files and tools like tabix and samtools from HTSlib.
Please note that a samtools binary is included with these scripts. Among other things, this samtools binary allows filtering reads based on insert size/read length. This is an early version of the samtools branch released on https://github.com/mpieva/samtools-patched. For samtool calls that are not filtering for read length/insert size, other (more recent) versions of samtools might be used. Please note though that parameters might be named differently and that the ASCII encoding of read filters is also special to the samtools version that we used.
If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and, if available, its DOI (see above).
Clone the repository to your local system, into the place where you want to perform the data analysis.
Configure the workflow according to your needs via editing the files in the config/
folder. Adjust config.yaml
to configure the workflow execution, samples.tsv
to specify your sample setup and regions.tsv
to specify target regions.
Install Snakemake using conda:
conda create -c bioconda -c conda-forge -n snakemake snakemake
For installation details, see the instructions in the Snakemake documentation.
The workflows are executed from the repository root folder. The different analyses have to be executed separately. To specify the respective workflow use the -s
switch followed by the path of the Snakefile
(e.g.: ./snakefile_WPS.smk
)
Activate the conda environment:
conda activate snakemake
Test your configuration by performing a dry-run via
snakemake -s path/to/Snakefile --use-conda -n
Execute the workflow locally via
snakemake -s path/to/Snakefile --use-conda --cores $N
using $N
cores or run it in a cluster environment via
snakemake -s path/to/Snakefile --use-conda --cluster qsub --jobs 100
or
snakemake -s path/to/Snakefile --use-conda --drmaa --jobs 100
If you not only want to fix the software stack but also the underlying OS, use
snakemake --use-conda --use-singularity
in combination with any of the modes above. See the Snakemake documentation for further details.
Fragment endpoint coordinates were extracted from BAM files with the pysam API.Both outer alignment coordinates of PE data were extracted for properly paired reads. Both end coordinates of SR alignments were extracted when PE data was collapsed to SR data by adapter trimming. A fragment’s coverage is defined as all positions between the two (inferred) fragment ends, inclusive of endpoints. We define the Windowed Protection Score (WPS) of a window of size k as the number of molecules spanning the window minus those with an endpoint within the window. We assign the determined WPS to the center of the window.
Windowed protection scores are calculated for all provided regions with additional 1000 bp up- and downstream. The values are then averaged over all provided regions and normalized by substracting the trimmed mean (excluding the outer 25% on both sides of the distribution).
- configured by the user (samples.tsv):
- analysis ID
- samples
- path to sample .bam files
- reference samples fro plotting
- genome build per sample
- configured by the user (regions.tsv):
- bed file containing regions of interest (e.g. TFBS), all having the same length
Note: More information about config files here
- table containing bp specific WPS for regions listed in bed
- line plot showing normalized WPS of multiple samples
The analysis is contained in the snakefile_GE_analysis.smk
workflow.
Human Protein Atlas
FPKM gene expression (GE) values measured for 20,344 Ensembl gene identifiers in 44 human cell lines and 32 primary tissues by the Human Protein Atlas (Uhlén et al., 2015) was downloaded from http://www.proteinatlas.org/download/rna.csv.zip. Genes with 3 or more non-zero expression values were retained (n=19,378 genes). The GE data set is provided with one decimal precision for the FPKM values. Thus, a zero GE value (0.0) indicates expression in the interval [0, 0.05) Unless otherwise noted, we set the minimum GE value to 0.04 FPKM before log2-transformation.
Fast Fourier transformation (FFT) and smoothing of trajectories
We use parameters to smooth (3 bp Daniell smoother; moving average giving half weight to the end values) and de-trend the data (i.e. subtract the mean of the series and remove a linear trend). A recursive time series filter implemented in R was used to remove high frequency variation from trajectories. 24 filter frequencies (1/seq(5,100,4)) were used, and the first 24 values of the trajectory were taken as init values. The 24-value shift in the resulting trajectories was corrected by repeating the last 24 values of the trajectory.
FFT intensity correlation with expression
WPS was used to calculate periodograms of genomic regions using Fast Fourier Transform (FFT, spec.pgram in R) with frequencies between 1/500 and 1/100 bases. Intensity values for the 120-280 bp frequency range were determined from smooth FFT periodograms. S-shaped Pearson correlation between GE values and FFT intensities was observed around the major inter-nucleosome distance peak, along with a pronounced negative correlation in the 193-199 bp frequency range. The mean intensity in this frequency range was correlated with the average intensity with log2-transformed GE values for downstream analysis.
- included in the repository:
- annotations
- labels
- RNAtable from Protein Atlas
- blood atlas ["Blood"]
- protein atlas tissues ["Tissue"]
- protein atlas tissues + cell lines ["Extended"]
- configured by the user (samples.tsv):
- analysis ID
- samples
- path to sample bam files
- genome build per sample
Note: More information about config files here
- fft_summary tables (results/intermediate/body/fft_summaries)
- plots showing intensities across tissues (results/plots)
- table showing correlation with tissues/cell lines
- table showing correlation rank difference to reference sample
Workflow containing utility functions to calculate and visualize similarities between samples.
- FFT tables by gene expression analysis
Note: FFT tables will be generated automatically if gene expression workflow was not exicuted before.
- heatmaps showing spearman correlation between samples from the same experiment (ID) based on configured FFT thresholds
- clustermaps showing spearman correlation between samples from the same experiment (ID) based on configured FFT thresholds with added hierarchical clustering
- Plots showing 2D UMAP projection labeled by kmeans clustering with specified number of clusters
- Plots showing 2D UMAP projection labeled by HDBSCAN clustering