Home

Welcome to the iCLIP wiki!

Getting Started

The iCLIP github repository is stored locally, and will be used for project deployment. Multiple projects can be deployed from this one point simultaneously, without concern.

Change working directory to the iCLIP repository

cd /data/RBL_NCI/Pipelines/iCLIP/[version number]

Load Snakemake to your environment.

# Recommend running snakemake>=5.19
module load snakemake

Snakemake Options

The Snakemake workflow has a multiple options: Usage: /home/sevillas2/git/iCLIP/run_snakemake.sh -p pipeline -p options: initialize, checks, dry-run, cluster, local, unlock, git, DAG, report Usage: -o output_dir -o path to output directory

Example initiation:

#Initialize Pipeline
sh run_snakemake.sh -p initialize -o /path/to/output/dir

#Check manifests
sh run_snakemake.sh -p check -o /path/to/output/dir

#Dry-Run
sh run_snakemake.sh -p dry -o /path/to/output/dir

#Execute pipeline on the cluster
sh run_snakemake.sh -p cluster -o /path/to/output/dir

#Execute pipeline locally
sh run_snakemake.sh -p local -o /path/to/output/dir

#Unlock directory (after failed partial run)
sh run_snakemake.sh -p unlock -o /path/to/output/dir

#GIT Action
sh run_snakemake.sh -p git -o /path/to/output/dir

#Create report
sh run_snakemake.sh -p report -o /path/to/output/dir

#Create DAG of pipeline
sh run_snakemake.sh -p DAG -o /path/to/output/dir

Explanation of pre-processing steps:

initialize (required): This must be performed before any Snakemake run (dry, local, cluster) can be performed. This will copy the necessary config, manifest and Snakefiles needed to run the pipeline to the provided output directory.
checks (optional): This is an optional step, to be performed before any Snakemake run (dry, local, cluster). This will check for errors in the snakemake_config files, as well as your input manifests. If there are errors they will be printed to the command line OR printed in a text file to your output dir.
dry-run (optional): This is an optional step, to be performed before any Snakemake run (local, cluster). This will check for errors within the pipeline, and ensure that you have read/write access to the files needed to run the full pipeline.

Explanation of processing steps:

local - This will run the pipeline on a local node. NOTE: This should only be performed on an interactive node.
cluster - This will submit a master job to the cluster, and subsequent sub-jobs as needed to complete the workflow. An email will be sent when the pipeline begins, if there are any errors, and when it completes.

Explanation of other steps:

unlock - This will unlock the pipeline if an error caused it to stop in the middle of a run.
git - This is only utilized for GITHUB Actions testing.
DAG - This will produce a DAG of the workflow and dependencies, saved to the /output/dir/log directory
report - This will produce a report generated from the snakemake statistics produced by your pipeline, saved to the /output/dir/log directory.

Preparing Configs

There are three config requirements for this pipeline, found in the /output/dir/config directory, after initialization. These files are:

cluster_config.yml - this file will contain the config default settings for analysis. This file does not require edits, unless processing requirements dictate it.
snakemake_config.yaml - this file will contain directory paths and user parameters for analysis;
- sourceDir: path to repo; this does not need to be changed if running local version; example: '/data/RBL_NCI/Pipelines/iCLIP/v2.0'
- outputDir: path to created output directory, where output will be stored; example: '/path/to/output/'
- sampleManifest: path to multiplex manifest (see specific details below(; example:'/path/to/sample_manifest.tsv'
- multiplexManifest: path to multiplex manifest (see specific details below); example: '/path/to/multiplex_manifest.tsv'
- contrastManifest: path to contrast manifest (see specific details below); example: '/path/to/contrast_manifest.tsv'
- fastqDir: path to gzipped multiplexed fastq files; example: '/path/to/raw/fastq/files'
- reference: selection of reference database ['hg38', 'mm10']
- filterlength: minimum read length to include in analysis [any int >20]
- spliceaware: whether to run splice_aware part of the pipeline ['y', 'n']
- includerRNA: whether to include refseq rRNA's in annotations ["Y", "N"]
- splice_bp_length: length of splice index to use [50, 75, 150]
- multiplexflag: whether samples are multiplexed ["Y","N"]
- mismatch: number of bp mismatches allowed in demultiplexing [1,2,3]
- mincount: integer value, of the minimum number of matches to count as a peak [1,2,3]
- ntmerge: minimum distance of nucleotides to merge peaks [10,20,30,40,50,60]
- peakid: report peaks for unique peaks only or unique and fractional mm ["unique","all"]
- DEmethod: choose DE method ["manorm","none"]
- splicejunction: "Y" #include splice junctions in peak calls: "manorm"
- condenseexon: "Y" #whether to collapse exons
index_config.yaml - this file will contain directory paths for index files. This file does not require edits, unless processing requirements dictate it.
- organism:
  - std: '/path/to/index/'
  - spliceaware:
    - valuebp1: '/path/to/index1/'
    - valuebp2: '/path/to/index2/'

Preparing Manifests

There are two required, and one optional, manifests for this pipeline. The paths of these files are defined in the snakemake_config.yaml file. Example files are placed in the /output/dir/manifest directory after initialization. You can edit these example files or choose to create your own. These files are:

multiplexManifest (required) - this manifest will include information to map fastq files to their multiple sample ID
- file_name: the full file name of the multiplexed sample, which must be unique; example: 'test_1.fastq.gz'
- multiplex: the multiplexID associated the fastq file, which must be unique. These names must match the multiplex column of the sampleManifest. example: 'test_1'
```
An example multplex_manifest.tsv file:

file_name		multiplex
test_1.fastq.gz	test_1
test_2.fastq.gz	test_2
```
samplesManifest (required)
- multiplex: the multiplexID associated with the fasta file, and will not be unique. These names must match the multiplex column of the multiplex_manifest.tsv file. example: 'SIM_iCLIP_S1'
- sample: the final sample name; this column must be unique. example: 'Ro_Clip'
- barcode: the barcode to identify multiplexed sample; this must be unique per each multiplex sample name but can repeat between multiplexid's. example: 'NNNTGGCNN'
- adaptor: the adaptor sequence, to be removed from sample; this may or may not be unique. example: 'AGATCGGAAGAGCGGTTCAG'
- group: groupings for samples, may or may not be unique values. example: 'CNTRL'
```
An example sampleManifest file with multiplexing of one sample. Notice that the multiplexID test_1 is repeated, as Ro_Clip and Control_Clip are both found in the same fastq file, whereas test_2 is not multiplexed:

multiplex	sample		group		barcode		adaptor
test_1	Ro_Clip		CLIP		NNNTGGCNN	AGATCGGAAGAGCGGTTCAG
test_1	Control_Clip	CNTRL		NNNCGGANN	AGATCGGAAGAGCGGTTCAG
test_2	Ro_Clip2	CLIP		NNNCGTANN	AGATCGGAAGAGCGGTTCAG
```
contrastManifest (Optional - required with DE_Method of MANORM)
- contrast_1: the sample name, identified in the samplesManifest [sample] column, of the sample to compare. example: 'Ro_Clip'
- contrast_2: the sample name, identified in the samplesManifest [sample] column, of the background to remove. example: 'Control_Clip'
```
An example contrastManifest file:

contrast_1,contrast_2
Ro_Clip,Control_Clip
```

Expected Outputs

The following directories are created under the output_directory:

01_remove_adaptor: zipped, fastq files with adaptors removed
02_unzip: unzipped fastq files, with adaptors removed
03_split: unzipped fastq files, split into smaller files to increase processing speed
04_sam:
- 01_alignment: intermediate split sam files, aligned to reference
- if splice_aware: 02_cleanup:
- if splice_aware: 03_genomic: converted transcriptome coordinates to genomic coordinates sam files, zipped
- 04_unmapped
05_reads: header, unique and multi-mapped text files
06_bam_unique: unsorted, sorted, and indexed bam files generated from unique split sam files
06_bam_mm: unsorted, sorted, and indexed bam files generated from multi-mapped split sam files
07_bam_merged_splits: merged sorted, indexed splits of unique and multi-mapped bam files
08_bam_merged: merged sorted, indexed unique and multi-mapped bam files
09_dedup: sorted and indexed deduplicated merged bam file; logs and headers
10_bed: unique bed files
11_SAF: unique peak SAF annotation files
12_counts: feature counts for all and unique reads
13_annotation:
- 01_project
- 02_peaks
- [sample].html
- [sample].txt
14_MAnorm:
- input:
[multiplexid]
- 00_qc_pre: fastqc reports for each sample (summarized in /qc/multiqc_report.html)
- 00_qc_post: barcode statistics for each sample (summarized in /qc/qc_report.html)
- 01_renamed: demultiplexed files, renamed to match sampleid
qc:
- 00_qc_post
- 00_qc_screen_species
- multiqc_data
- qc_report.html
- 00_qc_screen_rrna
- manifest_clean.txt
- multiqc_report.html
- split_params.tsv
log: slurm output files, copies of config and manifest files
workflow: saved Snakefile(s) used in run(s)

Troubleshooting

Check your email for an email regarding pipeline failure
Review the logs to determine what rule failed (logs are named by Snakemake rule)
Review /qc/qc_report.html to determine if poor performance was related to barcode mismatching or alignment

cd /path/to/output/dir/log

Address the error, unlock the directory (Step 4 in Running Pipeline), and re-execute pipeline (Step 2 or 3 in Running Pipeline)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly