Skip to content
Samantha edited this page Dec 1, 2021 · 24 revisions

Welcome to the iCLIP wiki!

Getting Started

The iCLIP github repository is stored locally, and will be used for project deployment. Multiple projects can be deployed from this one point simultaneously, without concern.

  1. Change working directory to the iCLIP repository
cd /data/RBL_NCI/Pipelines/iCLIP/[version number]
  1. Review the Tutorial WikiPage and run a test!

Snakemake Options

The Snakemake workflow has a multiple options:

Usage: /data/RBL_NCI/Pipelines/iCLIP/[version number]/run_snakemake.sh -p pipeline
	-p options: initialize, checks, dry-run, cluster, local, unlock, git, DAG, report
Usage:  -o output_dir
	-o path to output directory

Example commands:

#Initialize Pipeline
sh run_snakemake.sh -p initialize -o /path/to/output/dir

#Check manifests
sh run_snakemake.sh -p check -o /path/to/output/dir

#Dry-Run
sh run_snakemake.sh -p dry -o /path/to/output/dir

#Execute pipeline on the cluster
sh run_snakemake.sh -p cluster -o /path/to/output/dir

#Execute pipeline locally
sh run_snakemake.sh -p local -o /path/to/output/dir

#Unlock directory (after failed partial run)
sh run_snakemake.sh -p unlock -o /path/to/output/dir

#GIT Action
sh run_snakemake.sh -p git -o /path/to/output/dir

#Create report
sh run_snakemake.sh -p report -o /path/to/output/dir

#Create DAG of pipeline
sh run_snakemake.sh -p DAG -o /path/to/output/dir

Explanation of pre-processing steps:

  • initialize (required): This must be performed before any Snakemake run (dry, local, cluster) can be performed. This will copy the necessary config, manifest and Snakefiles needed to run the pipeline to the provided output directory.
  • checks (optional): This is an optional step, to be performed before any Snakemake run (dry, local, cluster). This will check for errors in the snakemake_config files, as well as your input manifests. If there are errors they will be printed to the command line OR printed in a text file to your output dir.
  • dry-run (optional): This is an optional step, to be performed before any Snakemake run (local, cluster). This will check for errors within the pipeline, and ensure that you have read/write access to the files needed to run the full pipeline.

Explanation of processing steps:

  • local - This will run the pipeline on a local node. NOTE: This should only be performed on an interactive node.
  • cluster - This will submit a master job to the cluster, and subsequent sub-jobs as needed to complete the workflow. An email will be sent when the pipeline begins, if there are any errors, and when it completes.

Explanation of other steps:

  • unlock - This will unlock the pipeline if an error caused it to stop in the middle of a run.
  • git - This is only utilized for GITHUB Actions testing.
  • DAG - This will produce a DAG of the workflow and dependencies, saved to the /output/dir/log directory
  • report - This will produce a report generated from the snakemake statistics produced by your pipeline, saved to the /output/dir/log directory.

Preparing Configs

There are three config requirements for this pipeline, found in the /output/dir/config directory, after initialization. These files are:

  1. cluster_config.yml - this file will contain the config default settings for analysis. This file does not require edits, unless processing requirements dictate it.

  2. snakemake_config.yaml - this file will contain directory paths and user parameters for analysis;

    • sourceDir: path to repo; this does not need to be changed if running local version; example: '/data/RBL_NCI/Pipelines/iCLIP/v2.0'
    • outputDir: path to created output directory, where output will be stored; example: '/path/to/output/'
    • sampleManifest: path to multiplex manifest (see specific details below(; example:'/path/to/sample_manifest.tsv'
    • multiplexManifest: path to multiplex manifest (see specific details below); example: '/path/to/multiplex_manifest.tsv'
    • contrastManifest: path to contrast manifest (see specific details below); example: '/path/to/contrast_manifest.tsv'
    • fastqDir: path to gzipped multiplexed fastq files; example: '/path/to/raw/fastq/files'
    • filterlength: minimum read length to include in analysis [any int >20]
    • multiplexflag: whether samples are multiplexed ["Y","N"]
    • mismatch: if multiplexed, number of bp mismatches allowed in demultiplexing [1,2,3]
    • reference: reference organism['hg38', 'mm10']
    • spliceaware: whether to run include spliceaware feature for alignment ["Y","N"]
    • includerRNA: if spliceaware, whether to include refseq rRNA's in annotations ["Y", "N"]
    • spliceBPlength: if spliceaware, length of splice index to use [50, 75, 150]
    • splicejunction: if spliceaware, whether to include splice junctions in peak calls for DE_METHOD MANORM or DIFFBIND ["Y", "N"]
    • condenseexon: "Y" #if spliceaware, if there are multiple peaks in the same transcript, whether to combine into one feature ["Y", "N"]
    • mincount: minimum number of reads to count as a peak [1,2,3]
    • ntmerge: minimum distance of nucleotides to merge peaks [10,20,30,40,50,60]
    • peakid: report peaks for unique peaks only or unique and fractional mm ["unique","all"]
    • DEmethod: whether to run differential expression ["diffbind", "manorm", "none"]
    • sampleoverlap: 1 #if DEmethod DIFFBIND, minimum number of samples a peak must be found in to be counted [>1]
    • pval: 1 #if DEmethod DIFFBIND, minimum number of samples a peak must be found in to be counted [>1]
    • fc: 1 #if DEmethod DIFFBIND, minimum number of samples a peak must be found in to be counted [>1]
    • pval: 0.05 #if DEmethod, pval cutoff for significance
    • fc: 1 #if DEmethod, fold change cut off for significance
  3. index_config.yaml - this file will contain directory paths for index files. This file does not require edits, unless processing requirements dictate it.

    • organism:
      • std: '/path/to/index/'
      • spliceaware:
        • valuebp1: '/path/to/index1/'
        • valuebp2: '/path/to/index2/'

Preparing Manifests

There are two required, and one optional, manifests for this pipeline. The paths of these files are defined in the snakemake_config.yaml file. Example files are placed in the /output/dir/manifest directory after initialization. You can edit these example files or choose to create your own. These files are:

  1. multiplexManifest (required) - this manifest will include information to map fastq files to their multiple sample ID

    • file_name: the full file name of the multiplexed sample, which must be unique; example: 'test_1.fastq.gz'
    • multiplex: the multiplexID associated the fastq file, which must be unique. These names must match the multiplex column of the sampleManifest. example: 'test_1'

    An example multplex_manifest.tsv file:

    file_name		multiplex
    test_1.fastq.gz		test_1
    test_2.fastq.gz		test_2
    
  2. samplesManifest (required)

    • multiplex: the multiplexID associated with the fasta file, and will not be unique. These names must match the multiplex column of the multiplex_manifest.tsv file. example: 'SIM_iCLIP_S1'
    • sample: the final sample name; this column must be unique. example: 'Ro_Clip'
    • barcode: the barcode to identify multiplexed sample; this must be unique per each multiplex sample name but can repeat between multiplexid's. example: 'NNNTGGCNN'
    • adaptor: the adaptor sequence, to be removed from sample; this may or may not be unique. example: 'AGATCGGAAGAGCGGTTCAG'
    • group: groupings for samples, may or may not be unique values. example: 'CNTRL'

    An example sampleManifest file with multiplexing of one sample. Notice that the multiplexID test_1 is repeated, as Ro_Clip and Control_Clip are both found in the same fastq file, whereas test_2 is not multiplexed:

    multiplex	sample		group		barcode		adaptor
    test_1		Ro_Clip		CLIP		NNNTGGCNN	AGATCGGAAGAGCGGTTCAG
    test_1		Control_Clip	CNTRL		NNNCGGANN	AGATCGGAAGAGCGGTTCAG
    test_2		Ro_Clip2	CLIP		NNNCGTANN	AGATCGGAAGAGCGGTTCAG
    
  3. contrastManifest (Optional - required with DE_Method of MANORM or DIFFBIND)

    • if MANORM:
      • sample: the sample name, identified in the samplesManifest [sample] column, of the sample to compare. example: 'Ro_Clip'
      • background: the background sample name, identified in the samplesManifest [sample] column, of the background to remove. example: 'Control_Clip'
    • if DIFFBIND:
      • sample: the sample group, identified in the samplesManifest [group] column, of the sample group to compare. example: 'CLIP' will include samples 'Ro_Clip' and 'Ro_Clip2'
      • background: the background group name, identified in the samplesManifest [group] column, of the background group to remove. example: 'CNTRL' will include sample 'Control_Clip'

    An example contrastManifest file for MANORM:

    sample,background
    Ro_Clip,Control_Clip
    

    An example contrastManifest file for DIFFBIND:

    sample,background
    CLIP,CNTRL
    

General Workflow

The following are the rule processes run, depending on the config selection:

  • rule check_manifest:
  • rule qc_barcode:
  • if demux_flag == "Y":
    • rule demultiplex:
    • rule rename_demux:
  • if demux_flag == "N":
    • rule copy_nondemux:
  • rule remove_adaptors:
  • rule qc_fastq_pre:
  • rule qc_fastq_post:
  • rule qc_screen_validator:
  • rule determine_splits:
  • rule split_files:
  • rule novoalign:
  • rule create_bam_mm_unique:
  • rule merge_splits_unique_mm:
  • rule merge_mm_and_unique:
  • rule multiqc:
  • rule qc_alignment:
  • rule dedup:
  • if splice_aware== "Y":
    • rule mapq_recalc:
    • rule mapq_stats:
  • rule create_beds_safs:
  • rule feature_counts:
  • rule project_annotations:
  • rule peak_annotations:
  • rule annotation_report:
  • if DE_Method == "MANORM":
    • rule MANORM_beds:
    • rule MANORM_analysis:
    • rule MANORM_post_processing:
    • rule MANORM_RMD:
  • if DE_Method == "DIFFBIND":
    • rule DIFFBIND_beds:
    • rule DIFFBIND_preprocess:
    • rule DIFFBIND_analysis:
    • rule DIFFBIND_report:

Expected Outputs

The following directories are created under the output_directory:

  • 01_preprocess: this directory includes intermediate files to be deleted upon pipeline completion
  • 02_bam: this directory includes the bam files for the pipeline, sorted by:
    • 01_unmapped: unmapped reads
    • 02_merged: unique and multi-mapped reads, sorted and indexed
    • 03_dedup: 02_merged file deduplicated
  • 03_peaks: this directory includes the bed and SAF files for the pipeline, sorted by:
    • 01_bed: bed files sorted by all reads or unique reads
    • 02_SAF: SAF files sorted by all reads or unique reads
    • 03_allreadpeaks: peaks for all reads split by unique and MM peaks
    • 03_alluniquereads: peaks for unique reads split by unique or MM peaks
  • 04_annotation: this directory includes the annotation files at a project and sample level, sorted by:
    • 01_project: includes project level annotation information
    • 02_peaks: includes annotation bed files, complete annotated peak text files
    • final annotation report (HTML) and table (TXT)
  • 05_demethod: this directory is only produced when MANORM or DIFFBIND is selected from DE_METHOD
    • 01_input: this includes bed files for any samples being compared
    • 02_analysis: this includes raw DE files (excel MANORM, text DIFFBIND) by comparison
    • 03_report: this includes the final reports (HTML) by comparison
  • qc: this directory includes the qc reports, sorted by:
    • multiqc_report: this includes the fastqc results, as well as fastq screen results of each sample before and after filtering
    • qc_report: this includes barcode and alignment information of each sample before and after filtering
  • log: this includes the slurm output files of the pipeline sorted by pipeline start time; copies of config and manifest files used in this specific pipeline run; error reporting script

Troubleshooting

  1. Check your email for an email regarding pipeline failure. You will receive an email from slurm@biowulf.nih.gov with the subject: Slurm Job_id=[#] Name=iCLIP Failed, Run time [time], FAILED, ExitCode 1
  2. Run the error report script
cd /[output_dir]/log/[time_of_run]
sh 00_create_error_report.sh
cat error.log

Review the report for the rules that erred, and the sample information. An example report is listed below:

The following error(s) were found in rules:
*********************************************
Error in rule rule1:
Error in rule rule2:
Error in rule rule3:

The following samples are affected by memory and must be deleted:


rule1.[sbatchid].sp=[sample_name].err:[E::hts_open_format] Disc quota exceeded

The following samples are affected by missing input files/output dir and should be reviewed:


rule2.[sbatchid].sp=[sample_name].err:[E::hts_open_format] Failed to open file "[file_name]" : No such file or directory

The following samples are affected by other error_rules and should be reviewed:


rule3.[sbatchid].sp=[sample_name].err:[E::hts_open_format] TIMEOUT

  1. Address the error(s) and restart the run:
cd /data/RBL_NCI/Pipelines/iCLIP/[version number]
module load snakemake

#unlock dir
sh run_snakemake.sh -p unlock -o /path/to/output/dir

#perform dry-run
sh run_snakemake.sh -p dry -o /path/to/output/dir

#submit to cluster
sh run_snakemake.sh -p cluster -o /path/to/output/dir