-
Notifications
You must be signed in to change notification settings - Fork 2
Home
Welcome to the iCLIP wiki!
The iCLIP github repository is stored locally, and will be used for project deployment. Multiple projects can be deployed from this one point simultaneously, without concern.
- Change working directory to the iCLIP repository
cd /data/RBL_NCI/Pipelines/iCLIP/[version number]
- Review the Tutorial WikiPage and run a test!
The Snakemake workflow has a multiple options:
Usage: /data/RBL_NCI/Pipelines/iCLIP/[version number]/run_snakemake.sh -p pipeline -p options: initialize, checks, dry-run, cluster, local, unlock, git, DAG, report Usage: -o output_dir -o path to output directory
Example commands:
#Initialize Pipeline sh run_snakemake.sh -p initialize -o /path/to/output/dir #Check manifests sh run_snakemake.sh -p check -o /path/to/output/dir #Dry-Run sh run_snakemake.sh -p dry -o /path/to/output/dir #Execute pipeline on the cluster sh run_snakemake.sh -p cluster -o /path/to/output/dir #Execute pipeline locally sh run_snakemake.sh -p local -o /path/to/output/dir #Unlock directory (after failed partial run) sh run_snakemake.sh -p unlock -o /path/to/output/dir #GIT Action sh run_snakemake.sh -p git -o /path/to/output/dir #Create report sh run_snakemake.sh -p report -o /path/to/output/dir #Create DAG of pipeline sh run_snakemake.sh -p DAG -o /path/to/output/dir
Explanation of pre-processing steps:
- initialize (required): This must be performed before any Snakemake run (dry, local, cluster) can be performed. This will copy the necessary config, manifest and Snakefiles needed to run the pipeline to the provided output directory.
- checks (optional): This is an optional step, to be performed before any Snakemake run (dry, local, cluster). This will check for errors in the snakemake_config files, as well as your input manifests. If there are errors they will be printed to the command line OR printed in a text file to your output dir.
- dry-run (optional): This is an optional step, to be performed before any Snakemake run (local, cluster). This will check for errors within the pipeline, and ensure that you have read/write access to the files needed to run the full pipeline.
Explanation of processing steps:
- local - This will run the pipeline on a local node. NOTE: This should only be performed on an interactive node.
- cluster - This will submit a master job to the cluster, and subsequent sub-jobs as needed to complete the workflow. An email will be sent when the pipeline begins, if there are any errors, and when it completes.
Explanation of other steps:
- unlock - This will unlock the pipeline if an error caused it to stop in the middle of a run.
- git - This is only utilized for GITHUB Actions testing.
- DAG - This will produce a DAG of the workflow and dependencies, saved to the /output/dir/log directory
- report - This will produce a report generated from the snakemake statistics produced by your pipeline, saved to the /output/dir/log directory.
There are three config requirements for this pipeline, found in the /output/dir/config directory, after initialization. These files are:
-
cluster_config.yml - this file will contain the config default settings for analysis. This file does not require edits, unless processing requirements dictate it.
-
snakemake_config.yaml - this file will contain directory paths and user parameters for analysis;
- sourceDir: path to repo; this does not need to be changed if running local version; example: '/data/RBL_NCI/Pipelines/iCLIP/v2.0'
- outputDir: path to created output directory, where output will be stored; example: '/path/to/output/'
- sampleManifest: path to multiplex manifest (see specific details below(; example:'/path/to/sample_manifest.tsv'
- multiplexManifest: path to multiplex manifest (see specific details below); example: '/path/to/multiplex_manifest.tsv'
- contrastManifest: path to contrast manifest (see specific details below); example: '/path/to/contrast_manifest.tsv'
- fastqDir: path to gzipped multiplexed fastq files; example: '/path/to/raw/fastq/files'
- filterlength: minimum read length to include in analysis [any int >20]
- multiplexflag: whether samples are multiplexed ["Y","N"]
- mismatch: if multiplexed, number of bp mismatches allowed in demultiplexing [1,2,3]
- reference: reference organism['hg38', 'mm10']
- spliceaware: whether to run include spliceaware feature for alignment ["Y","N"]
- includerRNA: if spliceaware, whether to include refseq rRNA's in annotations ["Y", "N"]
- spliceBPlength: if spliceaware, length of splice index to use [50, 75, 150]
- splicejunction: if spliceaware, whether to include splice junctions in peak calls for DE_METHOD MANORM or DIFFBIND ["Y", "N"]
- condenseexon: "Y" #if spliceaware, if there are multiple peaks in the same transcript, whether to combine into one feature ["Y", "N"]
- mincount: minimum number of reads to count as a peak [1,2,3]
- ntmerge: minimum distance of nucleotides to merge peaks [10,20,30,40,50,60]
- peakid: report peaks for unique peaks only or unique and fractional mm ["unique","all"]
- DEmethod: whether to run differential expression ["diffbind", "manorm", "none"]
- sampleoverlap: 1 #if DEmethod DIFFBIND, minimum number of samples a peak must be found in to be counted [>1]
- pval: 1 #if DEmethod DIFFBIND, minimum number of samples a peak must be found in to be counted [>1]
- fc: 1 #if DEmethod DIFFBIND, minimum number of samples a peak must be found in to be counted [>1]
- pval: 0.05 #if DEmethod, pval cutoff for significance
- fc: 1 #if DEmethod, fold change cut off for significance
-
index_config.yaml - this file will contain directory paths for index files. This file does not require edits, unless processing requirements dictate it.
- organism:
- std: '/path/to/index/'
- spliceaware:
- valuebp1: '/path/to/index1/'
- valuebp2: '/path/to/index2/'
- organism:
There are two required, and one optional, manifests for this pipeline. The paths of these files are defined in the snakemake_config.yaml file. Example files are placed in the /output/dir/manifest directory after initialization. You can edit these example files or choose to create your own. These files are:
-
multiplexManifest (required) - this manifest will include information to map fastq files to their multiple sample ID
- file_name: the full file name of the multiplexed sample, which must be unique; example: 'test_1.fastq.gz'
- multiplex: the multiplexID associated the fastq file, which must be unique. These names must match the multiplex column of the sampleManifest. example: 'test_1'
An example multplex_manifest.tsv file:
file_name multiplex test_1.fastq.gz test_1 test_2.fastq.gz test_2
-
samplesManifest (required)
- multiplex: the multiplexID associated with the fasta file, and will not be unique. These names must match the multiplex column of the multiplex_manifest.tsv file. example: 'SIM_iCLIP_S1'
- sample: the final sample name; this column must be unique. example: 'Ro_Clip'
- barcode: the barcode to identify multiplexed sample; this must be unique per each multiplex sample name but can repeat between multiplexid's. example: 'NNNTGGCNN'
- adaptor: the adaptor sequence, to be removed from sample; this may or may not be unique. example: 'AGATCGGAAGAGCGGTTCAG'
- group: groupings for samples, may or may not be unique values. example: 'CNTRL'
An example sampleManifest file with multiplexing of one sample. Notice that the multiplexID test_1 is repeated, as Ro_Clip and Control_Clip are both found in the same fastq file, whereas test_2 is not multiplexed:
multiplex sample group barcode adaptor test_1 Ro_Clip CLIP NNNTGGCNN AGATCGGAAGAGCGGTTCAG test_1 Control_Clip CNTRL NNNCGGANN AGATCGGAAGAGCGGTTCAG test_2 Ro_Clip2 CLIP NNNCGTANN AGATCGGAAGAGCGGTTCAG
-
contrastManifest (Optional - required with DE_Method of MANORM or DIFFBIND)
- if MANORM:
- sample: the sample name, identified in the samplesManifest [sample] column, of the sample to compare. example: 'Ro_Clip'
- background: the background sample name, identified in the samplesManifest [sample] column, of the background to remove. example: 'Control_Clip'
- if DIFFBIND:
- sample: the sample group, identified in the samplesManifest [group] column, of the sample group to compare. example: 'CLIP' will include samples 'Ro_Clip' and 'Ro_Clip2'
- background: the background group name, identified in the samplesManifest [group] column, of the background group to remove. example: 'CNTRL' will include sample 'Control_Clip'
An example contrastManifest file for MANORM:
sample,background Ro_Clip,Control_Clip
An example contrastManifest file for DIFFBIND:
sample,background CLIP,CNTRL
- if MANORM:
The following are the rule processes run, depending on the config selection:
- rule check_manifest:
- rule qc_barcode:
- if demux_flag == "Y":
- rule demultiplex:
- rule rename_demux:
- if demux_flag == "N":
- rule copy_nondemux:
- rule remove_adaptors:
- rule qc_fastq_pre:
- rule qc_fastq_post:
- rule qc_screen_validator:
- rule determine_splits:
- rule split_files:
- rule novoalign:
- rule create_bam_mm_unique:
- rule merge_splits_unique_mm:
- rule merge_mm_and_unique:
- rule multiqc:
- rule qc_alignment:
- rule dedup:
- if splice_aware== "Y":
- rule mapq_recalc:
- rule mapq_stats:
- rule create_beds_safs:
- rule feature_counts:
- rule project_annotations:
- rule peak_annotations:
- rule annotation_report:
- if DE_Method == "MANORM":
- rule MANORM_beds:
- rule MANORM_analysis:
- rule MANORM_post_processing:
- rule MANORM_RMD:
- if DE_Method == "DIFFBIND":
- rule DIFFBIND_beds:
- rule DIFFBIND_preprocess:
- rule DIFFBIND_analysis:
- rule DIFFBIND_report:
The following directories are created under the output_directory:
- 01_preprocess: this directory includes intermediate files to be deleted upon pipeline completion
- 02_bam: this directory includes the bam files for the pipeline, sorted by:
- 01_unmapped: unmapped reads
- 02_merged: unique and multi-mapped reads, sorted and indexed
- 03_dedup: 02_merged file deduplicated
- 03_peaks: this directory includes the bed and SAF files for the pipeline, sorted by:
- 01_bed: bed files sorted by all reads or unique reads
- 02_SAF: SAF files sorted by all reads or unique reads
- 03_allreadpeaks: peaks for all reads split by unique and MM peaks
- 03_alluniquereads: peaks for unique reads split by unique or MM peaks
- 04_annotation: this directory includes the annotation files at a project and sample level, sorted by:
- 01_project: includes project level annotation information
- 02_peaks: includes annotation bed files, complete annotated peak text files
- final annotation report (HTML) and table (TXT)
- 05_demethod: this directory is only produced when MANORM or DIFFBIND is selected from DE_METHOD
- 01_input: this includes bed files for any samples being compared
- 02_analysis: this includes raw DE files (excel MANORM, text DIFFBIND) by comparison
- 03_report: this includes the final reports (HTML) by comparison
- qc: this directory includes the qc reports, sorted by:
- multiqc_report: this includes the fastqc results, as well as fastq screen results of each sample before and after filtering
- qc_report: this includes barcode and alignment information of each sample before and after filtering
- log: this includes the slurm output files of the pipeline sorted by pipeline start time; copies of config and manifest files used in this specific pipeline run; error reporting script
- Check your email for an email regarding pipeline failure. You will receive an email from slurm@biowulf.nih.gov with the subject: Slurm Job_id=[#] Name=iCLIP Failed, Run time [time], FAILED, ExitCode 1
- Run the error report script
cd /[output_dir]/log/[time_of_run] sh 00_create_error_report.sh cat error.log
Review the report for the rules that erred, and the sample information. An example report is listed below:
The following error(s) were found in rules: ********************************************* Error in rule rule1: Error in rule rule2: Error in rule rule3:
The following samples are affected by memory and must be deleted:
rule1.[sbatchid].sp=[sample_name].err:[E::hts_open_format] Disc quota exceeded
The following samples are affected by missing input files/output dir and should be reviewed:
rule2.[sbatchid].sp=[sample_name].err:[E::hts_open_format] Failed to open file "[file_name]" : No such file or directory
The following samples are affected by other error_rules and should be reviewed:
rule3.[sbatchid].sp=[sample_name].err:[E::hts_open_format] TIMEOUT
- Address the error(s) and restart the run:
cd /data/RBL_NCI/Pipelines/iCLIP/[version number] module load snakemake #unlock dir sh run_snakemake.sh -p unlock -o /path/to/output/dir #perform dry-run sh run_snakemake.sh -p dry -o /path/to/output/dir #submit to cluster sh run_snakemake.sh -p cluster -o /path/to/output/dir