-
Notifications
You must be signed in to change notification settings - Fork 8
/
kfdrc_RNAseq_workflow.cwl
851 lines (818 loc) · 62.4 KB
/
kfdrc_RNAseq_workflow.cwl
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
cwlVersion: v1.2
class: Workflow
id: kfdrc-rnaseq-workflow
label: Kids First DRC RNAseq Workflow
doc: |
# Kids First RNA-Seq Workflow V4
This is the Kids First RNA-Seq pipeline, which calculates gene and transcript isoform expression, detects fusions and splice junctions.
We have transitioned to this current version which upgrades several software components.
Our legacy workflow is still available as [v3.0.1](https://github.com/kids-first/kf-rnaseq-workflow/tree/v3.0.1), and on CAVATICA, [revision 8](https://cavatica.sbgenomics.com/public/apps/cavatica/apps-publisher/kfdrc-rnaseq-workflow/8)
<p align="center">
<img src="docs/kids_first_logo.svg" alt="Kids First repository logo" width="660px" />
</p>
<p align="center">
<a href="https://github.com/kids-first/kf-rnaseq-workflow/blob/main/LICENSE"><img src="https://img.shields.io/github/license/kids-first/kf-rnaseq-workflow.svg?style=for-the-badge"></a>
</p>
## Introduction
This pipeline has an optional Cutadapt to trim adapters from the raw reads, alignment-to-fastq conversion if necessary, and passes the reads to STAR for alignment.
The alignment output is used by RSEM for gene expression abundance estimation and rMATS for differential alternative splicing events detection.
Additionally, Kallisto is used for quantification, but uses pseudoalignments to estimate the gene abundance from the raw data.
Fusion calling is performed using Arriba and STAR-Fusion detection tools on the STAR alignment outputs.
Filtering and prioritization of fusion calls is done by annoFuse.
Metrics for the workflow are generated by RNA-SeQC.
Junction files for the workflow are generated by rMATS.
If you would like to run this workflow using the CAVATICA public app, a basic primer on running public apps can be found [here](https://www.notion.so/d3b/Starting-From-Scratch-Running-Cavatica-af5ebb78c38a4f3190e32e67b4ce12bb).
Alternatively, if you would like to run it locally using `cwltool`, a basic primer on that can be found [here](https://www.notion.so/d3b/Starting-From-Scratch-Running-CWLtool-b8dbbde2dc7742e4aff290b0a878344d) and combined with app-specific info from the readme below.
This workflow is the current production workflow, equivalent to this [CAVATICA public app](https://cavatica.sbgenomics.com/public/apps#cavatica/apps-publisher/kfdrc-rnaseq-workflow).
### Cutadapt
[Cutadapt v3.4](https://github.com/marcelm/cutadapt) Cut adapter sequences from raw reads if needed.
### [STAR](docs/STAR_2.7.10a.md)
[STAR v2.7.10a](https://doi.org/f4h523) RNA-Seq raw data alignment.
### [RSEM](docs/RSEM_1.3.1.md)
[RSEM v1.3.1](https://doi:10/cwg8n5) Calculation of gene expression.
### Kallisto
[Kallisto v0.43.1](https://doi:10.1038/nbt.3519) Raw data pseudoalignment to estimate gene abundance.
### [STAR-Fusion](docs/STAR-Fusion_1.10.1.md)
[STAR-Fusion v1.10.1](https://doi:10.1101/120295) Fusion detection for `STAR` chimeric reads.
### [Arriba](docs/ARRIBA_2.2.1.md)
[Arriba v2.2.1](https://github.com/suhrig/arriba/) Fusion caller that uses `STAR` aligned reads and chimeric reads output.
### [annoFuse](docs/D3B_ANNOFUSE.md)
[annoFuse 0.92.0](https://github.com/d3b-center/annoFuse/releases/tag/v0.92.0) Filter and prioritize fusion calls. For more information, please see the following [paper](https://www.biorxiv.org/content/10.1101/839738v3).
### RNA-SeQC
[RNA-SeQC v2.3.4](https://github.com/broadinstitute/rnaseqc) Generate metrics such as gene and transcript counts, sense/antisense mapping, mapping rates, etc
### [rMATS](docs/D3B_RMATS.md)
[rMATS turbo v4.1.2](https://github.com/Xinglab/rmats-turbo) Computational tool to detect differential alternative splicing events from RNA-Seq data
### [T1k](docs/T1K_README.md)
[T1k v1.0.5](https://github.com/mourisl/T1K/) Genotype highly polymorphic genes (e.g. HLA) with bulk RNA-seq data.
## Usage
### Runtime Estimates:
Based on a test set of five input BAMs, CAVATICA compute and storage estimates:
- Typical 2 hour run time, 10 hours is a higher end possibility
- Cost:
- Pure spot instances with no terminations: $2.37 mean
- Pure on-demand: $5.19 mean
- Warning: If spot instance kill rate is high, especially for `c5.9xlarge` instance type, the cost could end up greater than on-demand
- Storage:
- Total output size 6GB mean
- Storage estimate ~ $0.14 per month
### Inputs common:
```yaml
inputs:
output_basename: { type: 'string?', doc: "String to use as basename for outputs. Will use read1 file basename if null." }
reads1: { type: File, doc: "Input fastq file, gzipped or uncompressed OR alignment file file" }
reads2: { type: 'File?', doc: "If paired end, R2 reads files, gzipped or uncompressed" }
is_paired_end: {type: 'boolean?', doc: "For BAM inputs, are the reads paired end?"}
wf_strand_param: { type: ['null', {type: 'enum', name: wf_strand_param, symbols: ["default",
"rf-stranded", "fr-stranded"]}], doc: "use 'default' for unstranded/auto, 'rf-stranded' if read1 in the fastq read pairs is reverse complement to the transcript, 'fr-stranded' if read1 same sense as transcript" }
gtf_anno: { type: 'File', doc: "General transfer format (gtf) file with gene models corresponding to fasta reference" }
star_fusion_genome_untar_path: {type: 'string?', doc: "This is what the path will be when genome_tar is unpackaged", default: "GRCh38_v39_CTAT_lib_Mar242022.CUSTOM"}
reference_fasta: {type: 'File', doc: "GRCh38.primary_assembly.genome.fa", "sbg:suggestedValue": {
class: File, path: 5f500135e4b0370371c051b4, name: GRCh38.primary_assembly.genome.fa,
secondaryFiles: [{class: File, path: 62866da14d85bc2e02ba52db, name: GRCh38.primary_assembly.genome.fa.fai}]},
secondaryFiles: ['.fai']}
```
### Alignment (SAM/BAM/CRAM) input-specific:
```yaml
inputs:
reads1: File
```
### PE Fastq input-specific:
```yaml
inputs:
reads1: File
reads2: File
```
### SE Fastq input-specific:
```yaml
inputs:
reads1: File
```
### Samtools fastq:
```yaml
samtools_fastq_cores: { type: 'int?', doc: "Num cores for align2fastq conversion, if input is an alignment file", default: 16 }
cram_reference: { type: 'File?', secondaryFiles: [.fai], doc: "If input align is cram and you are uncertain all contigs are registered at http://www.ebi.ac.uk/ena/cram/md5/, provide here" }
```
### cutadapt:
```yaml
r1_adapter: { type: 'string?', doc: "Optional input. If the input reads have already been trimmed, leave these as null. If they do need trimming, supply the adapters." }
r2_adapter: { type: 'string?', doc: "Optional input. If the input reads have already been trimmed, leave these as null. If they do need trimming, supply the adapters." }
min_len: { type: 'int?', doc: "If you do not use this option, reads that have a length of zero (empty reads) are kept in the output", default: 20 }
quality_base: { type: 'int?', doc: "Phred scale used", default: 33 }
quality_cutoff: {type: 'int[]?', doc: "Quality trim cutoff, see https://cutadapt.readthedocs.io/en/v3.4/guide.html#quality-trimming for how 5' 3' is handled" }
```
### STAR:
This section may seem overwhelming.
Many defaults are set.
Kids First favors setting/overriding defaults with "arriba-heavy" specified in [STAR docs](docs/STAR_2.7.10a.md), however if it is not a tumor sample, then GTEx is preferred
```yaml
outSAMattrRGline: {type: 'string?', doc: "Suggested setting format is: ID:sample_name LB:aliquot_id PL:platform SM:BSID for example ID:7316-242 LB:750189 PL:ILLUMINA SM:BS_W72364MN. STAR will automatically convert unquoted spaces into tabs. If you wish to have a value with whitespace, the KEY:VALUE must be enclosed in double quotes. Refer to the start documentation for complete input details. If not provided, value will be autogenerated based on the reads1 file basename."}
STARgenome: {type: File, doc: "Tar gzipped reference that will be unzipped at run time", "sbg:suggestedValue": {class: File, path: 62853e7ad63f7c6d8d7ae5a7,
name: STAR_2.7.10a_GENCODE39.tar.gz}}
runThreadN: {type: 'int?', default: 36, doc: "Adjust this value to change number of cores used."}
twopassMode: {type: ['null', {type: enum, name: twopassMode, symbols: ["Basic",
"None"]}], default: "Basic", doc: "Enable two pass mode to detect novel splice events. Default is basic (on)."}
alignSJoverhangMin: {type: 'int?', default: 8, doc: "minimum overhang for unannotated junctions. ENCODE default used."}
outFilterMismatchNoverLmax: {type: 'float?', default: 0.1, doc: "alignment will be output only if its ratio of mismatches to *mapped* length is less than or equal to this value"}
outFilterType: {type: ['null', {type: enum, name: outFilterType, symbols: ["BySJout",
"Normal"]}], default: "BySJout", doc: "type of filtering. Normal: standard filtering using only current alignment. BySJout (default): keep only those reads that contain junctions that passed filtering into SJ.out.tab."}
outFilterScoreMinOverLread: {type: 'float?', default: 0.33, doc: "alignment will be output only if its score is higher than or equal to this value, normalized to read length (sum of mate's lengths for paired-end reads)"}
outFilterMatchNminOverLread: {type: 'float?', default: 0.33, doc: "alignment will be output only if the number of matched bases is higher than or equal to this value., normalized to the read length (sum of mates' lengths for paired-end reads)"}
outReadsUnmapped: {type: ['null', {type: enum, name: outReadsUnmapped, symbols: [
"None", "Fastx"]}], default: "None", doc: "output of unmapped and partially mapped (i.e. mapped only one mate of a paired end read) reads in separate file(s). none (default): no output. Fastx: output in separate fasta/fastq files, Unmapped.out.mate1/2."}
limitSjdbInsertNsj: {type: 'int?', default: 1200000, doc: "maximum number of junction to be inserted to the genome on the fly at the mapping stage, including those from annotations and those detected in the 1st step of the 2-pass run"}
outSAMstrandField: {type: ['null', {type: enum, name: outSAMstrandField, symbols: [
"intronMotif", "None"]}], default: "intronMotif", doc: "Cufflinks-like strand field flag. None: not used. intronMotif (default): strand derived from the intron motif. This option changes the output alignments: reads with inconsistent and/or non-canonical introns are filtered out."}
outFilterIntronMotifs: {type: ['null', {type: enum, name: outFilterIntronMotifs,
symbols: ["None", "RemoveNoncanonical", "RemoveNoncanonicalUnannotated"]}],
default: "None", doc: "filter alignment using their motifs. None (default): no filtering. RemoveNoncanonical: filter out alignments that contain non-canonical junctions RemoveNoncanonicalUnannotated: filter out alignments that contain non-canonical unannotated junctions when using annotated splice junctions database. The annotated non-canonical junctions will be kept."}
alignSoftClipAtReferenceEnds: {type: ['null', {type: enum, name: alignSoftClipAtReferenceEnds,
symbols: ["Yes", "No"]}], default: "Yes", doc: "allow the soft-clipping of the alignments past the end of the chromosomes. Yes (default): allow. No: prohibit, useful for compatibility with Cufflinks"}
quantMode: {type: ['null', {type: enum, name: quantMode, symbols: [TranscriptomeSAM
GeneCounts, '-', TranscriptomeSAM, GeneCounts]}], default: TranscriptomeSAM
GeneCounts, doc: "types of quantification requested. -: none. TranscriptomeSAM: output SAM/BAM alignments to transcriptome into a separate file GeneCounts: count reads per gene. Choices are additive, so default is 'TranscriptomeSAM GeneCounts'"}
outSAMtype: {type: ['null', {type: enum, name: outSAMtype, symbols: ["BAM Unsorted",
"None", "BAM SortedByCoordinate", "SAM Unsorted", "SAM SortedByCoordinate"]}],
default: "BAM Unsorted", doc: "type of SAM/BAM output. None: no SAM/BAM output. Otherwise, first word is output type (BAM or SAM), second is sort type (Unsorted or SortedByCoordinate)"}
outSAMunmapped: {type: ['null', {type: enum, name: outSAMunmapped, symbols: ["Within",
"None", "Within KeepPairs"]}], default: "Within", doc: "output of unmapped reads in the SAM format. None: no output. Within (default): output unmapped reads within the main SAM file (i.e. Aligned.out.sam) Within KeepPairs: record unmapped mate for each alignment, and, in case of unsorted output, keep it adjacent to its mapped mate. Only affects multi-mapping reads"}
genomeLoad: {type: ['null', {type: enum, name: genomeLoad, symbols: ["NoSharedMemory",
"LoadAndKeep", "LoadAndRemove", "LoadAndExit"]}], default: "NoSharedMemory",
doc: "mode of shared memory usage for the genome file. In this context, the default value makes the most sense, the others are their as a courtesy."}
chimMainSegmentMultNmax: {type: 'int?', default: 1, doc: "maximum number of multi-alignments for the main chimeric segment. =1 will prohibit multimapping main segments"}
outSAMattributes: {type: 'string?', default: 'NH HI AS nM NM ch', doc: "a string of desired SAM attributes, in the order desired for the output SAM. Tags can be listed in any combination/order. Please refer to the STAR manual, as there are numerous combinations: https://raw.githubusercontent.com/alexdobin/star_2.7.10a/master/doc/STARmanual.pdf"}
alignInsertionFlush: {type: ['null', {type: enum, name: alignInsertionFlush, symbols: [
"None", "Right"]}], default: "None", doc: "how to flush ambiguous insertion positions. None (default): insertions not flushed. Right: insertions flushed to the right. STAR Fusion recommended (SF)"}
alignIntronMax: {type: 'int?', default: 1000000, doc: "maximum intron size. SF recommends 100000"}
alignMatesGapMax: {type: 'int?', default: 1000000, doc: "maximum genomic distance between mates, SF recommends 100000 to avoid readthru fusions within 100k"}
alignSJDBoverhangMin: {type: 'int?', default: 1, doc: "minimum overhang for annotated junctions. SF recommends 10"}
outFilterMismatchNmax: {type: 'int?', default: 999, doc: "maximum number of mismatches per pair, large number switches off this filter"}
alignSJstitchMismatchNmax: {type: 'string?', default: "5 -1 5 5", doc: "maximum number of mismatches for stitching of the splice junctions. Value '5 -1 5 5' improves SF chimeric junctions, also recommended by arriba (AR)"}
alignSplicedMateMapLmin: {type: 'int?', default: 0, doc: "minimum mapped length for a read mate that is spliced. SF recommends 30"}
alignSplicedMateMapLminOverLmate: {type: 'float?', default: 0.5, doc: "alignSplicedMateMapLmin normalized to mate length. SF recommends 0, AR 0.5"}
chimJunctionOverhangMin: {type: 'int?', default: 10, doc: "minimum overhang for a chimeric junction. SF recommends 8, AR 10"}
chimMultimapNmax: {type: 'int?', default: 50, doc: "maximum number of chimeric multi-alignments. SF recommends 20, AR 50."}
chimMultimapScoreRange: {type: 'int?', default: 1, doc: "the score range for multi-mapping chimeras below the best chimeric score. Only works with chimMultimapNmax > 1. SF recommends 3"}
chimNonchimScoreDropMin: {type: 'int?', default: 20, doc: "int>=0: to trigger chimeric detection, the drop in the best non-chimeric alignment score with respect to the read length has to be greater than this value. SF recommends 10"}
chimOutJunctionFormat: {type: 'int?', default: 1, doc: "formatting type for the Chimeric.out.junction file, value 1 REQUIRED for SF"}
chimOutType: {type: ['null', {type: enum, name: chimOutType, symbols: ["Junctions SeparateSAMold WithinBAM SoftClip", "Junctions", "SeparateSAMold", "WithinBAM SoftClip", "WithinBAM HardClip", "Junctions SeparateSAMold", "Junctions WithinBAM SoftClip", "Junctions WithinBAM HardClip", "Junctions SeparateSAMold WithinBAM HardClip", "SeparateSAMold WithinBAM SoftClip", "SeparateSAMold WithinBAM HardClip"]}], default: "Junctions WithinBAM SoftClip", doc: "type of chimeric output. Args are additive, and defined as such - Junctions: Chimeric.out.junction. SeparateSAMold: output old SAM into separate Chimeric.out.sam file WithinBAM: output into main aligned BAM files (Aligned.*.bam). WithinBAM HardClip: hard-clipping in the CIGAR for supplemental chimeric alignments WithinBAM SoftClip:soft-clipping in the CIGAR for supplemental chimeric alignments"}
chimScoreDropMax: {type: 'int?', default: 30, doc: "max drop (difference) of chimeric score (the sum of scores of all chimeric segments) from the read length. AR recommends 30"}
chimScoreJunctionNonGTAG: {type: 'int?', default: -1, doc: "penalty for a non-GT/AG chimeric junction. default -1, SF recommends -4, AR -1"}
chimScoreSeparation: {type: 'int?', default: 1, doc: "int>=0: minimum difference (separation) between the best chimeric score and the next one. AR recommends 1"}
chimSegmentMin: {type: 'int?', default: 10, doc: "minimum length of chimeric segment length, if ==0, no chimeric output. REQUIRED for SF, 12 is their default, AR recommends 10"}
chimSegmentReadGapMax: {type: 'int?', default: 3, doc: "maximum gap in the read sequence between chimeric segments. AR recommends 3"}
outFilterMultimapNmax: {type: 'int?', default: 50, doc: "max number of multiple alignments allowed for a read: if exceeded, the read is considered unmapped. ENCODE value is default. AR recommends 50"}
peOverlapMMp: {type: 'float?', default: 0.01, doc: "maximum proportion of mismatched bases in the overlap area. SF recommends 0.1"}
peOverlapNbasesMin: {type: 'int?', default: 10, doc: "minimum number of overlap bases to trigger mates merging and realignment. Specify >0 value to switch on the 'merging of overlapping mates'algorithm. SF recommends 12, AR recommends 10"}
```
### arriba:
```yaml
arriba_memory: {type: 'int?', doc: "Mem intensive tool. Set in GB", default: 64}
```
### STAR Fusion:
```yaml
FusionGenome: {type: 'File', doc: "STAR-Fusion CTAT Genome lib", "sbg:suggestedValue": {
class: File, path: 62853e7ad63f7c6d8d7ae5a8, name: GRCh38_v39_CTAT_lib_Mar242022.CUSTOM.tar.gz}}
compress_chimeric_junction: {type: 'boolean?', default: true, doc: 'If part of a
workflow, recommend compressing this file as final output'}
```
### RNAseQC:
```yaml
RNAseQC_GTF: {type: 'File', doc: "gtf file from `gtf_anno` that has been collapsed GTEx-style", "sbg:suggestedValue": {class: File, path: 62853e7ad63f7c6d8d7ae5a3,
name: gencode.v39.primary_assembly.rnaseqc.stranded.gtf}}
```
### kallisto
```yaml
kallisto_idx: {type: 'File', doc: "Specialized index of a **transcriptome** fasta file for kallisto", "sbg:suggestedValue": {class: File, path: 62853e7ad63f7c6d8d7ae5a6,
name: RSEM_GENCODE39.transcripts.kallisto.idx}}
```
### RSEM:
```yaml
RSEMgenome: {type: 'File', doc: "RSEM reference tar ball", "sbg:suggestedValue": {
class: File, path: 62853e7ad63f7c6d8d7ae5a5, name: RSEM_GENCODE39.tar.gz}}
estimate_rspd: {type: 'boolean?', doc: "Set this option if you want to estimate the read start position distribution (RSPD) from data", default: true}
```
### annoFuse:
```yaml
sample_name: {type: 'string?', doc: "Sample ID of the input reads. If not provided, will use reads1 file basename."}
annofuse_col_num: {type: 'int?', doc: "0-based column number in file of fusion name."}
fusion_annotator_ref: { type: 'File', doc: "Tar ball with fusion_annot_lib.idx and blast_pairs.idx from STAR-Fusion CTAT Genome lib. Can be same as FusionGenome, but only two files needed from that package", "sbg:suggestedValue": { class: 'File', path: '63cff818facdd82011c8d6fe', name: 'GRCh38_v39_fusion_annot_custom.tar.gz' }}
```
### rmats
```yaml
rmats_variable_read_length: {type: 'boolean?', doc: "Allow reads with lengths that differ from --readLength to be processed. --readLength will still be used to determine IncFormLen and SkipFormLen."}
rmats_novel_splice_sites: {type: 'boolean?', doc: "Select for novel splice site detection or unannotated splice sites. 'true' to detect or add this parameter, 'false' to disable denovo detection. Tool Default: false"}
rmats_stat_off: {type: 'boolean?', doc: "Select to skip statistical analysis, either between two groups or on single sample group. 'true' to add this parameter. Tool default: false"}
rmats_allow_clipping: {type: 'boolean?', doc: "Allow alignments with soft or hard clipping to be used."}
rmats_threads: {type: 'int?', doc: "Threads to allocate to RMATs."}
rmats_ram: {type: 'int?', doc: "GB of RAM to allocate to RMATs."}
```
### T1k
```yaml
run_t1k: { type: 'boolean?', default: true, doc: "Set to false to disable T1k HLA typing" }
hla_rna_ref_seqs: { type: 'File?', doc: "FASTA file containing the HLA allele reference sequences for RNA." }
hla_rna_gene_coords: { type: 'File?', doc: "FASTA file containing the coordinates of the HLA genes for RNA." }
```
### Run:
1) Reads inputs:
- For PE fastq input, please enter the reads 1 file in `reads1` and the reads 2 file in `reads2`.
- For SE fastq input, enter the single ends reads file in `reads1` and leave `reads2` empty as it is optional.
- For alignment input (SAM/BAM/CRAM), please enter the reads file in `reads1` and leave `reads2` empty as it is optional.
2) `r1_adapter` and `r2_adapter` are OPTIONAL:
- If the input reads have already been trimmed, leave these as null and cutadapt step will simple pass on the fastq files to STAR.
- If they do need trimming, supply the adapters and the cutadapt step will trim, and pass trimmed fastqs along.
- `min_len` if adapter is trimmed, currently set to min `20` bp. Change this as you see fit
- `quality_base` set to phred scale `33` by default if trimming. There was a weird time when `64` was used - change if different
- `quality_cutoff` if adapter is trimmed and you want to set a min bp quality. A single value will apply to both paired ends, 2 values will allow you to assign a different one to each (unusual)
3) `wf_strand_param` is now *optional* as the workflow will try to determine strandedness for you. Note: if the workflow fails to detect a strandedness, it will fail. If you'd like to override autodetect, it is a workflow convenience param so that, if you input the following, the equivalent will propagate to the four tools that use that parameter:
- `default`: 'rsem_std': null, 'kallisto_std': null, 'rnaseqc_std': null, 'arriba_std': null. This means unstranded or auto in the case of arriba.
- `rf-stranded`: 'rsem_std': 0, 'kallisto_std': 'rf-stranded', 'rnaseqc_std': 'rf', 'arriba_std': 'reverse'. This means if read1 in the input fastq/bam is reverse complement to the transcript that it maps to.
- `fr-stranded`: 'rsem_std': 1, 'kallisto_std': 'fr-stranded', 'rnaseqc_std': 'fr', 'arriba_std': 'yes'. This means if read1 in the input fastq/bam is the same sense (maps 5' to 3') to the transcript that it maps to.
4) Suggested STAR `outSAMattrRGline` format is `ID:sample_name LB:aliquot_id PL:platform SM:BSID`:
For example, `ID:7316-242 LB:750189 PL:ILLUMINA SM:BS_W72364MN`
These `KEY:VALUE` fields can be separated by either a whitespace or tab
character. Any unquoted whitespace will be automatically converted to a tab
value by STAR. If you wish to include whitespaces in your `VALUE`, you must put
double quotes around the `KEY:VALUE`. For example if you wanted a `DS` key with a
`I love read groups` value, the entry would look like: `ID:xxx "DS:I love read
groups"`. See the STAR documentation on `outSAMattrRGline` for complete details.
5) Suggested REFERENCE inputs are:
- `reference_fasta`: [GRCh38.primary_assembly.genome.fa](https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_39/GRCh38.primary_assembly.genome.fa.gz), will need to unzip
- `gtf_anno`: [gencode.v39.primary_assembly.annotation.gtf](https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_39/gencode.v39.primary_assembly.annotation.gtf.gz), will need to unzip
- `FusionGenome`: GRCh38_v39_CTAT_lib_Mar242022.CUSTOM.tar.gz. A custom library built using instructions from (https://github.com/STAR-Fusion/STAR-Fusion/wiki/installing-star-fusion#preparing-the-genome-resource-lib), using GENCODE 39 reference.
- `RNAseQC_GTF`: gencode.v39.primary_assembly.rnaseqc.stranded.gtf OR gencode.v39.primary_assembly.rnaseqc.unstranded.gtf, built using `gtf_anno` and following build instructions [here](https://github.com/broadinstitute/rnaseqc#usage) and [here](https://github.com/broadinstitute/gtex-pipeline/tree/master/gene_model)
- `RSEMgenome`: RSEM_GENCODE39.tar.gz, built using the `reference_fasta` and `gtf_anno`, following `GENCODE` instructions from [here](https://deweylab.github.io/RSEM/README.html), then creating a tar ball of the results.
- `STARgenome`: STAR_2.7.10a_GENCODE39.tar.gz, created using the star_2.7.10a_genome_generate.cwl tool, using the `reference_fasta`, `gtf_anno`, and setting `sjdbOverhang` to 100
- `kallisto_idx`: RSEM_GENCODE39.transcripts.kallisto.idx, built from RSEM GENCODE 39 transcript fasts, in `RSEMgenome` tar ball, following instructions from [here](https://pachterlab.github.io/kallisto/manual)
- `hla_rna_ref_seqs`: hla_v3.43.0_gencode_v39_rna_seq.fa, created using https://github.com/mourisl/T1K/blob/master/t1k-build.pl with [hla.dat v3.43.0](http://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/hla.dat) and [GENCODE v39 primary assembly GTF](https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_39/gencode.v39.primary_assembly.annotation.gtf.gz)
- `hla_rna_gene_coords`: hla_v3.43.0_gencode_v39_rna_coord.fa, created using https://github.com/mourisl/T1K/blob/master/t1k-build.pl with [hla.dat v3.43.0](http://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/hla.dat) and [GENCODE v39 primary assembly GTF](https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_39/gencode.v39.primary_assembly.annotation.gtf.gz)
6) rMATS requires the length of the reads in the sample. This workflow will attempt to estimate the read length based on a polling of reads. If the user wishes to override this value they can set `read_length_median` to their desired read length. Additionally, there is a `rmats_variable_read_length` boolean that users can set if their reads are not uniform in length. This workflow will poll the reads and set that value to true if it observes multiple read lengths. Like read length, user-provided input will override this guess.
7) While `output_basename`, `sample_name`, and `outSAMattrRGline` are optional, it is strongly recommended that the user provide these values for data quality purposes. If the user does not provide these values, the basename of the reads1 file will be substituted in their place.
- `output_basename` and `sample_name` values will become `reads1.basename.split('.')[0]`
- `outSAMattrRGline` value will become `ID:reads1.basename.split('.')[0]_1 LB:reads1.basename.split('.')[0] SM:reads1.basename.split('.')[0] PL:Illumina`
- Additionally, if no `outSAMattrRGline` input is provided a disclaimer will be added to the `@RG` header line that reads: `DS:Values for this read group were auto-generated and may not reflect the true read group information.`
### Outputs:
```yaml
cutadapt_stats: {type: 'File?', outputSource: cutadapt_3-4/cutadapt_stats, doc: "Cutadapt stats output, only if adapter is supplied."}
STAR_sorted_genomic_cram: {type: 'File', outputSource: samtools_bam_to_cram/output,
doc: "STAR sorted and indexed genomic alignment cram"}
STAR_chimeric_junctions: {type: 'File?', outputSource: star_fusion_1-10-1/chimeric_junction_compressed,
doc: "STAR chimeric junctions"}
STAR_gene_count: {type: 'File', outputSource: star_2-7-10a/gene_counts, doc: "STAR genecounts"}
STAR_junctions_out: {type: 'File', outputSource: star_2-7-10a/junctions_out, doc: "STARjunction reads"}
STAR_final_log: {type: 'File', outputSource: star_2-7-10a/log_final_out, doc: "STAR metricslog file of unique, multi-mapping, unmapped, and chimeric reads"}
STAR-Fusion_results: {type: 'File', outputSource: star_fusion_1-10-1/abridged_coding,
doc: "STAR fusion detection from chimeric reads"}
arriba_fusion_results: {type: 'File', outputSource: arriba_fusion_2-2-1/arriba_fusions,
doc: "Fusion output from Arriba"}
arriba_fusion_viz: {type: 'File', outputSource: arriba_draw_2-2-1/arriba_pdf, doc: "pdf output from Arriba"}
RSEM_isoform: {type: 'File', outputSource: rsem/isoform_out, doc: "RSEM isoform expression estimates"}
RSEM_gene: {type: 'File', outputSource: rsem/gene_out, doc: "RSEM gene expression estimates"}
RNASeQC_Metrics: {type: 'File', outputSource: rna_seqc/Metrics, doc: "Metrics on mapping, intronic, exonic rates, count information, etc"}
RNASeQC_counts: {type: 'File', outputSource: supplemental/RNASeQC_counts, doc: "Contains gene tpm, gene read, and exon counts"}
kallisto_Abundance: {type: 'File', outputSource: kallisto/abundance_out, doc: "Gene abundance output from STAR genomic bam file"}
annofuse_filtered_fusions_tsv: {type: 'File?', outputSource: annofuse/annofuse_filtered_fusions_tsv,
doc: "Filtered fusions called by annoFuse."}
rmats_filtered_alternative_3_prime_splice_sites_jc: {type: 'File', outputSource: rmats/filtered_alternative_3_prime_splice_sites_jc,
doc: "Alternative 3 prime splice sites JC.txt output from RMATs containing only those calls with 10 or more junction spanning read counts of support"}
rmats_filtered_alternative_5_prime_splice_sites_jc: {type: 'File', outputSource: rmats/filtered_alternative_5_prime_splice_sites_jc,
doc: "Alternative 5 prime splice sites JC.txt output from RMATs containing only those calls with 10 or more junction spanning read counts of support"}
rmats_filtered_mutually_exclusive_exons_jc: {type: 'File', outputSource: rmats/filtered_mutually_exclusive_exons_jc,
doc: "Mutually exclusive exons JC.txt output from RMATs containing only those calls with 10 or more junction spanning read counts of support"}
rmats_filtered_retained_introns_jc: {type: 'File', outputSource: rmats/filtered_retained_introns_jc,
doc: "Retained introns JC.txt output from RMATs containing only those calls with 10 or more junction spanning read counts of support"}
rmats_filtered_skipped_exons_jc: {type: 'File', outputSource: rmats/filtered_skipped_exons_jc,
doc: "Skipped exons JC.txt output from RMATs containing only those calls with 10 or more junction spanning read counts of support"}
t1k_genotype_tsv: {type: 'File?', outputSource: t1k/genotype_tsv, doc: "Genotyping results from T1k" }
```
## Reference build notes:
- STAR-Fusion reference built with command `/usr/local/STAR-Fusion/ctat-genome-lib-builder/prep_genome_lib.pl --gtf gencode.v39.primary_assembly.annotation.gtf --annot_filter_rule ../AnnotFilterRule.pm --CPU 36 --fusion_annot_lib ../fusion_lib.Mar2021.dat.gz --genome_fa ../GRCh38.primary_assembly.genome.fa --output_dir GRCh38_v39_CTAT_lib_Mar242022.CUSTOM --human_gencode_filter --pfam_db current --dfam_db human 2> build.errs > build.out &`
- fusion_annotator_ref built by placing GRCh38_v39_CTAT_lib_Mar242022.CUSTOM/fusion_annot_lib.idx and GRCh38_v39_CTAT_lib_Mar242022.CUSTOM/blast_pairs.idx into its own tar ball
- kallisto index built using RSEM `RSEM_GENCODE39.transcripts.fa` file as transcriptome fasta, using command: `kallisto index -i RSEM_GENCODE39.transcripts.kallisto.idx RSEM_GENCODE39.transcripts.fa`
- RNA-SEQc reference built using [collapse gtf script](https://github.com/broadinstitute/gtex-pipeline/blob/master/gene_model/collapse_annotation.py)
- Two references needed if data are stranded vs. unstranded
- Flag `--collapse_only` used for stranded
requirements:
- class: ScatterFeatureRequirement
- class: MultipleInputFeatureRequirement
- class: SubworkflowFeatureRequirement
- class: InlineJavascriptRequirement
- class: StepInputExpressionRequirement
inputs:
# many tool
reference_fasta: {type: 'File', doc: "GRCh38.primary_assembly.genome.fa", "sbg:suggestedValue": {class: File, path: 5f500135e4b0370371c051b4,
name: GRCh38.primary_assembly.genome.fa, secondaryFiles: [{class: File, path: 62866da14d85bc2e02ba52db, name: GRCh38.primary_assembly.genome.fa.fai}]},
secondaryFiles: ['.fai']}
output_basename: {type: 'string?', doc: "String to use as basename for outputs. Will use read1 file basename if null"}
reads1: {type: File, doc: "Input fastq file, gzipped or uncompressed OR alignment file"}
reads2: {type: 'File?', doc: "If paired end, R2 reads files, gzipped or uncompressed"}
is_paired_end: {type: 'boolean?', doc: "For BAM inputs, are the reads paired end?"}
wf_strand_param: {type: ['null', {type: 'enum', name: wf_strand_param, symbols: ["default", "rf-stranded", "fr-stranded"]}], doc: "use
'default' for unstranded/auto, 'rf-stranded' if read1 in the fastq read pairs is reverse complement to the transcript, 'fr-stranded'
if read1 same sense as transcript"}
gtf_anno: {type: 'File', doc: "General transfer format (gtf) file with gene models corresponding to fasta reference", "sbg:suggestedValue": {
class: File, path: 62853e7ad63f7c6d8d7ae5a4, name: gencode.v39.primary_assembly.annotation.gtf}}
star_fusion_genome_untar_path: {type: 'string?', doc: "This is what the path will be when genome_tar is unpackaged", default: "GRCh38_v39_CTAT_lib_Mar242022.CUSTOM"}
read_length_median: {type: 'int?', doc: "The median read length for the reads provided."}
read_length_stddev: {type: 'float?', doc: "Standard Deviation of the median read length."}
samtools_fastq_cores: {type: 'int?', doc: "Num cores for align2fastq conversion, if input is an alignment file", default: 16}
cram_reference: {type: 'File?', secondaryFiles: [.fai], doc: "If input align is cram and you are uncertain all contigs are registered
at http://www.ebi.ac.uk/ena/cram/md5/, provide here"}
r1_adapter: {type: 'string?', doc: "Optional input. If the input reads have already been trimmed, leave these as null. If they do
need trimming, supply the adapters."}
r2_adapter: {type: 'string?', doc: "Optional input. If the input reads have already been trimmed, leave these as null. If they do
need trimming, supply the adapters."}
min_len: {type: 'int?', doc: "If you do not use this option, reads that have a length of zero (empty reads) are kept in the output",
default: 20}
quality_base: {type: 'int?', doc: "Phred scale used", default: 33}
quality_cutoff: {type: 'int[]?', doc: "Quality trim cutoff, see https://cutadapt.readthedocs.io/en/v3.4/guide.html#quality-trimming
for how 5' 3' is handled"}
outSAMattrRGline: {type: 'string?', doc: "Suggested setting format is: ID:sample_name LB:aliquot_id PL:platform SM:BSID for example
ID:7316-242 LB:750189 PL:ILLUMINA SM:BS_W72364M N. STAR will automatically convert unquoted spaces into tabs. If you wish to
have a value with whitespace, the KEY:VALUE must be enclosed in double quotes. Refer to the start documen tation for complete
input details. If not provided, value will be autogenerated based on the reads1 file basename."}
STARgenome: {type: File, doc: "Tar gzipped reference that will be unzipped at run time", "sbg:suggestedValue": {class: File, path: 62853e7ad63f7c6d8d7ae5a7,
name: STAR_2.7.10a_GENCODE39.tar.gz}}
runThreadN: {type: 'int?', default: 36, doc: "Adjust this value to change number of cores used by STAR."}
twopassMode: {type: ['null', {type: enum, name: twopassMode, symbols: ["Basic", "None"]}], default: "Basic", doc: "Enable two pass
mode to detect novel splice events. Default is basic (on)."}
alignSJoverhangMin: {type: 'int?', default: 8, doc: "minimum overhang for unannotated junctions. ENCODE default used."}
outFilterMismatchNoverLmax: {type: 'float?', default: 0.1, doc: "alignment will be output only if its ratio of mismatches to *mapped*
length is less than or equal to this value"}
outFilterType: {type: ['null', {type: enum, name: outFilterType, symbols: ["BySJout", "Normal"]}], default: "BySJout", doc: "type
of filtering. Normal: standard filtering using only current alignment. BySJout (default): keep only those reads that contain
junctions that passed filtering into SJ.out.tab."}
outFilterScoreMinOverLread: {type: 'float?', default: 0.33, doc: "alignment will be output only if its score is higher than or equal
to this value, normalized to read length (sum of mate's lengths for paired-end reads)"}
outFilterMatchNminOverLread: {type: 'float?', default: 0.33, doc: "alignment will be output only if the number of matched bases
is higher than or equal to this value., normalized to the read length (sum of mates' lengths for paired-end reads)"}
outReadsUnmapped: {type: ['null', {type: enum, name: outReadsUnmapped, symbols: ["None", "Fastx"]}], default: "None", doc: "output
of unmapped and partially mapped (i.e. mapped only one mate of a paired end read) reads in separate file(s). none (default):
no output. Fastx: output in separate fasta/fastq files, Unmapped.out.mate1/2."}
limitSjdbInsertNsj: {type: 'int?', default: 1200000, doc: "maximum number of junction to be inserted to the genome on the fly at
the mapping stage, including those from annotations and those detected in the 1st step of the 2-pass run"}
outSAMstrandField: {type: ['null', {type: enum, name: outSAMstrandField, symbols: ["intronMotif", "None"]}], default: "intronMotif",
doc: "Cufflinks-like strand field flag. None: not used. intronMotif (default): strand derived from the intron motif. This option
changes the output alignments: reads with inconsistent and/or non-canonical introns are filtered out."}
outFilterIntronMotifs: {type: ['null', {type: enum, name: outFilterIntronMotifs, symbols: ["None", "RemoveNoncanonical", "RemoveNoncanonicalUnannotated"]}],
default: "None", doc: "filter alignment using their motifs. None (default): no filtering. RemoveNoncanonical: filter out alignments
that contain non-canonical junctions RemoveNoncanonicalUnannotated: filter out alignments that contain non-canonical unannotated
junctions when using annotated splice junctions database. The annotated non-canonical junctions will be kept."}
alignSoftClipAtReferenceEnds: {type: ['null', {type: enum, name: alignSoftClipAtReferenceEnds, symbols: ["Yes", "No"]}], default: "Yes",
doc: "allow the soft-clipping of the alignments past the end of the chromosomes. Yes (default): allow. No: prohibit, useful for
compatibility with Cufflinks"}
quantMode: {type: ['null', {type: enum, name: quantMode, symbols: [TranscriptomeSAM GeneCounts, '-', TranscriptomeSAM, GeneCounts]}],
default: TranscriptomeSAM GeneCounts, doc: "types of quantification requested. -: none. TranscriptomeSAM: output SAM/BAM alignments
to transcriptome into a separate file GeneCounts: count reads per gene. Choices are additive, so default is 'TranscriptomeSAM
GeneCounts'"}
outSAMtype: {type: ['null', {type: enum, name: outSAMtype, symbols: ["BAM Unsorted", "None", "BAM SortedByCoordinate", "SAM Unsorted",
"SAM SortedByCoordinate"]}], default: "BAM Unsorted", doc: "type of SAM/BAM output. None: no SAM/BAM output. Otherwise,
first word is output type (BAM or SAM), second is sort type (Unsorted or SortedByCoordinate)"}
outSAMunmapped: {type: ['null', {type: enum, name: outSAMunmapped, symbols: ["Within", "None", "Within KeepPairs"]}], default: "Within",
doc: "output of unmapped reads in the SAM format. None: no output. Within (default): output unmapped reads within the main SAM
file (i.e. Aligned.out.sam) Within KeepPairs: record unmapped mate for each alignment, and, in case of unsorted output, keep
it adjacent to its mapped mate. Only affects multi-mapping reads"}
genomeLoad: {type: ['null', {type: enum, name: genomeLoad, symbols: ["NoSharedMemory", "LoadAndKeep", "LoadAndRemove", "LoadAndExit"]}],
default: "NoSharedMemory", doc: "mode of shared memory usage for the genome file. In this context, the default value makes the
most sense, the others are their as a courtesy."}
chimMainSegmentMultNmax: {type: 'int?', default: 1, doc: "maximum number of multi-alignments for the main chimeric segment. =1 will
prohibit multimapping main segments"}
outSAMattributes: {type: 'string?', default: 'NH HI AS nM NM ch', doc: "a string of desired SAM attributes, in the order desired
for the output SAM. Tags can be listed in any combination/order. Please refer to the STAR manual, as there are numerous combinations:
https://raw.githubusercontent.com/alexdobin/star_2.7.10a/master/doc/STARmanual.pdf"}
alignInsertionFlush: {type: ['null', {type: enum, name: alignInsertionFlush, symbols: ["None", "Right"]}], default: "None", doc: "how
to flush ambiguous insertion positions. None (default): insertions not flushed. Right: insertions flushed to the right. STAR
Fusion recommended (SF)"}
alignIntronMax: {type: 'int?', default: 1000000, doc: "maximum intron size. SF recommends 100000"}
alignMatesGapMax: {type: 'int?', default: 1000000, doc: "maximum genomic distance between mates, SF recommends 100000 to avoid readthru
fusions within 100k"}
alignSJDBoverhangMin: {type: 'int?', default: 1, doc: "minimum overhang for annotated junctions. SF recommends 10"}
outFilterMismatchNmax: {type: 'int?', default: 999, doc: "maximum number of mismatches per pair, large number switches off this
filter"}
alignSJstitchMismatchNmax: {type: 'string?', default: "5 -1 5 5", doc: "maximum number of mismatches for stitching of the splice
junctions. Value '5 -1 5 5' improves SF chimeric junctions, also recommended by arriba (AR)"}
alignSplicedMateMapLmin: {type: 'int?', default: 0, doc: "minimum mapped length for a read mate that is spliced. SF recommends 30"}
alignSplicedMateMapLminOverLmate: {type: 'float?', default: 0.5, doc: "alignSplicedMateMapLmin normalized to mate length. SF recommends
0, AR 0.5"}
chimJunctionOverhangMin: {type: 'int?', default: 10, doc: "minimum overhang for a chimeric junction. SF recommends 8, AR 10"}
chimMultimapNmax: {type: 'int?', default: 50, doc: "maximum number of chimeric multi-alignments. SF recommends 20, AR 50."}
chimMultimapScoreRange: {type: 'int?', default: 1, doc: "the score range for multi-mapping chimeras below the best chimeric score.
Only works with chimMultimapNmax > 1. SF recommends 3"}
chimNonchimScoreDropMin: {type: 'int?', default: 20, doc: "int>=0: to trigger chimeric detection, the drop in the best non-chimeric
alignment score with respect to the read length has to be greater than this value. SF recommends 10"}
chimOutJunctionFormat: {type: 'int?', default: 1, doc: "formatting type for the Chimeric.out.junction file, value 1 REQUIRED for
SF"}
chimOutType: {type: ['null', {type: enum, name: chimOutType, symbols: ["Junctions SeparateSAMold WithinBAM SoftClip", "Junctions",
"SeparateSAMold", "WithinBAM SoftClip", "WithinBAM HardClip", "Junctions SeparateSAMold", "Junctions WithinBAM SoftClip",
"Junctions WithinBAM HardClip", "Junctions SeparateSAMold WithinBAM HardClip", "SeparateSAMold WithinBAM SoftClip", "SeparateSAMold
WithinBAM HardClip"]}], default: "Junctions WithinBAM SoftClip", doc: "type of chimeric output. Args are additive, and
defined as such - Junctions: Chimeric.out.junction. SeparateSAMold: output old SAM into separate Chimeric.out.sam file WithinBAM:
output into main aligned BAM files (Aligned.*.bam). WithinBAM HardClip: hard-clipping in the CIGAR for supplemental chimeric
alignments WithinBAM SoftClip:soft-clipping in the CIGAR for supplemental chimeric alignments"}
chimScoreDropMax: {type: 'int?', default: 30, doc: "max drop (difference) of chimeric score (the sum of scores of all chimeric segments)
from the read length. AR recommends 30"}
chimScoreJunctionNonGTAG: {type: 'int?', default: -1, doc: "penalty for a non-GT/AG chimeric junction. default -1, SF recommends
-4, AR -1"}
chimScoreSeparation: {type: 'int?', default: 1, doc: "int>=0: minimum difference (separation) between the best chimeric score and
the next one. AR recommends 1"}
chimSegmentMin: {type: 'int?', default: 10, doc: "minimum length of chimeric segment length, if ==0, no chimeric output. REQUIRED
for SF, 12 is their default, AR recommends 10"}
chimSegmentReadGapMax: {type: 'int?', default: 3, doc: "maximum gap in the read sequence between chimeric segments. AR recommends
3"}
outFilterMultimapNmax: {type: 'int?', default: 50, doc: "max number of multiple alignments allowed for a read: if exceeded, the
read is considered unmapped. ENCODE value is default. AR recommends 50"}
peOverlapMMp: {type: 'float?', default: 0.01, doc: "maximum proportion of mismatched bases in the overlap area. SF recommends 0.1"}
peOverlapNbasesMin: {type: 'int?', default: 10, doc: "minimum number of overlap bases to trigger mates merging and realignment.
Specify >0 value to switch on the 'merging of overlapping mates'algorithm. SF recommends 12, AR recommends 10"}
arriba_memory: {type: 'int?', doc: "Mem intensive tool. Set in GB", default: 64}
FusionGenome: {type: 'File', doc: "STAR-Fusion CTAT Genome lib", "sbg:suggestedValue": {class: File, path: 62853e7ad63f7c6d8d7ae5a8,
name: GRCh38_v39_CTAT_lib_Mar242022.CUSTOM.tar.gz}}
compress_chimeric_junction: {type: 'boolean?', default: true, doc: 'If part of a workflow, recommend compressing this file as final
output'}
RNAseQC_GTF: {type: 'File', doc: "gtf file from `gtf_anno` that has been collapsed GTEx-style", "sbg:suggestedValue": {class: File,
path: 62853e7ad63f7c6d8d7ae5a3, name: gencode.v39.primary_assembly.rnaseqc.stranded.gtf}}
kallisto_idx: {type: 'File', doc: "Specialized index of a **transcriptome** fasta file for kallisto", "sbg:suggestedValue": {class: File,
path: 62853e7ad63f7c6d8d7ae5a6, name: RSEM_GENCODE39.transcripts.kallisto.idx}}
RSEMgenome: {type: 'File', doc: "RSEM reference tar ball", "sbg:suggestedValue": {class: File, path: 62853e7ad63f7c6d8d7ae5a5, name: RSEM_GENCODE39.tar.gz}}
estimate_rspd: {type: 'boolean?', doc: "Set this option if you want to estimate the read start position distribution (RSPD) from
data", default: true}
sample_name: {type: 'string?', doc: "Sample ID of the input reads. If not provided, will use reads1 file basename."}
annofuse_col_num: {type: 'int?', doc: "0-based column number in file of fusion name.", default: 30}
fusion_annotator_ref: {type: 'File', doc: "Tar ball with fusion_annot_lib.idx and blast_pairs.idx from STAR-Fusion CTAT Genome lib.
Can be same as FusionGenome, but only two files needed from that package", "sbg:suggestedValue": {class: 'File', path: '63cff818facdd82011c8d6fe',
name: 'GRCh38_v39_fusion_annot_custom.tar.gz'}}
rmats_variable_read_length: {type: 'boolean?', doc: "Allow reads with lengths that differ from --readLength to be processed. --readLength
will still be used to determine IncFormLen and SkipFormLen."}
rmats_novel_splice_sites: {type: 'boolean?', doc: "Select for novel splice site detection or unannotated splice sites. 'true' to
detect or add this parameter, 'false' to disable denovo detection. Tool Default: false"}
rmats_stat_off: {type: 'boolean?', doc: "Select to skip statistical analysis, either between two groups or on single sample group.
'true' to add this parameter. Tool default: false"}
rmats_allow_clipping: {type: 'boolean?', doc: "Allow alignments with soft or hard clipping to be used."}
rmats_threads: {type: 'int?', doc: "Threads to allocate to RMATs."}
rmats_ram: {type: 'int?', doc: "GB of RAM to allocate to RMATs."}
run_t1k: {type: 'boolean?', default: true, doc: "Set to false to disable T1k HLA typing"}
hla_rna_ref_seqs: {type: 'File?', doc: "FASTA file containing the HLA allele reference sequences for RNA.", "sbg:suggestedValue": {
class: File, path: 6669ac8127374715fc3ba3c3, name: hla_v3.43.0_gencode_v39_rna_seq.fa}}
hla_rna_gene_coords: {type: 'File?', doc: "FASTA file containing the coordinates of the HLA genes for RNA.", "sbg:suggestedValue": {
class: File, path: 6669ac8127374715fc3ba3c1, name: hla_v3.43.0_gencode_v39_rna_coord.fa}}
t1k_abnormal_unmap_flag: {type: 'boolean?', doc: "Set if the flag in BAM for the unmapped read-pair is nonconcordant"}
t1k_ram: {type: 'int?', doc: "GB of RAM to allocate to T1k." }
outputs:
cutadapt_stats: {type: 'File?', outputSource: cutadapt_3-4/cutadapt_stats, doc: "Cutadapt stats output, only if adapter is supplied."}
STAR_sorted_genomic_cram: {type: 'File', outputSource: samtools_bam_to_cram/output, doc: "STAR sorted and indexed genomic alignment
cram"}
STAR_chimeric_junctions: {type: 'File?', outputSource: star_fusion_1-10-1/chimeric_junction_compressed, doc: "STAR chimeric junctions"}
STAR_gene_count: {type: 'File', outputSource: star_2-7-10a/gene_counts, doc: "STAR genecounts"}
STAR_junctions_out: {type: 'File', outputSource: star_2-7-10a/junctions_out, doc: "STARjunction reads"}
STAR_final_log: {type: 'File', outputSource: star_2-7-10a/log_final_out, doc: "STAR metricslog file of unique, multi-mapping, unmapped,
and chimeric reads"}
STAR-Fusion_results: {type: 'File', outputSource: star_fusion_1-10-1/abridged_coding, doc: "STAR fusion detection from chimeric
reads"}
arriba_fusion_results: {type: 'File', outputSource: arriba_fusion_2-2-1/arriba_fusions, doc: "Fusion output from Arriba"}
arriba_fusion_viz: {type: 'File', outputSource: arriba_draw_2-2-1/arriba_pdf, doc: "pdf output from Arriba"}
RSEM_isoform: {type: 'File', outputSource: rsem/isoform_out, doc: "RSEM isoform expression estimates"}
RSEM_gene: {type: 'File', outputSource: rsem/gene_out, doc: "RSEM gene expression estimates"}
RNASeQC_Metrics: {type: 'File', outputSource: rna_seqc/Metrics, doc: "Metrics on mapping, intronic, exonic rates, count information,
etc"}
RNASeQC_counts: {type: 'File', outputSource: supplemental/RNASeQC_counts, doc: "Contains gene tpm, gene read, and exon counts"}
kallisto_Abundance: {type: 'File', outputSource: kallisto/abundance_out, doc: "Gene abundance output from STAR genomic bam file"}
annofuse_filtered_fusions_tsv: {type: 'File?', outputSource: annofuse/annofuse_filtered_fusions_tsv, doc: "Filtered fusions called
by annoFuse."}
rmats_filtered_alternative_3_prime_splice_sites_jc: {type: 'File', outputSource: rmats/filtered_alternative_3_prime_splice_sites_jc,
doc: "Alternative 3 prime splice sites JC.txt output from RMATs containing only those calls with 10 or more junction spanning
read counts of support"}
rmats_filtered_alternative_5_prime_splice_sites_jc: {type: 'File', outputSource: rmats/filtered_alternative_5_prime_splice_sites_jc,
doc: "Alternative 5 prime splice sites JC.txt output from RMATs containing only those calls with 10 or more junction spanning
read counts of support"}
rmats_filtered_mutually_exclusive_exons_jc: {type: 'File', outputSource: rmats/filtered_mutually_exclusive_exons_jc, doc: "Mutually
exclusive exons JC.txt output from RMATs containing only those calls with 10 or more junction spanning read counts of support"}
rmats_filtered_retained_introns_jc: {type: 'File', outputSource: rmats/filtered_retained_introns_jc, doc: "Retained introns JC.txt
output from RMATs containing only those calls with 10 or more junction spanning read counts of support"}
rmats_filtered_skipped_exons_jc: {type: 'File', outputSource: rmats/filtered_skipped_exons_jc, doc: "Skipped exons JC.txt output
from RMATs containing only those calls with 10 or more junction spanning read counts of support"}
t1k_genotype_tsv: {type: 'File?', outputSource: t1k/genotype_tsv, doc: "Genotyping results from T1k"}
steps:
basename_picker:
run: ../tools/basename_picker.cwl
in:
root_name:
source: reads1
valueFrom: $(self.basename.split('.')[0])
output_basename: output_basename
sample_name: sample_name
star_rg_line: outSAMattrRGline
out: [outname, outsample, outrg]
alignmentfile_pairedness:
run: ../tools/alignmentfile_pairedness.cwl
when: $(inputs.input_reads.basename.search(/.(b|cr|s)am$/) != -1)
in:
input_reads: reads1
input_reference: cram_reference
out: [is_paired_end]
align2fastq:
# Skip if input is FASTQ already
run: ../tools/samtools_fastq.cwl
when: $(inputs.input_reads_1.basename.search(/.(b|cr|s)am$/) != -1)
in:
input_reads_1: reads1
SampleID: basename_picker/outname
cores: samtools_fastq_cores
is_paired_end:
source: [is_paired_end, alignmentfile_pairedness/is_paired_end]
pickValue: first_non_null
cram_reference: cram_reference
out: [fq1, fq2]
cutadapt_3-4:
# Skip if no adapter given, get fastq from prev step if not null or wf input
run: ../tools/cutadapter_3.4.cwl
when: $(inputs.r1_adapter != null)
in:
readFilesIn1:
source: [align2fastq/fq1, reads1]
pickValue: first_non_null
readFilesIn2:
source: [align2fastq/fq2, reads2]
pickValue: first_non_null
r1_adapter: r1_adapter
r2_adapter: r2_adapter
min_len: min_len
quality_base: quality_base
quality_cutoff: quality_cutoff
sample_name: basename_picker/outname
out: [trimmedReadsR1, trimmedReadsR2, cutadapt_stats]
star_2-7-10a:
# will get fastq from first non-null in this order - cutadapt, align2fastq, wf input
run: ../tools/star_2.7.10a_align.cwl
in:
outSAMattrRGline: basename_picker/outrg
genomeDir: STARgenome
readFilesIn1:
source: [cutadapt_3-4/trimmedReadsR1, align2fastq/fq1, reads1]
pickValue: first_non_null
readFilesIn2:
source: [cutadapt_3-4/trimmedReadsR2, align2fastq/fq2, reads2]
pickValue: first_non_null
outFileNamePrefix: basename_picker/outname
runThreadN: runThreadN
twopassMode: twopassMode
alignSJoverhangMin: alignSJoverhangMin
outFilterMismatchNoverLmax: outFilterMismatchNoverLmax
outFilterType: outFilterType
outFilterScoreMinOverLread: outFilterScoreMinOverLread
outFilterMatchNminOverLread: outFilterMatchNminOverLread
outReadsUnmapped: outReadsUnmapped
limitSjdbInsertNsj: limitSjdbInsertNsj
outSAMstrandField: outSAMstrandField
outFilterIntronMotifs: outFilterIntronMotifs
alignSoftClipAtReferenceEnds: alignSoftClipAtReferenceEnds
quantMode: quantMode
outSAMtype: outSAMtype
outSAMunmapped: outSAMunmapped
genomeLoad: genomeLoad
chimMainSegmentMultNmax: chimMainSegmentMultNmax
outSAMattributes: outSAMattributes
alignInsertionFlush: alignInsertionFlush
alignIntronMax: alignIntronMax
alignMatesGapMax: alignMatesGapMax
alignSJDBoverhangMin: alignSJDBoverhangMin
outFilterMismatchNmax: outFilterMismatchNmax
alignSJstitchMismatchNmax: alignSJstitchMismatchNmax
alignSplicedMateMapLmin: alignSplicedMateMapLmin
alignSplicedMateMapLminOverLmate: alignSplicedMateMapLminOverLmate
chimJunctionOverhangMin: chimJunctionOverhangMin
chimMultimapNmax: chimMultimapNmax
chimMultimapScoreRange: chimMultimapScoreRange
chimNonchimScoreDropMin: chimNonchimScoreDropMin
chimOutJunctionFormat: chimOutJunctionFormat
chimOutType: chimOutType
chimScoreDropMax: chimScoreDropMax
chimScoreJunctionNonGTAG: chimScoreJunctionNonGTAG
chimScoreSeparation: chimScoreSeparation
chimSegmentMin: chimSegmentMin
chimSegmentReadGapMax: chimSegmentReadGapMax
outFilterMultimapNmax: outFilterMultimapNmax
peOverlapMMp: peOverlapMMp
peOverlapNbasesMin: peOverlapNbasesMin
out: [chimeric_junctions, chimeric_sam_out, gene_counts, genomic_bam_out, junctions_out, log_final_out, log_out, log_progress_out,
transcriptome_bam_out]
samtools_sort:
run: ../tools/samtools_sort.cwl
in:
unsorted_bam: star_2-7-10a/genomic_bam_out
chimeric_sam_out: star_2-7-10a/chimeric_sam_out
out: [sorted_bam, sorted_bai, chimeric_bam_out]
t1k:
run: ../tools/t1k.cwl
when: $(inputs.run_t1k)
in:
run_t1k: run_t1k
bam:
source: [samtools_sort/sorted_bam, samtools_sort/sorted_bai]
valueFrom: |
${
var bundle = self[0];
bundle.secondaryFiles = [self[1]];
return bundle;
}
reference: hla_rna_ref_seqs
gene_coordinates: hla_rna_gene_coords
abnormal_unmap_flag: t1k_abnormal_unmap_flag
preset:
valueFrom: "hla"
output_basename:
source: output_basename
valueFrom: $(self).t1k_hla
skip_post_analysis:
valueFrom: $(1 == 1)
ram: t1k_ram
out: [genotype_tsv]
bam_strandness:
run: ../tools/bam_strandness.cwl
in:
input_bam: samtools_sort/sorted_bam
annotation_gtf: gtf_anno
kallisto_idx: kallisto_idx
paired_end:
source: [reads2, is_paired_end, alignmentfile_pairedness/is_paired_end]
valueFrom: |
$(self[0] != null ? true : self[1] != null ? self[1] : self[2])
out: [output, strandedness, read_length_median, read_length_stddev, is_paired_end]
rmats:
run: ../workflow/rmats_wf.cwl
in:
gtf_annotation: gtf_anno
sample_1_bams:
source: samtools_sort/sorted_bam
valueFrom: |
$([self])
read_length: read_length_median
variable_read_length: rmats_variable_read_length
read_type:
source: [is_paired_end, bam_strandness/is_paired_end]
pickValue: first_non_null
valueFrom: |
$(self ? "paired" : "single")
strandedness:
source: [wf_strand_param, bam_strandness/strandedness]
pickValue: first_non_null
valueFrom: |
$(self == "rf-stranded" ? "fr-firststrand" : self == "fr-stranded" ? "fr-secondstrand" : "fr-unstranded")
novel_splice_sites: rmats_novel_splice_sites
stat_off: rmats_stat_off
allow_clipping: rmats_allow_clipping
output_basename: basename_picker/outname
rmats_threads: rmats_threads
rmats_ram: rmats_ram
out: [filtered_alternative_3_prime_splice_sites_jc, filtered_alternative_5_prime_splice_sites_jc, filtered_mutually_exclusive_exons_jc,
filtered_retained_introns_jc, filtered_skipped_exons_jc]
strand_parse:
run: ../tools/expression_parse_strand_param.cwl
in:
wf_strand_param:
source: [wf_strand_param, bam_strandness/strandedness]
pickValue: first_non_null
out: [rsem_std, kallisto_std, rnaseqc_std, arriba_std]
star_fusion_1-10-1:
run: ../tools/star_fusion_1.10.1_call.cwl
in:
Chimeric_junction: star_2-7-10a/chimeric_junctions
genome_tar: FusionGenome
output_basename: basename_picker/outname
genome_untar_path: star_fusion_genome_untar_path
compress_chimeric_junction: compress_chimeric_junction
out: [abridged_coding, chimeric_junction_compressed]
arriba_fusion_2-2-1:
run: ../tools/arriba_fusion_2.2.1.cwl
in:
genome_aligned_bam:
source: [samtools_sort/sorted_bam, samtools_sort/sorted_bai]
valueFrom: |
${
var bundle = self[0];
bundle.secondaryFiles = [self[1]];
return bundle;
}
memory: arriba_memory
reference_fasta: reference_fasta
gtf_anno: gtf_anno
outFileNamePrefix: basename_picker/outname
arriba_strand_flag: strand_parse/arriba_std
out: [arriba_fusions]
arriba_draw_2-2-1:
run: ../tools/arriba_draw_2.2.1.cwl
in:
fusions: arriba_fusion_2-2-1/arriba_fusions
genome_aligned_bam:
source: [samtools_sort/sorted_bam, samtools_sort/sorted_bai]
valueFrom: |
${
var bundle = self[0];
bundle.secondaryFiles = [self[1]];
return bundle;
}
gtf_anno: gtf_anno
memory: arriba_memory
out: [arriba_pdf]
rsem:
run: ../tools/rsem_calc_expression.cwl
in:
bam: star_2-7-10a/transcriptome_bam_out
paired_end:
source: [is_paired_end, bam_strandness/is_paired_end]
pickValue: first_non_null
estimate_rspd: estimate_rspd
genomeDir: RSEMgenome
outFileNamePrefix: basename_picker/outname
strandedness: strand_parse/rsem_std
out: [gene_out, isoform_out]
rna_seqc:
run: ../tools/rnaseqc_2.4.2.cwl
in:
aligned_sorted_reads: samtools_sort/sorted_bam
collapsed_gtf: RNAseQC_GTF
stranded: strand_parse/rnaseqc_std
unpaired:
source: [is_paired_end, bam_strandness/is_paired_end]
pickValue: first_non_null
valueFrom: |
$(!self)
out: [Metrics, Gene_TPM, Gene_count, Exon_count]
supplemental:
run: ../tools/supplemental_tar_gz.cwl
in:
outFileNamePrefix: basename_picker/outname
Gene_TPM: rna_seqc/Gene_TPM
Gene_count: rna_seqc/Gene_count
Exon_count: rna_seqc/Exon_count
out: [RNASeQC_counts]
kallisto:
run: ../tools/kallisto_calc_expression.cwl
in:
transcript_idx: kallisto_idx
strand: strand_parse/kallisto_std
reads1:
source: [cutadapt_3-4/trimmedReadsR1, align2fastq/fq1, reads1]
pickValue: first_non_null
reads2:
source: [cutadapt_3-4/trimmedReadsR2, align2fastq/fq2, reads2]
pickValue: first_non_null
SampleID: basename_picker/outname
avg_frag_len:
source: [read_length_median, bam_strandness/read_length_median]
valueFrom: |
$(self.some(function(e){ return e != null }) ? self.filter(function(e) { return e != null })[0] : null)
std_dev:
source: [read_length_stddev, bam_strandness/read_length_stddev]
valueFrom: |
$(self.some(function(e){ return e != null }) ? self.filter(function(e) { return e != null })[0] : null)
out: [abundance_out]
annofuse:
run: ../workflow/kfdrc_annoFuse_wf.cwl
in:
sample_name: basename_picker/outsample
FusionGenome: fusion_annotator_ref
genome_untar_path: star_fusion_genome_untar_path
rsem_expr_file: rsem/gene_out
arriba_output_file: arriba_fusion_2-2-1/arriba_fusions
star_fusion_output_file: star_fusion_1-10-1/abridged_coding
col_num: annofuse_col_num
output_basename: basename_picker/outname
out: [annofuse_filtered_fusions_tsv]
samtools_bam_to_cram:
run: ../tools/samtools_bam_to_cram.cwl
in:
reference: reference_fasta
input_bam:
source: [samtools_sort/sorted_bam, samtools_sort/sorted_bai]
valueFrom: |
${
var bundle = self[0];
bundle.secondaryFiles = [self[1]];
return bundle;
}
out: [output]
$namespaces:
sbg: https://sevenbridges.com
hints:
- class: "sbg:maxNumberOfParallelInstances"
value: 3
"sbg:license": Apache License 2.0
"sbg:publisher": KFDRC
"sbg:categories":
- ALIGNMENT
- ANNOFUSE
- ARRIBA
- BAM
- CRAM
- CUTADAPT
- FASTQ
- KALLISTO
- PE
- RNASEQ
- RNASEQC
- RMATS
- RSEM
- SE
- STAR
"sbg:links":
- id: 'https://github.com/kids-first/kf-rnaseq-workflow/releases/tag/v4.8.0'
label: github-release