Sample IDs based on filenames causing errors in computation #20

tdido · 2024-06-07T12:39:26Z

I have a pipeline that uses clair3 for haplotyping and severus afterward. I noticed that even though I have control samples, the output only included all_SVs. I suspected the cause to be that the input bams have the same name (as clair3 generates them). So I ran some tests, and that seems to be the case.

This is the log for the run with files named the same:

[2024-06-07 12:37:38] root: INFO: Starting Severus 1.0
[2024-06-07 12:37:38] root: DEBUG: Cmd: severus --target-bam results/clair3/tumour/phased_output.bam --control-bam results/clair3/normal/phased_output.bam --out-dir severus_test_samename -t 64 --phasing-vcf results/clair3/tumour/phased_merge_output.vcf.gz --vntr-bed human_GRCh38_no_alt_analysis_set.trf.bed
[2024-06-07 12:37:38] root: DEBUG: Python version: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0]
[2024-06-07 12:37:38] root: INFO: Parsing reads from phased_output.bam
[2024-06-07 12:40:41] root: INFO:       Total read length: 96463990191
[2024-06-07 12:40:41] root: INFO:       Total aligned length: 93395728985 (0.97)
[2024-06-07 12:40:41] root: INFO:       Read N50 / N90: 17207 / 4360
[2024-06-07 12:40:41] root: INFO:       Alignments N50 / N90: 15474 / 3734
[2024-06-07 12:40:41] root: INFO:       Read error rate (Q25 / Q50 / Q75): 0.0057 / 0.0095 / 0.0194
[2024-06-07 12:40:41] root: INFO:       Read mismatch rate (Q25 / Q50 / Q75): 0.0019 / 0.0034 / 0.0070
[2024-06-07 12:40:42] root: INFO: Parsing reads from phased_output.bam
[2024-06-07 12:44:23] root: INFO:       Total read length: 91950759372
[2024-06-07 12:44:23] root: INFO:       Total aligned length: 87270955492 (0.95)
[2024-06-07 12:44:23] root: INFO:       Read N50 / N90: 9758 / 3341
[2024-06-07 12:44:23] root: INFO:       Alignments N50 / N90: 8771 / 2967
[2024-06-07 12:44:23] root: INFO:       Read error rate (Q25 / Q50 / Q75): 0.0051 / 0.0085 / 0.0169
[2024-06-07 12:44:23] root: INFO:       Read mismatch rate (Q25 / Q50 / Q75): 0.0016 / 0.0030 / 0.0060
[2024-06-07 12:44:24] root: INFO: Computing read quality
[2024-06-07 12:46:45] root: INFO: Annotating reads
[2024-06-07 12:48:52] root: INFO: Computing coverage histogram
[2024-06-07 12:49:48] root: INFO:       Median coverage by PASS reads for phased_output.bam (H1 / H2 / H0): 22.0 / 22.0 / 1.0
[2024-06-07 12:49:48] root: INFO:       Median coverage by PASS reads for phased_output.bam (H1 / H2 / H0): 22.0 / 22.0 / 1.0
[2024-06-07 12:49:48] root: INFO: Extracting split alignments
[2024-06-07 12:49:56] root: INFO: Extracting clipped reads
[2024-06-07 12:50:17] root: INFO: Starting breakpoint detection
[2024-06-07 12:50:40] root: INFO: Clustering unmapped insertions
[2024-06-07 12:50:46] root: INFO: Starting compute_bp_coverage
[2024-06-07 12:52:03] root: INFO: Filtering breakpoints
[2024-06-07 12:52:05] root: INFO: Writing breakpoints
[2024-06-07 12:52:06] root: INFO: Preparing outputs for all_SVs
[2024-06-07 12:52:06] root: INFO:       Computing segment coverage
[2024-06-07 12:52:36] root: INFO:       Total phased length: 2341332850
[2024-06-07 12:52:36] root: INFO:       Phase blocks N50: 994777
[2024-06-07 12:52:37] root: INFO:       Preparing graph
[2024-06-07 12:52:38] root: INFO:       Writing vcf

And this is the available output:

.
├── all_SVs
│   ├── breakpoint_clusters_list.tsv
│   ├── breakpoint_clusters.tsv
│   ├── plots
│   │   ├── severus_0.html
│   │   ├── severus_1.html
│   │   ├── severus_4.html
│   │   ├── severus_5.html
│   │   └── severus_7.html
│   └── severus_all.vcf
├── breakpoints_double.csv
├── read_qual.txt
└── severus.log

And this is the log for the run where I created symlinks for the same bams, to have different filenames:

[2024-06-07 12:02:41] root: INFO: Starting Severus 1.0
[2024-06-07 12:02:41] root: DEBUG: Cmd: severus --target-bam results/clair3/tumour/target.bam --control-bam results/clair3/normal/control.bam --out-dir severus_test_diffname -t 64 --phasing-vcf results/clair3/tumour/phased_merge_output.vcf.gz --vntr-bed human_GRCh38_no_alt_analysis_set.trf.bed
[2024-06-07 12:02:41] root: DEBUG: Python version: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0]
[2024-06-07 12:02:42] root: INFO: Parsing reads from target.bam
[2024-06-07 12:10:58] root: INFO:       Total read length: 96463990191
[2024-06-07 12:10:58] root: INFO:       Total aligned length: 93395728985 (0.97)
[2024-06-07 12:10:58] root: INFO:       Read N50 / N90: 17207 / 4360 
[2024-06-07 12:10:58] root: INFO:       Alignments N50 / N90: 15474 / 3734
[2024-06-07 12:10:58] root: INFO:       Read error rate (Q25 / Q50 / Q75): 0.0057 / 0.0095 / 0.0194
[2024-06-07 12:10:58] root: INFO:       Read mismatch rate (Q25 / Q50 / Q75): 0.0019 / 0.0034 / 0.0070
[2024-06-07 12:10:59] root: INFO: Parsing reads from control.bam
[2024-06-07 12:18:30] root: INFO:       Total read length: 91950759372
[2024-06-07 12:18:30] root: INFO:       Total aligned length: 87270955492 (0.95)
[2024-06-07 12:18:30] root: INFO:       Read N50 / N90: 9758 / 3341
[2024-06-07 12:18:30] root: INFO:       Alignments N50 / N90: 8771 / 2967
[2024-06-07 12:18:30] root: INFO:       Read error rate (Q25 / Q50 / Q75): 0.0051 / 0.0085 / 0.0169
[2024-06-07 12:18:30] root: INFO:       Read mismatch rate (Q25 / Q50 / Q75): 0.0016 / 0.0030 / 0.0060
[2024-06-07 12:18:31] root: INFO: Computing read quality
[2024-06-07 12:21:08] root: INFO: Annotating reads
[2024-06-07 12:23:01] root: INFO: Computing coverage histogram
[2024-06-07 12:23:59] root: INFO:       Median coverage by PASS reads for target.bam (H1 / H2 / H0): 11.0 / 11.0 / 0.0
[2024-06-07 12:24:00] root: INFO:       Median coverage by PASS reads for control.bam (H1 / H2 / H0): 10.0 / 10.0 / 1.0
[2024-06-07 12:24:00] root: INFO: Extracting split alignments
[2024-06-07 12:24:08] root: INFO: Extracting clipped reads
[2024-06-07 12:24:49] root: INFO: Starting breakpoint detection
[2024-06-07 12:25:15] root: INFO: Clustering unmapped insertions
[2024-06-07 12:25:18] root: INFO: Starting compute_bp_coverage
[2024-06-07 12:34:41] root: INFO: Filtering breakpoints 
[2024-06-07 12:34:43] root: INFO: Writing breakpoints
[2024-06-07 12:34:45] root: INFO: Preparing outputs for all_SVs
[2024-06-07 12:34:45] root: INFO:       Computing segment coverage
[2024-06-07 12:35:15] root: INFO:       Total phased length: 2341332850
[2024-06-07 12:35:15] root: INFO:       Phase blocks N50: 994777
[2024-06-07 12:35:15] root: INFO:       Preparing graph 
[2024-06-07 12:35:21] root: INFO:       Writing vcf
[2024-06-07 12:35:22] root: INFO: Preparing outputs for somatic_SVs
[2024-06-07 12:35:22] root: INFO:       Computing segment coverage
[2024-06-07 12:35:51] root: INFO:       Total phased length: 2341332850
[2024-06-07 12:35:51] root: INFO:       Phase blocks N50: 994777
[2024-06-07 12:35:51] root: INFO:       Preparing graph
[2024-06-07 12:35:51] root: INFO:       Writing vcf

And the corresponding output tree:

.
├── all_SVs
│   ├── breakpoint_clusters_list.tsv
│   ├── breakpoint_clusters.tsv
│   ├── plots
│   │   ├── severus_0.html
│   │   ├── severus_1.html
│   │   ├── severus_2.html
│   │   ├── severus_3.html
│   │   ├── severus_5.html
│   │   └── severus_7.html
│   └── severus_all.vcf
├── breakpoints_double.csv
├── read_qual.txt
├── severus.log
└── somatic_SVs
    ├── breakpoint_clusters_list.tsv
    ├── breakpoint_clusters.tsv
    ├── plots
    └── severus_somatic.vcf

The text was updated successfully, but these errors were encountered:

aysegokce · 2024-06-07T15:05:24Z

Thank you for the report. This will be solved in our next release.
Ayse

aysegokce · 2024-07-01T16:28:57Z

Thank you for the feedback. The issue is solved in version 1.1

villena-francis · 2024-08-19T09:08:59Z

Hi @aysegokce ! I also have an issue related to file names in Severus v1.1.

All my BAM files are generated with the same name, and I use the storage directory as their ID. However, Severus appears to only consider the file name, and if they share the same name (even in different directories), the execution is canceled, warning that I'm using the control as the target too:

[2024-07-24 01:30:32] root: INFO: Starting Severus 1.1
[2024-07-24 01:30:32] root: DEBUG: Cmd: /storage/scratch01/groups/bu/simSVs/visor-simulations/.snakemake/conda/2a5c70c705b132ee6ccd896c702286f5_/bin/severus --target-bam results/v1_30x_B_LASeR/sim.srt.bam --control-bam results/normal_LASeR/sim.srt.bam --out-dir results/v1_30x_B_severus -t 24
[2024-07-24 01:30:32] root: DEBUG: Python version: 3.12.4 | packaged by conda-forge | (main, Jun 17 2024, 10:23:07) [GCC 12.3.0]
[2024-07-24 01:30:32] root: ERROR: Error: Control bam also inputted as target bam

For now, I am avoiding the problem by creating symbolic links for the tumor files:

ln -sf $(basename {input.tumor_bam}) $(dirname {input.tumor_bam})/tumor.bam
ln -sf $(basename {input.tumor_bam}).bai $(dirname {input.tumor_bam})/tumor.bam.bai
    severus \
        --target-bam $(dirname {input.tumor_bam})/tumor.bam \
        --control-bam {input.normal_bam} \
        --out-dir {output.dir} \
        -t {threads} \
        > {log} 2>&1

aysegokce closed this as completed Jul 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sample IDs based on filenames causing errors in computation #20

Sample IDs based on filenames causing errors in computation #20

tdido commented Jun 7, 2024

aysegokce commented Jun 7, 2024

aysegokce commented Jul 1, 2024

villena-francis commented Aug 19, 2024

Sample IDs based on filenames causing errors in computation #20

Sample IDs based on filenames causing errors in computation #20

Comments

tdido commented Jun 7, 2024

aysegokce commented Jun 7, 2024

aysegokce commented Jul 1, 2024

villena-francis commented Aug 19, 2024