Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sample IDs based on filenames causing errors in computation #20

Closed
tdido opened this issue Jun 7, 2024 · 3 comments
Closed

Sample IDs based on filenames causing errors in computation #20

tdido opened this issue Jun 7, 2024 · 3 comments

Comments

@tdido
Copy link

tdido commented Jun 7, 2024

I have a pipeline that uses clair3 for haplotyping and severus afterward. I noticed that even though I have control samples, the output only included all_SVs. I suspected the cause to be that the input bams have the same name (as clair3 generates them). So I ran some tests, and that seems to be the case.

This is the log for the run with files named the same:

[2024-06-07 12:37:38] root: INFO: Starting Severus 1.0
[2024-06-07 12:37:38] root: DEBUG: Cmd: severus --target-bam results/clair3/tumour/phased_output.bam --control-bam results/clair3/normal/phased_output.bam --out-dir severus_test_samename -t 64 --phasing-vcf results/clair3/tumour/phased_merge_output.vcf.gz --vntr-bed human_GRCh38_no_alt_analysis_set.trf.bed
[2024-06-07 12:37:38] root: DEBUG: Python version: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0]
[2024-06-07 12:37:38] root: INFO: Parsing reads from phased_output.bam
[2024-06-07 12:40:41] root: INFO:       Total read length: 96463990191
[2024-06-07 12:40:41] root: INFO:       Total aligned length: 93395728985 (0.97)
[2024-06-07 12:40:41] root: INFO:       Read N50 / N90: 17207 / 4360
[2024-06-07 12:40:41] root: INFO:       Alignments N50 / N90: 15474 / 3734
[2024-06-07 12:40:41] root: INFO:       Read error rate (Q25 / Q50 / Q75): 0.0057 / 0.0095 / 0.0194
[2024-06-07 12:40:41] root: INFO:       Read mismatch rate (Q25 / Q50 / Q75): 0.0019 / 0.0034 / 0.0070
[2024-06-07 12:40:42] root: INFO: Parsing reads from phased_output.bam
[2024-06-07 12:44:23] root: INFO:       Total read length: 91950759372
[2024-06-07 12:44:23] root: INFO:       Total aligned length: 87270955492 (0.95)
[2024-06-07 12:44:23] root: INFO:       Read N50 / N90: 9758 / 3341
[2024-06-07 12:44:23] root: INFO:       Alignments N50 / N90: 8771 / 2967
[2024-06-07 12:44:23] root: INFO:       Read error rate (Q25 / Q50 / Q75): 0.0051 / 0.0085 / 0.0169
[2024-06-07 12:44:23] root: INFO:       Read mismatch rate (Q25 / Q50 / Q75): 0.0016 / 0.0030 / 0.0060
[2024-06-07 12:44:24] root: INFO: Computing read quality
[2024-06-07 12:46:45] root: INFO: Annotating reads
[2024-06-07 12:48:52] root: INFO: Computing coverage histogram
[2024-06-07 12:49:48] root: INFO:       Median coverage by PASS reads for phased_output.bam (H1 / H2 / H0): 22.0 / 22.0 / 1.0
[2024-06-07 12:49:48] root: INFO:       Median coverage by PASS reads for phased_output.bam (H1 / H2 / H0): 22.0 / 22.0 / 1.0
[2024-06-07 12:49:48] root: INFO: Extracting split alignments
[2024-06-07 12:49:56] root: INFO: Extracting clipped reads
[2024-06-07 12:50:17] root: INFO: Starting breakpoint detection
[2024-06-07 12:50:40] root: INFO: Clustering unmapped insertions
[2024-06-07 12:50:46] root: INFO: Starting compute_bp_coverage
[2024-06-07 12:52:03] root: INFO: Filtering breakpoints
[2024-06-07 12:52:05] root: INFO: Writing breakpoints
[2024-06-07 12:52:06] root: INFO: Preparing outputs for all_SVs
[2024-06-07 12:52:06] root: INFO:       Computing segment coverage
[2024-06-07 12:52:36] root: INFO:       Total phased length: 2341332850
[2024-06-07 12:52:36] root: INFO:       Phase blocks N50: 994777
[2024-06-07 12:52:37] root: INFO:       Preparing graph
[2024-06-07 12:52:38] root: INFO:       Writing vcf

And this is the available output:

.
├── all_SVs
│   ├── breakpoint_clusters_list.tsv
│   ├── breakpoint_clusters.tsv
│   ├── plots
│   │   ├── severus_0.html
│   │   ├── severus_1.html
│   │   ├── severus_4.html
│   │   ├── severus_5.html
│   │   └── severus_7.html
│   └── severus_all.vcf
├── breakpoints_double.csv
├── read_qual.txt
└── severus.log

And this is the log for the run where I created symlinks for the same bams, to have different filenames:

[2024-06-07 12:02:41] root: INFO: Starting Severus 1.0
[2024-06-07 12:02:41] root: DEBUG: Cmd: severus --target-bam results/clair3/tumour/target.bam --control-bam results/clair3/normal/control.bam --out-dir severus_test_diffname -t 64 --phasing-vcf results/clair3/tumour/phased_merge_output.vcf.gz --vntr-bed human_GRCh38_no_alt_analysis_set.trf.bed
[2024-06-07 12:02:41] root: DEBUG: Python version: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0]
[2024-06-07 12:02:42] root: INFO: Parsing reads from target.bam
[2024-06-07 12:10:58] root: INFO:       Total read length: 96463990191
[2024-06-07 12:10:58] root: INFO:       Total aligned length: 93395728985 (0.97)
[2024-06-07 12:10:58] root: INFO:       Read N50 / N90: 17207 / 4360 
[2024-06-07 12:10:58] root: INFO:       Alignments N50 / N90: 15474 / 3734
[2024-06-07 12:10:58] root: INFO:       Read error rate (Q25 / Q50 / Q75): 0.0057 / 0.0095 / 0.0194
[2024-06-07 12:10:58] root: INFO:       Read mismatch rate (Q25 / Q50 / Q75): 0.0019 / 0.0034 / 0.0070
[2024-06-07 12:10:59] root: INFO: Parsing reads from control.bam
[2024-06-07 12:18:30] root: INFO:       Total read length: 91950759372
[2024-06-07 12:18:30] root: INFO:       Total aligned length: 87270955492 (0.95)
[2024-06-07 12:18:30] root: INFO:       Read N50 / N90: 9758 / 3341
[2024-06-07 12:18:30] root: INFO:       Alignments N50 / N90: 8771 / 2967
[2024-06-07 12:18:30] root: INFO:       Read error rate (Q25 / Q50 / Q75): 0.0051 / 0.0085 / 0.0169
[2024-06-07 12:18:30] root: INFO:       Read mismatch rate (Q25 / Q50 / Q75): 0.0016 / 0.0030 / 0.0060
[2024-06-07 12:18:31] root: INFO: Computing read quality
[2024-06-07 12:21:08] root: INFO: Annotating reads
[2024-06-07 12:23:01] root: INFO: Computing coverage histogram
[2024-06-07 12:23:59] root: INFO:       Median coverage by PASS reads for target.bam (H1 / H2 / H0): 11.0 / 11.0 / 0.0
[2024-06-07 12:24:00] root: INFO:       Median coverage by PASS reads for control.bam (H1 / H2 / H0): 10.0 / 10.0 / 1.0
[2024-06-07 12:24:00] root: INFO: Extracting split alignments
[2024-06-07 12:24:08] root: INFO: Extracting clipped reads
[2024-06-07 12:24:49] root: INFO: Starting breakpoint detection
[2024-06-07 12:25:15] root: INFO: Clustering unmapped insertions
[2024-06-07 12:25:18] root: INFO: Starting compute_bp_coverage
[2024-06-07 12:34:41] root: INFO: Filtering breakpoints 
[2024-06-07 12:34:43] root: INFO: Writing breakpoints
[2024-06-07 12:34:45] root: INFO: Preparing outputs for all_SVs
[2024-06-07 12:34:45] root: INFO:       Computing segment coverage
[2024-06-07 12:35:15] root: INFO:       Total phased length: 2341332850
[2024-06-07 12:35:15] root: INFO:       Phase blocks N50: 994777
[2024-06-07 12:35:15] root: INFO:       Preparing graph 
[2024-06-07 12:35:21] root: INFO:       Writing vcf
[2024-06-07 12:35:22] root: INFO: Preparing outputs for somatic_SVs
[2024-06-07 12:35:22] root: INFO:       Computing segment coverage
[2024-06-07 12:35:51] root: INFO:       Total phased length: 2341332850
[2024-06-07 12:35:51] root: INFO:       Phase blocks N50: 994777
[2024-06-07 12:35:51] root: INFO:       Preparing graph
[2024-06-07 12:35:51] root: INFO:       Writing vcf

And the corresponding output tree:

.
├── all_SVs
│   ├── breakpoint_clusters_list.tsv
│   ├── breakpoint_clusters.tsv
│   ├── plots
│   │   ├── severus_0.html
│   │   ├── severus_1.html
│   │   ├── severus_2.html
│   │   ├── severus_3.html
│   │   ├── severus_5.html
│   │   └── severus_7.html
│   └── severus_all.vcf
├── breakpoints_double.csv
├── read_qual.txt
├── severus.log
└── somatic_SVs
    ├── breakpoint_clusters_list.tsv
    ├── breakpoint_clusters.tsv
    ├── plots
    └── severus_somatic.vcf
@aysegokce
Copy link
Contributor

Thank you for the report. This will be solved in our next release.
Ayse

@aysegokce
Copy link
Contributor

Thank you for the feedback. The issue is solved in version 1.1

@villena-francis
Copy link

Hi @aysegokce ! I also have an issue related to file names in Severus v1.1.

All my BAM files are generated with the same name, and I use the storage directory as their ID. However, Severus appears to only consider the file name, and if they share the same name (even in different directories), the execution is canceled, warning that I'm using the control as the target too:

[2024-07-24 01:30:32] root: INFO: Starting Severus 1.1
[2024-07-24 01:30:32] root: DEBUG: Cmd: /storage/scratch01/groups/bu/simSVs/visor-simulations/.snakemake/conda/2a5c70c705b132ee6ccd896c702286f5_/bin/severus --target-bam results/v1_30x_B_LASeR/sim.srt.bam --control-bam results/normal_LASeR/sim.srt.bam --out-dir results/v1_30x_B_severus -t 24
[2024-07-24 01:30:32] root: DEBUG: Python version: 3.12.4 | packaged by conda-forge | (main, Jun 17 2024, 10:23:07) [GCC 12.3.0]
[2024-07-24 01:30:32] root: ERROR: Error: Control bam also inputted as target bam

For now, I am avoiding the problem by creating symbolic links for the tumor files:

ln -sf $(basename {input.tumor_bam}) $(dirname {input.tumor_bam})/tumor.bam
ln -sf $(basename {input.tumor_bam}).bai $(dirname {input.tumor_bam})/tumor.bam.bai
    severus \
        --target-bam $(dirname {input.tumor_bam})/tumor.bam \
        --control-bam {input.normal_bam} \
        --out-dir {output.dir} \
        -t {threads} \
        > {log} 2>&1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants