-
Notifications
You must be signed in to change notification settings - Fork 12
Single breakend SV
Usually, structural variations (SVs) are those in which the coordinate and directions of two breakpoints are identified (hereafter, we call them canonical SVs). However, there are repeat sequences (such as LINE1, telomere, and centromere sequences) for which reference sequences have not been well established. When the breakpoints of SVs are located in these regions, those SVs are not well represented by the current reference genome scheme (e.g., GRCh38).
In order to capture and represent some of these SVs, we propose a class of SVs, which we call single breakend SVs, and added a novel functionality (single breakend SV module). Single breakend SVs are characterized by the chromosome (Chr), coordinate (Pos), and direction (Dir) of one breakpoint, as well as the contig sequence beyond the breakpoint (Contig). The example of single breakend SVs (minimal essential representation without any annotations) are as follows:
Chr Pos Dir Contig
chr1 150776209 + TGAATGGAATAATCATTGAACGGAATCGTTTCCATCCGATGATGATTCCATTCGATTCCGTTCAATGATTATTCCATTCGAGTCCATTCGA
chr15 26802064 - CATTCCCTTCTATTCAACTCGGAATGATTCCATTCCATTCCATTCCATTCCATTCCATTCCATTCCATTCCATTTCGTTCCATTCCATTCT
chr21 41639140 - AATGAAGCGGGTAAACGGCGGGAGTAACTATGACTCTCTTAAGGTAGCCAAATGCCTCGTCATCTAATTAGTGAGCGCATGAATGGATGAA
chr6 154151473 + TTCCATTCCATTAAATTTCATTCCATTCCATTCCATTCCTTCCATTTCATTCAACATCCATTCCACTCCAGTCCATTCCATTCGTGTCCAT
Single breakend SVs could be identified even using short-read sequencing data (e.g., by GRIDSS2). However, it would be possible to obtain a more extended contig sequence beyond the breakpoint, thus enabling a more precise understanding of the features of SVs.
One limitation of Single breakend SV module is that it requires that at least one of the breakpoints be located in the region of the established reference genome. Even with this method, SVs in which both breakpoints are in highly ambiguous regions (such as centromere sequences) cannot be detected.
To identify single breakend SVs, please add --single_bnd
and --use_racon
options in the get function.
Please see the tutorial page for the example. Then, we will obtain the *.nanomonsv.sbnd.result.txt
result file.
We provided a script (postprocess_sbnd.sh) for integrating, classifying, and annotating single breakend SVs. We assume that users already installed bwa and RepeatMasker. Also, please install tidyverse and ggrepel for R script.
wget https://raw.githubusercontent.com/friend1ws/nanomonsv/master/misc/postprocess_sbnd.sh
mkdir subscript_postprocess_sbnd
wget -P subscript_postprocess_sbnd https://raw.githubusercontent.com/friend1ws/nanomonsv/master/misc/subscript_postprocess_sbnd/add_simple_repeat.py
wget -P subscript_postprocess_sbnd https://raw.githubusercontent.com/friend1ws/nanomonsv/master/misc/subscript_postprocess_sbnd/add_simple_repeat_sbnd.py
wget -P subscript_postprocess_sbnd https://raw.githubusercontent.com/friend1ws/nanomonsv/master/misc/subscript_postprocess_sbnd/integrate_sbnd.py
wget -P subscript_postprocess_sbnd https://raw.githubusercontent.com/friend1ws/nanomonsv/master/misc/subscript_postprocess_sbnd/plot_sbnd_vis.R
In the following, we will describe how to execute this script in the setting of the tutorial page. First, please create a bgzip'ed and tabix'ed simple repeat bed file (again, please consult the tutorial page).
bash postprocess_sbnd.sh $PWD/output/COLO829/COLO829 $PWD/reference/Homo_sapiens_assembly38.fasta simpleRepeat.bed.gz
This script performs the following procedures:
- For each single breakend SV contig sequence, perform alignment to the human reference genome and the annotation results by RepeatMasker.
- Classification of single breakend SVs based on the above result.
- Perform
nanomonsv insert_classify
command - Add simple repeat annotation.
- Generate visualizations of human genome alignment and RepeatMasker of single breakend SV contig sequences.
We will get the following result files:
-
*.nanomonsv.annot.proc.result.txt
:The annotation result of canonical SVs. Canonical SVs and LINE1 mediated rearrangement identified by single breakend SV module are included here. -
*.nanomonsv.sbnd.annot.proc.result.txt
: The annotation result of single breakend SVs. -
*.nanomonsv.sbnd_vis
: Directory including visualizations of human genome alignment and RepeatMasker of single breakend SV contig sequences.
We recommend removing SVs not tagged with PASS.
head -n 1 $PWD/output/COLO829/COLO829.nanomonsv.annot.proc.result.txt > $PWD/output/COLO829/COLO829.nanomonsv.annot.proc.pass.result.txt
tail -n +2 $PWD/output/COLO829/COLO829.nanomonsv.annot.proc.result.txt | grep PASS >> $PWD/output/COLO829/COLO829.nanomonsv.annot.proc.pass.result.txt
head -n 1 $PWD/output/COLO829/COLO829.nanomonsv.sbnd.annot.proc.result.txt > $PWD/output/COLO829/COLO829.nanomonsv.sbnd.annot.proc.pass.result.txt
tail -n +2 $PWD/output/COLO829/COLO829.nanomonsv.sbnd.annot.proc.result.txt | grep PASS >> $PWD/output/COLO829/COLO829.nanomonsv.sbnd.annot.proc.pass.result.txt
LINE1 mediated rearrangement | SVs involving centromere sequence |
---|---|
The left panel is a typical example of LINE1 mediated rearrangement, where the first portion of the contig sequence matches the LINE1 sequence, and the remaining portion unambiguously matches the human genome sequence distant from the breakpoints. From this panel, we can discern that this SV corresponds to the 1,105 bp deletion (chr5:2,227,133-2,228,237) mediated by the 1,328 bp LINE1 sequence.
The right panel shows the SVs involving centromere sequences, where the contig sequence is annotated as centromere sequences. The breakpoint is located at chr4, and the contig sequence is, although ambiguous, aligned to chromosome 18 (in GRCh38). Therefore, the SV is probably an interchromosomal translocation.
Single breakend SV module for nanomonsv consists of the following four steps.
- Parsing: the reads putatively supporting single breakend SVs are extracted from both tumor and matched control BAM files using soft clipping information in the CIGAR strings.
- Clustering: the reads from the tumor sample that presumably support the same single breakend SVs are clustered. The candidates are removed if apparent supporting reads are detected in the matched control sample (or non-matched control panel samples when they are available).
- Refinement: Gather the soft-clipped part of the reads with 100bp margins inside the breakpoints and generate an error-corrected consensus sequence by two round iterations of all-vs-all alignment by minimap2 and polishing with racon. Then, aligning the consensus sequence to those around the possible breakpoint regions by Smith-Waterman algorithm, we detect single base resolution breakpoints and the consensus sequence after the breakpoint.
- Validation: From the breakpoint determined in the previous step and the error-corrected consensus sequence after the breakpoint, we generate the “putative SV segment sequence.” Then, the reads around the breakpoint of putative single breakend SVs are classified into “variant supporting read” or “reference read” for both tumor and matched control. Finally, candidate SVs with >=3 variants supporting reads in the tumor and no variant supporting reads in the matched control sample are kept as the final single breakend SVs. See the Method section of the preprint for detail.
After removing SVs that share a breakpoint with SVs already detected via Canonical SV module, SVs are basically classified by integrating the alignment of contig sequences to the human reference genome (HG) and the annotation results by RepeatMasker (RM). The right panel shows the typical pattern of an alignment to HG and an annotation result by RM of the contig for each category. L1HS stands for the human LINE-1 (L1) element L1 Homo sapiens (L1Hs).