Underestimation of Repeat Number following Split-and-Align #44

SeAudet · 2021-09-24T17:33:36Z

Hello, I'm trying to statistically predict the number of repeats for the CANVAS chr4 RFC1 expansion (STR=AAAAG) in a sample for which over 600k reads are available following a barcoded-PCR ONT sequencing. The reads have a pretty good quality for Nanopore data, and since the barcode was required on both ends during the demultiplexing steps, most reads do align to the ROI using minimap2. The only problem is that RepeatHMM seems to predict less than 90 repeats on both alleles, even if around 125 can clearly be seen in the fastq or bam. Even in the trf step of RepeatHmm, the temp file *.res outputs nice results like these:
@00000efc-9aae-4aa3-ab75-9d80d8e8f87a
282 929 5 127.8 5 89 9 1084 79 0 19 0 0.75 AAAAG AAAAGAAAAGAAAAAAGAAAAGAAAAGAAAGAAAAGAAAAGAAAAGAAAAGAAAAAAGAAAAGAAAAGAAAAAGAAAGAAAAGAAAAGAAAAGAAAAGAAAAAGAAAAGAAAAGAAAAAGAAAAGAAAGAAAGAAAAAGAAAAAGAAAAGAAAAGAAAAAGAAAAAGAAAAGAAAAGAAAAGAAAAAGAAAAGAAAAAGAAAAGAAAGAAAAAGAAAAGAAAAGAAAAGAAAAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAACAGTGAAAAGAAAAGAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAGAAAAGAAAAGGAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAGAAAAGAAAAGAAAAGAAAAGAAAGAAAAAGAAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAGAAAAGAGAAGAAAAGAAAAAGAAAAGAAAAGAAAAAGAAAAGAAAAGAAAGAAAAGAAAAGAAAAGAAAA TTGCGCCACTGCACTCCAGTCTGGGCAACAGGAGCAAGACTCTGTTTCAA AGCATGTTCTAAAGAGAATAGCAACGGTGTAGCTGGAGTATCACTTGATA
@00005ca3-9a41-458d-94b8-246ea8556823
314 953 5 124.8 5 91 8 1065 80 0 19 0 0.73 AAAAG AAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAAGAAAAAGAAAAGAAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAAGAAAAGAAAAGAAAGAAAAGAAAAGAAAAGAAAGAAAAGAAAAAGAAAGAAAAGAAAAAGAAAAGAAAAAGAAAAGAAAAAGAAAAAGAAAAGAAAGAAAAGAAAAGAAAAAGAAAAAGAAAAAGAAAAGAAAAGAAAGAAAAGAAAAAGAAAAAGAAAAGAAAAGAAAAAGAAAAAGAAAAAGAAAAAGAAAAGAAAAGAACAGAAAAGAAAAGAAAAGAAAAGAAAAAGAAAAGAAAAGAAAAAGAAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAAGAAAAGAAAAGAAAAAGAAAAGAAAAAGAAAAGAAAAGAAAAGAAAAGAAAGAAAGAAAAGAAAAGAAAGAAAAGAAAGAAAAAGAAAAGAAAAGAAAAAGAAAAGAAAAGAAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAGAAAAGAAAAGAAAAGAAAGAAAAGAAAAGAAAAGAAAAGAAAA ATTGCGCCACTGCACTCCAGTCTGGGCAACAGAGCAAGACTCTGTTTCAA AGCATGTTCTAAAGAGAATAGCAACGGTGTGGCTGGGTATCCTTGATAAA

There is about 250bp that align to the ROI on both sides of the expansion, which should allow a decent alignment score. Here's the output scores for RFC1 :
INFO: expRegionInLongRead= 0 0
INFO: nonRepeatinLongRead= 37631 539660
p2sp ['rfc1', 11.8, [0, 0], 'allocr:0:37514, 1:22, 2:76, 3:17, 4:8, 5:8, 6:6, 7:3, 8:4, 9:3, 10:3, 12:1, 13:2, 14:1, 15:1, 16:1, 19:2, 20:4, 22:2, 23:1, 24:2, 26:2, 27:4, 29:1, 31:2, 35:2, 36:1, 55:1,', 37694, 539660]

I'm guessing I might not be running the script properly, but here's how I ran it (I know some of the options I've entered are already default, but that's what I used when it "worked") :
python {TOOL}/repeatHMM.py BAMinput --Onebamfile input_sorted.bam --hg hg19 --hgfile hg19.fa --Patternfile reference_sts/hg19/hg19.predefined.pa --repeatName rfc1 --GapCorrection 1 --FlankLength 30 --RepeatTime 10 --SeqTech Nanopore --UserDefinedUniqID {SAMPLE}_rfc1 --outlog {OUTPUT}/{SAMPLE}

I've also tried a more simple command where SplitAndReAlign is supposed to remain as Default=0 (Off) but it did not work :
python {TOOL}/repeatHMM.py BAMinput --Onebamfile input_sorted.bam --hg hg19 --hgfile hg19.fa --repeatName rfc1 --UserDefinedRepeat chr4/39350045/39350103/AAAAG/-11// --UserDefinedUniqID {SAMPLE}_{REPEAT}

In the .pa file, I've added my expansion as rfc1,chr4,39350045,39350103,AAAAG,-11,, - I'm not sure about the last number, but I've read in another issue that only the +/- is used. If it's wrong, let me know!

liuqianhn · 2021-10-27T13:45:46Z

@SeAudet Sorry for the late reply. "INFO: expRegionInLongRead= 0 0" indicates that there is no much longer reads. I need to investigate it to see why. Could you please share a smaller set of long reads with longer repeat so that I can debug it? Thanks.

SeAudet · 2021-10-30T01:02:52Z

Thank you for the reply! I've shared a subset of the data from two samples (with and without repeat expansion) with you. There should be thousands of reads that align to RFC1, but let me know if you need more than what I've shared, or additional information about what I've tried before.

mengya98 · 2022-10-05T06:51:12Z

Hi, I had similar underestimation problem. May I know how the problem was solved?
My log:
INFO: expRegionInLongRead= 0 0
INFO: nonRepeatinLongRead= 52460 32948
INFO: p2sp ['atxn3', 14.0, [27, 27], 'allocr:0:4, 1:1, 2:1, 3:6, 4:6, 5:12, 6:6, 7:22, 8:26, 9:29, 10:46, 11:48, 12:52, 13:71, 14:85, 15:174, 16:272, 17:166, 18:216, 19:286, 20:425, 21:649, 22:1103, 23:1928, 24:4015, 25:8034, 26:13999, 27:15579, 28:4540, 29:534, 30:74, 31:20, 32:14, 33:8, 34:3, 35:1, 39:1, 41:1, 42:2, 48:1,', 52460, 32948]
However, 70-80 repeats can clearly be seen in the fastq or bam.
Thanks!

SeAudet · 2022-10-05T18:20:07Z

@mengya98 This specific issue has not been solved yet, although playing with the parameters allowed me from nothing called, to a few repeats, to about 60-70% of repeats called. The latest attempt was done with the following parameters (pattern file was also changed so that the last number is +11 instead of -11) :
python {TOOL}/repeatHMM.py BAMinput --Onebamfile {INPUT} --hg {MODE} --hgfile {GENOME} --Patternfile {PATTERN} --repeatName {REPEAT} --GapCorrection 1 --FlankLength 30 --RepeatTime 5 --SeqTech Nanopore --UserDefinedUniqID {SAMPLE}_{REPEAT} --outlog {OUTPUT}/{SAMPLE}

I've had to put this project aside because it was taking up too much time, but the most interesting observation was that the intermediate "TRF scan" step seems to accurately define the number of repeats for relevant reads. Therefore I could just subset reads that perfectly aligned to my region of interest, quantify repeats with the TRF scan, extract the value for each read, and plot the information for a grant figure. Alternatively, I was suggested to try DeepRepeat (from same author), but failed to make it work.

For a more standard repeat microsatellite (ATXN2) where there's more training data, the tool was pretty accurate to call the repeat number with the aforementioned parameters (with RepeatTime 10), so perhaps a little tweaking will suffice in your case! Otherwise, tool author is far more knowledgeable to try and help you.
Best of luck!

liuqianhn · 2022-10-08T07:00:19Z

@SeAudet Thank you for sharing the 'TRF scan' investigation, and My apology for not solving it. May I know which email you used to share the data? For DeepRepeat, I suggest using docker and install glibc-source. If you still have issues with docker after installing glibc-source, please kindly share your issues and we can check it.

@mengya98 It seems that split-alignment step does not work. Again, showing the bam of the subregion alignment will help the debug. Thanks.

mengya98 · 2022-10-09T13:00:14Z

@SeAudet Thank you for sharing!

@liuqianhn Thank you for the reply! I'll have more try.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Underestimation of Repeat Number following Split-and-Align #44

Underestimation of Repeat Number following Split-and-Align #44

SeAudet commented Sep 24, 2021 •

edited

Loading

liuqianhn commented Oct 27, 2021

SeAudet commented Oct 30, 2021

mengya98 commented Oct 5, 2022

SeAudet commented Oct 5, 2022 •

edited

Loading

liuqianhn commented Oct 8, 2022

mengya98 commented Oct 9, 2022

Underestimation of Repeat Number following Split-and-Align #44

Underestimation of Repeat Number following Split-and-Align #44

Comments

SeAudet commented Sep 24, 2021 • edited Loading

liuqianhn commented Oct 27, 2021

SeAudet commented Oct 30, 2021

mengya98 commented Oct 5, 2022

SeAudet commented Oct 5, 2022 • edited Loading

liuqianhn commented Oct 8, 2022

mengya98 commented Oct 9, 2022

SeAudet commented Sep 24, 2021 •

edited

Loading

SeAudet commented Oct 5, 2022 •

edited

Loading