Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Underestimation of Repeat Number following Split-and-Align #44

Open
SeAudet opened this issue Sep 24, 2021 · 6 comments
Open

Underestimation of Repeat Number following Split-and-Align #44

SeAudet opened this issue Sep 24, 2021 · 6 comments

Comments

@SeAudet
Copy link

SeAudet commented Sep 24, 2021

Hello, I'm trying to statistically predict the number of repeats for the CANVAS chr4 RFC1 expansion (STR=AAAAG) in a sample for which over 600k reads are available following a barcoded-PCR ONT sequencing. The reads have a pretty good quality for Nanopore data, and since the barcode was required on both ends during the demultiplexing steps, most reads do align to the ROI using minimap2. The only problem is that RepeatHMM seems to predict less than 90 repeats on both alleles, even if around 125 can clearly be seen in the fastq or bam. Even in the trf step of RepeatHmm, the temp file *.res outputs nice results like these:
@00000efc-9aae-4aa3-ab75-9d80d8e8f87a
282 929 5 127.8 5 89 9 1084 79 0 19 0 0.75 AAAAG AAAAGAAAAGAAAAAAGAAAAGAAAAGAAAGAAAAGAAAAGAAAAGAAAAGAAAAAAGAAAAGAAAAGAAAAAGAAAGAAAAGAAAAGAAAAGAAAAGAAAAAGAAAAGAAAAGAAAAAGAAAAGAAAGAAAGAAAAAGAAAAAGAAAAGAAAAGAAAAAGAAAAAGAAAAGAAAAGAAAAGAAAAAGAAAAGAAAAAGAAAAGAAAGAAAAAGAAAAGAAAAGAAAAGAAAAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAACAGTGAAAAGAAAAGAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAGAAAAGAAAAGGAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAGAAAAGAAAAGAAAAGAAAAGAAAGAAAAAGAAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAGAAAAGAGAAGAAAAGAAAAAGAAAAGAAAAGAAAAAGAAAAGAAAAGAAAGAAAAGAAAAGAAAAGAAAA TTGCGCCACTGCACTCCAGTCTGGGCAACAGGAGCAAGACTCTGTTTCAA AGCATGTTCTAAAGAGAATAGCAACGGTGTAGCTGGAGTATCACTTGATA
@00005ca3-9a41-458d-94b8-246ea8556823
314 953 5 124.8 5 91 8 1065 80 0 19 0 0.73 AAAAG AAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAAGAAAAAGAAAAGAAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAAGAAAAGAAAAGAAAGAAAAGAAAAGAAAAGAAAGAAAAGAAAAAGAAAGAAAAGAAAAAGAAAAGAAAAAGAAAAGAAAAAGAAAAAGAAAAGAAAGAAAAGAAAAGAAAAAGAAAAAGAAAAAGAAAAGAAAAGAAAGAAAAGAAAAAGAAAAAGAAAAGAAAAGAAAAAGAAAAAGAAAAAGAAAAAGAAAAGAAAAGAACAGAAAAGAAAAGAAAAGAAAAGAAAAAGAAAAGAAAAGAAAAAGAAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAAGAAAAGAAAAGAAAAAGAAAAGAAAAAGAAAAGAAAAGAAAAGAAAAGAAAGAAAGAAAAGAAAAGAAAGAAAAGAAAGAAAAAGAAAAGAAAAGAAAAAGAAAAGAAAAGAAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAGAAAAGAAAAGAAAAGAAAGAAAAGAAAAGAAAAGAAAAGAAAA ATTGCGCCACTGCACTCCAGTCTGGGCAACAGAGCAAGACTCTGTTTCAA AGCATGTTCTAAAGAGAATAGCAACGGTGTGGCTGGGTATCCTTGATAAA

There is about 250bp that align to the ROI on both sides of the expansion, which should allow a decent alignment score. Here's the output scores for RFC1 :
INFO: expRegionInLongRead= 0 0
INFO: nonRepeatinLongRead= 37631 539660
p2sp ['rfc1', 11.8, [0, 0], 'allocr:0:37514, 1:22, 2:76, 3:17, 4:8, 5:8, 6:6, 7:3, 8:4, 9:3, 10:3, 12:1, 13:2, 14:1, 15:1, 16:1, 19:2, 20:4, 22:2, 23:1, 24:2, 26:2, 27:4, 29:1, 31:2, 35:2, 36:1, 55:1,', 37694, 539660]

I'm guessing I might not be running the script properly, but here's how I ran it (I know some of the options I've entered are already default, but that's what I used when it "worked") :
python {TOOL}/repeatHMM.py BAMinput --Onebamfile input_sorted.bam --hg hg19 --hgfile hg19.fa --Patternfile reference_sts/hg19/hg19.predefined.pa --repeatName rfc1 --GapCorrection 1 --FlankLength 30 --RepeatTime 10 --SeqTech Nanopore --UserDefinedUniqID {SAMPLE}_rfc1 --outlog {OUTPUT}/{SAMPLE}

I've also tried a more simple command where SplitAndReAlign is supposed to remain as Default=0 (Off) but it did not work :
python {TOOL}/repeatHMM.py BAMinput --Onebamfile input_sorted.bam --hg hg19 --hgfile hg19.fa --repeatName rfc1 --UserDefinedRepeat chr4/39350045/39350103/AAAAG/-11// --UserDefinedUniqID {SAMPLE}_{REPEAT}

In the .pa file, I've added my expansion as rfc1,chr4,39350045,39350103,AAAAG,-11,, - I'm not sure about the last number, but I've read in another issue that only the +/- is used. If it's wrong, let me know!

@liuqianhn
Copy link
Collaborator

@SeAudet Sorry for the late reply. "INFO: expRegionInLongRead= 0 0" indicates that there is no much longer reads. I need to investigate it to see why. Could you please share a smaller set of long reads with longer repeat so that I can debug it? Thanks.

@SeAudet
Copy link
Author

SeAudet commented Oct 30, 2021

Thank you for the reply! I've shared a subset of the data from two samples (with and without repeat expansion) with you. There should be thousands of reads that align to RFC1, but let me know if you need more than what I've shared, or additional information about what I've tried before.

@mengya98
Copy link

mengya98 commented Oct 5, 2022

Hi, I had similar underestimation problem. May I know how the problem was solved?
My log:
INFO: expRegionInLongRead= 0 0
INFO: nonRepeatinLongRead= 52460 32948
INFO: p2sp ['atxn3', 14.0, [27, 27], 'allocr:0:4, 1:1, 2:1, 3:6, 4:6, 5:12, 6:6, 7:22, 8:26, 9:29, 10:46, 11:48, 12:52, 13:71, 14:85, 15:174, 16:272, 17:166, 18:216, 19:286, 20:425, 21:649, 22:1103, 23:1928, 24:4015, 25:8034, 26:13999, 27:15579, 28:4540, 29:534, 30:74, 31:20, 32:14, 33:8, 34:3, 35:1, 39:1, 41:1, 42:2, 48:1,', 52460, 32948]
However, 70-80 repeats can clearly be seen in the fastq or bam.
Thanks!

@SeAudet
Copy link
Author

SeAudet commented Oct 5, 2022

@mengya98 This specific issue has not been solved yet, although playing with the parameters allowed me from nothing called, to a few repeats, to about 60-70% of repeats called. The latest attempt was done with the following parameters (pattern file was also changed so that the last number is +11 instead of -11) :
python {TOOL}/repeatHMM.py BAMinput --Onebamfile {INPUT} --hg {MODE} --hgfile {GENOME} --Patternfile {PATTERN} --repeatName {REPEAT} --GapCorrection 1 --FlankLength 30 --RepeatTime 5 --SeqTech Nanopore --UserDefinedUniqID {SAMPLE}_{REPEAT} --outlog {OUTPUT}/{SAMPLE}

I've had to put this project aside because it was taking up too much time, but the most interesting observation was that the intermediate "TRF scan" step seems to accurately define the number of repeats for relevant reads. Therefore I could just subset reads that perfectly aligned to my region of interest, quantify repeats with the TRF scan, extract the value for each read, and plot the information for a grant figure. Alternatively, I was suggested to try DeepRepeat (from same author), but failed to make it work.

For a more standard repeat microsatellite (ATXN2) where there's more training data, the tool was pretty accurate to call the repeat number with the aforementioned parameters (with RepeatTime 10), so perhaps a little tweaking will suffice in your case! Otherwise, tool author is far more knowledgeable to try and help you.
Best of luck!

@liuqianhn
Copy link
Collaborator

@SeAudet Thank you for sharing the 'TRF scan' investigation, and My apology for not solving it. May I know which email you used to share the data? For DeepRepeat, I suggest using docker and install glibc-source. If you still have issues with docker after installing glibc-source, please kindly share your issues and we can check it.

@mengya98 It seems that split-alignment step does not work. Again, showing the bam of the subregion alignment will help the debug. Thanks.

@mengya98
Copy link

mengya98 commented Oct 9, 2022

@SeAudet Thank you for sharing!

@liuqianhn Thank you for the reply! I'll have more try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants