-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Underestimation of Repeat Number following Split-and-Align #44
Comments
@SeAudet Sorry for the late reply. "INFO: expRegionInLongRead= 0 0" indicates that there is no much longer reads. I need to investigate it to see why. Could you please share a smaller set of long reads with longer repeat so that I can debug it? Thanks. |
Thank you for the reply! I've shared a subset of the data from two samples (with and without repeat expansion) with you. There should be thousands of reads that align to RFC1, but let me know if you need more than what I've shared, or additional information about what I've tried before. |
Hi, I had similar underestimation problem. May I know how the problem was solved? |
@mengya98 This specific issue has not been solved yet, although playing with the parameters allowed me from nothing called, to a few repeats, to about 60-70% of repeats called. The latest attempt was done with the following parameters (pattern file was also changed so that the last number is +11 instead of -11) : I've had to put this project aside because it was taking up too much time, but the most interesting observation was that the intermediate "TRF scan" step seems to accurately define the number of repeats for relevant reads. Therefore I could just subset reads that perfectly aligned to my region of interest, quantify repeats with the TRF scan, extract the value for each read, and plot the information for a grant figure. Alternatively, I was suggested to try DeepRepeat (from same author), but failed to make it work. For a more standard repeat microsatellite (ATXN2) where there's more training data, the tool was pretty accurate to call the repeat number with the aforementioned parameters (with RepeatTime 10), so perhaps a little tweaking will suffice in your case! Otherwise, tool author is far more knowledgeable to try and help you. |
@SeAudet Thank you for sharing the 'TRF scan' investigation, and My apology for not solving it. May I know which email you used to share the data? For DeepRepeat, I suggest using docker and install @mengya98 It seems that split-alignment step does not work. Again, showing the bam of the subregion alignment will help the debug. Thanks. |
@SeAudet Thank you for sharing! @liuqianhn Thank you for the reply! I'll have more try. |
Hello, I'm trying to statistically predict the number of repeats for the CANVAS chr4 RFC1 expansion (STR=AAAAG) in a sample for which over 600k reads are available following a barcoded-PCR ONT sequencing. The reads have a pretty good quality for Nanopore data, and since the barcode was required on both ends during the demultiplexing steps, most reads do align to the ROI using minimap2. The only problem is that RepeatHMM seems to predict less than 90 repeats on both alleles, even if around 125 can clearly be seen in the fastq or bam. Even in the trf step of RepeatHmm, the temp file *.res outputs nice results like these:
@00000efc-9aae-4aa3-ab75-9d80d8e8f87a
282 929 5 127.8 5 89 9 1084 79 0 19 0 0.75 AAAAG AAAAGAAAAGAAAAAAGAAAAGAAAAGAAAGAAAAGAAAAGAAAAGAAAAGAAAAAAGAAAAGAAAAGAAAAAGAAAGAAAAGAAAAGAAAAGAAAAGAAAAAGAAAAGAAAAGAAAAAGAAAAGAAAGAAAGAAAAAGAAAAAGAAAAGAAAAGAAAAAGAAAAAGAAAAGAAAAGAAAAGAAAAAGAAAAGAAAAAGAAAAGAAAGAAAAAGAAAAGAAAAGAAAAGAAAAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAACAGTGAAAAGAAAAGAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAGAAAAGAAAAGGAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAGAAAAGAAAAGAAAAGAAAAGAAAGAAAAAGAAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAGAAAAGAGAAGAAAAGAAAAAGAAAAGAAAAGAAAAAGAAAAGAAAAGAAAGAAAAGAAAAGAAAAGAAAA TTGCGCCACTGCACTCCAGTCTGGGCAACAGGAGCAAGACTCTGTTTCAA AGCATGTTCTAAAGAGAATAGCAACGGTGTAGCTGGAGTATCACTTGATA
@00005ca3-9a41-458d-94b8-246ea8556823
314 953 5 124.8 5 91 8 1065 80 0 19 0 0.73 AAAAG AAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAAGAAAAAGAAAAGAAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAAGAAAAGAAAAGAAAGAAAAGAAAAGAAAAGAAAGAAAAGAAAAAGAAAGAAAAGAAAAAGAAAAGAAAAAGAAAAGAAAAAGAAAAAGAAAAGAAAGAAAAGAAAAGAAAAAGAAAAAGAAAAAGAAAAGAAAAGAAAGAAAAGAAAAAGAAAAAGAAAAGAAAAGAAAAAGAAAAAGAAAAAGAAAAAGAAAAGAAAAGAACAGAAAAGAAAAGAAAAGAAAAGAAAAAGAAAAGAAAAGAAAAAGAAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAAGAAAAGAAAAGAAAAAGAAAAGAAAAAGAAAAGAAAAGAAAAGAAAAGAAAGAAAGAAAAGAAAAGAAAGAAAAGAAAGAAAAAGAAAAGAAAAGAAAAAGAAAAGAAAAGAAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAGAAAAGAAAAGAAAAGAAAGAAAAGAAAAGAAAAGAAAAGAAAA ATTGCGCCACTGCACTCCAGTCTGGGCAACAGAGCAAGACTCTGTTTCAA AGCATGTTCTAAAGAGAATAGCAACGGTGTGGCTGGGTATCCTTGATAAA
There is about 250bp that align to the ROI on both sides of the expansion, which should allow a decent alignment score. Here's the output scores for RFC1 :
INFO: expRegionInLongRead= 0 0
INFO: nonRepeatinLongRead= 37631 539660
p2sp ['rfc1', 11.8, [0, 0], 'allocr:0:37514, 1:22, 2:76, 3:17, 4:8, 5:8, 6:6, 7:3, 8:4, 9:3, 10:3, 12:1, 13:2, 14:1, 15:1, 16:1, 19:2, 20:4, 22:2, 23:1, 24:2, 26:2, 27:4, 29:1, 31:2, 35:2, 36:1, 55:1,', 37694, 539660]
I'm guessing I might not be running the script properly, but here's how I ran it (I know some of the options I've entered are already default, but that's what I used when it "worked") :
python {TOOL}/repeatHMM.py BAMinput --Onebamfile input_sorted.bam --hg hg19 --hgfile hg19.fa --Patternfile reference_sts/hg19/hg19.predefined.pa --repeatName rfc1 --GapCorrection 1 --FlankLength 30 --RepeatTime 10 --SeqTech Nanopore --UserDefinedUniqID {SAMPLE}_rfc1 --outlog {OUTPUT}/{SAMPLE}
I've also tried a more simple command where SplitAndReAlign is supposed to remain as Default=0 (Off) but it did not work :
python {TOOL}/repeatHMM.py BAMinput --Onebamfile input_sorted.bam --hg hg19 --hgfile hg19.fa --repeatName rfc1 --UserDefinedRepeat chr4/39350045/39350103/AAAAG/-11// --UserDefinedUniqID {SAMPLE}_{REPEAT}
In the .pa file, I've added my expansion as rfc1,chr4,39350045,39350103,AAAAG,-11,, - I'm not sure about the last number, but I've read in another issue that only the +/- is used. If it's wrong, let me know!
The text was updated successfully, but these errors were encountered: