-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alignment outside of contig #2
Comments
This is strange, I will have a better look at this over the weekend. It would help if I can get at least one read(/sequence) that has aligned coordinates outside this contig, as well as your command line calling uLTRA. Would this be possible? |
I decided to wait with tracing down this bug until I get a sequence that produces violating alignment coordinates, as this bug may be nontrivial. |
I prepared a minimal example, by taking every 100th read from the affected contig: samtools view CTL_S2_sorted.bam KI270827.1|perl -lane '$c+=1; if ($c%100==0){print "\@read$c\n$F[9]\n+\n$F[10]"}'>test.fastq This resulted in 10 reads. As the aligned reads did not contain quality strings, there are no quality values in the fastq, but I think it does not matter. The reference genome and annotation I used is from GENCODE. # prepare the index
uLTRA index GRCh38.genome.fa gencode.v36.chr_patch_hapl_scaff.annotation.gtf uLTRA_index --disable_infer
# align the reads
uLTRA align GRCh38.genome.fa test.fastq . --prefix test --index uLTRA_index --isoseq Strangely, the resulting alignment contained only 4 reads (even though all where aligned to KI270827.1 when origninally processed together with all reads). For each read, two alignments are found, one on chr11 and one on KI270827.1. Sometimes chr11 is reported as primary, sometimes KI270827.1. Except for the strand[1] the alignment to KI270827.1 is the same as in my original uLTRA alignment with all reads. [1] apparently the sequence in the sam file is reversed if aligned to the - strand, which I did not consider when creating the fastq. |
Hi Matthias, Thanks for the effort in producing a minimal example, it really helps. I just wanted to double-check with you if you are sure that you are using v36 on the gencode annotation? The reason that I ask is that I cannot find This may be precisely what is causing the bug, but just thought I would check if I have the correct GTF annotation file before I go on a bug hunt. |
I can confirm that there is no annotation on KI270827.1 in the version I was using (v36). |
So I looked into this and here are my observations:
I'm therefore wondering whether in your case, uLTRA made use of an old If you do such a "complete rerun" (and also use a fasta file or properly formatted fastq file) and still observe this behaviour I would have to take more serious debugging action. Let me know. |
I reran the complete pipeline, but the issue persists...
As you found, the transcript is actually on KI270831.1, and the position seems to match. However, for me KI270827.1 is reported. |
Thanks for your time with this! Really helps as the bug may be difficult. I will rerun it on another machine to try to reproduce this bug. Are you sure that you didn't specify the full path to the I ran with these commands specifically:
I also tried the alignment step using only one core, i.e., with I ran on my Macbook. Pythons dictionary iteration order change across runs and I use a dictionary to store reference name to ref_ID (used internally in program) so will start by examining this property. |
You are right I used absolute paths. I removed them as I thought it makes things more easy to read. I noticed a mistake on my side, not sure whether this is relevant: I reran the index and alignment steps with GRCh38.p13.genome.fa and got different results, however having the same issue. This time I got 45 alignments on KI270827.1, of which 42 are beyond the end position of the sequence. I also noticed other sequences affected, but I guess this one is very obvious, as it is relatively short. If you need more examples (e.g. the index I produced) let me know. As you mention that dict implementation might be related to it, my exact python version and linux kernel:
|
I have not forgotten about this issue. Will take a look at it again after vacation. |
I can finally close this issue as it, with the highest likelihood, is fixed due to a very similar bug report in #17 that I just fixed. If you still have this instance easily set up, it would be good with a verification, but I'm not expecting anything as it was so long ago now. |
I used your tool (version 0.3, installed with pip) to align isoseq data to the human reference genome. For at least one contig I got reads beyond the last position (I renamed, sorted and binarized the output file reads.sam to CTL_S2_sorted.bam):
e.g. Contig KI270827.1 has 67707 bases.
These are the start position of the aligned reads:
The text was updated successfully, but these errors were encountered: