Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TAMA collapse IndexError #126

Open
mainciburu opened this issue Feb 12, 2024 · 4 comments
Open

TAMA collapse IndexError #126

mainciburu opened this issue Feb 12, 2024 · 4 comments

Comments

@mainciburu
Copy link

Hi,
I'm trying to run TAMA collapse on ONT aligned reads that have gone through the following pipeline:
-Adapter trimming and reorienting with Pychopper

  • PolyA trimming with TAMA
  • Alignment to genome with uLTRA

Now, I'm using this command to run TAMA collapse

python tama/tama_collapse.py -s reads.filtered.sorted.sam -f $ref_genome_dir/$ref_genome -x no_cap -a 100 -z 100 -m 10 -p tama/collapse/mysample

And I'm getting this error

sam count 1530000 Traceback (most recent call last): File "tama/tama_collapse.py", line 5190, in <module> [h_count,s_count,i_count,d_count,mis_count,nomatch_dict,sj_pre_error_list,sj_post_error_list] = calc_error_rate(start_pos,cigar,seq_list,scaff_name,read_id) File "tama/tama_collapse.py", line 847, in calc_error_rate next_cig_flag = cig_char_list[i + 1] IndexError: list index out of ran

I've guess it has something to do with the cigar string and errors around splice junction. I've also found the first read in my sam that causes the error (I think there's more than one). It's this one:

99:1914|a1ac6b88-3c74-4652-8b19-b682f6e0e40d 0 NC_027836.1 1435 60 2X3=1D1X1=1X4=4I3=3X3=3X166=1X8=1X11=3D104=1I84=2D56=1X111=2X113=1X74=1I1X12=2X6=1X8=1X26=1X2=1X17=1X6=1X88=1X15=1X58=1X137=1X38=1X11=1X5=1X14=1X7=1X10=1X31=1I3=1I32=44I1X2=1I2=1D1X2=1D2=180I1=1X2=1X3=1X2=1D1=1X43=1X2=1X3=1D1=1X3=1X45=1D112=1X7D2992N * 0 0 CGACTTGGCATTAGAATTAGGCTTTGGGGCGAAAATGACTTTATTCAACAAATCATAAAGATATTGGAACATTATATTTTATTTTTGGAATTTGAGCAGGGATAGTAGGTACTTCTTTAAGTTTATTAATTCGAGCTGAATTAGGGACTCCAGGATCTTTAATTGGAGATGATCAAATTTATAATACTATTGTAGCAGCTCATACTTTTATTATATTTTTTATAGTTATACCTATTATAATTGGAGGATTTGGAAATTGACTTGTACCTTTAATATTAGGAGCCCCTGATATAGCTTTCCCACGTATAAATAATATAAGTTTTTTGACTTTTACCCCCATCTTTAACTTTATTAATTTCTAGTAGCATTGTAGAAAATGGAGCAGGAACTGGATGAACAGTTTACCCCCCTCTCCTCTAATATTGCTCATGGTGGTAGTTCAGTAGATTTAGCTATTTTCCCACTTCATTTAGCTGGAATTTCATCTATTTTAGGAGCTATTAACTTTATTACTACTATTATTAATATACGATTAAATAATTTATCATTTGATCAAATACCTTTATTTATTTAGGCTGTAGGTATTACTGCATTCTTATTATTATTATCTTTACCTGTTTTAGCCGGAGCTATTACTATATTACTTACTGATCGAAATTTAAATACATCATTTTTCGATCCTGCAGGTGGAGGTGATCCTATTCTTTATCAACATTTATTTTGATTTTTTGGACATCCTGAAGTATATATTTTAATTTTACCAAGGATTTGGTATACATTCTCATATTATTTCCCAAGAAAGAGGTAAAAAGGAAACATTCGGGTGTTTAGGTATAATTTACGCTATACTAGCAATTGGTTTATTAGGATTTATTGTTTGAGCTCATCATATATTTACTGTAGGAATAGATATTGATACACGAGCATATTTTACATCAGCAACAATAATTATTACTGTACCAACAGGTATTAAAATTTTTAGTTGATTAGCTACTTTCCATGGAACTCAAATTAATTATTCCCCATCTATTTTATGAAGATTAGGATTTGTATTTTTATTTACTGTAGGAGGATTAACAGGTGTAATTTTATCTAATTCTTCTATTGATATTACTTTACATGATACTTACTATGTAGTTGCTCATTTCCATTATGTTTTATCAATAGGAGCTGTATTTGCTATTTTAGGGGGATTTATTCATTGATACCCATTATTTACTGGTTTATCTTCAAATCCTTATTTATTAAAAATTCAATTTTTTATTATATTTATCCGGAAGTAAATTTAACTTTCTTCCCACAACATTTTTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGATGCATGCCTGGTGGTATTCTGATTATCCTGATTCTTATATTTCATGAAATATTATTTTATTATTGAATCATATATTTCATTATTAGCAATCATATTTATATTAATTATTATTTGAAATCTATAATTAATCAACGAATTTCTTTATTTACTTTAAATTTATCTTCTTCAATTGAATGATATCAAAATTTACCACCAGCTGAACATTCATATAATGAATTGCCTATTTTT * XA:Z: XC:Z:Insufficient_junction_coverage_unclassified NM:i:297

Could you help me out understand what's going on? How could I filter the sam file to get rid of problematic reads?
Thank you!

@ksahlin
Copy link

ksahlin commented Feb 12, 2024

Hi,

My bad. I am the developer of uLTRA. What you observe is an error in how uLTRA output the cigar (problematic cigar string as it ends with 112=1X7D2992N). It would take me some time to track down and fix this error. I do see that the read looks problematic.

In the meantime, perhaps it is possible to handle such errors gracefully in TAMA with a try except statemrnt and printing a warning if an error occurs in this part of the code?

Best,
Kristoffer

@mainciburu
Copy link
Author

Thanks a lot for the fast reply! I've been trying to make sense of this error for a few days and I would have never found the answer. When you say that read I showed looks problematic, is it a formatting issue or does it look bad quality / bad alignment...?
In any case, I found and removed the reads with cigar string ending in N (it was just 3). I run TAMA again and the original error disappeared! Unfortunately, now I have a new error, probably not related to uLTRA anymore (?). I've seen it has been reported a couple of times here, but didn't get a clear idea of what to do about it in my case. It's the following:

Error with exon end earlier than exon start 738633 738633 ['105:518|689ca4ad-c28a-4bc4-aa8c-51cede2beb56', '99:514|04e01855-0675-4357-bac9-8cc36ad5497b', '133:524|25e08f18-7d09-4bb4-aab9-bfe7f4d66be7', '132:413|493618ca-d927-44c7-931e-c99a14ff3822', '105:523|9fbfd390-4964-454d-881d-3441cb994cfa', '119:412|dc5a1936-d9f9-4944-af15-ade882739d93', '121:325|54554b0c-84d6-4611-bdbf-c0fc12a887d3', '131:562|8a715b1c-eb42-4789-b79e-787b500e2b8c', '138:488|9b56f899-461f-4407-a590-8fa04b572ad4', '128:537|6655c133-3d0b-45d3-9e7d-3f1c09128914', '105:383|5c6fe978-7fba-46bb-8a79-b29c1f020d63', '108:395|79a13c40-b842-4f9f-b895-4a5fb21f61e0', '126:383|7cd9ba40-065a-4ee3-b6be-cbc70c24f81d', '139:519|20ff1d7f-5853-41bc-95ba-67141321180f', '126:415|09635947-bb94-4922-a3cf-acaff07eeac9', '130:541|45d993fe-091a-4a45-8944-adee3ab2e009', '136:549|9b0186ba-411a-4699-b259-552a1b257141', '98:311|250ea703-de96-40e3-9724-2084768bb525', '127:509|4ebcafbb-83af-4fdc-ba1b-b5523f43d734', '128:374|6026b742-a4d9-4d27-b33b-36956afd61fd', '129:379|8dea2747-d088-46f6-99a0-861dcd40c336', '129:384|2af9aaea-c4b1-477f-89e3-5b44629e65d1', '121:591|700b6c2b-320c-4a22-8c54-3327954e1be6', '126:507|de27e02f-378c-42c6-8174-4efa6d594b58', '104:411|f2bc7335-a24b-4a0d-9feb-c0835e17be5c', '136:545|e12af295-37af-4ef3-a6aa-2e895b22a41b', '129:417|68ad3d6c-e02a-4819-8f42-ca404a023041', '98:309|80ef9789-6713-4838-afd3-b33a35c91918', '136:574|f2dc187d-0b30-467b-8fbd-c0cf66bff3e5', '137:398|1c647ab7-c18a-4373-8d28-e8adb668983f', '131:532|e2e37bc5-1f95-4ecb-8ba8-b5d90f22e17e', '121:531|8bc49d2e-ac63-41c4-ab4f-77f9532d78d3']

So I still have a bunch of problematic reads... I attach the sam file with the reads mentioned in the error message here:
error_subset_sam.txt
I'll appreciate any hint on how to proceed. Thank you!!

@ksahlin
Copy link

ksahlin commented Feb 16, 2024

When you say that read I showed looks problematic, is it a formatting issue or does it look bad quality / bad alignment...?

I was referring to the long homopolymer/STR. However, a mature aligner should handle these things seamlessly (diss to my own aligner). Although I now see that also other established aligners are struggling with the cigar output as reported in this repo.

I've seen it has been reported a couple of times here, but didn't get a clear idea of what to do about it in my case.

Not sure. Perhaps you can try to remove also those reads (as done here: #113). The 30-40 reads affected is only a tiny subset of your data I assume(?), so shouldn't change your results notably unless they are all from some gene(s) that you are interested in.

@mainciburu
Copy link
Author

Thanks for the clarification! And yeh, I ended up removing those reads too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants