Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep duplicates "contigs hitting multiple probes" #328

Open
sbabbbie opened this issue Feb 14, 2024 · 3 comments
Open

Keep duplicates "contigs hitting multiple probes" #328

sbabbbie opened this issue Feb 14, 2024 · 3 comments

Comments

@sbabbbie
Copy link

Hi there, I have a duplicate file from the --keep duplicates flag. However, I'm confused about how to automate retrieving the contigs that map to multiple UCEs. Because I am working with very small genomes, many of my UCEs seem to be close enough together that the assembled contigs cover multiple UCEs, but I would still like to include these loci in my downstream analysis rather than just dropping them. But I'm not sure of the most efficient way to do this. I see you have scripts for the opposite issue (phyluce_assembly_parse_duplicates_file.py retrieves contigs under "probes hitting multiple contigs" rather than "contigs hitting multiple probes" which is what I need). I've tried editing this script to look at contigs hitting multiple probes instead, but I just keep getting blank output files.

Would appreciate any advice!

@brantfaircloth
Copy link
Member

Howdy,

What types of data are you inputting? If loci are proximate to one another in the assemblies you have, it might be worthwhile to consider following the "harvesting loci from genomes" approach (e.g. Tutorial 3) and reducing the distance sliced from the "core" of each UCE locus identified (within a given contig). Then, input those genome slices to the normal approach.

Just keep in mind that if the loci are VERY proximate to one another, you are not getting a independent-ish draw from the genome.

@sbabbbie
Copy link
Author

sbabbbie commented Feb 14, 2024

Thank you, that's a very useful suggestion! I am working with contigs assembled in SPADES from raw next gen sequencing data, trying to identify what UCEs I have represented. Luckily I have many UCEs from all over the genome and they are not ALL very proximate to one another, but there's definitely some that are close enough together that they're getting assembled and then hitting multiple probes. It messes up my analysis to have them all dropped since I have an underestimate of locus representation across taxa. I will try the harvesting loci from genomes approach and see if that solves my issue!

@brantfaircloth
Copy link
Member

Another option would be to switch to guided assembly of your contigs based on the probe sequence (e.g. as in aTram or itero - but that might not work so well if your reads are divergent from your baits/loci.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants