-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Advice on multi-sample, multi-species, multi-contig analyses for both πN/πS and dN/dS #39
Comments
Thanks a lot for your interest in SNPGenie, @andreaswallberg! Regarding the questions:
I think this speaks to your question, as it sounds like your "sample" columns each represent one individual. Because SNPGenie expects each will represent a pool, this approach will not work. It will instead be necessary to bioinformatically "pool" the individuals you've sequenced, e.g., summarize them in a single VCF file for use with whichever SNPGenie VCF format is most convenient. Ah! Looks like I got ahead of myself. Indeed, if Freebayes the bioinformatically pooled "summary" data in this way, using --vcfformat=2 should ignore the "sample" columns altogether and just use the DP and AF flags — so you should be good to go!
I now see your question about PAML, etc. — that is fine, too. What you suggest would be ideal, and if I'd been a better coder when writing SNPGenie I surely would have used a modular approach that would allow another dN/dS engine to be "plugged in", as it were. Unfortunately, SNPGenie is anything but modular and such a change would require a major rewrite that I (sadly) currently don't have time for. I do hope to take a crack at this whole thing again in coming years, using Python this time, and will make modularization a priority! About the question of orthology, I think that your approach of mapping reads from different species to the same reference is a clever one. Depending on how distant the species are, you might run into problems with indels and rearrangements, etc. In my ideal world, it would be best to first align the references of all species together using a pipeline like Enredo/Pecan/Ortheus to determine which sites are homologous; exclude any sites with multiple hits in the alignment (i.e. exclude any sites that are even possibly duplications/paralogs); map reads separately to each reference; and finally convert the coordinates from each reference to one one the species' references using the multi-species alignment. This is of course too perfectionistic and terribly tedious; it's probably fine to use your approach so long as you're somehow trying to correct for possible biases, e.g., read mapping bias that could occur due to regions of divergence or rearrangements.
Could you explain what you mean by accessions in whether the revcom script will "handle multiple accessions as is"? Hope at least some of this is helpful! |
Thanks for the feedback! I fully appreciate the complexity and time needed to add new functions to old but working code :-) With "handle multiple accessions as is", I mean if the script would work on files that include many separate contigs, but I think you answered my question. Basically, I need one set of trio files per contig (or scaffold). My current draft assembly spans thousands of contigs/scaffolds in the FASTA file, as do the variation data in the VCF file. So I better split these into separate files when running the program? |
Ah, understood now! Yes, unfortunately, it will be necessary to put each contig in its own FASTA file, and the SNPs and genes within each contig in their own corresponding VCF and GTF files, respectively. The positions of both SNPs and CDS entries in the GTF files should be accurate with respect to that contig's (1-based) coordinates. Let me know if that makes sense! |
Makes a lot of sense! I set up a pipeline for dN/dS in PAML using high-quality pairwise alignments a few years ago. Perl is my main language, so perhaps I can dust off what I did back then, and see if it can be put to use in SNPGenie. I will let you know if I start playing around with your code. |
Dear @cwnelson88 ,
Thanks for implementing these solutions in these scripts! These are hugely time-saving for anyone interested in accomplishing gene-centered population genomics!
I have some questions which I think were partly explained in #34 and #35 but I will take the liberty to ask again, just to be sure.
Lets say I work with discrete libraries from individuals sampled from one population (one library, one read-group sample and one column per individual following the "FORMAT" column on every line with a variant) and produce a VCF with a tool like Freebayes (VCFv4.2). The intention is to compute πN/πS across all these individuals, i.e. to treat them as a single population.
If run with --vcfformat=4, will these samples be treated as a single or separate populations?
Freebayes also outputs the global DP and AF tags in the INFO column. Will SNPGenie use these instead of the "SAMPLE" columns? Can I specify --vcfformat=2 to enforce using these tags and effectively parse the multi-sample VCF as a single-population VCF?
I suspect one answer to this is to map data from other species to the same reference, and generate one VCF per species. Depending on genetic distance, this will contain quite a few gene regions with variants fixed for a non-reference alternate allele. Using the script from #34, species-specific sequences can then be exported and easily concatenated. This mapping approach side-steps the issue of orthology-detection using external tools.
Does this sound reasonable to you?
For dN/dS, is there any correction for multiple substitutions when estimating dN and dS?
Corrected dN and dS could possibly be computed separately using PAML. Would it be possible for you to implement a way of optionally side-loading those stats from a simple table, such that they can replace or complement the internal computation of dN/dS and be incorporated in the standard summary statistics computed by the program?
Must all included contigs have a gene model?
Will the reverse-complement script handle multiple accessions as is, or will I need to use alternative solutions?
Cheers!
The text was updated successfully, but these errors were encountered: