-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding auto-annotate and inventory compilation scripts, plus new annotation files #52
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @nbokulich,
Some high-level comments...
The auto-annotate functionality looks extremely handy, but I think it's beyond the scope of this specific repo to house this utility code. I can see people wanting to use the code, and then you having to support it, and that's not really what we designed this repo to be.
A few motivations for putting the code in another repo:
- It will help keep mockrobiota focused as data resource. It wasn't intended to be a software package, and there are different things that you need to do to manage a software package versus a data resource (e.g., releasing on package managers like pypi is important for software, but not for data resources). The only code that is currently in this repo is to check for data integrity, and I think we should stick with that.
- As of now there isn't any test code for this, so you really need to be careful with it. If you put it in this repo, which we point users at, we're sort of endorsing use of the code and people will start using it. We can't let that happen unless the code has been fully unit tested.
So, I think you should remove the code directory from this PR, and create a new repository with this code, for example github.com/nbokulich/mockrobiota-utilities
. That doesn't mean that we could never re-visit this decision, or that you shouldn't use it to help with management of mockrobiota, but it would clearly distinguish the stable data resource from the ultility code that you use to help generate the data files. Once you make that transition, I could give more feedback on the code if you'd like (though transitioning from getopt
to click
as I mention in my inline comments would be helpful to do first, since that will reduce the amount of code by a lot).
Same goes for the other utility scripts that you've included in this pull request. After unit testing, supporting fuzzy matching, etc, I could see the auto-annotate script being publishable as an applications note if we wanted to go that route. So I think you should put these in another repo, use them for a while, and then we should discuss whether it's worth trying to make them into a tool that we'd publicly support/publish (in which case it might be time to put them into mockrobiota
, though that's a decision we'd want to weigh at the time) or whether we just want them to remain utilities that are not support for use by others.
Other high-level comments:
- could you rename the otu directories to use dashes instead of underscores (so
99-otus
rather than99_otus
) for consistency with the other directory names in the repo? It's a pain to mix both as you can never remember/guess which will be used at any given time. (We like dashes because you don't have to hit shift... yep, programmers are that lazy...) - remove
data/mock-12/.DS_Store
, and any other.DS_store
files that are hiding around the repo. If you add.DS_store
to the.gitignore
file, git won't try to track them anymore. Here's an example. - I think you should rename
data/mock-12/source/ExtremeRefSeqs.fasta
todata/mock-12/source/expected-sequences.fasta
, for consistency withexpected-taxonomy.tsv
. What do you think of that? Simiarly,HMP_MOCK.fasta
should becomemock.1.fasta
(and in general, the names should be .fasta).
The additional data sets are fantastic, and it's awesome that we have sequence data for some now!
One other comment: it's generally a good idea to keep your pull requests more modular. In this case, I would have made the changes to all of the data one PR, the addition of the new data sets another PR, and the addition of the code into a third PR. This makes them easier to review, and in this case would mean less work because we could just not merge the PR that adds the code
directory, rather than having to revert the addtion in this PR.
I can review again when you've made these changes. Thanks for the work on this!
from os import makedirs | ||
import re | ||
import sys | ||
import getopt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We generally use click for our command line interfaces. You might want to look into it, as I think it would reduce a lot of the boilerplate code here, and it would make it easier for others to help with maintenance of this. I think we should transition this from getopt
to click, but it doesn't necessarily have to happen in this PR. If you want to do it in a different PR, could you create an issue for that?
import getopt | ||
|
||
|
||
def add_lists(l1, l2): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a convention, you should rename all of your functions to begin with an _
(so this would become _add_lists
), unless you want them to be public (meaning that you support other Python programs importing these). If functions are public, then you have to think about backward compatibility, not changing interfaces, etc, so you generally want to make as little public as you can get away with. In this case, since your users would access this functionality though the CLI that you're creating, I highly recommend having no public functions. Users can still import functions that begin with _
, but it's a general rule that they shouldn't (so it's their fault if they do and their workflows then break).
return lineage | ||
|
||
|
||
usage = '''usage: autoannotate-taxa.py -i <source_fp> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of this is what you wouldn't have to write with click.
sp = arg | ||
|
||
# generate dict of {name: (genus, species, abundances)} | ||
with open(source_fp, "r") as source: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If source_fp
doesn't exist, the user is going to see a traceback error. You should detect these types of issues and present more useful error messages.
|
||
# search for match at genus, then species level | ||
for full, partial in ref_taxa.items(): | ||
if t[0][0] in partial[0]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This exact matching is really strict - I wonder if you want to use fuzzy matching instead, to help you identify slight differences in spelling or spelling errors. Fuzzywuzzy is one package that does this. This isn't essential, but is a "nice to have" feature that would probably save you some time.
if not exists(destination_dir): | ||
makedirs(destination_dir) | ||
|
||
with open(join(destination_dir, 'expected-taxonomy.tsv'), "w") as dest: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will overwrite the file if it already exists - is that what you want to do? There's not a right or wrong way to handle this, it's just worth thinking about whether you would want to error if the file already exists to avoid losing work if the user doesn't realize they're about to overwrite a file, or if you just want to overwrite. This decision generally depends on how hard it would be to regenerate the overwritten file if the user didn't really want to overwrite it.
@@ -0,0 +1,11 @@ | |||
#mock-12 | |||
|
|||
Composed of 27 bacterial strains amplified with 515f/806r primers. Intended to include more fine-scale variation than the other mock communities, the members of which were chosen in part for their well-separated 16S rRNA gene sequences. The Extreme strains are all distinguishable over the sequenced region of the 16S rRNA gene, but some pairs of strains differ by as little as one nucleotide. Generated by Benjamin Callahan at Stanford University in 2015. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more fine-scale variation -> more closely related taxa
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Extreme strains -> These strains, because you haven't mentioned the name extreme yet.
Thanks @gregcaporaso ! I have pushed all requested changes (with the exception of code edits, as these are now moved to a new repo). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks good, but I noticed one more thing:
-
mockrobiota/data/example-1/greengenes/13_8/97/expected-taxonomy.tsv
have97-otus
as the lowest level directory (ie., not just97
?
Once that change is made this is ready for merge.
Thanks @gregcaporaso ! Have made that fix and pushed the changes. |
Thanks @nbokulich, looks good! |
@gregcaporaso would you mind reviewing?
Some big changes here — I generated two draft scripts,
autoannotate-taxonomy.py
andinventory-compiler.py
, which automatically generate the expected-taxonomy.tsv files, database-identifiers.tsv, and inventory.tsv.Then I did just that — all bacterial communities have new greengens and silva 97% and 99% OTU expected-taxonomy.tsv files and database-identifiers.tsv files. Others coming next.
This is a complete fix for #22 and #46 and a partial fix for #24
Probably need some clean up and
inventory-compiler.py
needs some work to automate whenever specific files are added (if it's possible — see notes).