-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how might we distribute "diff" or patch databases? #985
Comments
I think we should use the multi-DB capabilities in
Taxonomy-wise, include the
I think we should continue reporting the dataset ID (be it
These two go together, I suspect: we can remove files from full builds, and maybe provide the 'screen' in weekly builds (as an additional file in the DB?) to indicate what matches to skip? Note: Why would we want to remove signatures? We always provide the latest version of a genome, and so we need to remove the old one? Can genbank/refseq submission be retracted?
Incoming brain dump! #456 is a fairly old PR, and I don't even know how to properly rebase it for today's codebase, but most of it migrated to other PRs:
One thing still left (and connected to this issue) is the
There are a bunch of optimizations that can be done to avoid consuming too much memory:
So: I think this connects with IPFS and this issue because, instead of providing full ZIP files, we could provide only the description and change instruction to run |
wow, that went in a direction all right. Not sure how to respond to the IPFS stuff, have to re-read that or maybe brainstorm in person :). re
yes, some genomes are just broken and get removed or deprecated, and I don't think they should be available for search. Note, for the genomeRxiv work, we will face similar questions of how to provide regular database updates. Since we should have actual funding for that, maybe that's a place to dig in! |
Feedback from personal comm:
|
#1477 could add support for "masking" arbitrary signatures from search and gather. |
see also #433 |
a few quick thoughts -
|
this is a fascinating situation where we could actually use manifests. just thinking out loud: my first (bad) idea is that we could simply edit manifests, since (as noted in #1849) there are situations where they don't necessarily contain all signatures, anyway. a second (better?) idea is to add a 'deprecated' field that marks the signature as something to ignore. a third (maybe actually good?) idea is to add a 'deprecated by' column that points at another signature (maybe an md5?). a fourth (also maybe actually good) idea is to add a 'deprecates' column in database manifests that would support ignoring signatures in older databases. not sure how to best indicate which signature to ignore - md5 + identifier, maybe? the first three ideas all involve modifying old databases. boo. the fourth only involves modifying new databases. |
keyword search bait: database updates, update databases, incremental database updates |
re #970, how might we move towards a model where we regularly (weekly? monthly?) release genbank/refseq databases that take into account new and revised genomes?
random thoughts -
this could tie into some of the work that @luizirber is doing with IPFS, I suspect.
The text was updated successfully, but these errors were encountered: