-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better conceptual definition on minhash/signature/query #616
Comments
agreed with all things :)
|
@luizirber and I recently had a conversation via slack about naming conventions for signatures. His explanation was helpful and I think it could be helpful to add it to the documentation. At the least, now it wont be lost to the ages in disappearing slack threads: Taylor: "what’s the correct terminology for a signature that contains multiple minhashes in sourmash? like if I create a signature for each contig in a fasta, and put them all in the same signature…what is the baby sig called?" Luiz: so, how I like to think about it (with examples):
it's a bit silly to have so many hierarchies when we only have one implementation for |
We also have a big conceptual dichotomy, where the code computing signatures behaves as if each |
I'm well aware... But the idea of a Signature as a collection of Sketches came later anyway (during the refactoring), so don't blame yourself too much =] On the bright side, I tried to keep the Rust codebase doing operations over Signatures instead of over Sketches. For now the first thing these operations do is extract the matching MinHash in each sig and do the regular MinHash operations, but I think it is quite doable to "hide" most of the operations (especially Containment and Similarity) inside the Signature impl in the same way. This can be lifted to Python too, with the awesome benefit of avoiding a new copy of the MinHash in |
I like how you avoided the word "just" in that comment in favor of "quite doable." |
as I dug in to |
I think this should be punted to 5.0, or later. |
:) Over in #1392, I'm starting to make the clear distinction that search and gather return the original This also makes #198 even more interesting, because then you could build an SBT that lets you find matches based on the kind of sketch that was indexed, but returns a richer set of sketches (as part of the returned signature object). |
in re #2039, @luizirber reminded me:
and now I'm curious, what operations does the HLL (as implemented) currently support? definitely cardinality counting, and merge. does it support containment as a byproduct of merge? |
since we are using the estimators from https://arxiv.org/abs/1706.07290, we can do similarity and containment Caveat: estimators don't work well with wildly different cardinalities, so genome containment in metagenome (or vice-versa) don't work well =( |
ok, thx - this makes me less enthusiastic about putting a lot of UX time into adding it, but I think it is an excellent "nth" method to add to expand internal support for more sketches, because a lot of sourmash power comes from the containment stuff. |
ref #1616 |
…ture` in a JSON record (#3333) This PR was originally about debugging sourmash-bio/sourmash_plugin_branchwater#445, but that's going to require more work to fix properly. For now, I would like to nominate it for merge because sourmash fails silently in this situation, and that's Bad. In brief, the main thing this PR does is panic with an `unimplemented!` when `FSStorage::load_sig` encounters more than one `Signature` in a JSON record. This PR also adds a bit of documentation to `InnerStorage`, per the bottom of [this comment](sourmash-bio/sourmash_plugin_branchwater#445 (comment)). --- The problem at hand: when loading a `SigStore`/`Signature` from a `Storage`, sourmash only loads the first one and ignores any others. https://github.com/sourmash-bio/sourmash/blob/26b50f3e3566006fd6356a4f8b4d47c5e381aeec/src/core/src/storage/mod.rs#L34-L38 This results from the concept of a `Signature` as containing one or more sketches; the history of this is described [here](#616 (comment)), and it leads to some interesting silliness [in the Python layer](https://github.com/sourmash-bio/sourmash/blob/d63c464e825529fa54bb7e8b81faa53b858b09de/src/sourmash/save_load.py#L297). The contrapositive is that, in Rust, a single `Signature` can include multiple sketches, e.g. with different ksizes. So this works fine for the wort case where we have a single `.sig` file with k=21, k=31, k51. Note that the Python layer (and hence the entire sourmash CLI) fully supports multiple `Signature`s in JSON: this is well tested and well covered behavior. The branchwater plugin runs into it because it is using the Rust layer and the API is not fully fleshed out there. ---
I think we should separate
query
,signature
andminhash
better around the codebase.It's all pretty entangled, but they are different things!
We can also lift up some functionality from MinHash into Signature:
add_sequence
can be aSignature
method, and will calladd_kmer
(or equivalent) in all the MinHash defined for that signature. At the moment we do all this parsing during compute (for example), where for each sequence we need to iterate with the appropriate k size and so on.Another note:
Signature
is a collection of MinHash at the moment, but would be pretty interesting to allow it to keep HLL/BF/CMS/HistoSketch representations of the data too.Some examples
feature/bf_query
I'm attaching a Nodegraph with the Sig/MH content to the query to make search faster (but it is weird, because Signature shouldn't have a Nodegraph attached to it!)(things to consider on #556)
The text was updated successfully, but these errors were encountered: