[EXP] add support for rapid signature selection from Zipfile collections by md5 picklists #1589
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
NOTE: PR into #1588
This PR is an experimental PR that does some terrible things to
ZipfileLinearIndex
in order to support rapid picklist extraction. In exchange for those terrible things, it enables cool stuff like rapid extraction of signatures from large collections per #1365.a demonstration
Below, we run a prefetch of a signature against 48,000 signatures in a zipfile collection, which yields 13 matches (for this query). We then use the picklist functionality in
sourmash sig extract
with thematch_md5
column from the prefetch results to extract just the relevant signatures.With this PR, the
sourmash sig extract
takes about 2 seconds. With #1588 (which supports picklists but not any special zipfile interaction), it takes a few minutes.