Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EXP] add support for rapid signature selection from Zipfile collections by md5 picklists #1589

Closed
wants to merge 2 commits into from

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Jun 13, 2021

NOTE: PR into #1588

This PR is an experimental PR that does some terrible things to ZipfileLinearIndex in order to support rapid picklist extraction. In exchange for those terrible things, it enables cool stuff like rapid extraction of signatures from large collections per #1365.

a demonstration

Below, we run a prefetch of a signature against 48,000 signatures in a zipfile collection, which yields 13 matches (for this query). We then use the picklist functionality in sourmash sig extract with the match_md5 column from the prefetch results to extract just the relevant signatures.

With this PR, the sourmash sig extract takes about 2 seconds. With #1588 (which supports picklists but not any special zipfile interaction), it takes a few minutes.

% sourmash prefetch podar-ref/63.fa.sig gtdb-r202.genomic-reps.k31.zip -o 63.prefetch.csv
...
(takes a few minutes, yields a prefetch.csv with 13 results)
...
% sourmash sig extract --picklist 63.prefetch.csv:match_md5:md5prefix8 \
         gtdb-r202.genomic-reps.k31.zip -o /tmp/abc.zip
picking column 'match_md5' of type 'md5prefix8' from '63.prefetch.csv'
loaded 13 distinct values into picklist.
loaded 13 sigs from 'gtdb-r202.genomic-reps.k31.zip'
loaded 13 total that matched ksize & molecule type
extracted 13 signatures from 1 file(s)
for given picklist, found 13 matches of 13 total

@codecov
Copy link

codecov bot commented Jun 13, 2021

Codecov Report

Merging #1589 (e205e64) into add/picklist_selectors (a88b66d) will decrease coverage by 0.28%.
The diff coverage is 37.14%.

Impacted file tree graph

@@                    Coverage Diff                     @@
##           add/picklist_selectors    #1589      +/-   ##
==========================================================
- Coverage                   89.05%   88.77%   -0.29%     
==========================================================
  Files                          76       76              
  Lines                        6736     6763      +27     
  Branches                     1209     1218       +9     
==========================================================
+ Hits                         5999     6004       +5     
- Misses                        520      540      +20     
- Partials                      217      219       +2     
Flag Coverage Δ
python 88.77% <37.14%> (-0.29%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/sourmash/index.py 94.11% <37.14%> (-4.92%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a88b66d...e205e64. Read the comment docs.

@ctb
Copy link
Contributor Author

ctb commented Jun 17, 2021

Closing in favor of #1590, which is more general.

@ctb ctb closed this Jun 17, 2021
@ctb ctb deleted the add/picklist_zf branch June 17, 2021 03:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant