-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] add flexible & iterative support for outputting signatures in variety of formats #1493
Conversation
Codecov Report
@@ Coverage Diff @@
## latest #1493 +/- ##
==========================================
+ Coverage 89.75% 94.93% +5.17%
==========================================
Files 123 97 -26
Lines 19579 16266 -3313
Branches 1498 1515 +17
==========================================
- Hits 17574 15442 -2132
+ Misses 1778 595 -1183
- Partials 227 229 +2
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
This also now fixes #1428 by specifying utf-8 as the default output encoding. Is this a good idea, @luizirber? All the tests pass with only one rather simple fix needed, so... 🤷 A remaining nit in this PR is a Rust panic triggered in the error triggered by test
|
...
ah-hah, found the old issue - #1167 - and also #1159 (comment) and #616. I thought it was familiar! I think I'm going to hack and slash in a fix at the Python layer, for now. |
ready for review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Auto-detect desired output format and make it happen:: a beautiful UX that I'm looking forward to using!!
Mostly made some minor comment suggestions. My main concern is duplicated md5sums, since I seem to run into that fairly often, even with small(er) databases.
I don't have much to say on the implementation -- looks clean and clear to me, and I learned about super()
and class inheritance :). Not sure if that means Luiz's eyes will be more helpful!
assert self.zf | ||
super().add(ss) | ||
|
||
md5 = ss.md5sum() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried about using md5sum
s for zipfiles, given that we have duplicate md5sums reasonably often in our databases.
Can we implement something to avoid overwriting duplicates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
k, will work on it as a PR into this 'un!
Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>
Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>
Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>
Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>
Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>
Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>
Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>
Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>
LGTM. I think this could also support the scanpy-like usage described in #1353 (comment) and #939 (comment) |
🎉 |
This PR adds support for a variety of output match formats using
sourmash_args.SaveSignaturesToLocation(...)
-sourmash search
andsourmash gather
when using--save-matches
.sourmash signature
functionssourmash compute
andsourmash sketch
This does not support indexed format output like
.sbt.zip
or.lca.json
because those only work with one ksize/moltype/etc.This functionalty was requested by @bluegenes in #1440.
usage
if
--save-matches
is given a filename with a trailing/
, it saves matches as files in that directory.for example,
saves the matching signature as
bar/38729c63.sig.gz
. Similarly, this works for.zip
and.sig
and.sig.gz
files, making the appropriate format choices as you'd expect.This class also lets us avoid keeping matching signatures in memory where possible, which is extra nice for large databases / long-running queries such as those supported by prefetch in #1370; ref also better UX for massive collections #1350.
misc TODO:
search --save-matches
sourmash sig
too?