Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] standardize and simplify search, prefetch, gather results by using dataclasses #1955

Merged
merged 31 commits into from
Apr 20, 2022

Conversation

bluegenes
Copy link
Contributor

@bluegenes bluegenes commented Apr 15, 2022

This PR replaces SearchResult, PrefetchResult, GatherResult namedtuples with dataclasses, maintaining current csv output. It's now possible to calculate attributes for each result directly within the dataclass. I've tried to replace external computing with this, but there is certainly more optimization that can be done.

To facilitate these comparison calculations, this PR also introduces MinHashComparison dataclasses, (BaseMinHashComparison, FracMinHashComparison, and NumMinHashComparison) which contain properties for all the comparisons we'd like to build when comparing two minhashes, including automatically downsampling to the lower resolution sketch for comparison.

It would be great to think about and/or standardize column output a bit more, ref #1555, #1737, but at least now the output is only specified in a single spot, so it should be easier to change. Columns can be easily added to the written output for each class by adding to the *write_cols list in the class.

related:

motivation: I need to add several columns (and calculations) for ANI estimation, and namedtuples were getting repetitive and unruly.

benefit: Many of the params we're passing into our search results are calculated from the two sketches. We can automate this within the dataclass to avoid needing to write out the calculation each time.

@bluegenes bluegenes changed the base branch from latest to add-cols April 15, 2022 20:57
@ctb
Copy link
Contributor

ctb commented Apr 15, 2022

benefit: Many of the params we're passing into our search results are calculated from the two sketches. We can automate this within the dataclass to avoid needing to write out the calculation each time. I'm currently doing this in --post_init__, but we could make these properties instead to have them only be calculated when needed.

YOU R GENIUS 🤩

@codecov
Copy link

codecov bot commented Apr 15, 2022

Codecov Report

Merging #1955 (4e06116) into latest (efc700b) will increase coverage by 8.16%.
The diff coverage is 99.65%.

@@            Coverage Diff             @@
##           latest    #1955      +/-   ##
==========================================
+ Coverage   83.25%   91.42%   +8.16%     
==========================================
  Files         126       96      -30     
  Lines       13990     9932    -4058     
  Branches     1913     1946      +33     
==========================================
- Hits        11647     9080    -2567     
+ Misses       2071      579    -1492     
- Partials      272      273       +1     
Flag Coverage Δ
python 91.42% <99.65%> (+0.19%) ⬆️
rust ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/sourmash/search.py 97.69% <99.40%> (+0.68%) ⬆️
src/sourmash/commands.py 88.37% <100.00%> (-0.19%) ⬇️
src/sourmash/minhash.py 93.77% <100.00%> (+0.56%) ⬆️
src/sourmash/sketchcomparison.py 100.00% <100.00%> (ø)
src/core/src/index/sbt/mhbt.rs
src/core/src/encodings.rs
src/core/src/errors.rs
src/core/src/ffi/storage.rs
src/core/src/from.rs
src/core/tests/test.rs
... and 26 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update efc700b...4e06116. Read the comment docs.

@bluegenes
Copy link
Contributor Author

benefit: Many of the params we're passing into our search results are calculated from the two sketches. We can automate this within the dataclass to avoid needing to write out the calculation each time. I'm currently doing this in --post_init__, but we could make these properties instead to have them only be calculated when needed.

YOU R GENIUS 🤩

except I forgot that __post_init__ can't be inherited, sigh ... will repeat for now and then clean it all up by doing it in a redefined __init__ method :)

@ctb
Copy link
Contributor

ctb commented Apr 15, 2022

except I forgot that __post_init__ can't be inherited, sigh ... will repeat for now and then clean it all up by doing it in a redefined __init__ method :)

pardon my lack of knowledge, but: can't you have the code in a method that IS inherited, and just call that method in __post_init__?

@bluegenes bluegenes changed the base branch from add-cols to latest April 15, 2022 23:28
@bluegenes
Copy link
Contributor Author

@ctb I think this is ready for preliminary review.

In particular, I want to make sure everything for gather is being handled properly and I could use some pointers on how to test these classes.

I think we could enable additional things (e.g. calculate jaccard/average containment/max containment/ani directly in the search classes (this might be especially helpful for ANI). But gather results make this a bit trickier (need to distinguish btwn results from original vs subtracted search mh), so I wanted to get your thoughts. For example, we could pass in the subtracted minhash to the gatherresults, and calculate everything in the class...

Base automatically changed from add-cols to latest April 16, 2022 01:38
src/sourmash/search.py Outdated Show resolved Hide resolved
src/sourmash/search.py Outdated Show resolved Hide resolved
src/sourmash/search.py Outdated Show resolved Hide resolved
@ctb
Copy link
Contributor

ctb commented Apr 16, 2022

In particular, I want to make sure everything for gather is being handled properly and I could use some pointers on how to test these classes.

On a quick skim, this looks beautifully simple and straightforward, and much cleaner than what we were doing before! (And thank you - it must have been tedious as heck 😆 )

This may be famous last words, but I'm not sure you need much more by the way of tests. The search, gather, and prefetch code is thoroughly tested, and while there's always the opportunity for more testing, the consistency checks we have in the code already should be good enough - your added tests here are just nice gravy.

I think we could enable additional things (e.g. calculate jaccard/average containment/max containment/ani directly in the search classes (this might be especially helpful for ANI). But gather results make this a bit trickier (need to distinguish btwn results from original vs subtracted search mh), so I wanted to get your thoughts. For example, we could pass in the subtracted minhash to the gatherresults, and calculate everything in the class...

This would definitely be nice and doesn't strike me as hard, just more tedium - I agree with your idea of passing in the subtracted minhash.

Other than addressing the remaining missed code coverage and the questions in there, this is a nice PR already and I'd love to see it merged.

@bluegenes
Copy link
Contributor Author

well, this turned into a bit of a beast (😅), but it's ready for review @ctb @sourmash-bio/devs

@bluegenes bluegenes changed the title [WIP] standardize and simplify search, prefetch, gather results by using dataclasses [MRG] standardize and simplify search, prefetch, gather results by using dataclasses Apr 20, 2022
Copy link
Contributor

@ctb ctb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor cleanup issues, otherwise looks good!

src/sourmash/sketchcomparison.py Outdated Show resolved Hide resolved
src/sourmash/sketchcomparison.py Show resolved Hide resolved
src/sourmash/sketchcomparison.py Outdated Show resolved Hide resolved
src/sourmash/sketchcomparison.py Outdated Show resolved Hide resolved
src/sourmash/sketchcomparison.py Outdated Show resolved Hide resolved
Copy link
Contributor

@ctb ctb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants