Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gather is calculating remaining_bp incorrectly #3194

Closed
ctb opened this issue Jun 9, 2024 · 2 comments · Fixed by #3195
Closed

gather is calculating remaining_bp incorrectly #3194

ctb opened this issue Jun 9, 2024 · 2 comments · Fixed by #3195

Comments

@ctb
Copy link
Contributor

ctb commented Jun 9, 2024

It looks like it is not taking into account the unidentified minhashes that are subtracted from the query sketch before entering into GatherResult.

this may be the source of the difference noticed between Rust-based gather in sourmash-bio/sourmash_plugin_branchwater#331 and sourmash gather output.

more here soon!

@ctb
Copy link
Contributor Author

ctb commented Jun 9, 2024

Using the same brutal debugging tactics from #3193, the following script:

import sourmash

combined_sig = list(sourmash.load_file_as_signatures('combined.sig',
                                                     ksize=21))[0]
combined_mh = combined_sig.minhash
combined_set = set(combined_mh.hashes)

match1_sig = list(sourmash.load_file_as_signatures('match1_NC_000853.1.sig',
                                                   ksize=21))[0]
match1_mh = match1_sig.minhash
match1_set = set(match1_mh.hashes)

match2_sig = list(sourmash.load_file_as_signatures('match2_NC_011978.1.sig',
                                                   ksize=21))[0]

match2_mh = match2_sig.minhash
match2_set = set(match2_mh.hashes)

match5_sig = list(sourmash.load_file_as_signatures('match5_NC_009486.1.sig',
                                                   ksize=21))[0]
match5_mh = match5_sig.minhash
match5_set = set(match5_mh.hashes)

###

print('orig len:', len(combined_set) * combined_mh.scaled)

f_match_1 = len(match1_set.intersection(combined_set)) / len(match1_set)
combined_set -= match1_set

print('after first match removed:', len(combined_set) * combined_mh.scaled)

f_match_2 = len(match2_set.intersection(combined_set)) / len(match2_set)
combined_set -= match2_set
print('after second match removed:', len(combined_set) * combined_mh.scaled)

f_match_5 = len(match5_set.intersection(combined_set)) / len(match5_set)
combined_set -= match5_set

print('after third match removed:', len(combined_set) * combined_mh.scaled)

yields:

orig len: 14660000
after first match removed: 12740000
after second match removed: 11050000
after third match removed: 10130000

which is what calculate_gather_stats returns (tested in commit 6abb5cf in #3193).

However, sourmash gather returns numbers based on the identified portion of the sketch only:

image

@ctb ctb changed the title gather may be calculating remaining_bp incorrectly gather is calculating remaining_bp incorrectly Jun 9, 2024
@ctb ctb transferred this issue from sourmash-bio/sourmash_plugin_directsketch Jun 9, 2024
@ctb
Copy link
Contributor Author

ctb commented Jun 9, 2024

Fixing in #3195

ctb added a commit that referenced this issue Jun 9, 2024
…d by plugins (#3193)

This PR fixes a bug in `disk_revindex.rs::RevIndexOps::gather` where
RocksDB-based `gather` was
incorrectly subtracting hashes multiple times from the counter in
situations of high redundancy.

For example, consider this Venn diagram of the 3-way intersection
between three sketches:


![image](https://github.com/sourmash-bio/sourmash/assets/51016/f98bf6a0-acc4-415f-bf39-1c48a599996a)

When a metagenome contains the union of all three of these sketches, the
broken implementation would subtract the `10` at the center multiple
times. This was caused by removing hashes from the matches, rather than
the intersection, each pass through the counter.

Of note, this made RocksDB-based `fastmultigather` return incorrect
results, ref
sourmash-bio/sourmash_plugin_branchwater#322;
first discovered in
#3138 (comment).

The PR fixes this, and adds a more complete pair of tests, based on
`test_gather_metagenome_num_results` in the Python tests for sourmash.

This PR also adjusts the hash function string for DNA sketches in Rust
to be uppercase `DNA` rather than lowercase `dna`, ref
sourmash-bio/sourmash_plugin_directsketch#49

And remember, it's not *just* the destination - it's the friends you
make along the way, like `env_logger`.

* Fixes #3139
* Fixes
sourmash-bio/sourmash_plugin_directsketch#49

For consideration:

Right now we are calculating the intersection twice, once in
`disk_revindex.rs` and once in `calculate_gather_stats` in
`revindex/mod.rs`. This is unnecessary, of course. But the function
signature for `calculate_gather_stats` would need to change to take the
intersection as an argument. We could:
* keep calculating it twice, just for simplicity;
* change `calculate_gather_stats` to take an _optional_ intersection,
and calculate it if not provided;
* change `calculate_gather_stats` to require an intersection.

TODO:
- [x] add at least one explicit test for the moltype fix

Other notes:

* there is a discrepancy between the Python (`sourmash gather`) and Rust
(`sourmash_plugin_branchwater` results) calculations for `remaining_bp`.
It seems to me like the Python one is definitely wrong; not yet sure
about Rust. Viz #3194.
@ctb ctb closed this as completed in #3195 Jun 10, 2024
ctb added a commit that referenced this issue Jun 10, 2024
Adjusts `remaining_bp` from `sourmash gather` to take into account the
original size of the query sketch, not just the identified ones. Adds a
specific test, and updates documentation to provide a bit more detail,
too.

Fixes #3194

TODO:
- [x] actually fixme :)

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant