-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gather is calculating remaining_bp
incorrectly
#3194
Comments
1 task
Using the same brutal debugging tactics from #3193, the following script: import sourmash
combined_sig = list(sourmash.load_file_as_signatures('combined.sig',
ksize=21))[0]
combined_mh = combined_sig.minhash
combined_set = set(combined_mh.hashes)
match1_sig = list(sourmash.load_file_as_signatures('match1_NC_000853.1.sig',
ksize=21))[0]
match1_mh = match1_sig.minhash
match1_set = set(match1_mh.hashes)
match2_sig = list(sourmash.load_file_as_signatures('match2_NC_011978.1.sig',
ksize=21))[0]
match2_mh = match2_sig.minhash
match2_set = set(match2_mh.hashes)
match5_sig = list(sourmash.load_file_as_signatures('match5_NC_009486.1.sig',
ksize=21))[0]
match5_mh = match5_sig.minhash
match5_set = set(match5_mh.hashes)
###
print('orig len:', len(combined_set) * combined_mh.scaled)
f_match_1 = len(match1_set.intersection(combined_set)) / len(match1_set)
combined_set -= match1_set
print('after first match removed:', len(combined_set) * combined_mh.scaled)
f_match_2 = len(match2_set.intersection(combined_set)) / len(match2_set)
combined_set -= match2_set
print('after second match removed:', len(combined_set) * combined_mh.scaled)
f_match_5 = len(match5_set.intersection(combined_set)) / len(match5_set)
combined_set -= match5_set
print('after third match removed:', len(combined_set) * combined_mh.scaled) yields:
which is what However, |
ctb
changed the title
gather may be calculating
gather is calculating Jun 9, 2024
remaining_bp
incorrectlyremaining_bp
incorrectly
This was referenced Jun 9, 2024
Fixing in #3195 |
ctb
added a commit
that referenced
this issue
Jun 9, 2024
…d by plugins (#3193) This PR fixes a bug in `disk_revindex.rs::RevIndexOps::gather` where RocksDB-based `gather` was incorrectly subtracting hashes multiple times from the counter in situations of high redundancy. For example, consider this Venn diagram of the 3-way intersection between three sketches: ![image](https://github.com/sourmash-bio/sourmash/assets/51016/f98bf6a0-acc4-415f-bf39-1c48a599996a) When a metagenome contains the union of all three of these sketches, the broken implementation would subtract the `10` at the center multiple times. This was caused by removing hashes from the matches, rather than the intersection, each pass through the counter. Of note, this made RocksDB-based `fastmultigather` return incorrect results, ref sourmash-bio/sourmash_plugin_branchwater#322; first discovered in #3138 (comment). The PR fixes this, and adds a more complete pair of tests, based on `test_gather_metagenome_num_results` in the Python tests for sourmash. This PR also adjusts the hash function string for DNA sketches in Rust to be uppercase `DNA` rather than lowercase `dna`, ref sourmash-bio/sourmash_plugin_directsketch#49 And remember, it's not *just* the destination - it's the friends you make along the way, like `env_logger`. * Fixes #3139 * Fixes sourmash-bio/sourmash_plugin_directsketch#49 For consideration: Right now we are calculating the intersection twice, once in `disk_revindex.rs` and once in `calculate_gather_stats` in `revindex/mod.rs`. This is unnecessary, of course. But the function signature for `calculate_gather_stats` would need to change to take the intersection as an argument. We could: * keep calculating it twice, just for simplicity; * change `calculate_gather_stats` to take an _optional_ intersection, and calculate it if not provided; * change `calculate_gather_stats` to require an intersection. TODO: - [x] add at least one explicit test for the moltype fix Other notes: * there is a discrepancy between the Python (`sourmash gather`) and Rust (`sourmash_plugin_branchwater` results) calculations for `remaining_bp`. It seems to me like the Python one is definitely wrong; not yet sure about Rust. Viz #3194.
ctb
added a commit
that referenced
this issue
Jun 10, 2024
Adjusts `remaining_bp` from `sourmash gather` to take into account the original size of the query sketch, not just the identified ones. Adds a specific test, and updates documentation to provide a bit more detail, too. Fixes #3194 TODO: - [x] actually fixme :) --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
It looks like it is not taking into account the unidentified minhashes that are subtracted from the query sketch before entering into
GatherResult
.this may be the source of the difference noticed between Rust-based gather in sourmash-bio/sourmash_plugin_branchwater#331 and sourmash gather output.
more here soon!
The text was updated successfully, but these errors were encountered: