MRG: write full gather result from `fastgather` and non-rocksdb `fastmultigather` #298

bluegenes · 2024-04-02T19:50:01Z

This PR adds utilities for building full gather results file for fastgather and non-rocksdb fastmultigather, and makes full output default.

Fixes fastmultigather should report full gather results for non rocksdb databases! #287
Fixes optionally enable full gather csv output from fastmultigather #187
Fixes fastgather CSV output for intersect_bp is in hashes, not in hashes * scaled as it is in sourmash gather output #254
includes a local fix for fastmultigather returns different results from sourmash gather in Python, in PR #298 #318, which means that the fastgather and non-rocksdb fastmultigather full output here matches sourmash gather. Issues with rocksdb gather are being handled elsewhere.

Benchmarking

software/version	command	details	time	max RAM
branchwater v0.9.3	`fastgather`	minimal result	1m 47s	14 GB
branchwater v0.9.3-dev	`fastgather`	full result	1m 57s	14 GB
branchwater v0.9.3	`fastmultigather`	minimal result	8m 3s	25 GB
branchwater v0.9.3-dev	`fastmultigather`	full result	8m 9s	25 GB
branchwater v0.9.3	`fastmultigather`	rocksdb full result	24s	600 MB

progress/separate PRs:

Fill out match_filename in full results (use Record.filename to get match filename for full gather outputs #303; requires new sourmash core release with MRG: allow get/set record.filename sourmash#3121)
switch to using KmerMinHashBTree for hash subtraction +benchmark. Per luiz, KmerMinHashBTree are better for any situation where we'll be subtracting/adding hashes to a sketch WIP: use KmerMinHashBTree for hash subtraction #310
sourmash: make getting Record.filename public in order to keep match_filename and write it to full results. (MRG: allow get/set record.filename sourmash#3121)
remove --full-results and make full results default WIP: Remove minimal results option #327

bluegenes · 2024-04-19T23:14:54Z

benchmarking with this branch:

session:

srun -p bmh -J bench -t 24:00:00  --nodes=1 -c 64 --mem=50G --pty bash

location: /home/ntpierce/SRR606249-feb13-2024
query: SRR606249.trim.k31.sig.zip
db: /group/ctbrowngrp/sourmash-db/gtdb-rs214/gtdb-rs214-k31.zip

hackmd version: https://hackmd.io/IgakZLceRyeTCXuCgH0iDA?edit

software/version	command	details	time	max RAM
branchwater v0.9.3	`fastgather`	minimal result	1m 47s	14 GB
branchwater v0.9.3	`fastgather`	full result	1m 57s	14 GB
branchwater v0.9.3	`fastmultigather`	minimal result	8m 3s	25 GB
branchwater v0.9.3	`fastmultigather`	full result	8m 9s	25 GB
branchwater v0.9.3	`fastmultigather`	rocksdb full result	24s	600 MB

I ran non-rocksdb fastgather and fastmultigather a couple times each -- sometimes the full result was faster than the minimal result, depending on multithreading efficiency (Percent of CPU). I kept the faster times here.

Just looking at the number of lines in the gather result:

     85 SRR606249.gather.csv
     85 SRR606249.x.gtdb-rs214.gather.csv
     85 srr.fg.csv
     85 srr.fg.full.csv
     79 srr.fmg-rdb.csv

... so rocksdb fastmultigather produces 79 result lines, while the rest had 85 🤔

bluegenes · 2024-04-20T00:05:20Z

@ctb ready for review.

Note: When I did column-by-column comparisons with python-based sourmash gather, the remaining_bp entries do not match. Can you see anything I'm doing wrong? I think the calculation here may actually be correct?

ctb · 2024-05-01T23:41:01Z

A few notes and questions -

if I infer correctly, the results of sourmash gather are in SRR606249.x.gtdb-rs214.gather.csv, yes?
this PR sets the output columns to be {'match_name', 'match_filename', 'match_md5'} rather than {'name', 'filename', 'md5'}, right?

ctb · 2024-05-01T23:51:10Z

when I look at the differences in the matching names between srr.fg.full.csv and SRR606249.x.gtdb-rs214.gather.csv, I see:

To me this looks like it's simply the difference between different implementations of the underlying gather algorithm: there are different matches, but they're all the same species, so I bet there is a tie at some point and sourmash gather decides differently from fastgather.

See https://github.com/ctb/2024-debug-gather-difference for notebook for exploring the differences.

ctb · 2024-05-01T23:59:38Z

I think the column name update is good, per sourmash-bio/sourmash#1555, since we want to use the prefetch-like column names. Just, y'know, we should put them in the PR description and also document it.

I'll look at the set of column names here and see if there's something else we should change here in prep for sourmash v5.

ctb · 2024-05-02T14:46:51Z

Just looking at the number of lines in the gather result:
     85 SRR606249.gather.csv
     85 SRR606249.x.gtdb-rs214.gather.csv
     85 srr.fg.csv
     85 srr.fg.full.csv
     79 srr.fmg-rdb.csv
... so rocksdb fastmultigather produces 79 result lines, while the rest had 85 🤔

Actually, I'm finding (on a stripped-down subset of 90 sketches that contains the union of all matches across the various CSVs) that my newly generated srr.fmg.csv only has 79 matches. You missed including it in the table above, too, but your CSV srr.fmg.csv also only has 79 matches.

That suggests that maybe fastmultigather is the problem, not rocksdb.

ctb · 2024-05-02T14:47:46Z

I see the same 79-line output from fastmultigather against my combined-matches-k31.sig.zip whether I use rocksdb or the zipfile, so I have a small, reproducible example of the differences. 🎉 . Will commence the digging!

ctb · 2024-05-02T19:46:39Z

That suggests that maybe fastmultigather is the problem, not rocksdb.

ok, I'm wrong, because of the way fastmultigather -o works. Digging in more.

ctb · 2024-05-04T11:27:54Z

Debugged one mismatch here: #318 - this was causing f_unique_to_query to be incorrectly calculated in the Rust code.

Fixed by #319, which can be merged into this PR.

There's another set of mismatches occurring from the same error in the RocksDB code, which uses sourmash-rs core code to calculate f_unique_to_query - see sourmash-bio/sourmash#3137, and a fix in sourmash-bio/sourmash#3138.

And there's an additional discrepancy in the RocksDB-based fastmultigather, as noted above. I've zeroed in on one specific problem that shows up when looking at gather results (from Python, or fastgather, for fastmultigather against a sig list ;). In brief, the 12th match in RocksDB-fastmultigather is returning too small an overlap. I can spend more time debugging this in the future, but it suggests to me that something is rotten in RevIndex::gather.

ctb · 2024-05-04T12:07:55Z

Random additional thought - is there any reason not to make --full-results the default, given that there seems to be no significant extra processing cost?

That change would also fix #254

**Note:** PR into #298 Fixes #318 Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>

bluegenes · 2024-05-09T17:52:24Z

Random additional thought - is there any reason not to make --full-results the default, given that there seems to be no significant extra processing cost?

That change would also fix #254

Yes, we can do that. I set it up this way b/c you were originally concerned about slowdown :)

ctb · 2024-05-09T19:16:55Z

Yes, we can do that. I set it up this way b/c you were originally concerned about slowdown :)

and I appreciate it, but I think you've done a great job of showing that that's not a concern!

NOTE: PR into #298 Removes `--full-results` and updates tests for switch to full results by default.

ctb · 2024-05-10T18:00:47Z

some things to do before merge -

put benchmarks in top level description
if it does fix fastgather CSV output for intersect_bp is in hashes, not in hashes * scaled as it is in sourmash gather output #254, indicate as such in top level description
reference fastmultigather returns different results from sourmash gather in Python, in PR #298 #318 and rocksdb breakage stuff in top level description

bluegenes · 2024-05-10T19:15:59Z

some things to do before merge -

put benchmarks in top level description

if it does fix fastgather CSV output for intersect_bp is in hashes, not in hashes * scaled as it is in sourmash gather output #254, indicate as such in top level description

reference fastmultigather returns different results from sourmash gather in Python, in PR #298 #318 and rocksdb breakage stuff in top level description

done! ready for re-review, etc

ctb

I ran some separate validation scripts and it all looks great! Thank you!! It's a go for merge!

bluegenes and others added 11 commits March 29, 2024 17:59

init full gatherresult

7f24ab7

compiling version

b578d7c

restore default minimal gather result writing

32e9791

Merge branch 'main' into fmg-fullgather

d00d58d

downsample as needed; add test

414ed57

init fullres testing

5c1a73d

fix scaling for full result calculations

562d463

test more cols

9441c58

test more cols

33a4bec

test more columns

ee7d523

Merge branch 'main' into fmg-fullgather

22570c3

bluegenes changed the title ~~EXP: explore outputting full gather result from fastgather/fastmultigather~~ MRG: explore outputting full gather result from fastgather/fastmultigather Apr 20, 2024

bluegenes mentioned this pull request Apr 20, 2024

fastgather is faster than fastmultigather in loading the database #268

Open

bluegenes changed the title ~~MRG: explore outputting full gather result from fastgather/fastmultigather~~ MRG: optionally write full gather result from fastgather and non-rocksdb fastmultigather Apr 20, 2024

also use fullres test in fmg

91a0d8f

bluegenes mentioned this pull request Apr 20, 2024

use Record.filename to get match filename for full gather outputs #303

Open

bluegenes and others added 4 commits April 21, 2024 14:31

fill out match_filename in the non-rocksdb full-result gathers (#304)

367840a

Merge branch 'main' into fmg-fullgather

6854115

clean up

98254d3

Merge branch 'main' into fmg-fullgather

8880e04

This was referenced May 4, 2024

fastmultigather returns different results from sourmash gather in Python, in PR #298 #318

Closed

MRG: Adjust denominator of f_unique_to_query in #298 #319

Merged

This was referenced May 4, 2024

fastgather CSV output for intersect_bp is in hashes, not in hashes * scaled as it is in sourmash gather output #254

Closed

RocksDB-based fastmultigather is broken. #322

Closed

bluegenes and others added 3 commits May 8, 2024 09:05

Merge branch 'main' into fmg-fullgather

976f18f

use orig_query size as denominator in f_unique_to_query (#319)

caa0b79

**Note:** PR into #298 Fixes #318 Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>

Merge branch 'main' into fmg-fullgather

e4e54bc

ctb mentioned this pull request May 10, 2024

WIP: Remove minimal results option #327

Merged

WIP: Remove minimal results option (#327)

1dbbf96

NOTE: PR into #298 Removes `--full-results` and updates tests for switch to full results by default.

bluegenes changed the title ~~MRG: optionally write full gather result from fastgather and non-rocksdb fastmultigather~~ MRG: write full gather result from fastgather and non-rocksdb fastmultigather May 10, 2024

ctb approved these changes May 10, 2024

View reviewed changes

bluegenes merged commit 38845e9 into main May 10, 2024
1 check passed

bluegenes deleted the fmg-fullgather branch May 10, 2024 19:51

This was referenced Jun 20, 2024

How can I use fastmultigather to get the abundance values? #167

Closed

Calculate ANI based on a CLI flag #164

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MRG: write full gather result from `fastgather` and non-rocksdb `fastmultigather` #298

MRG: write full gather result from `fastgather` and non-rocksdb `fastmultigather` #298

bluegenes commented Apr 2, 2024 •

edited

Loading

bluegenes commented Apr 19, 2024 •

edited

Loading

bluegenes commented Apr 20, 2024

ctb commented May 1, 2024

ctb commented May 1, 2024

ctb commented May 1, 2024

ctb commented May 2, 2024

ctb commented May 2, 2024

ctb commented May 2, 2024

ctb commented May 4, 2024

ctb commented May 4, 2024 •

edited

Loading

bluegenes commented May 9, 2024

ctb commented May 9, 2024

ctb commented May 10, 2024

bluegenes commented May 10, 2024

ctb left a comment

MRG: write full gather result from fastgather and non-rocksdb fastmultigather #298

MRG: write full gather result from fastgather and non-rocksdb fastmultigather #298

Conversation

bluegenes commented Apr 2, 2024 • edited Loading

Benchmarking

bluegenes commented Apr 19, 2024 • edited Loading

bluegenes commented Apr 20, 2024

ctb commented May 1, 2024

ctb commented May 1, 2024

ctb commented May 1, 2024

ctb commented May 2, 2024

ctb commented May 2, 2024

ctb commented May 2, 2024

ctb commented May 4, 2024

ctb commented May 4, 2024 • edited Loading

bluegenes commented May 9, 2024

ctb commented May 9, 2024

ctb commented May 10, 2024

bluegenes commented May 10, 2024

ctb left a comment

Choose a reason for hiding this comment

MRG: write full gather result from `fastgather` and non-rocksdb `fastmultigather` #298

MRG: write full gather result from `fastgather` and non-rocksdb `fastmultigather` #298

bluegenes commented Apr 2, 2024 •

edited

Loading

bluegenes commented Apr 19, 2024 •

edited

Loading

ctb commented May 4, 2024 •

edited

Loading