MRG: Uniquify csv output from multigather #2065

olgabot · 2022-05-28T20:27:48Z

Hello!
Hope you and yours are doing well. I am using multigather to query protein sequences against each other, and primarily use the csv file for downstream processing. As is, if the signatures in --query query.sig were created from one fasta file with e.g. --singleton, then all results iteratively overwrite each other into the same csv file. This proposed change adds the md5sum to the query file to ensure uniqueness. Having seen the plans to deprecate multigather, #1614, maybe this is not the right way to go about this. Other suggestions are welcome!

(not tested yet)

Warmest,
Olga

Hello! Hope you and yours are doing well. I am using multigather to query protein sequences against each other, and primarily use the csv file for downstream processing. As is, if the signatures in `--query query.sig` were created from one fasta file with e.g. `--singleton`, then all results iteratively overwrite each other into the same csv file. This proposed change adds the md5sum to the query file to ensure uniqueness. Other suggestions are welcome! (not tested yet) Warmest, Olga

codecov · 2022-05-28T20:35:59Z

Codecov Report

Merging #2065 (e732f30) into latest (d0e3726) will decrease coverage by 59.55%.
The diff coverage is 0.00%.

@@             Coverage Diff             @@
##           latest    #2065       +/-   ##
===========================================
- Coverage   85.34%   25.80%   -59.55%     
===========================================
  Files         133      104       -29     
  Lines       15196    12402     -2794     
  Branches     2614     2390      -224     
===========================================
- Hits        12969     3200     -9769     
- Misses       1926     9002     +7076     
+ Partials      301      200      -101

Flag	Coverage Δ
hypothesis-py	`25.80% <0.00%> (-0.01%)`	⬇️
python	`?`
rust	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed	Coverage Δ
src/sourmash/commands.py	`3.11% <0.00%> (-88.30%)`	⬇️

... and 128 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

ctb · 2022-06-02T13:54:09Z

hi @olgabot, sounds good to me! I'm only in favor of removing things if people aren't using them 😆

Hot take: it looks like this PR so far does munging of an existing output column to include md5sum, which presents three challenges -

first, we shouldn't change the format or meaning of existing columns b/c semantic versioning
second, this involves weird after-the-fact parsing by consumers of the CSV
third, it wouldn't allow downstream consumption by other sourmash code, especially the picklist code.

I think a better solution might be to output the full md5sum as a separate column. Based on the work @bluegenes did in #1955, I suggest naming it query_md5.

(Also going to tag in #1555; I think we're heading towards standardizing on the PrefetchResult column names as much as possible.)

olgabot · 2023-08-18T22:34:24Z

Hi @ctb, revisiting this PR one year later :)

If I recall correctly, this PR doesn't change any of the columns of the csv output, only the filenames. Let me see if I can explain my thought process.

For a command like this:

sourmash gather --query query_filename.sig --db db_filename.sig

if query_filename.sig actually contains 1000s of signatures made with --singleton (e.g. made from PacBio long reads and you want to do sourmash gather on each individual read against a database), then what happens now is that a single set of *csv, *matches.sig and *.unassigned.sig files get created, and overwritten with each signature.

query_filename.csv
query_filename.matches.sig
query_filename.unassigned.sig

What I was hoping to see was a separate file for each query_md5 in the query_filename.sig file, like this:

query_filename.abc123.csv
query_filename.abc123.matches.sig
query_filename.abc123.unassigned.sig
...
query_filename.xyz789.csv
query_filename.xyz789.matches.sig
query_filename.xyz789.unassigned.sig

Another option could be to concatenate all outputs into a single file, so have 3 outputs like the original, but have ALL signatures, and add a query_md5 field in the column names as you mention

query_filename.csv
query_filename.matches.sig
query_filename.unassigned.sig

Let me know what you think! Open to either many files or one mega-file.

FWIW, I have been using the hack above, plus this StackOverflow answer which benchmarked a pure Python approach to concatenating 100s of csv files with headers for a bit and it's been working for me.

ctb · 2023-08-19T15:21:08Z

hi @olgabot please see #2721, in which I update this PR to the code in latest (including the most recent release, sourmash v4.8.3). I think that merge should be uncontroversial ;). More discussion coming!

ctb · 2023-08-19T15:26:31Z

This PR may fix #2328.

ctb · 2023-08-19T16:16:50Z

OK, discussion over in #2328 (comment). Let me know what you think, @olgabot! Commentary in #1614 might be equally interesting, since we now have a plugin that provides really substantial speedups...

Note: PR into #2065. This PR updates #2065 with all the changes from `latest`. This includes the fix & update to multigather in #2322, which: - fixes the output of multigather to include more than one line 😆 - prints out the output filename - supports `--output-dir` No functional changes are made beyond the merge, so some tests are still failing; will discuss fixes in yet a new PR :). --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Keya Barve <53328492+keyabarve@users.noreply.github.com> Co-authored-by: ccbaumler <63077899+ccbaumler@users.noreply.github.com> Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Taylor Reiter <taylorreiter@gmail.com> Co-authored-by: Erik Young <eeyoung@ucdavis.edu> Co-authored-by: David Koslicki <dmkoslicki@gmail.com> Co-authored-by: Luiz Irber <luizirber@users.noreply.github.com> Co-authored-by: Colton Baumler <baumlerc@farm.ucdavis.edu> Co-authored-by: Luiz Irber <contact+github@luizirber.org> Co-authored-by: N. Tessa Pierce-Ward <ntpierce@gmail.com> Co-authored-by: Peter Cock <p.j.a.cock@googlemail.com> Co-authored-by: Francesco Beghini <francesco.beghini@yale.edu> Co-authored-by: Jason Stajich <jason.stajich@ucr.edu> Co-authored-by: Katrin Leinweber <9948149+katrinleinweber@users.noreply.github.com>

olgabot · 2023-08-21T01:03:46Z

Thanks for looking into this! I didn't realize that adding md5 would cause other problems, but glad they're not TOO huge.

…o olgabot-patch-2

ctb · 2024-02-28T14:34:44Z

closing this in favor of #2722, which contains these commits but is directly against latest :)

…t-add-query-md5sum` (#2722) This PR: - adds documentation for `multigather` to sourmash docs! - builds on #2065 / #2721 so that tests pass. - adds an option `-U/--output-add-query-md5sum` to `sourmash multigather` - adds an option `--force-allow-overwrite-output` to `sourmash multigather` - **CHANGES BEHAVIOR** of multigather by treating `query.filename == '-'` as if `query.filename` is empty, thus replacing it with md5sum - **CHANGES BEHAVIOR** of multigather by failing loudly and clearly if output files are going to be overwritten - adds `-E/--extension` to allow output to files other than `.sig` See discussion over in [#2328: `multigather` CSV output uses signature `filename` as basename](#2328). To add: - [x] tests for `-U`; - [x] implement and test `-E/--extension` - [x] implement and test `--force-allow-overwrite-output` - [x] fix for `query_filename` being None/empty in `-U` branch - [x] documentation update for changed output behavior for multigather: '-' => using md5sum - [x] documentation update for changed output behavior for multigather: fails if overwrite happens - [x] fix multigather link in docs --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Olga Botvinnik <olga.botvinnik@gmail.com> Co-authored-by: Keya Barve <53328492+keyabarve@users.noreply.github.com> Co-authored-by: ccbaumler <63077899+ccbaumler@users.noreply.github.com> Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Taylor Reiter <taylorreiter@gmail.com> Co-authored-by: Erik Young <eeyoung@ucdavis.edu> Co-authored-by: David Koslicki <dmkoslicki@gmail.com> Co-authored-by: Luiz Irber <luizirber@users.noreply.github.com> Co-authored-by: Colton Baumler <baumlerc@farm.ucdavis.edu> Co-authored-by: Luiz Irber <contact+github@luizirber.org> Co-authored-by: N. Tessa Pierce-Ward <ntpierce@gmail.com> Co-authored-by: Peter Cock <p.j.a.cock@googlemail.com> Co-authored-by: Francesco Beghini <francesco.beghini@yale.edu> Co-authored-by: Jason Stajich <jason.stajich@ucr.edu> Co-authored-by: Katrin Leinweber <9948149+katrinleinweber@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

ctb changed the title ~~Uniquify csv output from multigather~~ MRG: Uniquify csv output from multigather Aug 19, 2023

ctb mentioned this pull request Aug 19, 2023

MRG: update #2065 (uniquify CSV output from multigather) #2721

Merged

ctb mentioned this pull request Aug 19, 2023

deprecate sourmash multigather for 5.0 #1614

Open

This was referenced Aug 19, 2023

MRG: fix multigather output by adding md5sum along with -U/--output-add-query-md5sum #2722

Merged

multigather CSV output uses signature filename as basename. #2328

Open

ctb and others added 2 commits August 20, 2023 19:47

Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…

4e148a0

…o olgabot-patch-2

Merge branch 'latest' into olgabot-patch-2

e732f30

bluegenes mentioned this pull request Feb 26, 2024

fastmultigather --output is only being used when searching against a rocksdb sourmash-bio/sourmash_plugin_branchwater#239

Open

ctb closed this Feb 28, 2024

ctb mentioned this pull request Jul 13, 2024

using fastmultigather to do contig-level gather and taxonomy assignment - a brief tutorial #3095

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MRG: Uniquify csv output from multigather #2065

MRG: Uniquify csv output from multigather #2065

olgabot commented May 28, 2022

codecov bot commented May 28, 2022 •

edited

Loading

ctb commented Jun 2, 2022

olgabot commented Aug 18, 2023

ctb commented Aug 19, 2023

ctb commented Aug 19, 2023

ctb commented Aug 19, 2023

olgabot commented Aug 21, 2023

ctb commented Feb 28, 2024

MRG: Uniquify csv output from multigather #2065

MRG: Uniquify csv output from multigather #2065

Conversation

olgabot commented May 28, 2022

codecov bot commented May 28, 2022 • edited Loading

Codecov Report

ctb commented Jun 2, 2022

olgabot commented Aug 18, 2023

ctb commented Aug 19, 2023

ctb commented Aug 19, 2023

ctb commented Aug 19, 2023

olgabot commented Aug 21, 2023

ctb commented Feb 28, 2024

codecov bot commented May 28, 2022 •

edited

Loading