Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: Uniquify csv output from multigather #2065

Closed
wants to merge 4 commits into from
Closed

Conversation

olgabot
Copy link
Collaborator

@olgabot olgabot commented May 28, 2022

Hello!
Hope you and yours are doing well. I am using multigather to query protein sequences against each other, and primarily use the csv file for downstream processing. As is, if the signatures in --query query.sig were created from one fasta file with e.g. --singleton, then all results iteratively overwrite each other into the same csv file. This proposed change adds the md5sum to the query file to ensure uniqueness. Having seen the plans to deprecate multigather, #1614, maybe this is not the right way to go about this. Other suggestions are welcome!

(not tested yet)

Warmest,
Olga

Hello!
Hope you and yours are doing well. I am using multigather to query protein sequences against each other, and primarily use the csv file for downstream processing. As is, if the signatures in `--query query.sig` were created from one fasta file with e.g. `--singleton`, then all results iteratively overwrite each other into the same csv file. This proposed change adds the md5sum to the query file to ensure uniqueness. Other suggestions are welcome!

(not tested yet)

Warmest,
Olga
@codecov
Copy link

codecov bot commented May 28, 2022

Codecov Report

Merging #2065 (e732f30) into latest (d0e3726) will decrease coverage by 59.55%.
The diff coverage is 0.00%.

@@             Coverage Diff             @@
##           latest    #2065       +/-   ##
===========================================
- Coverage   85.34%   25.80%   -59.55%     
===========================================
  Files         133      104       -29     
  Lines       15196    12402     -2794     
  Branches     2614     2390      -224     
===========================================
- Hits        12969     3200     -9769     
- Misses       1926     9002     +7076     
+ Partials      301      200      -101     
Flag Coverage Δ
hypothesis-py 25.80% <0.00%> (-0.01%) ⬇️
python ?
rust ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
src/sourmash/commands.py 3.11% <0.00%> (-88.30%) ⬇️

... and 128 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@ctb
Copy link
Contributor

ctb commented Jun 2, 2022

hi @olgabot, sounds good to me! I'm only in favor of removing things if people aren't using them 😆

Hot take: it looks like this PR so far does munging of an existing output column to include md5sum, which presents three challenges -

  • first, we shouldn't change the format or meaning of existing columns b/c semantic versioning
  • second, this involves weird after-the-fact parsing by consumers of the CSV
  • third, it wouldn't allow downstream consumption by other sourmash code, especially the picklist code.

I think a better solution might be to output the full md5sum as a separate column. Based on the work @bluegenes did in #1955, I suggest naming it query_md5.

(Also going to tag in #1555; I think we're heading towards standardizing on the PrefetchResult column names as much as possible.)

@olgabot
Copy link
Collaborator Author

olgabot commented Aug 18, 2023

Hi @ctb, revisiting this PR one year later :)

If I recall correctly, this PR doesn't change any of the columns of the csv output, only the filenames. Let me see if I can explain my thought process.

For a command like this:

sourmash gather --query query_filename.sig --db db_filename.sig

if query_filename.sig actually contains 1000s of signatures made with --singleton (e.g. made from PacBio long reads and you want to do sourmash gather on each individual read against a database), then what happens now is that a single set of *csv, *matches.sig and *.unassigned.sig files get created, and overwritten with each signature.

query_filename.csv
query_filename.matches.sig
query_filename.unassigned.sig

What I was hoping to see was a separate file for each query_md5 in the query_filename.sig file, like this:

query_filename.abc123.csv
query_filename.abc123.matches.sig
query_filename.abc123.unassigned.sig
...
query_filename.xyz789.csv
query_filename.xyz789.matches.sig
query_filename.xyz789.unassigned.sig

Another option could be to concatenate all outputs into a single file, so have 3 outputs like the original, but have ALL signatures, and add a query_md5 field in the column names as you mention

query_filename.csv
query_filename.matches.sig
query_filename.unassigned.sig

Let me know what you think! Open to either many files or one mega-file.

FWIW, I have been using the hack above, plus this StackOverflow answer which benchmarked a pure Python approach to concatenating 100s of csv files with headers for a bit and it's been working for me.

@ctb ctb changed the title Uniquify csv output from multigather MRG: Uniquify csv output from multigather Aug 19, 2023
@ctb
Copy link
Contributor

ctb commented Aug 19, 2023

hi @olgabot please see #2721, in which I update this PR to the code in latest (including the most recent release, sourmash v4.8.3). I think that merge should be uncontroversial ;). More discussion coming!

@ctb
Copy link
Contributor

ctb commented Aug 19, 2023

This PR may fix #2328.

@ctb
Copy link
Contributor

ctb commented Aug 19, 2023

OK, discussion over in #2328 (comment). Let me know what you think, @olgabot! Commentary in #1614 might be equally interesting, since we now have a plugin that provides really substantial speedups...

Note: PR into #2065.

This PR updates #2065 with all the changes from `latest`. This includes
the fix & update to multigather in
#2322, which:
- fixes the output of multigather to include more than one line 😆 
- prints out the output filename
- supports `--output-dir`

No functional changes are made beyond the merge, so some tests are still
failing; will discuss fixes in yet a new PR :).

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Keya Barve <53328492+keyabarve@users.noreply.github.com>
Co-authored-by: ccbaumler <63077899+ccbaumler@users.noreply.github.com>
Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Taylor Reiter <taylorreiter@gmail.com>
Co-authored-by: Erik Young <eeyoung@ucdavis.edu>
Co-authored-by: David Koslicki <dmkoslicki@gmail.com>
Co-authored-by: Luiz Irber <luizirber@users.noreply.github.com>
Co-authored-by: Colton Baumler <baumlerc@farm.ucdavis.edu>
Co-authored-by: Luiz Irber <contact+github@luizirber.org>
Co-authored-by: N. Tessa Pierce-Ward <ntpierce@gmail.com>
Co-authored-by: Peter Cock <p.j.a.cock@googlemail.com>
Co-authored-by: Francesco Beghini <francesco.beghini@yale.edu>
Co-authored-by: Jason Stajich <jason.stajich@ucr.edu>
Co-authored-by: Katrin Leinweber <9948149+katrinleinweber@users.noreply.github.com>
@olgabot
Copy link
Collaborator Author

olgabot commented Aug 21, 2023

Thanks for looking into this! I didn't realize that adding md5 would cause other problems, but glad they're not TOO huge.

@ctb
Copy link
Contributor

ctb commented Feb 28, 2024

closing this in favor of #2722, which contains these commits but is directly against latest :)

@ctb ctb closed this Feb 28, 2024
ctb added a commit that referenced this pull request Feb 29, 2024
…t-add-query-md5sum` (#2722)

This PR:
- adds documentation for `multigather` to sourmash docs!
- builds on #2065 /
#2721 so that tests pass.
- adds an option `-U/--output-add-query-md5sum` to `sourmash
multigather`
- adds an option `--force-allow-overwrite-output` to `sourmash
multigather`
- **CHANGES BEHAVIOR** of multigather by treating `query.filename ==
'-'` as if `query.filename` is empty, thus replacing it with md5sum
- **CHANGES BEHAVIOR** of multigather by failing loudly and clearly if
output files are going to be overwritten
- adds `-E/--extension` to allow output to files other than `.sig`

See discussion over in [#2328: `multigather` CSV output uses signature
`filename` as
basename](#2328).

To add:
- [x] tests for `-U`;
- [x] implement and test `-E/--extension`
- [x] implement and test `--force-allow-overwrite-output`
- [x] fix for `query_filename` being None/empty in `-U` branch
- [x] documentation update for changed output behavior for multigather:
'-' => using md5sum
- [x] documentation update for changed output behavior for multigather:
fails if overwrite happens
- [x] fix multigather link in docs

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Olga Botvinnik <olga.botvinnik@gmail.com>
Co-authored-by: Keya Barve <53328492+keyabarve@users.noreply.github.com>
Co-authored-by: ccbaumler <63077899+ccbaumler@users.noreply.github.com>
Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Taylor Reiter <taylorreiter@gmail.com>
Co-authored-by: Erik Young <eeyoung@ucdavis.edu>
Co-authored-by: David Koslicki <dmkoslicki@gmail.com>
Co-authored-by: Luiz Irber <luizirber@users.noreply.github.com>
Co-authored-by: Colton Baumler <baumlerc@farm.ucdavis.edu>
Co-authored-by: Luiz Irber <contact+github@luizirber.org>
Co-authored-by: N. Tessa Pierce-Ward <ntpierce@gmail.com>
Co-authored-by: Peter Cock <p.j.a.cock@googlemail.com>
Co-authored-by: Francesco Beghini <francesco.beghini@yale.edu>
Co-authored-by: Jason Stajich <jason.stajich@ucr.edu>
Co-authored-by: Katrin Leinweber <9948149+katrinleinweber@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants