-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MRG: Uniquify csv output from multigather #2065
Conversation
Hello! Hope you and yours are doing well. I am using multigather to query protein sequences against each other, and primarily use the csv file for downstream processing. As is, if the signatures in `--query query.sig` were created from one fasta file with e.g. `--singleton`, then all results iteratively overwrite each other into the same csv file. This proposed change adds the md5sum to the query file to ensure uniqueness. Other suggestions are welcome! (not tested yet) Warmest, Olga
Codecov Report
@@ Coverage Diff @@
## latest #2065 +/- ##
===========================================
- Coverage 85.34% 25.80% -59.55%
===========================================
Files 133 104 -29
Lines 15196 12402 -2794
Branches 2614 2390 -224
===========================================
- Hits 12969 3200 -9769
- Misses 1926 9002 +7076
+ Partials 301 200 -101
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 128 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
hi @olgabot, sounds good to me! I'm only in favor of removing things if people aren't using them 😆 Hot take: it looks like this PR so far does munging of an existing output column to include md5sum, which presents three challenges -
I think a better solution might be to output the full md5sum as a separate column. Based on the work @bluegenes did in #1955, I suggest naming it (Also going to tag in #1555; I think we're heading towards standardizing on the |
Hi @ctb, revisiting this PR one year later :) If I recall correctly, this PR doesn't change any of the columns of the csv output, only the filenames. Let me see if I can explain my thought process. For a command like this:
if
What I was hoping to see was a separate file for each query_md5 in the
Another option could be to concatenate all outputs into a single file, so have 3 outputs like the original, but have ALL signatures, and add a
Let me know what you think! Open to either many files or one mega-file. FWIW, I have been using the hack above, plus this StackOverflow answer which benchmarked a pure Python approach to concatenating 100s of csv files with headers for a bit and it's been working for me. |
This PR may fix #2328. |
OK, discussion over in #2328 (comment). Let me know what you think, @olgabot! Commentary in #1614 might be equally interesting, since we now have a plugin that provides really substantial speedups... |
Note: PR into #2065. This PR updates #2065 with all the changes from `latest`. This includes the fix & update to multigather in #2322, which: - fixes the output of multigather to include more than one line 😆 - prints out the output filename - supports `--output-dir` No functional changes are made beyond the merge, so some tests are still failing; will discuss fixes in yet a new PR :). --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Keya Barve <53328492+keyabarve@users.noreply.github.com> Co-authored-by: ccbaumler <63077899+ccbaumler@users.noreply.github.com> Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Taylor Reiter <taylorreiter@gmail.com> Co-authored-by: Erik Young <eeyoung@ucdavis.edu> Co-authored-by: David Koslicki <dmkoslicki@gmail.com> Co-authored-by: Luiz Irber <luizirber@users.noreply.github.com> Co-authored-by: Colton Baumler <baumlerc@farm.ucdavis.edu> Co-authored-by: Luiz Irber <contact+github@luizirber.org> Co-authored-by: N. Tessa Pierce-Ward <ntpierce@gmail.com> Co-authored-by: Peter Cock <p.j.a.cock@googlemail.com> Co-authored-by: Francesco Beghini <francesco.beghini@yale.edu> Co-authored-by: Jason Stajich <jason.stajich@ucr.edu> Co-authored-by: Katrin Leinweber <9948149+katrinleinweber@users.noreply.github.com>
Thanks for looking into this! I didn't realize that adding md5 would cause other problems, but glad they're not TOO huge. |
closing this in favor of #2722, which contains these commits but is directly against latest :) |
…t-add-query-md5sum` (#2722) This PR: - adds documentation for `multigather` to sourmash docs! - builds on #2065 / #2721 so that tests pass. - adds an option `-U/--output-add-query-md5sum` to `sourmash multigather` - adds an option `--force-allow-overwrite-output` to `sourmash multigather` - **CHANGES BEHAVIOR** of multigather by treating `query.filename == '-'` as if `query.filename` is empty, thus replacing it with md5sum - **CHANGES BEHAVIOR** of multigather by failing loudly and clearly if output files are going to be overwritten - adds `-E/--extension` to allow output to files other than `.sig` See discussion over in [#2328: `multigather` CSV output uses signature `filename` as basename](#2328). To add: - [x] tests for `-U`; - [x] implement and test `-E/--extension` - [x] implement and test `--force-allow-overwrite-output` - [x] fix for `query_filename` being None/empty in `-U` branch - [x] documentation update for changed output behavior for multigather: '-' => using md5sum - [x] documentation update for changed output behavior for multigather: fails if overwrite happens - [x] fix multigather link in docs --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Olga Botvinnik <olga.botvinnik@gmail.com> Co-authored-by: Keya Barve <53328492+keyabarve@users.noreply.github.com> Co-authored-by: ccbaumler <63077899+ccbaumler@users.noreply.github.com> Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Taylor Reiter <taylorreiter@gmail.com> Co-authored-by: Erik Young <eeyoung@ucdavis.edu> Co-authored-by: David Koslicki <dmkoslicki@gmail.com> Co-authored-by: Luiz Irber <luizirber@users.noreply.github.com> Co-authored-by: Colton Baumler <baumlerc@farm.ucdavis.edu> Co-authored-by: Luiz Irber <contact+github@luizirber.org> Co-authored-by: N. Tessa Pierce-Ward <ntpierce@gmail.com> Co-authored-by: Peter Cock <p.j.a.cock@googlemail.com> Co-authored-by: Francesco Beghini <francesco.beghini@yale.edu> Co-authored-by: Jason Stajich <jason.stajich@ucr.edu> Co-authored-by: Katrin Leinweber <9948149+katrinleinweber@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Hello!
Hope you and yours are doing well. I am using multigather to query protein sequences against each other, and primarily use the csv file for downstream processing. As is, if the signatures in
--query query.sig
were created from one fasta file with e.g.--singleton
, then all results iteratively overwrite each other into the same csv file. This proposed change adds the md5sum to the query file to ensure uniqueness. Having seen the plans to deprecatemultigather
, #1614, maybe this is not the right way to go about this. Other suggestions are welcome!(not tested yet)
Warmest,
Olga