Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] add flexible & iterative support for outputting signatures in variety of formats #1493

Merged
merged 19 commits into from
May 5, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions doc/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -620,6 +620,8 @@ with fields:

If `--outdir` is specified, all of the signatures are placed in outdir.

Note: `split` only saves files in the JSON `.sig` format.

### `sourmash signature merge` - merge two or more signatures into one

Merge two (or more) signatures.
Expand All @@ -637,6 +639,9 @@ then the merged signature will have the sum of all abundances across
the individual signatures. The `--flatten` flag will override this
behavior and allow merging of mixtures by removing all abundances.

Note: `merge` only creates one output file, with one signature in it,
in the JSON `.sig` format.

### `sourmash signature rename` - rename a signature

Rename the display name for one or more signatures - this is the name
Expand Down Expand Up @@ -666,6 +671,9 @@ will subtract all of the hashes in `file2.sig` and `file3.sig` from
To use `subtract` on signatures calculated with
`-p abund`, you must specify `--flatten`.

Note: `subtract` only creates one output file, with one signature in it,
in the JSON `.sig` format.

### `sourmash signature intersect` - intersect two (or more) signatures

Output the intersection of the hash values in multiple signature files.
Expand All @@ -682,6 +690,9 @@ The `intersect` command flattens all signatures, i.e. the abundances
in any signatures will be ignored and the output signature will have
`track_abundance` turned off.

Note: `intersect` only creates one output file, with one signature in it,
in the JSON `.sig` format.

### `sourmash signature downsample` - decrease the size of a signature

Downsample one or more signatures.
Expand Down Expand Up @@ -773,6 +784,9 @@ sourmash signature import filename.msh.json -o imported.sig
```
will import the contents of `filename.msh.json` into `imported.sig`.

Note: `import` only creates one output file, with one signature in it,
in the JSON `.sig` format.

### `sourmash signature export` - export signatures to mash.

Export signatures from sourmash format. Currently only supports
Expand Down Expand Up @@ -860,6 +874,25 @@ signatures from zip files. You can create a compressed collection of
signatures using `zip -r collection.zip *.sig` and then specify
`collections.zip` on the command line.

### Saving signatures, more generally

As of sourmash 4.1, most signature saving arguments (`--save-matches`
for `search` and `gather`, `-o` for `sourmash sketch`, and most of the
`sourmash signature` commands) support flexible saving of collections of
signatures into JSON text, Zip files, and/or directories.

This behavior is triggered by the requested output filename --

* to save to JSON signature files, use `.sig`; `-` will send JSON to stdout.
* to save to gzipped JSON signature files, use `.sig.gz`;
* to save to a Zip file collection, use `.zip`;
* to save signature files to a directory, use a name ending in `/`; the directory will be created if it doesn't exist;

If none of these file extensions is detected, output will be written in the JSON `.sig` format, either to the provided output filename or to stdout.

All of these save formats can be loaded by sourmash commands, too.
ctb marked this conversation as resolved.
Show resolved Hide resolved


### Loading all signatures under a directory

All of the `sourmash` commands support loading signatures from
Expand Down
2 changes: 1 addition & 1 deletion src/sourmash/cli/sig/cat.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ def subparser(subparsers):
help='suppress non-error output'
)
subparser.add_argument(
'-o', '--output', metavar='FILE',
'-o', '--output', metavar='FILE', default='-',
help='output signature to this file (default stdout)'
)
subparser.add_argument(
Expand Down
3 changes: 2 additions & 1 deletion src/sourmash/cli/sig/downsample.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,8 @@ def subparser(subparsers):
)
subparser.add_argument(
'-o', '--output', metavar='FILE',
help='output signature to this file (default stdout)'
help='output signature to this file (default stdout)',
default='-',
)
add_ksize_arg(subparser, 31)
add_moltype_args(subparser)
Expand Down
3 changes: 2 additions & 1 deletion src/sourmash/cli/sig/extract.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,8 @@ def subparser(subparsers):
)
subparser.add_argument(
'-o', '--output', metavar='FILE',
help='output signature to this file (default stdout)'
help='output signature to this file (default stdout)',
default='-',
)
subparser.add_argument(
'--md5', default=None,
Expand Down
3 changes: 2 additions & 1 deletion src/sourmash/cli/sig/filter.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,8 @@ def subparser(subparsers):
)
subparser.add_argument(
'-o', '--output', metavar='FILE',
help='output signature to this file (default stdout)'
help='output signature to this file (default stdout)',
default='-'
)
subparser.add_argument(
'--md5', type=str, default=None,
Expand Down
3 changes: 2 additions & 1 deletion src/sourmash/cli/sig/flatten.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,8 @@ def subparser(subparsers):
)
subparser.add_argument(
'-o', '--output', metavar='FILE',
help='output signature to this file (default stdout)'
help='output signature to this file (default stdout)',
default='-',
)
subparser.add_argument(
'--md5', default=None,
Expand Down
4 changes: 3 additions & 1 deletion src/sourmash/cli/sig/rename.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,9 @@ def subparser(subparsers):
help='print debugging output'
)
subparser.add_argument(
'-o', '--output', metavar='FILE', help='output to this file'
'-o', '--output', metavar='FILE',
help='output renamed signature to this file (default stdout)',
default='-'
)
add_ksize_arg(subparser, 31)
add_moltype_args(subparser)
Expand Down
20 changes: 17 additions & 3 deletions src/sourmash/command_compute.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
import time

from . import sourmash_args
from .signature import SourmashSignature, save_signatures
from .signature import SourmashSignature
from .logging import notify, error, set_quiet
from .utils import RustObject
from ._lowlevel import ffi, lib
Expand Down Expand Up @@ -267,9 +267,23 @@ def set_sig_name(sigs, filename, name=None):


def save_siglist(siglist, sigfile_name):
import sourmash

# save!
with sourmash_args.FileOutput(sigfile_name, 'w') as fp:
save_signatures(siglist, fp)
with sourmash_args.SaveSignaturesToLocation(sigfile_name) as save_sig:
for ss in siglist:
try:
save_sig.add(ss)
except sourmash.exceptions.Panic:
# this deals with a disconnect between the way Rust
# and Python handle signatures; Python expects one
# minhash (and hence one md5sum) per signature, while
# Rust supports multiple. For now, go through serializing
# and deserializing the signature! See issue #1167 for more.
json_str = sourmash.save_signatures([ss])
for ss in sourmash.load_signatures(json_str):
save_sig.add(ss)

notify('saved signature(s) to {}. Note: signature license is CC0.',
sigfile_name)

Expand Down
14 changes: 9 additions & 5 deletions src/sourmash/commands.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,8 @@
from . import signature as sig
from . import sourmash_args
from .logging import notify, error, print_results, set_quiet
from .sourmash_args import DEFAULT_LOAD_K, FileOutput, FileOutputCSV
from .sourmash_args import (DEFAULT_LOAD_K, FileOutput, FileOutputCSV,
SaveSignaturesToLocation)

WATERMARK_SIZE = 10000

Expand Down Expand Up @@ -524,8 +525,10 @@ def search(args):
# save matching signatures upon request
if args.save_matches:
notify('saving all matched signatures to "{}"', args.save_matches)
with FileOutput(args.save_matches, 'wt') as fp:
sig.save_signatures([ sr.match for sr in results ], fp)

with SaveSignaturesToLocation(args.save_matches) as save_sig:
for sr in results:
save_sig.add(sr.match)


def categorize(args):
Expand Down Expand Up @@ -714,8 +717,9 @@ def gather(args):
# save matching signatures?
if found and args.save_matches:
notify(f"saving all matches to '{args.save_matches}'")
with FileOutput(args.save_matches, 'wt') as fp:
sig.save_signatures([ r.match for r in found ], fp)
with SaveSignaturesToLocation(args.save_matches) as save_sig:
for sr in found:
save_sig.add(sr.match)

# save unassigned hashes?
if args.output_unassigned:
Expand Down
73 changes: 39 additions & 34 deletions src/sourmash/sig/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,9 +70,10 @@ def cat(args):
encountered_md5sums = defaultdict(int) # used by --unique
progress = sourmash_args.SignatureLoadingProgress()

siglist = []
save_sigs = sourmash_args.SaveSignaturesToLocation(args.output)
save_sigs.open()

for sigfile in args.signatures:
this_siglist = []
try:
loader = sourmash_args.load_file_as_signatures(sigfile,
progress=progress)
Expand All @@ -85,19 +86,18 @@ def cat(args):
if args.unique and encountered_md5sums[md5] > 1:
continue

siglist.append(sig)
save_sigs.add(sig)
except Exception as exc:
error(str(exc))
error('(continuing)')

notify('loaded {} signatures from {}...', n_loaded, sigfile, end='\r')

notify('loaded {} signatures total.', len(siglist))
notify('loaded {} signatures total.', len(save_sigs))

with FileOutput(args.output, 'wt') as fp:
sourmash.save_signatures(siglist, fp=fp)
save_sigs.close()

notify('output {} signatures', len(siglist))
notify('output {} signatures', len(save_sigs))

multiple_md5 = [ 1 for cnt in encountered_md5sums.values() if cnt > 1 ]
if multiple_md5:
Expand Down Expand Up @@ -523,7 +523,9 @@ def rename(args):

progress = sourmash_args.SignatureLoadingProgress()

outlist = []
save_sigs = sourmash_args.SaveSignaturesToLocation(args.output)
save_sigs.open()

for filename in args.sigfiles:
debug('loading {}', filename)
siglist = sourmash_args.load_file_as_signatures(filename,
Expand All @@ -533,12 +535,11 @@ def rename(args):

for sigobj in siglist:
sigobj._name = args.name
outlist.append(sigobj)
save_sigs.add(sigobj)

with FileOutput(args.output, 'wt') as fp:
sourmash.save_signatures(outlist, fp=fp)
save_sigs.close()

notify("set name to '{}' on {} signatures", args.name, len(outlist))
notify("set name to '{}' on {} signatures", args.name, len(save_sigs))


def extract(args):
Expand All @@ -550,7 +551,9 @@ def extract(args):

progress = sourmash_args.SignatureLoadingProgress()

outlist = []
save_sigs = sourmash_args.SaveSignaturesToLocation(args.output)
save_sigs.open()

total_loaded = 0
for filename in args.signatures:
siglist = sourmash_args.load_file_as_signatures(filename,
Expand All @@ -567,18 +570,18 @@ def extract(args):
if args.name is not None:
siglist = [ ss for ss in siglist if args.name in str(ss) ]

outlist.extend(siglist)
for ss in siglist:
save_sigs.add(ss)

notify("loaded {} total that matched ksize & molecule type",
total_loaded)
if not outlist:
if not save_sigs:
error("no matching signatures!")
sys.exit(-1)

with FileOutput(args.output, 'wt') as fp:
sourmash.save_signatures(outlist, fp=fp)
save_sigs.close()

notify("extracted {} signatures from {} file(s)", len(outlist),
notify("extracted {} signatures from {} file(s)", len(save_sigs),
len(args.signatures))


Expand All @@ -591,7 +594,9 @@ def filter(args):

progress = sourmash_args.SignatureLoadingProgress()

outlist = []
save_sigs = sourmash_args.SaveSignaturesToLocation(args.output)
save_sigs.open()

total_loaded = 0
for filename in args.signatures:
siglist = sourmash_args.load_file_as_signatures(filename,
Expand Down Expand Up @@ -628,27 +633,28 @@ def filter(args):

ss.minhash = filtered_mh

outlist.extend(siglist)
save_sigs.add(ss)

with FileOutput(args.output, 'wt') as fp:
sourmash.save_signatures(outlist, fp=fp)
save_sigs.close()

notify("loaded {} total that matched ksize & molecule type",
total_loaded)
notify("extracted {} signatures from {} file(s)", len(outlist),
notify("extracted {} signatures from {} file(s)", len(save_sigs),
len(args.signatures))


def flatten(args):
"""
flatten a signature, removing abundances.
flatten one or more signatures, removing abundances.
"""
set_quiet(args.quiet)
moltype = sourmash_args.calculate_moltype(args)

progress = sourmash_args.SignatureLoadingProgress()

outlist = []
save_sigs = sourmash_args.SaveSignaturesToLocation(args.output)
save_sigs.open()

total_loaded = 0
for filename in args.signatures:
siglist = sourmash_args.load_file_as_signatures(filename,
Expand All @@ -667,15 +673,13 @@ def flatten(args):

for ss in siglist:
ss.minhash = ss.minhash.flatten()
save_sigs.add(ss)

outlist.extend(siglist)

with FileOutput(args.output, 'wt') as fp:
sourmash.save_signatures(outlist, fp=fp)
save_sigs.close()

notify("loaded {} total that matched ksize & molecule type",
total_loaded)
notify("extracted {} signatures from {} file(s)", len(outlist),
notify("extracted {} signatures from {} file(s)", len(save_sigs),
len(args.signatures))


Expand All @@ -694,9 +698,11 @@ def downsample(args):
error('cannot specify both --num and --scaled')
sys.exit(-1)

save_sigs = sourmash_args.SaveSignaturesToLocation(args.output)
save_sigs.open()

progress = sourmash_args.SignatureLoadingProgress()

output_list = []
total_loaded = 0
for sigfile in args.signatures:
siglist = sourmash_args.load_file_as_signatures(sigfile,
Expand Down Expand Up @@ -734,10 +740,9 @@ def downsample(args):

sigobj.minhash = mh_new

output_list.append(sigobj)
save_sigs.add(sigobj)

with FileOutput(args.output, 'wt') as fp:
sourmash.save_signatures(output_list, fp=fp)
save_sigs.close()

notify("loaded and downsampled {} signatures", total_loaded)

Expand Down
Loading