-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Explore addition of a sketch
command.
#1159
Conversation
Codecov Report
@@ Coverage Diff @@
## latest #1159 +/- ##
===========================================
+ Coverage 67.86% 93.01% +25.15%
===========================================
Files 24 76 +52
Lines 3314 5986 +2672
===========================================
+ Hits 2249 5568 +3319
+ Misses 1065 418 -647
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
OK. I think this addresses most of the UX problems with |
This is really exciting!! |
Interesting! We've found that for "realistic" 10x data that we need to do several steps before getting to the sketch part. Check out the kmermaid pipeline for implementation details, and @pranathivemuri feel free to correct me where I'm wrong.
Skipping step (3) results in ~700,000 cell barcode files per bam, because that's the total number of theoretical cell barcodes per 10x run (even more with the v3 chemistry I think) so it is necessary. The "real" number of cell barcodes per run is closer to 2,000-5,000. If you allow all ~700,000 through, then most of your "cells" have ~1-10 reads which usually get removed from downstream analyses anyway. Min 1000 UMIs is a pretty reasonable threshold. I've seen pipelines that filter for min 1000 UMIs/cell and min 100 genes/cell.
This would be great!
Maybe the same as now, where it does the 6-frame translation?
I think this should either:
Interesting! Maybe
Does this assume that the RNA library prep is strand-specific? If it's not strand-specific, would both sides need to be used?
This would be awesome!
👍
👍 EDIT: Forgot trimming & ribosomal RNA removal steps |
OK, after talking to @bluegenes she's 👍 on this functionality and already started using the terminology in slides 😂 So, I'm planning to put this up for merge; remaining items:
|
This comment has been minimized.
This comment has been minimized.
ok, so currently:
is identical to
which makes sense! I might have it throw an error anyway, but it's nice the code "just works"... conveniently,
throws an error already :). I'll have to make this a nicer error message, is all!
|
Yikes, a self.signatures
.iter_mut()
.for_each(|sketch| {
sketch.add_protein(&seq).unwrap(); }
); I didn't know about self.signatures
.iter_mut()
.try_for_each(|sketch| {
sketch.add_protein(&seq) }
)?; This is going to raise a |
sketch
command.sketch
command.
🎉 |
A new
sourmash sketch
commandIn this experimental PR, I implement a new command-line submodule,
sketch
, which has three main subcommands:Note that:
sketch dna
is a replacement forcompute
.sketch protein
is a replacement forcompute --input-is-protein --no-dna
.sketch translate
replacescompute --protein --no-dna
.sketch protein --dayhoff|--hp
replacescompute --dayhoff|--hp --no-dna --input-is-protein
.sketch translate --dayhoff|--hp
replacescompute --dayhoff|--hp --no-dna
.sketch protein
andsketch translate
are now 1/3rd of what they are forcompute
!!!k=31,scaled=1000,noabund
k=21,scaled=200,noabund
k=19,scaled=200,noabund
k=30,scaled=200,noabund
Sketch parameters
The
<sketch params>
arguments cover the common sketch config params: ksize, num/scaled, and track_abund. We use good defaults per #219, and use explicit per-MinHash notation, e.g.-p k=31,scaled=1000,abund
to construct each MinHash (with support for multiple-p
).Examples of param strings that work :):
-p k=31,scaled=1000,abund
-p k=31,noabund
-p k=51
- default to a scaled and abund (based on moltype/command)-p k=31,k=51,k=21
- compute multiple ksizes, defaults otherwise-p k=20,num=500,protein -p k=19,num=400,dayhoff,abund
-p k=30,scaled=200,hp -p k=30,scaled=200,seed=58
- computes multiple ksizes, moltypes, scaled/num, and even uses a different seed.Notes and brainstorming
sourmash sketch reads
,sourmash sketch genome
, andsourmash sketch ncbi
, that choose good defaults for those kinds of inputsscaled=1000,abund
sourmash sketch 10x
too!sourmash sketch dayhoff
andsourmash sketch hp
.sourmash sketch 16s
(see e.g. classification with full length 16S gene #1000)sketch dna -p k=21,protein
andsketch protein -p k=21,dna
are errors.Additional thoughts:
sourmash sketch dna --passthrough <input> | sourmash sketch protein <input>
- this would yield two signatures. Not sure how to do the output tho - where does the signature file go? do we need to specify multiple different-o
options?!sketch
module suitable for direct use by Python, per Idea: imports match CLI? #1112?sourmash.sketch.dna(filenames=...)
make test
Did it pass the tests?make coverage
Is the new code covered?without a major version increment. Changing file formats also requires a
major version number increment.
changes were made?