Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Better molecule type checks #782

Merged
merged 6 commits into from
Dec 11, 2019
Merged

[MRG] Better molecule type checks #782

merged 6 commits into from
Dec 11, 2019

Conversation

luizirber
Copy link
Member

@luizirber luizirber commented Nov 30, 2019

This triggers bug #781, still need to fix it

Checklist

  • Is it mergeable?
  • make test Did it pass the tests?
  • make coverage Is the new code covered?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Was a spellchecker run on the source code and documentation after
    changes were made?

@ctb
Copy link
Contributor

ctb commented Nov 30, 2019

This is missing tests/test-data/protein_781.sig :)

@ctb
Copy link
Contributor

ctb commented Nov 30, 2019

(added the signature from @bluegenes example)

@luizirber
Copy link
Member Author

luizirber commented Dec 1, 2019

so, I looked into this and... Currently we have two types of preprocessing of sequences (before adding to minhash): DNA and protein. For proteins, we can do three encodings: aminoacid, dayhoff and hp. So, something like this:

  • DNA
  • protein
    • aa
    • dayhoff
    • hp

The main issue for this PR is that load_signatures with a select_moltype=protein will also bring the hp encoding, on top of the aa encoding (the 'default' for protein). I think we should break that hierarchy and do one moltype per encoding (even if aa, dayhoff and hp are all protein).

This is also very relevant for #751.

@luizirber
Copy link
Member Author

The good thing is that molecule is always one of dna, protein, dayhoff or hp in the sig JSON file, so no changes needed there.

We would need to change kmer_min_hash.hh, because nowadays is_protein is true for dayhoff and hp too (and it should be only true for aa?)

@luizirber
Copy link
Member Author

luizirber commented Dec 1, 2019

(also pinging @olgabot and @pranathivemuri to see if this breaks any of their pipelines)

tests/test_bugs.py Outdated Show resolved Hide resolved
@pranathivemuri
Copy link
Contributor

The changes look good to me. I don't see how it might break a pipeline as long as all the tests are passing.

@olgabot
Copy link
Collaborator

olgabot commented Dec 3, 2019

Thanks for finding this! testing it now.

@luizirber
Copy link
Member Author

The changes look good to me. I don't see how it might break a pipeline as long as all the tests are passing.

So.. there is no way of making the tests pass with the changes I proposed. This is the main culprit:
https://github.com/dib-lab/sourmash/blob/57f6efc7dfe298e521529549e588c4745d0bb159/tests/test_sourmash_compute.py#L496

The main change in this PR is making is_protein represent only the protein molecule type (the original aa translation), but for this test the x.minhash.is_molecule_type('protein') returns True for both protein and hp molecule types. That's why I'm asking if this breaks any pipelines, because if it doesn't we change it from == 4 to == 2, and release 2.3.1.

@luizirber luizirber changed the title [WIP] trigger bug 781 [WIP] Better molecule type checks Dec 3, 2019
@codecov
Copy link

codecov bot commented Dec 4, 2019

Codecov Report

Merging #782 into master will increase coverage by 0.02%.
The diff coverage is 87.5%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #782      +/-   ##
==========================================
+ Coverage   89.35%   89.38%   +0.02%     
==========================================
  Files          29       29              
  Lines        4605     4608       +3     
  Branches       49       49              
==========================================
+ Hits         4115     4119       +4     
+ Misses        486      485       -1     
  Partials        4        4
Impacted Files Coverage Δ
sourmash/sourmash_args.py 96.72% <100%> (+0.81%) ⬆️
sourmash/signature.py 95.02% <66.66%> (-0.48%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 57f6efc...760d375. Read the comment docs.

@luizirber
Copy link
Member Author

The main change in this PR is making is_protein represent only the protein molecule type (the original aa translation), but for this test the x.minhash.is_molecule_type('protein') returns True for both protein and hp molecule types. That's why I'm asking if this breaks any pipelines, because if it doesn't we change it from == 4 to == 2, and release 2.3.1.

I went ahead changed the test value. That should mean a major version bump, but since mostly @olgabot and team are using this feature I think we should treat as a behavior bug instead and release a quick fix.

Ready for review and merge @ctb @olgabot

@luizirber luizirber changed the title [WIP] Better molecule type checks [MRG] Better molecule type checks Dec 5, 2019
@luizirber luizirber merged commit f7987a5 into master Dec 11, 2019
@luizirber luizirber deleted the bug_781 branch December 11, 2019 18:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants