Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: HMMER added identifier prefixes to alignment because of non-unique sequence identifiers #189

Closed
sacdallago opened this issue Oct 6, 2018 · 7 comments
Labels

Comments

@sacdallago
Copy link
Member

Ref: #175 #151

A job for alpha-synuclein failed on compare with the following log:

Traceback (most recent call last):
  File "/n/groups/marks/software/anaconda_o2/envs/evcouplings_backend_develop/lib/python3.5/site-packages/evcouplings/utils/pipeline.py", line 389, in execute_wrapped
    outcfg = execute(**config)
  File "/n/groups/marks/software/anaconda_o2/envs/evcouplings_backend_develop/lib/python3.5/site-packages/evcouplings/utils/pipeline.py", line 185, in execute
    outcfg = runner(**incfg)
  File "/n/groups/marks/software/anaconda_o2/envs/evcouplings_backend_develop/lib/python3.5/site-packages/evcouplings/compare/protocol.py", line 1044, in run
    return PROTOCOLS[kwargs["protocol"]](**kwargs)
  File "/n/groups/marks/software/anaconda_o2/envs/evcouplings_backend_develop/lib/python3.5/site-packages/evcouplings/compare/protocol.py", line 500, in standard
    "prefix": aux_prefix,
  File "/n/groups/marks/software/anaconda_o2/envs/evcouplings_backend_develop/lib/python3.5/site-packages/evcouplings/compare/protocol.py", line 102, in _identify_structures
    **kwargs
  File "/n/groups/marks/software/anaconda_o2/envs/evcouplings_backend_develop/lib/python3.5/site-packages/evcouplings/compare/sifts.py", line 723, in by_alignment
    **kwargs
  File "/n/groups/marks/software/anaconda_o2/envs/evcouplings_backend_develop/lib/python3.5/site-packages/evcouplings/compare/sifts.py", line 172, in find_homologs
    ali = Alignment.from_file(a, "stockholm")
  File "/n/groups/marks/software/anaconda_o2/envs/evcouplings_backend_develop/lib/python3.5/site-packages/evcouplings/align/alignment.py", line 648, in from_file
    raise_hmmer_prefixes=raise_hmmer_prefixes
  File "/n/groups/marks/software/anaconda_o2/envs/evcouplings_backend_develop/lib/python3.5/site-packages/evcouplings/align/alignment.py", line 176, in read_stockholm
    "HMMER added identifier prefixes to alignment because of non-unique "
ValueError: HMMER added identifier prefixes to alignment because of non-unique sequence identifiers. Either some sequence identifier is present twice in the sequence database, or your target sequence identifier is the same as an identifier in the database. In the first case, please fix your sequence database. In the second case, please choose a different sequence identifier for your target sequence that does not overlap with the sequence database.

This is different from the issues reported above, but seems related.

It is worth mentioning that this is running from the branch of #166 , which if I'm not mistaken doesn't include the changes of #180

I'll try to merge the latest develop into Management and run this again. Maybe @thomashopf you immediately see what happens and can deny/confirm that that's the issue.

@sacdallago sacdallago added the bug label Oct 6, 2018
@thomashopf
Copy link
Contributor

thomashopf commented Oct 8, 2018

This is the more informative error message I added as part of the solution to #175 so this appears to be in your branch already.

My guess is that you are still using outdated SIFTS database files that were made with a version that doesn't have the fix (in the fixed version, the sequences in the FASTA file are prefixed with >evsp and >evtr instead of >sp and >tr to avoid the identifier clashes).

if you remake these using the most recent develop or master branch version, the problem should be gone.

@sacdallago
Copy link
Member Author

sacdallago commented Oct 22, 2018

That's fair. Problem: currently the .current version of the databases on o2 points to the 06/2018 release, which is indexed >sp as you correctly noted:

lrwxrwxrwx  1 cd174 marks   67 Aug  9 11:47 pdb_chain_uniprot_plus_current.o2.csv -> /n/groups/marks/databases/SIFTS/pdb_chain_uniprot_plus_2018_6_1.csv
lrwxrwxrwx  1 cd174 marks   66 Aug  9 11:17 pdb_chain_uniprot_plus_current.o2.fa -> /n/groups/marks/databases/SIFTS/pdb_chain_uniprot_plus_2018_7_1.fa
lrwxrwxrwx  1 cd174 marks   69 Aug  9 11:21 pdb_chain_uniprot_plus_current.o2.fasta -> /n/groups/marks/databases/SIFTS/pdb_chain_uniprot_plus_2018_6_1.fasta

there are later versions on o2, but they are broken:

[cd174@login03 SIFTS]$ more pdb_chain_uniprot_plus_2018_9_1.fasta
<!DOCTYPE html SYSTEM "about:legacy-compat">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"><head><title>Retrieve/ID mapping</title><meta content="IE=edge" http-equiv="X-UA-Compatible"/><meta content="text/html; charset=UTF-8" http-equiv="Content-Typ
e"/><meta content="width=device-width, initial-scale=1" name="viewport"/><link href="/" rel="home"/><link href="http://creativecommons.org/licenses/by-nd/3.0/" rel="license"/><link type="image/vnd.microsoft.icon" href="/favic
on.ico" rel="shortcut icon"/><link href="/uniprot.min.css2018_07" type="text/css" rel="stylesheet"/><script type="text/javascript">
			var BASE = '/';
			var isInternal = false;
		</script><script src="/scripts/frontier/d3/d3.v3.min.js" type="text/javascript"></script><script src="/js-compr.js2018_07" type="text/javascript"></script><script type="text/javascript">
				uniprot.namespace = 'uniprot';
				onRdyFn(function(){
					if (false) {
						var shouldShowBasket = true;
						switch(uniprot.namespace) {
....

I supposed the db_update script needs to be re-written. Apart from being moved to o2's CRON vs. orchestra. This ties in with #179 and #178 .

Pinging: @b-schubert @aggreen

This needs fixing ASAP because about 50% of the runs that I see being submitted via web fail at this stage.

@sacdallago
Copy link
Member Author

sacdallago commented Oct 22, 2018

Plan:

High prio

  • Update stable&unstable conda envs on o2
  • Manually update dbs on o2 and create new symlinks for .currents

medium prio

  • Figure out where script for monthly updates is located
  • Move the monthly update to o2 (vs. orchestra)

@sacdallago
Copy link
Member Author

sacdallago commented Oct 22, 2018

Update : I've mistakenly committed directly to develop, but it's a super small update, see 1f9f35d

Neccessary, because evcouplings_dbupdate -v --sifts /n/groups/marks/databases/SIFTS/ will result in:

Updating SIFTS
Updating uniprot
Traceback (most recent call last):
  File "/n/groups/marks/software/anaconda_o2/envs/evcouplings_stable/lib/python3.6/pathlib.py", line 1226, in mkdir
    self._accessor.mkdir(self, mode)
  File "/n/groups/marks/software/anaconda_o2/envs/evcouplings_stable/lib/python3.6/pathlib.py", line 387, in wrapped
    return strfunc(str(pathobj), *args)
FileNotFoundError: [Errno 2] No such file or directory: '/groups/marks/databases/jackhmmer/uniprot'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/n/groups/marks/software/anaconda_o2/envs/evcouplings_stable/lib/python3.6/pathlib.py", line 1226, in mkdir
    self._accessor.mkdir(self, mode)
  File "/n/groups/marks/software/anaconda_o2/envs/evcouplings_stable/lib/python3.6/pathlib.py", line 387, in wrapped
    return strfunc(str(pathobj), *args)
FileNotFoundError: [Errno 2] No such file or directory: '/groups/marks/databases/jackhmmer'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/n/groups/marks/software/anaconda_o2/envs/evcouplings_stable/lib/python3.6/pathlib.py", line 1226, in mkdir
    self._accessor.mkdir(self, mode)
  File "/n/groups/marks/software/anaconda_o2/envs/evcouplings_stable/lib/python3.6/pathlib.py", line 387, in wrapped
    return strfunc(str(pathobj), *args)
FileNotFoundError: [Errno 2] No such file or directory: '/groups/marks/databases'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/n/groups/marks/software/anaconda_o2/envs/evcouplings_stable/lib/python3.6/pathlib.py", line 1226, in mkdir
    self._accessor.mkdir(self, mode)
  File "/n/groups/marks/software/anaconda_o2/envs/evcouplings_stable/lib/python3.6/pathlib.py", line 387, in wrapped
    return strfunc(str(pathobj), *args)
FileNotFoundError: [Errno 2] No such file or directory: '/groups/marks'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/n/groups/marks/software/anaconda_o2/envs/evcouplings_stable/bin/evcouplings_dbupdate", line 11, in <module>
    load_entry_point('evcouplings==0.0.5', 'console_scripts', 'evcouplings_dbupdate')()
  File "/n/groups/marks/software/anaconda_o2/envs/evcouplings_stable/lib/python3.6/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/n/groups/marks/software/anaconda_o2/envs/evcouplings_stable/lib/python3.6/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/n/groups/marks/software/anaconda_o2/envs/evcouplings_stable/lib/python3.6/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/n/groups/marks/software/anaconda_o2/envs/evcouplings_stable/lib/python3.6/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/n/groups/marks/software/anaconda_o2/envs/evcouplings_stable/lib/python3.6/site-packages/evcouplings/utils/update_database.py", line 189, in app
    run(**kwargs)
  File "/n/groups/marks/software/anaconda_o2/envs/evcouplings_stable/lib/python3.6/site-packages/evcouplings/utils/update_database.py", line 149, in run
    dir.mkdir(parents=True, exist_ok=True)
  File "/n/groups/marks/software/anaconda_o2/envs/evcouplings_stable/lib/python3.6/pathlib.py", line 1230, in mkdir
    self.parent.mkdir(parents=True, exist_ok=True)
  File "/n/groups/marks/software/anaconda_o2/envs/evcouplings_stable/lib/python3.6/pathlib.py", line 1230, in mkdir
    self.parent.mkdir(parents=True, exist_ok=True)
  File "/n/groups/marks/software/anaconda_o2/envs/evcouplings_stable/lib/python3.6/pathlib.py", line 1230, in mkdir
    self.parent.mkdir(parents=True, exist_ok=True)
  File "/n/groups/marks/software/anaconda_o2/envs/evcouplings_stable/lib/python3.6/pathlib.py", line 1226, in mkdir
    self._accessor.mkdir(self, mode)
  File "/n/groups/marks/software/anaconda_o2/envs/evcouplings_stable/lib/python3.6/pathlib.py", line 387, in wrapped
    return strfunc(str(pathobj), *args)
PermissionError: [Errno 13] Permission denied: '/groups'

which apparently means that one way or another, evcouplings_dbupdate will always try to update all dbs, instead of selectively going for the ones that are defined.

@sacdallago
Copy link
Member Author

This is now done. If this problem persist, I'll re-open.

@sacdallago
Copy link
Member Author

Seems like this is still a problem?

Traceback (most recent call last):
  File "/n/groups/marks/software/anaconda_o2/envs/evcouplings_backend_develop/lib/python3.5/site-packages/evcouplings/utils/pipeline.py", line 389, in execute_wrapped
    outcfg = execute(**config)
  File "/n/groups/marks/software/anaconda_o2/envs/evcouplings_backend_develop/lib/python3.5/site-packages/evcouplings/utils/pipeline.py", line 185, in execute
    outcfg = runner(**incfg)
  File "/n/groups/marks/software/anaconda_o2/envs/evcouplings_backend_develop/lib/python3.5/site-packages/evcouplings/align/protocol.py", line 1673, in run
    return PROTOCOLS[kwargs["protocol"]](**kwargs)
  File "/n/groups/marks/software/anaconda_o2/envs/evcouplings_backend_develop/lib/python3.5/site-packages/evcouplings/align/protocol.py", line 1471, in standard
    ali_raw = Alignment.from_file(a, "stockholm")
  File "/n/groups/marks/software/anaconda_o2/envs/evcouplings_backend_develop/lib/python3.5/site-packages/evcouplings/align/alignment.py", line 648, in from_file
    raise_hmmer_prefixes=raise_hmmer_prefixes
  File "/n/groups/marks/software/anaconda_o2/envs/evcouplings_backend_develop/lib/python3.5/site-packages/evcouplings/align/alignment.py", line 176, in read_stockholm
    "HMMER added identifier prefixes to alignment because of non-unique "
ValueError: HMMER added identifier prefixes to alignment because of non-unique sequence identifiers. Either some sequence identifier is present twice in the sequence database, or your target seq
uence identifier is the same as an identifier in the database. In the first case, please fix your sequence database. In the second case, please choose a different sequence identifier for your ta
rget sequence that does not overlap with the sequence database.

just for one threshold (two other ones went through, so maybe it's just a problem in the SIFTs file?) in a web submitted job. See /n/scratch2/d0aa2114a3501fe4cac5724db04eefed, file: d0aa2114a3501fe4cac5724db04eefed_b0.4.failed.

Using all the .current databases and the latest version from this repo... So, I don't know :(

@thomashopf
Copy link
Contributor

Nope if you check the stack trace this happens in the align stage. Reason is that a full Uniprot identifier (like sp|P01112|RASH_HUMAN) is searched against Uniprot and the region of the hit is exactly that of the query which triggers the HMMER renumbering.

I mentioned this problem and the solution in the original issue #175 (comment) - fixing the identifier on the server is one line of code vs. introducing massive ugliness in the pipeline, so this one is on the server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants