[FEATURE] improved/corrected `pdb_selaltloc` tool #156

joaomcteixeira · 2023-03-30T22:16:35Z

Improves pdb_selaltloc tool:

closes pdb_selaltloc removes residues with (only) alternate locations #153
respects all previous tests
add more tests beyond pdb_selaltloc removes residues with (only) alternate locations #153
improved documentation
complete rewrite of the algorithm. It's difficult to follow the track changes, clean file here.

I think we should bump minor because of the importance of this update.
Thanks to everyone participating in the discussions 👏

I cannot add you as a reviewer @mgiulini; consider yourselves a reviewer 😉

TODO: confirm tests

joaomcteixeira · 2023-03-30T22:25:44Z

Actions will work after #157

pdbtools/pdb_selaltloc.py

JoaoRodrigues · 2023-04-02T22:16:01Z

pdbtools/pdb_selaltloc.py

-                altloc_lines = dict()
-                res_per_loc = dict()
+            # flush lines because we enter a new chain
+            if chain != prev_chain:


I'd really avoid storing all residues in a chain, especially since this operation is residue based. You can trigger a flush when you see a new residue. That saves a lot of memory, not that it matters much because chains are tiny, but keeps with the idea of generators/iterators.

Expanding on this a little, I wonder if you can simplify the overall logic by doing the following. All ATOM/HETATM/ANISOU records have the same identifiers (chain, resnum, inscode). You can record these are you iterate since they must be in order. If they're not, it's really not our problem.

Once you hit a new residue, look back at the lines you accumulated for the residue. Track altlocs, if any in any of those lines. If there are any altlocs we want to keep/discard, do it then while flushing and ignoring those lines. This requires 3x iterations for a residue block if there are altlocs - once to store, once to process, once to flush - and you could cut this down to two by doing the processing as you store (basically keep a dict of {pdbname: {altloc: occupancy}}).

I think this might be simpler than what you have and has the added benefit that you also filter ANISOU lines that belong to discarded altlocs. Let me know what you think.

I'd really avoid storing all residues in a chain, especially since this operation is residue based. You can trigger a flush when you see a new residue. That saves a lot of memory, not that it matters much because chains are tiny, but keeps with the idea of generators/iterators.

While this is formally true, it is also true that there are some cases where altlocs are organized by altloc instead of residues. See test example. I know this is not the official format but if we added it to tests it is because we have seen it at some point.

Expanding on this a little (...)

You are right; we may optimize the code further, but I don't know if it can be simplified. In this PR, I aimed to reach the most robust algorithm, disregarding fine-tuned performance. I am positive there won't be memory issues because lines will be flushed on every chain, model or terminal line. Because pdb-tools is for PDBs and not for CIFs, there won't be more than 100k lines stored simultaneously in the dictionary; that's about 10MB of memory.

Considering the pressing nature of this issue, I suggest we merge this PR with [FEATURE] so all bugs are addressed, and then we work calmly on performance issues. I quickly measured the performance on a large PDB:

· time 'pdb_selaltloc 7bqx.pdb' pdb_selaltloc 7bqx.pdb: command not found real 0m0,110s user 0m0,090s sys 0m0,025s

What do you think? 😉 🎾

mgiulini · 2023-07-03T15:34:19Z

hey there, I don't know whether this PR is closed, but I found a bug in the new implementation of the code. Take pdb file 2qxf: if you run the new selaltloc you'll get duplicate rows for all the residues that have 3 alternative locations, two of which with equal probability weight. An example is MET A 235:

ATOM   1924  N  AMET A 235      30.258  16.409   7.504  0.35 13.81           N
ATOM   1925  N  BMET A 235      30.226  16.401   7.511  0.35 13.80           N
ATOM   1926  N  CMET A 235      30.284  16.403   7.495  0.30 13.83           N
ATOM   1927  CA AMET A 235      31.514  15.799   7.009  0.35 13.83           C
ATOM   1928  CA BMET A 235      31.457  15.715   7.065  0.35 13.83           C
ATOM   1929  CA CMET A 235      31.587  15.874   6.971  0.30 13.85           C

becomes

ATOM   1924  N   MET A 235      30.258  16.409   7.504  0.35 13.81           N
ATOM   1925  N   MET A 235      30.226  16.401   7.511  0.35 13.80           N
ATOM   1927  CA  MET A 235      31.514  15.799   7.009  0.35 13.83           C
ATOM   1928  CA  MET A 235      31.457  15.715   7.065  0.35 13.83           C

I think that the problem lies here https://github.com/joaomcteixeira/pdb-tools/blob/0981c5a174e79490068bec327dc337cc3cb492d6/pdbtools/pdb_selaltloc.py#L287-L292

hope it helps!

mgiulini and others added 12 commits February 8, 2023 15:27

first draft of solution to selaltloc bug

3c82825

added test to selaltloc

e172b64

Working beta

9669693

TODO: confirm tests

added function to remove alt loc char

580f292

pass all previous tests

ac38e0f

add missing test pdb

ad4e66d

take tests from @mgiulini

7f0f303

add test from 3B9F suggested by @rvhonorato

ebcffea

add MODEL test

66cec87

clean, docs, test the comments

6c29270

test to flush on chains

6f37e82

remove unnecessary prints

dd27826

joaomcteixeira added bug enhancement labels Mar 30, 2023

joaomcteixeira requested review from JoaoRodrigues, amjjbonvin and rvhonorato March 30, 2023 22:16

joaomcteixeira self-assigned this Mar 30, 2023

joaomcteixeira mentioned this pull request Mar 30, 2023

first draft of solution to selaltloc bug #154

Closed

rvhonorato requested review from mgiulini and removed request for rvhonorato March 31, 2023 09:49

Merge branch 'master' into selaltlocnew

8196f07

JoaoRodrigues requested changes Apr 2, 2023

View reviewed changes

remove strip

0981c5a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] improved/corrected `pdb_selaltloc` tool #156

[FEATURE] improved/corrected `pdb_selaltloc` tool #156

joaomcteixeira commented Mar 30, 2023

joaomcteixeira commented Mar 30, 2023

JoaoRodrigues Apr 2, 2023

JoaoRodrigues Apr 2, 2023

joaomcteixeira Apr 3, 2023

mgiulini commented Jul 3, 2023

[FEATURE] improved/corrected pdb_selaltloc tool #156

Are you sure you want to change the base?

[FEATURE] improved/corrected pdb_selaltloc tool #156

Conversation

joaomcteixeira commented Mar 30, 2023

joaomcteixeira commented Mar 30, 2023

JoaoRodrigues Apr 2, 2023

Choose a reason for hiding this comment

JoaoRodrigues Apr 2, 2023

Choose a reason for hiding this comment

joaomcteixeira Apr 3, 2023

Choose a reason for hiding this comment

mgiulini commented Jul 3, 2023

[FEATURE] improved/corrected `pdb_selaltloc` tool #156

[FEATURE] improved/corrected `pdb_selaltloc` tool #156