Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve altloc handling #263

Merged
merged 87 commits into from
Mar 11, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
87 commits
Select commit Hold shift + click to select a range
54613aa
Fix bug in `add_k_nn_edges`.
anton-bushuiev Nov 2, 2022
1de248f
Extend `add_k_nn_edges`.
anton-bushuiev Nov 2, 2022
27ee8af
Add types to docstring
a-r-j Nov 2, 2022
018fd9c
Update changelog
a-r-j Nov 2, 2022
71fee7a
Add `kind_name` argument
anton-bushuiev Nov 2, 2022
74968ce
Test `filter_distmat`
anton-bushuiev Nov 3, 2022
77a89c6
Merge branch 'master' of https://github.com/anton-bushuiev/graphein
anton-bushuiev Nov 3, 2022
c91cede
Merge branch 'a-r-j:master' into master
anton-bushuiev Nov 3, 2022
713d0e3
Merge branch 'master' of https://github.com/anton-bushuiev/graphein
anton-bushuiev Nov 3, 2022
beb15d3
Set default value of `long_interaction_threshold` to 0
anton-bushuiev Nov 3, 2022
584c9f9
Fix filtering bug in `add_k_nn_edges`
anton-bushuiev Nov 4, 2022
b9cc99b
Test `add_k_nn_edges`
anton-bushuiev Nov 4, 2022
fd1b36b
Refactor with `add_edge`
anton-bushuiev Nov 4, 2022
fdc8b96
Fix bug for empty `edges_to_excl`
anton-bushuiev Nov 10, 2022
5075462
Improve `convert_nx_to_pyg`
anton-bushuiev Nov 10, 2022
21f10a1
Fix bug in `plot_pyg_data`
anton-bushuiev Nov 10, 2022
febaa2b
Test `convert_nx_to_pyg` on multimers
anton-bushuiev Nov 10, 2022
48941fa
Merge
anton-bushuiev Nov 10, 2022
e856693
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 10, 2022
9b89b44
Update `CHANGELOG.md`
anton-bushuiev Nov 10, 2022
c3a5e84
Merge branch 'master' of https://github.com/anton-bushuiev/graphein
anton-bushuiev Nov 10, 2022
a80a387
Fix version in `CHANGELOG.md`
anton-bushuiev Nov 10, 2022
629a61c
Handle corner cases
anton-bushuiev Nov 10, 2022
f1fcc29
Handle NaNs in coordinatess
anton-bushuiev Nov 10, 2022
f54a41f
Add PyG install to CI
a-r-j Nov 13, 2022
05f2ef0
typo in CI config
a-r-j Nov 13, 2022
b5156d8
bump torch versions in CI
a-r-j Nov 13, 2022
7f8c9c1
make pyg-related tests conditional pyg installation
a-r-j Nov 13, 2022
daa5c96
Try fixing graph attributes
a-r-j Dec 7, 2022
421e628
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 7, 2022
6b126a8
Merge branch 'master' into master
a-r-j Dec 18, 2022
326442d
Fix typo and extend amino acid 3to1, 1to3 mappings
anton-bushuiev Jan 25, 2023
b3dc713
Merge remote-tracking branch 'origin/master'
anton-bushuiev Jan 25, 2023
decac66
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 25, 2023
fb42f6b
Merge branch 'a-r-j:master' into master
anton-bushuiev Jan 25, 2023
57d0c97
Adapt imports of amino acid codes
anton-bushuiev Jan 26, 2023
e16fd21
Merge branch 'master' of https://github.com/anton-bushuiev/graphein
anton-bushuiev Jan 26, 2023
ed1504a
Merge branch 'a-r-j:master' into master
anton-bushuiev Jan 28, 2023
9ac298a
add semicolon to version
a-r-j Jan 28, 2023
82c6e7f
remove wildcard version number for pyyaml
a-r-j Jan 28, 2023
5e80ee0
Merge branch 'master' into master
a-r-j Jan 30, 2023
eca0cfa
fix typo
a-r-j Jan 30, 2023
3930a53
fix additonal typos
a-r-j Jan 30, 2023
a5903cf
Extend aggregation to vectors
anton-bushuiev Feb 9, 2023
e39a5a7
Implement `aggregate_feature_over_residues`
anton-bushuiev Feb 9, 2023
7179825
Merge remote-tracking branch 'origin/master'
anton-bushuiev Feb 9, 2023
53430aa
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 9, 2023
edd58ef
Add docstring and aggregation type
a-r-j Feb 9, 2023
cc8fa07
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 9, 2023
e538cc8
import literal from typing extensions
a-r-j Feb 9, 2023
df9f9fe
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 9, 2023
c459122
Merge branch 'a-r-j:master' into master
anton-bushuiev Feb 9, 2023
9092358
Add missing `median` in exception message
anton-bushuiev Feb 9, 2023
dc679ae
Fix `nullcontext`
anton-bushuiev Feb 9, 2023
bd1f4fa
fix dataset test
a-r-j Feb 9, 2023
d1f1c8c
fix division by zero errors in edge colouring
a-r-j Feb 9, 2023
00f99fd
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 9, 2023
c3f8554
update changlelog
a-r-j Feb 10, 2023
2ed033b
Separate and improve `remove_alt_locs`
anton-bushuiev Feb 10, 2023
042819f
Test `remove_alt_locs`
anton-bushuiev Feb 10, 2023
bf107b6
Merge branch 'a-r-j:master' into master
anton-bushuiev Feb 10, 2023
23df2a7
Merge branch 'master' of https://github.com/anton-bushuiev/graphein
anton-bushuiev Feb 10, 2023
a4b3dc1
Rename test
anton-bushuiev Feb 10, 2023
0bdaf0a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 10, 2023
7fc256f
Set `insertions=True` by default
anton-bushuiev Feb 10, 2023
e2b93b5
Merge remote-tracking branch 'origin/master'
anton-bushuiev Feb 10, 2023
ecf3b22
Make `alt_locs` configurable (TODO `include` case)
anton-bushuiev Feb 10, 2023
1df862f
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 10, 2023
ec1d31c
use typing_extensions literal for 3.7 compatibility
a-r-j Feb 10, 2023
6d99475
use typing extensions literal for 3.7 compatibility
a-r-j Feb 10, 2023
f455288
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 10, 2023
3943541
Merge branch 'a-r-j:master' into master
anton-bushuiev Feb 13, 2023
28f9538
Merge branch 'master' into master
a-r-j Mar 11, 2023
5abd73a
improve hbond donor/acceptor assignment robustnness
a-r-j Mar 11, 2023
219d471
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 11, 2023
39c66b8
replace trailing ":" in insertions
a-r-j Mar 11, 2023
365a1d6
fix test and hbond granularity inference
a-r-j Mar 11, 2023
8f01ec3
Add altloc identifer to node ID
a-r-j Mar 11, 2023
09ef16a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 11, 2023
2ac0c3b
fix tests
a-r-j Mar 11, 2023
9548cd3
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 11, 2023
43fae72
fix test
a-r-j Mar 11, 2023
1021eab
fix test
a-r-j Mar 11, 2023
7850d53
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 11, 2023
f6563b8
actually fix test
a-r-j Mar 11, 2023
d29ac48
update changelog
a-r-j Mar 11, 2023
ad7dff4
Fix typo
anton-bushuiev Mar 11, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
* [Logging] - [#242](https://github.com/a-r-j/graphein/pull/242) Adds control of protein graph construction logging. Resolves [#238](https://github.com/a-r-j/graphein/issues/238)

#### Protein
* [Feature] = [#263](https://github.com/a-r-j/graphein/pull/263) Adds control of Alt Loc selection strategy. N.b. Default `ProteinGraphConfig` changed to include insertions by default (`insertions=True`) and `alt_locs="max_occupancy"`.
* [Feature] - [#264](https://github.com/a-r-j/graphein/pull/264) Adds entrypoint to `graphein.protein.graphs.construct_graph` for passing in a BioPandas dataframe directly.
* [Feature] - [#229](https://github.com/a-r-j/graphein/pull/220) Adds support for filtering KNN edges based on self-loops and chain membership. Contribution by @anton-bushuiev.
* [Feature] - [#234](https://github.com/a-r-j/graphein/pull/234) Adds support for aggregating node features over residues (`graphein.protein.features.sequence.utils.aggregate_feature_over_residues`).
Expand Down
28 changes: 26 additions & 2 deletions graphein/protein/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
from typing import Any, Callable, List, Optional, Union

from deepdiff import DeepDiff
from pydantic import BaseModel
from pydantic import BaseModel, validator
from typing_extensions import Literal

from graphein.protein.edges.distance import add_peptide_bonds
Expand Down Expand Up @@ -97,6 +97,14 @@ class GetContactsConfig(BaseModel):
GranularityOpts = Literal["atom", "centroids"]
"""Allowable granularity options for nodes in the graph."""

AltLocsOpts = Union[
bool,
Literal[
"max_occupancy", "min_occupancy", "first", "last", "exclude", "include"
],
]
"""Allowable altlocs options for alternative locations handling."""


class ProteinGraphConfig(BaseModel):
"""
Expand All @@ -118,6 +126,12 @@ class ProteinGraphConfig(BaseModel):
:type keep_hets: List[str]
:param insertions: Controls whether or not insertions are allowed.
:type insertions: bool
:param alt_locs: Controls whether or not alternative locations are allowed. The supported values are
``"max_occupancy"``, ``"min_occupancy"``, ``"first"``, ``"last"``, ``"exclude"``. First two will leave altlocs
with the highest/lowest occupancies, next two will leave first/last in the PDB file ordering. The ``"exclude"``
value will drop them entirely and ``"include"`` leave all of them. Additionally, boolean values are the aliases
for the latest options. Default is ``"max_occupancy"``.
:type alt_locs: AltLocsOpts
:param pdb_dir: Specifies path to download protein structures into.
:type pdb_dir: pathlib.Path. Optional.
:param verbose: Specifies verbosity of graph creation process.
Expand Down Expand Up @@ -151,7 +165,8 @@ class ProteinGraphConfig(BaseModel):

granularity: Union[GraphAtoms, GranularityOpts] = "CA"
keep_hets: List[str] = []
insertions: bool = False
insertions: bool = True
alt_locs: AltLocsOpts = "max_occupancy"
pdb_dir: Optional[Path] = None
verbose: bool = False
exclude_waters: bool = True
Expand All @@ -172,6 +187,15 @@ class ProteinGraphConfig(BaseModel):
get_contacts_config: Optional[GetContactsConfig] = None
dssp_config: Optional[DSSPConfig] = None

@validator("alt_locs")
def convert_alt_locs_aliases(cls, v):
if v is True:
return "include"
elif v is False:
return "exclude"
else:
return v

def __eq__(self, other: Any) -> bool:
"""Overwrites the BaseModel __eq__ function in order to check more specific cases (like partial functions)."""
if isinstance(other, ProteinGraphConfig):
Expand Down
44 changes: 24 additions & 20 deletions graphein/protein/features/nodes/amino_acid.py
Original file line number Diff line number Diff line change
Expand Up @@ -200,20 +200,22 @@ def hydrogen_bond_donor(
returns a ``pd.Series``. Default is ``True``.
:type return_array: bool
"""
node_id = n.split(":")
res = node_id[1]
res = d["residue_name"]

if len(node_id) == 4: # Atomic graph
atom = node_id[-1]
# Hack to determine graph type
# If last ID component is atom type, assume graph is atomic
if n.split(":")[-1] == d["atom_type"]:
granularity = "atom"
else:
granularity = "residue"

if granularity == "atom": # Atomic graph
atom = d["atom_type"]
try:
features = HYDROGEN_BOND_DONORS[res][atom]
except KeyError:
try: # Handle insertions
atom = node_id[-2]
features = HYDROGEN_BOND_DONORS[res][atom]
except KeyError:
features = 0
elif len(node_id) == 3: # Residue graph
features = 0
else: # Residue graph
if res not in HYDROGEN_BOND_DONORS.keys():
features = 0
else:
Expand Down Expand Up @@ -247,19 +249,21 @@ def hydrogen_bond_acceptor(
returns a ``pd.Series``. Default is ``True``.
:type return_array: bool
"""
node_id = n.split(":")
res = node_id[1]
if len(node_id) == 4: # Atomic graph
atom = node_id[-1]
res = d["residue_name"]
# Hack to determine graph type
# If last ID component is atom type, assume graph is atomic
if n.split(":")[-1] == d["atom_type"]:
granularity = "atom"
else:
granularity = "residue"

if granularity == "atom": # Atomic graph
atom = d["atom_type"]
try:
features = HYDROGEN_BOND_ACCEPTORS[res][atom]
except KeyError:
try: # Handle insertions
atom = node_id[-2]
features = HYDROGEN_BOND_ACCEPTORS[res][atom]
except KeyError:
features = 0
elif len(node_id) == 3: # Residue graph
features = 0
else: # Residue graph
if res not in HYDROGEN_BOND_ACCEPTORS.keys():
features = 0
else:
Expand Down
68 changes: 60 additions & 8 deletions graphein/protein/graphs.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
from loguru import logger as log
from rich.progress import Progress
from tqdm.contrib.concurrent import process_map
from typing_extensions import Literal

from graphein.protein.config import (
DSSPConfig,
Expand Down Expand Up @@ -151,6 +152,11 @@ def label_node_id(

if insertions:
df["node_id"] = df["node_id"] + ":" + df["insertion"].apply(str)
# Replace trailing : for non insertions
df["node_id"] = df["node_id"].str.replace(":$", "")
# Add Alt Loc identifiers
df["node_id"] = df["node_id"] + ":" + df["alt_loc"].apply(str)
df["node_id"] = df["node_id"].str.replace(":$", "")
df["residue_id"] = df["node_id"]
if granularity == "atom":
df["node_id"] = df["node_id"] + ":" + df["atom_name"]
Expand Down Expand Up @@ -222,7 +228,49 @@ def subset_structure_to_atom_type(
)


def remove_insertions(df: pd.DataFrame, keep: str = "first") -> pd.DataFrame:
def remove_alt_locs(
df: pd.DataFrame, keep: str = "max_occupancy"
) -> pd.DataFrame:
"""
This function removes alternatively located atoms from PDB DataFrames
(see https://proteopedia.org/wiki/index.php/Alternate_locations). Among the
alternative locations the ones with the highest occupancies are left.

:param df: Protein Structure dataframe to remove alternative located atoms
from.
:type df: pd.DataFrame
:param keep: Controls how to remove altlocs. Default is ``"max_occupancy"``.
:type keep: Literal["max_occupancy", "min_occupancy", "first", "last"]
:return: Protein structure dataframe with alternative located atoms removed
:rtype: pd.DataFrame
"""
# Sort accordingly
if keep == "max_occupancy":
df = df.sort_values("occupancy")
keep = "last"
elif keep == "min_occupancy":
df = df.sort_values("occupancy")
keep = "first"
elif keep == "exclude":
keep = False

# Filter
duplicates = df.duplicated(
subset=["chain_id", "residue_number", "atom_name", "insertion"],
keep=keep,
)
df = df[~duplicates]

# Unsort
if keep in ["max_occupancy", "min_occupancy"]:
df = df.sort_index()

return df


def remove_insertions(
df: pd.DataFrame, keep: Literal["first", "last"] = "first"
) -> pd.DataFrame:
"""
This function removes insertions from PDB DataFrames.

Expand All @@ -231,13 +279,14 @@ def remove_insertions(df: pd.DataFrame, keep: str = "first") -> pd.DataFrame:
:param keep: Specifies which insertion to keep. Options are ``"first"`` or
``"last"``.
Default is ``"first"``
:type keep: str
:type keep: Literal["first", "last"]
:return: Protein structure dataframe with insertions removed
:rtype: pd.DataFrame
"""
# Catches unnamed insertions
duplicates = df.duplicated(
subset=["chain_id", "residue_number", "atom_name"], keep=keep
subset=["chain_id", "residue_number", "atom_name", "alt_loc"],
keep=keep,
)
df = df[~duplicates]

Expand All @@ -246,11 +295,6 @@ def remove_insertions(df: pd.DataFrame, keep: str = "first") -> pd.DataFrame:
df, by_column="insertion", list_of_values=[""], boolean=True
)

# Remove alt_locs
df = filter_dataframe(
df, by_column="alt_loc", list_of_values=["", "A"], boolean=True
)

return df


Expand All @@ -275,6 +319,7 @@ def process_dataframe(
granularity: str = "centroids",
chain_selection: str = "all",
insertions: bool = False,
alt_locs: bool = False,
deprotonate: bool = True,
keep_hets: List[str] = [],
verbose: bool = False,
Expand Down Expand Up @@ -303,6 +348,8 @@ def process_dataframe(
:type granularity: str
:param insertions: Whether or not to keep insertions.
:param insertions: bool
:param alt_locs: Whether or not to keep alternatively located atoms.
:param alt_locs: bool
:param deprotonate: Whether or not to remove hydrogen atoms (i.e.
deprotonation).
:type deprotonate: bool
Expand Down Expand Up @@ -372,6 +419,10 @@ def process_dataframe(
protein_df = atoms

# Remove alt_loc residues
if alt_locs != "include":
protein_df = remove_alt_locs(protein_df, keep=alt_locs)

# Remove inserted residues
if not insertions:
protein_df = remove_insertions(protein_df)

Expand Down Expand Up @@ -763,6 +814,7 @@ def construct_graph(
chain_selection=chain_selection,
granularity=config.granularity,
insertions=config.insertions,
alt_locs=config.alt_locs,
keep_hets=config.keep_hets,
)

Expand Down
71 changes: 70 additions & 1 deletion tests/protein/test_graphs.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
from pathlib import Path

import networkx as nx
import numpy as np
import pytest

from graphein.protein.config import DSSPConfig, ProteinGraphConfig
Expand Down Expand Up @@ -356,11 +357,12 @@ def test_graph_sequence_feature():
assert g_atom.graph[f"sequence_{c}"] == g_res.graph[f"sequence_{c}"]


def test_insertion_handling():
def test_insertion_and_alt_loc_handling():
configs = {
"granularity": "CA",
"keep_hets": [],
"insertions": False,
"alt_locs": "max_occupancy",
"verbose": False,
"node_metadata_functions": [meiler_embedding, expasy_protein_scale],
"edge_construction_functions": [
Expand All @@ -384,6 +386,73 @@ def test_insertion_handling():
assert g.graph["coords"].shape[0] == len(g)


def test_alt_loc_exclusion():
configs = {
"granularity": "CA",
"keep_hets": [],
"insertions": True,
"alt_locs": "max_occupancy",
"verbose": False,
"node_metadata_functions": [meiler_embedding, expasy_protein_scale],
"edge_construction_functions": [
add_peptide_bonds,
add_hydrogen_bond_interactions,
add_ionic_interactions,
add_aromatic_sulphur_interactions,
add_hydrophobic_interactions,
add_cation_pi_interactions,
],
}

config = ProteinGraphConfig(**configs)

# This is a PDB with three altlocs
g = construct_graph(config=config, pdb_code="2VVI")

# Test altlocs are dropped
assert len(set(g.nodes())) == len(g.nodes())

# Test the correct one is left
for opt, expected_coords, node_id in (
("max_occupancy", [5.850, -9.326, -42.884], "A:CYS:195:A"),
("min_occupancy", [5.864, -9.355, -42.943], "A:CYS:195:B"),
("first", [5.850, -9.326, -42.884], "A:CYS:195:A"),
("last", [5.864, -9.355, -42.943], "A:CYS:195:B"),
):
config.alt_locs = opt
g = construct_graph(config=config, pdb_code="2VVI")
assert np.array_equal(g.nodes[node_id]["coords"], expected_coords)


def test_alt_loc_inclusion():
configs = {
"granularity": "CA",
"keep_hets": [],
"insertions": False,
"alt_locs": True,
"verbose": False,
"node_metadata_functions": [meiler_embedding, expasy_protein_scale],
"edge_construction_functions": [
add_peptide_bonds,
add_hydrogen_bond_interactions,
add_ionic_interactions,
add_aromatic_sulphur_interactions,
add_hydrophobic_interactions,
add_cation_pi_interactions,
],
}

config = ProteinGraphConfig(**configs)

# This is a PDB with an altloc leading to different residues
g = construct_graph(config=config, pdb_code="1ALX")

# Test both are present
assert "A:TYR:11:A" in g.nodes() and "A:TRP:11:B" in g.nodes()

# TODO Test on other PDBs where altlocs are of the same residues


def test_edges_do_not_add_nodes_for_chain_subset():
new_funcs = {
"edge_construction_functions": [
Expand Down