Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Biopolymer rdkit smarts #1301

Merged
merged 3 commits into from
May 30, 2022
Merged

[WIP] Biopolymer rdkit smarts #1301

merged 3 commits into from
May 30, 2022

Conversation

richardjgowers
Copy link
Contributor

FAO @j-wags

This is how I'd propose doing the standard residue recognition with a cheminformatics toolkit. It's not 100% working yet, I think I might have to tweak the smarts patterns, but this might also improve them (e.g. could do something like [#7;D3;H1] to get the non-cap N ends, and this might resolve the race condition on patterns that there's currently a TODO about). I also think my attempt at string mashing to get the smarts to generalise is weak and should be replaced with some TOOLKIT_REGISTRY.fuzzy_match(SMARTS) to generate queries that are bond/charge/arom insensitive, will have to check OEChem support for that first.

Some questions:

  • Before I dive into hacking the smarts patterns, is this approach viable?
  • Is the general future outlook for polymer-likes that people can provide a list of monomer-SMARTS to load molecules and apply sufficient information to them?
  • Out of curiosity is anyone working on general polymers yet?

@j-wags
Copy link
Member

j-wags commented May 12, 2022

Thanks for putting this together @richardjgowers!

TL;DR

Before I dive into hacking the smarts patterns, is this approach viable?

Yes, I think this is viable. The big pluses are:

  • This may be significantly faster (though I can't get this to run locally on T4, so I'd like to verify this!)
  • It seems cleaner to use a cheminformatics toolkit for molecule stuff
  • It removes the jank around negative atomic nums for hydrogen
  • It may allow us to directly query on number of hydrogens/degree

The functional liabilities with the current code, which I think can be totally fixed, are:

  • The current code seems to stray away from handling inter-monomer bonds. We do eventually intend to support general polymers and one day the inter-monomer bond may not be single. I don't think this is a fundamental limitation of the approach, though, and it shouldn't be too hard to match these (if it's not happening already)
  • The removal of checks for unassigned edges makes me nervous - Even if we let peptide bonds slide, we may be missing a sidechain and not knowing it

Details

I might have to tweak the smarts patterns, but this might also improve them (e.g. could do something like [#7;D3;H1] to get the non-cap N ends, and this might resolve the race condition on patterns that there's currently a TODO about).

Oh, I really like that we could get rid of that race condition, and start explicitly stating the number of hydrogens and atom degrees.

The code that generates the substructure dict should be generalizable enough to write SMARTS with that info, though the exact implementation may take some thought. I'll need to stew on this a little bit but could probably get that working if we choose to pursue this.

I also think my attempt at string mashing to get the smarts to generalise is weak and should be replaced with some TOOLKIT_REGISTRY.fuzzy_match(SMARTS) to generate queries that are bond/charge/arom insensitive, will have to check OEChem support for that first.

Yup, pretty much agree with all of this.

Thinking a bit more about the method signature above - It should probably be TOOLKIT_REGISTRY.fuzzy_match(SMARTS, molecule), but where the molecule isn't yet chemically valid. This is tricky because our Molecule class isn't made to hold chemically invalid molecules, so this could get brittle (hence why I was using a networkx graph). Maybe the method could take an entire substructure dict and an OpenMM topology, and then let the toolkit wrapper do the complete assignment and return an OFFMol?

Is the general future outlook for polymer-likes that people can provide a list of monomer-SMARTS to load molecules and apply sufficient information to them?
Out of curiosity is anyone working on general polymers yet?

Yes. @connordavel is an undergrad in the Shirts group currently looking at handling more complex polymers. Right now the "substructure dictionary" files are "private" (we don't guarantee their stability or format) and Connor is hacking around directly on the contents. We will make them public eventually, but not in the 0.11.0 release (case in point, this PR might rejigger the substructure dict to query on n_hydrogens).

@richardjgowers
Copy link
Contributor Author

@j-wags I've updated this branch to be my best attempt so far, my working gist here: https://gist.github.com/richardjgowers/c40186e1fe18ebafb22958e18827dd2f

It's getting most but not all AAs, using the patterns in the existing json (and eventually we'll want a synthetic test that iterates through the json and creates one of each monomer to check we correctly assign to it). If I change the input patterns to not include explicit hydrogens it seems to work, but then that's annoying if you want to assign atom names to the hydrogens as they won't show up in the resulting match.

@j-wags
Copy link
Member

j-wags commented May 24, 2022

Thanks for documenting your progress, @richardjgowers - I'll take over this PR!

@j-wags j-wags changed the base branch from topology-biopolymer-refactor to biopolymer_rdkit_smarts May 30, 2022 16:36
@j-wags
Copy link
Member

j-wags commented May 30, 2022

I'm going to repurpose this PR to port these changes into a branch rather than a fork so CI will be happy. I'll open a new PR momentarily!

@j-wags j-wags merged commit c12cfed into openforcefield:biopolymer_rdkit_smarts May 30, 2022
j-wags added a commit that referenced this pull request Jun 8, 2022
* [WIP] Biopolymer rdkit smarts (#1301)

* first pass at doing structure matching with rdkit instead of networkx

* current best attempt at fuzzy rdkit query

Co-authored-by: Jeff Wagner <jwagnerjpl@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix LEU hitting max matches, merge conflict errors

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* porting rdkit-dependent logic to RDKitToolkitWrapper

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* get rdkit implementation working again. add check for unassigned atoms and bonds. get dipeptide+disulfide assignment going again. Remove unneeded code.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* make mypy happy

Co-authored-by: Richard Gowers <richardjgowers+github@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Yoshanuikabundi added a commit that referenced this pull request Jun 20, 2022
* [WIP] Biopolymer rdkit smarts (#1301)

* first pass at doing structure matching with rdkit instead of networkx

* current best attempt at fuzzy rdkit query

Co-authored-by: Jeff Wagner <jwagnerjpl@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix LEU hitting max matches, merge conflict errors

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* porting rdkit-dependent logic to RDKitToolkitWrapper

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* get rdkit implementation working again. add check for unassigned atoms and bonds. get dipeptide+disulfide assignment going again. Remove unneeded code.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Docs updates for hierarchy schemes

* Fix sphinx warnings

* Factor default chain and residue schemes to own methods

* Add perceive_hierarchy to add_hierarchy_scheme and remove now-duplicated calls to it

* Add residue hierarchy scheme when perceiving residues

* Improve docs of hierarchy scheme stuff

* Improve docs of hierarchy-adjacent objects

* Changes to tests to account for moving perceive calls around

* Re-export Hierarchy types from openff.toolkit.topology

* Streamline links in docs

* perceive_hierarchy -> update_hierarchy_schemes

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Avoid overwriting existing methods with hierarchy iterators

* Update tests to give correct output

- Return correct second triplet of residues to test (CYX is two
  ACE-CYS-NME chains with a disulfide bond)
- Add second chain (each `Molecule` in the `Topology` has its own
  HierarchyScheme, and elements within them are not consolidated)

* Add new hierarchy scheme directly in _initialize_from_dict

* Work around case where self._hierarchy_schemes is undefined in __getattr__

* Add type hint required by mypy

* atom.properties -> atom.metadata in add_hierarchy_scheme docstring

Co-authored-by: Jeff Wagner <jwagnerjpl@gmail.com>

* Document reason behind manual addition of hierarchy scheme

Add comment describing why we don't call `add_hierarchy_scheme` from `_initialize_from_dict`

Co-authored-by: Jeff Wagner <jwagnerjpl@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Co-authored-by: Richard Gowers <richardjgowers+github@gmail.com>
Co-authored-by: Jeff Wagner <jwagnerjpl@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants