[WIP] Biopolymer rdkit smarts #1301

richardjgowers · 2022-05-11T23:35:07Z

This is how I'd propose doing the standard residue recognition with a cheminformatics toolkit. It's not 100% working yet, I think I might have to tweak the smarts patterns, but this might also improve them (e.g. could do something like [#7;D3;H1] to get the non-cap N ends, and this might resolve the race condition on patterns that there's currently a TODO about). I also think my attempt at string mashing to get the smarts to generalise is weak and should be replaced with some TOOLKIT_REGISTRY.fuzzy_match(SMARTS) to generate queries that are bond/charge/arom insensitive, will have to check OEChem support for that first.

Some questions:

Before I dive into hacking the smarts patterns, is this approach viable?
Is the general future outlook for polymer-likes that people can provide a list of monomer-SMARTS to load molecules and apply sufficient information to them?
Out of curiosity is anyone working on general polymers yet?

j-wags · 2022-05-12T04:48:59Z

Thanks for putting this together @richardjgowers!

TL;DR

Before I dive into hacking the smarts patterns, is this approach viable?

Yes, I think this is viable. The big pluses are:

This may be significantly faster (though I can't get this to run locally on T4, so I'd like to verify this!)
It seems cleaner to use a cheminformatics toolkit for molecule stuff
It removes the jank around negative atomic nums for hydrogen
It may allow us to directly query on number of hydrogens/degree

The functional liabilities with the current code, which I think can be totally fixed, are:

The current code seems to stray away from handling inter-monomer bonds. We do eventually intend to support general polymers and one day the inter-monomer bond may not be single. I don't think this is a fundamental limitation of the approach, though, and it shouldn't be too hard to match these (if it's not happening already)
The removal of checks for unassigned edges makes me nervous - Even if we let peptide bonds slide, we may be missing a sidechain and not knowing it

Details

I might have to tweak the smarts patterns, but this might also improve them (e.g. could do something like [#7;D3;H1] to get the non-cap N ends, and this might resolve the race condition on patterns that there's currently a TODO about).

Oh, I really like that we could get rid of that race condition, and start explicitly stating the number of hydrogens and atom degrees.

The code that generates the substructure dict should be generalizable enough to write SMARTS with that info, though the exact implementation may take some thought. I'll need to stew on this a little bit but could probably get that working if we choose to pursue this.

I also think my attempt at string mashing to get the smarts to generalise is weak and should be replaced with some TOOLKIT_REGISTRY.fuzzy_match(SMARTS) to generate queries that are bond/charge/arom insensitive, will have to check OEChem support for that first.

Yup, pretty much agree with all of this.

Thinking a bit more about the method signature above - It should probably be TOOLKIT_REGISTRY.fuzzy_match(SMARTS, molecule), but where the molecule isn't yet chemically valid. This is tricky because our Molecule class isn't made to hold chemically invalid molecules, so this could get brittle (hence why I was using a networkx graph). Maybe the method could take an entire substructure dict and an OpenMM topology, and then let the toolkit wrapper do the complete assignment and return an OFFMol?

Is the general future outlook for polymer-likes that people can provide a list of monomer-SMARTS to load molecules and apply sufficient information to them?
Out of curiosity is anyone working on general polymers yet?

Yes. @connordavel is an undergrad in the Shirts group currently looking at handling more complex polymers. Right now the "substructure dictionary" files are "private" (we don't guarantee their stability or format) and Connor is hacking around directly on the contents. We will make them public eventually, but not in the 0.11.0 release (case in point, this PR might rejigger the substructure dict to query on n_hydrogens).

richardjgowers · 2022-05-24T17:29:25Z

@j-wags I've updated this branch to be my best attempt so far, my working gist here: https://gist.github.com/richardjgowers/c40186e1fe18ebafb22958e18827dd2f

It's getting most but not all AAs, using the patterns in the existing json (and eventually we'll want a synthetic test that iterates through the json and creates one of each monomer to check we correctly assign to it). If I change the input patterns to not include explicit hydrogens it seems to work, but then that's annoying if you want to assign atom names to the hydrogens as they won't show up in the resulting match.

j-wags · 2022-05-24T17:59:22Z

Thanks for documenting your progress, @richardjgowers - I'll take over this PR!

j-wags · 2022-05-30T16:39:12Z

I'm going to repurpose this PR to port these changes into a branch rather than a fork so CI will be happy. I'll open a new PR momentarily!

* [WIP] Biopolymer rdkit smarts (#1301) * first pass at doing structure matching with rdkit instead of networkx * current best attempt at fuzzy rdkit query Co-authored-by: Jeff Wagner <jwagnerjpl@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix LEU hitting max matches, merge conflict errors * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * porting rdkit-dependent logic to RDKitToolkitWrapper * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * get rdkit implementation working again. add check for unassigned atoms and bonds. get dipeptide+disulfide assignment going again. Remove unneeded code. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * make mypy happy Co-authored-by: Richard Gowers <richardjgowers+github@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [WIP] Biopolymer rdkit smarts (#1301) * first pass at doing structure matching with rdkit instead of networkx * current best attempt at fuzzy rdkit query Co-authored-by: Jeff Wagner <jwagnerjpl@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix LEU hitting max matches, merge conflict errors * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * porting rdkit-dependent logic to RDKitToolkitWrapper * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * get rdkit implementation working again. add check for unassigned atoms and bonds. get dipeptide+disulfide assignment going again. Remove unneeded code. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Docs updates for hierarchy schemes * Fix sphinx warnings * Factor default chain and residue schemes to own methods * Add perceive_hierarchy to add_hierarchy_scheme and remove now-duplicated calls to it * Add residue hierarchy scheme when perceiving residues * Improve docs of hierarchy scheme stuff * Improve docs of hierarchy-adjacent objects * Changes to tests to account for moving perceive calls around * Re-export Hierarchy types from openff.toolkit.topology * Streamline links in docs * perceive_hierarchy -> update_hierarchy_schemes * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Avoid overwriting existing methods with hierarchy iterators * Update tests to give correct output - Return correct second triplet of residues to test (CYX is two ACE-CYS-NME chains with a disulfide bond) - Add second chain (each `Molecule` in the `Topology` has its own HierarchyScheme, and elements within them are not consolidated) * Add new hierarchy scheme directly in _initialize_from_dict * Work around case where self._hierarchy_schemes is undefined in __getattr__ * Add type hint required by mypy * atom.properties -> atom.metadata in add_hierarchy_scheme docstring Co-authored-by: Jeff Wagner <jwagnerjpl@gmail.com> * Document reason behind manual addition of hierarchy scheme Add comment describing why we don't call `add_hierarchy_scheme` from `_initialize_from_dict` Co-authored-by: Jeff Wagner <jwagnerjpl@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Co-authored-by: Richard Gowers <richardjgowers+github@gmail.com> Co-authored-by: Jeff Wagner <jwagnerjpl@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

first pass at doing structure matching with rdkit instead of networkx

b97901f

current best attempt at fuzzy rdkit query

35b351f

j-wags changed the base branch from topology-biopolymer-refactor to biopolymer_rdkit_smarts May 30, 2022 16:36

Merge branch 'biopolymer_rdkit_smarts' into biopolymer_rdkit_smarts

0d073c7

j-wags merged commit c12cfed into openforcefield:biopolymer_rdkit_smarts May 30, 2022

j-wags mentioned this pull request May 30, 2022

[WIP] Biopolymer rdkit smarts (continuing from #1301) #1306

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Biopolymer rdkit smarts #1301

[WIP] Biopolymer rdkit smarts #1301

richardjgowers commented May 11, 2022

j-wags commented May 12, 2022

richardjgowers commented May 24, 2022

j-wags commented May 24, 2022

j-wags commented May 30, 2022

[WIP] Biopolymer rdkit smarts #1301

[WIP] Biopolymer rdkit smarts #1301

Conversation

richardjgowers commented May 11, 2022

j-wags commented May 12, 2022

TL;DR

Details

richardjgowers commented May 24, 2022

j-wags commented May 24, 2022

j-wags commented May 30, 2022