-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Biopolymer rdkit smarts #1301
[WIP] Biopolymer rdkit smarts #1301
Conversation
Thanks for putting this together @richardjgowers! TL;DR
Yes, I think this is viable. The big pluses are:
The functional liabilities with the current code, which I think can be totally fixed, are:
Details
Oh, I really like that we could get rid of that race condition, and start explicitly stating the number of hydrogens and atom degrees. The code that generates the substructure dict should be generalizable enough to write SMARTS with that info, though the exact implementation may take some thought. I'll need to stew on this a little bit but could probably get that working if we choose to pursue this.
Yup, pretty much agree with all of this. Thinking a bit more about the method signature above - It should probably be
Yes. @connordavel is an undergrad in the Shirts group currently looking at handling more complex polymers. Right now the "substructure dictionary" files are "private" (we don't guarantee their stability or format) and Connor is hacking around directly on the contents. We will make them public eventually, but not in the 0.11.0 release (case in point, this PR might rejigger the substructure dict to query on n_hydrogens). |
@j-wags I've updated this branch to be my best attempt so far, my working gist here: https://gist.github.com/richardjgowers/c40186e1fe18ebafb22958e18827dd2f It's getting most but not all AAs, using the patterns in the existing json (and eventually we'll want a synthetic test that iterates through the json and creates one of each monomer to check we correctly assign to it). If I change the input patterns to not include explicit hydrogens it seems to work, but then that's annoying if you want to assign atom names to the hydrogens as they won't show up in the resulting match. |
Thanks for documenting your progress, @richardjgowers - I'll take over this PR! |
I'm going to repurpose this PR to port these changes into a branch rather than a fork so CI will be happy. I'll open a new PR momentarily! |
* [WIP] Biopolymer rdkit smarts (#1301) * first pass at doing structure matching with rdkit instead of networkx * current best attempt at fuzzy rdkit query Co-authored-by: Jeff Wagner <jwagnerjpl@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix LEU hitting max matches, merge conflict errors * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * porting rdkit-dependent logic to RDKitToolkitWrapper * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * get rdkit implementation working again. add check for unassigned atoms and bonds. get dipeptide+disulfide assignment going again. Remove unneeded code. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * make mypy happy Co-authored-by: Richard Gowers <richardjgowers+github@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* [WIP] Biopolymer rdkit smarts (#1301) * first pass at doing structure matching with rdkit instead of networkx * current best attempt at fuzzy rdkit query Co-authored-by: Jeff Wagner <jwagnerjpl@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix LEU hitting max matches, merge conflict errors * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * porting rdkit-dependent logic to RDKitToolkitWrapper * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * get rdkit implementation working again. add check for unassigned atoms and bonds. get dipeptide+disulfide assignment going again. Remove unneeded code. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Docs updates for hierarchy schemes * Fix sphinx warnings * Factor default chain and residue schemes to own methods * Add perceive_hierarchy to add_hierarchy_scheme and remove now-duplicated calls to it * Add residue hierarchy scheme when perceiving residues * Improve docs of hierarchy scheme stuff * Improve docs of hierarchy-adjacent objects * Changes to tests to account for moving perceive calls around * Re-export Hierarchy types from openff.toolkit.topology * Streamline links in docs * perceive_hierarchy -> update_hierarchy_schemes * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Avoid overwriting existing methods with hierarchy iterators * Update tests to give correct output - Return correct second triplet of residues to test (CYX is two ACE-CYS-NME chains with a disulfide bond) - Add second chain (each `Molecule` in the `Topology` has its own HierarchyScheme, and elements within them are not consolidated) * Add new hierarchy scheme directly in _initialize_from_dict * Work around case where self._hierarchy_schemes is undefined in __getattr__ * Add type hint required by mypy * atom.properties -> atom.metadata in add_hierarchy_scheme docstring Co-authored-by: Jeff Wagner <jwagnerjpl@gmail.com> * Document reason behind manual addition of hierarchy scheme Add comment describing why we don't call `add_hierarchy_scheme` from `_initialize_from_dict` Co-authored-by: Jeff Wagner <jwagnerjpl@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Co-authored-by: Richard Gowers <richardjgowers+github@gmail.com> Co-authored-by: Jeff Wagner <jwagnerjpl@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
FAO @j-wags
This is how I'd propose doing the standard residue recognition with a cheminformatics toolkit. It's not 100% working yet, I think I might have to tweak the smarts patterns, but this might also improve them (e.g. could do something like
[#7;D3;H1]
to get the non-cap N ends, and this might resolve the race condition on patterns that there's currently a TODO about). I also think my attempt at string mashing to get the smarts to generalise is weak and should be replaced with someTOOLKIT_REGISTRY.fuzzy_match(SMARTS)
to generate queries that are bond/charge/arom insensitive, will have to check OEChem support for that first.Some questions: