Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Feature branch: Topology Biopolymer Refactor #951

Closed
wants to merge 117 commits into from

Conversation

j-wags
Copy link
Member

@j-wags j-wags commented May 25, 2021

Feature branch plans: https://openforcefield.atlassian.net/wiki/spaces/IN/pages/1579122724/2021+Topology+Refactor

Do not edit this branch directly -- All changes should go through PR review!

Notes and to-dos inherited from #881

Adds cachetools as a requirement

  • Addresses Adding lur_cache to toolkits #536
  • (temporary, should be replaced by a more general solution) Adds a lightweight, index-sensitive hash method for molecules (FrozenMolecule.ordered_connection_table_hash)
  • Adds caching for to_openeye and to_rdkit using cachelib
  • Adds caching for Atom.molecule_atom_index, since the data structure makes self-index lookup an O(n_atoms_in_molecule) operation. Molecule._invalidate_cached_properties now recursively deletes this cached property on the molecule's Atoms.
  • Removes @classmethod designation from OpenEyeToolkitWrapper.to_openeye and RDKitToolkitWrapper.to_rdkit (I don't see a need for them to be classmethods, this was probably from before when I knew what I was doing)
  • Makes Molecule.to_openeye and to_rdkit accept a toolkit registry kwarg. It's unclear why they didn't before (a user could reasonably want to divert Molecule.to_rdkit through a custom ToolkitWrapper)
  • Incorporate suggestions from Improving SMARTS matching performance with OpenEye backend #992
  • Add tests
    • appropriate cached molecule hash invalidation (eg when an atom/bond is added)
    • appropriate cached atom index invalidation (necessary? Atoms can only be appended at present)
    • cache for to_openeye / to_rdkit actually detects when a aromaticity model is requested
    • test that operations on cached outputs don't mutate the object in the cache
  • Update docstrings/documentation, if applicable
  • Lint codebase
  • Update changelog

Notes and to-dos inherited from #973

Remaining things that this doesn't do are:

  • Standard caps
  • uncapped C terminal
  • Main-chain prolines
    • (possibly related) Figure out what's up with the proline substructure that gets an N- in it I'm pretty sure this is a problem with components.cif

Notes and to-dos inherited from #982

  • Add pycifrw as dependency to some recipes
  • Add tests for substructure dictionary creation

Notes and to-dos from #1072

Notes and to-dos from #1085

  • Implement Molecule-level HierarchySchemes and HierarchyElements, providing ways to generate a static index where atoms are grouped according to their common metadata.
  • In our original API design/plan and there is an option where we want to initialize a Molecule with some hierarchy flavor.
  • Add an example that both details how the default behavior is implemented and also implements custom hierarchy scheme.
  • Refactor to make Topology objects consist of explicit copies of Molecules, deprecating TopologyMolecule, TopologyAtom, TopologyVirtualSite, etc
  • Implement access to HierarchySchemes from the Topology level.
  • Implement to/from_dict methods for Topology
    protein_param_test.tar.gz

Notes and to-dos from #1097

  • Refactor to make Topology objects consist of explicit copies of Molecules, removing TopologyMolecule, TopologyAtom, TopologyVirtualSite, etc
  • Refactor test_topology.py to use basic pytest features
  • Implement access to HierarchySchemes from the Topology level, using offtop.hierarchy_iterator (I'm happy to change the name of this in a future PR)
  • Implement to/from_dict methods for Topology
  • Add Molecule.particle_index(particle), atom(index), atom_index(atom), virtual_site(index), virtual_site_index(vsite), virtual_site_particle_start_index(vsite), and bond(index).
  • Add Topology.atoms, atom_index(atom), particle_index(particle), virtual_site_particle_start_index(virtual_site), molecule_index(molecule), molecule_atom_start_index(molecule), molecule_virtual_particle_start_index(molecule)
  • Fixes Topologies cannot be instantiated from other topologies #946 (but SHOULDN'T close it until the refactor is merged to master)
  • Implement duplicate-molecule-detection to save time in ToolkitAM1BCCHandler (done in Improve topology refactor performance, offer API point to group molecules #1140)
  • Implement duplicate-molecule-detection to save time during WBO assignment
  • Test speed of atom/particle indexing operations on new Topology class and implement heuristics to recover original performance
  • Have all deleted API points now either redirect to a equivalent new API point or raise a descriptive error
  • Decide whether a topology should have an aromaticity model, or if it's redundant with the Molecule and ForceField aromaticity models
  • Ensure that all API points for permutations of {Molecule,Topology}.{atom,particle,virtual_site}{,s,_index} and counters like n_atomsare exposed and tested.
  • Wire up residue and chain information to play nice with Topology.{to,from}_openmm
  • Update docstrings/documentation, if applicable
  • Lint codebase

Notes and to-dos from #1105

  • Prototype implementation of reading a molecule from a PDB consisting of entirely "standard" components and assigning bond orders and formal charges from aa_variants.cif (and some manually-added caps)
  • Fix issue where our peptide bond SMARTS could mislabel PRO ring-closing carbon as CA (maybe go over metadata assignment substructures in some sort of order that puts peptide+disulfide bonds last, and make earlier matches take precedence)
  • Code cleanup
  • More extensive testing
  • Ensure that perceive_residues still works with newly-generated substructure files
  • Resolve whether the same file can be used for BOTH metadata assignment (perceive_residues) AND bond order/formal charge assignment (from_pdb)
    • It can't -- The atom-naming dict needs to remove imidazole+guanidinium bond orders and formal charges to catch resonance states. The chemistry-assigning dict needs to know that exact information.
  • Make stereo warnings quiet (maybe don't do from_rdkit at all and just make offmol directly in OFFTK API?)
    • Loading process now ingests coordinates and perceives stereo from that.
  • Remove direct rdkit dependence from molecule.py
  • Add extensive tests for PDB loading (take PDBs from PLBenchmarks, COVID moonshot?, other sources?)
  • Add logic to gracefully error for unrecognized residues in PDB (like TPO)
  • Add API points to support adding additional residues to the substructure dict
  • Handle loading PDBs with multiple components (even if they're a mix of polymer and small molecules) (do we really want to support this? It's easy to split PDBs) This is out-of-scope for the 0.11.0 release
  • Offer API point for residue substructure perception when loading OpenMM topologies This is out of scope for the 0.11.0 release
  • Fill residue_name, residue_number metadata
  • Fill atom_name metadata
  • Add tests for metadata loading
  • Test roundtrips to/from PDB (Guarantee that we keep atom name, atom order, atom number(maybe not this one, if original indexing doesn't start at 0?), atom name, residue name, residue number, chain name, element, coordinates)
  • Make sure perceive_residues assigns appropriate residue numbers at disulfide bridges (numbering should increment at each peptide bond, NOT jump disulfide bonds)
  • Schedule meeting with RCSB to talk about ways to improve aa-variants file

Notes and to-dos from #1140

  • Add functionality to efficiently group identical molecules
  • Use new functionality to recover SMARTS matching performance
  • Polish Topology.identical_molecule_groups docstring so actual humans can read it
  • Use new functionality to deduplicate molecules in special charge assignment cases (like ToolkitAM1BCC and friends)
  • Add tests
  • Update docstrings/documentation, if applicable

Notes and to-dos from #1195

  • Refactor PDB loading into toolkit wrappers
  • Refactor (to/from) (openmm,rdkit,openeye) to handle hierarchy metadata
    • Document spec for hierarchy info transfer, edge cases, and roundtrip reliability
    • Document spec for other info transfer (basically get Preserve Molecule core properties through RDMol/OEMol conversions #135 out of issue tracker and into docs)
    • Update docstrings to give basic info about hierarchy info transfer spec
    • Make public API point for loading custom substructure dicts
    • (maybe?) Don't assign info to non-tagged atoms in substructure SMARTS

Notes and to-dos from #1285

  • Cherry pick vsite refactor
  • (blocking) Ensure that tests from test_parameters highlighted at this link get ported to Interchange
  • (blocking) Re-enable vsite_showcase notebook and get it to a working state

* improve performance of smarts matching on proteins

* add cachetools to deps
@codecov
Copy link

codecov bot commented May 25, 2021

Codecov Report

Merging #951 (088fc9f) into master (97462b8) will increase coverage by 2.78%.
The diff coverage is 64.70%.

❗ Current head 088fc9f differs from pull request most recent head 8acf6ad. Consider uploading reports for the commit 8acf6ad to get more accurate results

j-wags and others added 4 commits June 1, 2021 10:16
* Initial implementation of atom metadata + tests

* add docstrings and update releasehistory

* update CI to run on topology biopolymer refactor branch
* fix metadata dict serialization

* Only cache connection table in to_openeye and to_rdkit

* have find_smarts_matches only convert connection table to destination format

* black
* Initial implementation of atom metadata + tests

* black

* black

* add docstrings and update releasehistory

* update CI to run on topology biopolymer refactor branch

* isort

* Adding serialized substructure dictionary file

* Molecule perceive residues capability prototype.

* Initial tests for perceive residue substructures

* After working session with JW. Not working.

* Correct atom names in serialized substructure file.

* Residue perception with atom type data implementation.

* Tests for residue perception.

* Changing test for conflicting histidine HIP

* Discard match if it is smaller.

* Adding test sdf file for HIP

* Making Atom with metadata serializable again.

* Adding substructure library file with explicit bond orders

Co-authored-by: j-wags <jwagnerjpl@gmail.com>
* Class for creating substructure library from CCD/cif files.

* Override some entries from aa variants. Clean before filling data.

* Better naming, documentation and cleaning up a bit.

* Now with atom name information in substructure dictionary.

* Dealing with atom names correctly when no leaving atoms.

* Support for optional explicit bond orders.
@lgtm-com
Copy link

lgtm-com bot commented Jun 11, 2021

This pull request introduces 2 alerts when merging 519f69c into b2b76bb - view on LGTM.com

new alerts:

  • 1 for Unused local variable
  • 1 for Syntax error

@lgtm-com
Copy link

lgtm-com bot commented Jun 12, 2021

This pull request introduces 2 alerts when merging 7fb46c7 into b2b76bb - view on LGTM.com

new alerts:

  • 1 for Unused local variable
  • 1 for Unused import

import networkx as nx
import numpy as np
from cached_property import cached_property
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from cached_property import cached_property
from functools import cached_property

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be related to cached_property only being available in 3.8

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you're usiing https://pypi.org/project/cached-property/ I'm not sure if https://pypi.org/project/backports.cached-property/ would be better, and the answer comes from how each works with 3.8+; a solution in the standard library should be used over a third-party backport. I have not looked into this but it should be a quick check.

@mattwthompson
Copy link
Member

With the T4 lysozyme, I'm getting a single atom with missing metadata, causing issues with export:

In:

import re
from collections import Counter

from openff.toolkit.topology import Molecule, Topology
from openff.toolkit.typing.engines.smirnoff import ForceField
from openff.toolkit.utils import get_data_file_path
from openff.units import unit
from simtk import unit as simtk_unit
from simtk.openmm.app import PDBFile

from openff.interchange.components.interchange import Interchange
from openff.interchange.drivers import get_gromacs_energies, get_openmm_energies
from openff.interchange.utils import get_test_file_path


def _fix_openff_impropers(force_field: ForceField):
    impropers = force_field.get_parameter_handler("ImproperTorsions")
    for imp in impropers.parameters:
        imp.smirks = re.sub("(:[23])(.+)(:[23])", r"\3\2\1", imp.smirks)

    return force_field


protein = Molecule.from_file(get_data_file_path("proteins/T4-protein.sdf"))
protein.perceive_residues()

top = protein.to_topology()

print("Counter(Length of metadata dict : number of atoms)")
print(Counter([len(a.atom.metadata) for a in top.topology_atoms]))

print("The atom with no metadata:")
for top_atom in top.topology_atoms:
    if len(top_atom.atom.metadata) == 0:
        print(top_atom)

Out:

Counter(Length of metadata dict : number of atoms)
Counter({3: 2633, 0: 1})
The atom with no metadata:
TopologyAtom 2633 with reference atom <Atom name='' atomic number='8'> and parent TopologyMolecule <openff.toolkit.topology.topology.TopologyMolecule object at 0x11993baf0>

Resolve conflicts, folding in changes from #1026

# Conflicts:
#	.github/workflows/CI.yml
#	devtools/conda-envs/test_env.yaml
#	docs/releasehistory.md
#	openff/toolkit/topology/molecule.py
#	openff/toolkit/utils/toolkits.py
@lgtm-com
Copy link

lgtm-com bot commented Aug 13, 2021

This pull request introduces 2 alerts when merging 58c4284 into d78912b - view on LGTM.com

new alerts:

  • 1 for Unused local variable
  • 1 for Unused import

j-wags and others added 8 commits May 11, 2022 18:51
* Deprecate particle methods from `Topology` API

* Deprecate particles methods from `Molecule`
)

* Remove `use_interchange` argument

Remove `ParameterHandler.create_force` and associated tools

Add back `ParameterHandler_check_all_valence_terms_assigned`

Also add back `ParameterHandler._assert_correct_connectivity`

Update developer docs

Remove allow_nonintegral_charges

Fix typos

Typo

Remove unused exception

Skip charge increment tests for now

Fix JSON serialization

Add to docstring of LibraryChargeType.from_molecule

Cleanup

Fix version serialization test

Update CI

Cleanup, add test for create_force

Turn on some charge increment tests again

* Turn on duplicate partial bond order molecule test

* Re-implement `ForceField.get_partial_charges`

* Re-implement `allow_nonintegral_charges`

* Switch to Interchange's main branch

* Revert dev-facing CI simplifications

* Update openff/toolkit/tests/utils.py

Co-authored-by: Jeff Wagner <jwagnerjpl@gmail.com>

* Turn back on a 1-4 test

* Add back missing kwarg exception and test

* Update release history

Co-authored-by: Jeff Wagner <jwagnerjpl@gmail.com>
* [WIP] Biopolymer rdkit smarts (#1301)

* first pass at doing structure matching with rdkit instead of networkx

* current best attempt at fuzzy rdkit query

Co-authored-by: Jeff Wagner <jwagnerjpl@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix LEU hitting max matches, merge conflict errors

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* porting rdkit-dependent logic to RDKitToolkitWrapper

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* get rdkit implementation working again. add check for unassigned atoms and bonds. get dipeptide+disulfide assignment going again. Remove unneeded code.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* make mypy happy

Co-authored-by: Richard Gowers <richardjgowers+github@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Add FAQ section on `allow_undefined_stereo=True`

* Fix typos in FAQ

* Fix link to cookbook using MyST magic

Co-authored-by: Josh A. Mitchell <yoshanuikabundi@gmail.com>

* Fix typo

* Apply suggestions from code review

Co-authored-by: Jeff Wagner <jwagnerjpl@gmail.com>

* Apply suggestions from code review

Co-authored-by: Josh A. Mitchell <yoshanuikabundi@gmail.com>

Co-authored-by: Josh A. Mitchell <yoshanuikabundi@gmail.com>
Co-authored-by: Jeff Wagner <jwagnerjpl@gmail.com>
* Fix more calls to deprecated methods

* Revert escapes on docstrings

* `from_pdb` -> `from_polymer_pdb` in tests

* Apply suggestions from code review

Co-authored-by: Jeff Wagner <jwagnerjpl@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove particle stuff from `_mm_molecule`, run `pre-commit`

* Clean up BuiltInToolkitWrapper charge assignment

* Require RDKit on another test with `from_polymber_pdb`

Co-authored-by: Jeff Wagner <jwagnerjpl@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
mattwthompson and others added 6 commits June 10, 2022 14:51
* Assorted fixes to increase test coverage

* Fix  test

* Move some functions to Interchange

* Remove un-used code

* Skip coverage on non-public utilities

* Use `MissingOptionalDependency` from openff-utilities

* Use MissingOptionalDependencyError
* Add test reproducing #1319 with T4 lysozyme

* Force Molecule deepcopies to run through dicts

* Fix test

* Fix tests

* Fix tests

* Tinker with environment

* Fix installs in examples workflow

* Update openff/toolkit/tests/test_molecule.py

Co-authored-by: Jeff Wagner <jwagnerjpl@gmail.com>

Co-authored-by: Jeff Wagner <jwagnerjpl@gmail.com>
* fix 1308 and add test for default hierarchy scheme correctness

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* `chain` -> `chain_id` in some tests

* Update openff/toolkit/topology/molecule.py

Co-authored-by: Matt Thompson <mattwthompson@protonmail.com>

* Update openff/toolkit/topology/molecule.py

Co-authored-by: Matt Thompson <mattwthompson@protonmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Matthew W. Thompson <mattwthompson@protonmail.com>
* [WIP] Biopolymer rdkit smarts (#1301)

* first pass at doing structure matching with rdkit instead of networkx

* current best attempt at fuzzy rdkit query

Co-authored-by: Jeff Wagner <jwagnerjpl@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix LEU hitting max matches, merge conflict errors

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* porting rdkit-dependent logic to RDKitToolkitWrapper

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* get rdkit implementation working again. add check for unassigned atoms and bonds. get dipeptide+disulfide assignment going again. Remove unneeded code.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Docs updates for hierarchy schemes

* Fix sphinx warnings

* Factor default chain and residue schemes to own methods

* Add perceive_hierarchy to add_hierarchy_scheme and remove now-duplicated calls to it

* Add residue hierarchy scheme when perceiving residues

* Improve docs of hierarchy scheme stuff

* Improve docs of hierarchy-adjacent objects

* Changes to tests to account for moving perceive calls around

* Re-export Hierarchy types from openff.toolkit.topology

* Streamline links in docs

* perceive_hierarchy -> update_hierarchy_schemes

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Avoid overwriting existing methods with hierarchy iterators

* Update tests to give correct output

- Return correct second triplet of residues to test (CYX is two
  ACE-CYS-NME chains with a disulfide bond)
- Add second chain (each `Molecule` in the `Topology` has its own
  HierarchyScheme, and elements within them are not consolidated)

* Add new hierarchy scheme directly in _initialize_from_dict

* Work around case where self._hierarchy_schemes is undefined in __getattr__

* Add type hint required by mypy

* atom.properties -> atom.metadata in add_hierarchy_scheme docstring

Co-authored-by: Jeff Wagner <jwagnerjpl@gmail.com>

* Document reason behind manual addition of hierarchy scheme

Add comment describing why we don't call `add_hierarchy_scheme` from `_initialize_from_dict`

Co-authored-by: Jeff Wagner <jwagnerjpl@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Co-authored-by: Richard Gowers <richardjgowers+github@gmail.com>
Co-authored-by: Jeff Wagner <jwagnerjpl@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

mattwthompson and others added 7 commits June 24, 2022 13:44
* Warn when using AM1-BCC or similar methods on large molecules

* Fix typo

* Point to `stable` tag in docs
* Use `pytest.approx` to safeguard against WBO flakiness

* Fix interaction between `openmm.unit.Quantity` and `pytest.approx`
* Drop ParmEd from protein-ligand workflows

* Do not uninstall openff-toolkit-base when not installed

* Turn off broken cells, update calls to deprecated methods

* Update environment, fix typo

* Cleanup and fixes

* Require RDKit for loading protein PDBs

* Move stuff around with /examples/external/

* Update action

* Update action

* Re-run notebook without exceptions

* Move notebook

* Add back old BRD4 example

* Update workflow for new notebook path
* Improve error description when combining bond handlers (#719)

* Add reproducing test
* Correct warning about allow_undefined_stereo argument in cookbook

* Revise remapping and from topology sections

* Increase cell execution timeout so QCArchive cell completes

* Increase timeout further

* Increase minimum myst-nb version

* Add details about force field stereochemistry dependence to FAQ

* Pare down undefined stereochemistry warning and link to FAQ
* progress toward oe implementation

* initial implementation of OE graph matching

* Note why code doesn't use OEQMol

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* final cleanups

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Start `Topology.to_openmm` vs `Interchange.to_openmm_topology`

* Add note to self
@mattwthompson
Copy link
Member

This can probably be closed now that all of these changes are in the default branch.

@mattwthompson
Copy link
Member

I think we're forced to "close" this but it has been implicitly merged in - the default branch is now main, which was seeded from topology-biopolymer-refactor. (Updating the base branch to main would effectively move us back a month or two in time.) Changes that would have gone into this PR are now going straight into main, just as planned, and landing in beta/RC builds of version 0.11.0.

Of course, please re-open if I'm wrong! But I don't see any other ending here.

@j-wags
Copy link
Member Author

j-wags commented Jul 29, 2022

You served us well, giant PR!

image

@mattwthompson mattwthompson deleted the topology-biopolymer-refactor branch January 2, 2024 16:50
@mattwthompson mattwthompson restored the topology-biopolymer-refactor branch January 2, 2024 16:50
@mattwthompson mattwthompson deleted the topology-biopolymer-refactor branch February 5, 2024 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Topologies cannot be instantiated from other topologies
6 participants