Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor to openff.units.elements #1182

Merged
merged 12 commits into from
Feb 10, 2022

Conversation

mattwthompson
Copy link
Member

@mattwthompson mattwthompson commented Jan 27, 2022

Resolves #1178 if accepted.

This PR proposes to not use an external elements package and further remove the concept of atoms containing elements. It seems that every use case can be covered by knowing only the atomic number.

  • Atom.element is removed
  • Atom.element.symbol becomes Atom.symbol
  • Atom.element.mass becomes Atom.mass
  • Atom.element.atomic_number probably should never have been used, but was also probably always in sync with Atom.atomic_number

Upsides:

  • No new dependencies
  • Quick import (~6 ms on my aging hardware), which could probably be optimized
  • Fast (< 100 ns) atomic mass and symbol lookups

Downsides:

  • Breaks the API (though we're allowed to for 0.11.0!), which was already going to happen for Atom.mass
  • Potentially feature-light for feature development (ambiguous)

Quick profiling script

import time
from openff.units import unit
start_time = time.time()
from openff.units.elements import MASSES, SYMBOLS
print(time.time() - start_time)
  • Check examples - some are not passing on the refactor branch, not clear if they should be fixed here or later while prepping an RC build
  • Tag issue being addressed
  • Update tests
  • Update docstrings/documentation, if applicable
  • Lint codebase
  • Update changelog

@openforcefield openforcefield deleted a comment from lgtm-com bot Jan 27, 2022
@openforcefield openforcefield deleted a comment from lgtm-com bot Jan 27, 2022
@openforcefield openforcefield deleted a comment from lgtm-com bot Jan 27, 2022
@codecov
Copy link

codecov bot commented Jan 27, 2022

Codecov Report

❗ No coverage uploaded for pull request base (topology-biopolymer-refactor@0beef75). Click here to learn what that means.
The diff coverage is n/a.

@mattwthompson
Copy link
Member Author

@j-wags it might take multiple iterations of reviews but it's ready for a first review - I hope it's material enough to make a decision on if this is a direction we want to proceed with.

One new test (openff/toolkit/tests/test_molecule.py::TestMolecule::test_chemical_environment_matches_OE) is failing for reasons that aren't obvious to me.

Copy link
Member

@j-wags j-wags left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @mattwthompson. This looks really solid. I left a few comments, mostly non-blocking. The one blocking thing is probably a misunderstanding on my part, let's discuss at our check in tomorrow.

The only additional thing I'd request is a few sentences of description/migration guide in the releasenotes. This is the first mention of openff-units in this repo, so it'd be good to give some info (differences in getting and setting unit bearing quantities, how to adapt scripts if users are already using openmm units, what is openff units (with a link) and pint (with a link) )

@@ -17,8 +17,7 @@ dependencies:
- coverage
- numpy
- networkx
- parmed # ???
- mendeleev
- parmed # ??? what ???
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not blocking) Maybe this was for example tests (from before they were separated into their own CI)? Feel free to try removing if you have time to let CI run.

devtools/conda-envs/test_env.yaml Outdated Show resolved Hide resolved
openff/toolkit/topology/molecule.py Outdated Show resolved Hide resolved
openff/toolkit/topology/molecule.py Outdated Show resolved Hide resolved
@@ -1306,7 +1306,8 @@ def create_openmm_system(self, topology, **kwargs):
# This means that even though virtual sites may have been created via
# the molecule API, an empty VirtualSites tag must exist in the FF
for atom in topology.topology_atoms:
system.addParticle(atom.mass)
# addParticle(mass.m_as(unit.dalton)) would be safer but slower
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(blocking) Would the explicit units be substantially slower? Like, more than 1 second for 100,000 atoms? And won't atom.mass.m involve a conversion to openmm units anyway? I'd feel way safer if the conversion were explicit. I'm probably misunderstanding something - Let's touch base on this at our check in this week.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surprisingly, there is no conversion to an OpenMM quantity; internally the quickest way through this method is to pass it an int. Here is a snippet from the SWIG-generated file I found in .../site-packages/openmm/openmm.py:

    def addParticle(self, mass):
        r"""
        addParticle(self, mass) -> int
        Add a particle to the System. If the mass is 0, Integrators will ignore the particle and not modify its position or velocity. This is most often used for virtual sites, but can also be used as a way to prevent a particle from moving.

        Parameters
        ----------
        mass : double
            the mass of the particle (in atomic mass units)

        Returns
        -------
        int
            the index of the particle that was added
        """

        if unit.is_quantity(mass):
            mass = mass.value_in_unit(unit.amu)


        return _openmm.System_addParticle(self, mass)

So, giving it something that looks like a number is fastest. If it thinks it sees something wrapped with OpenMM units, it will do the necessary conversion.


In [1]: from openff.units import unit

In [2]: from openmm import System, unit as openmm_unit

In [3]: dummy_mass_openff = 1.0 * unit.dalton

In [4]: dummy_mass_openmm = 1.0 * openmm_unit.amu

In [5]: system = System()

In [6]: %timeit system.addParticle(mass=1.0)
592 ns ± 81 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [7]: %timeit system.addParticle(mass=dummy_mass_openff.m)
756 ns ± 136 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [8]: %timeit system.addParticle(mass=dummy_mass_openmm)
4.76 µs ± 1.46 µs per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [9]: %timeit system.addParticle(mass=dummy_mass_openff.m_as(unit.dalton))
17.2 µs ± 2.16 µs per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Quantity.m is fast because it does no unit checking, it just reports the magnitude in whatever it's currently set as. Ensuring that the values are in Dalton is slower, but only so much slower than allowing OpenMM do the analogous check internally.

Of course, the problem here is dealing with other masses - not something I've ever run into, but it could be considered. The current implementation will always be in Daltons as long as we don't change this upstream. There is no setter on the Atom class, so I don't even know it could get to that state without other changes that themselves might significant behavior changes.

If we can't assume Daltons, I'm having a hard time figuring out any way around the time it takes to do this conversion. The only middle ground I can think of now is using the setter to ensure its units; this would safeguard against similar cases in which we wrote code without considering the non-Dalton case. It has the downside of pushing the cost elsewhere, of course.

In conclusion, I think the current approach is the best one (if accidentally - I did not think this all through while working on this). It assumes the units, but those are hardcoded in openff-units. It's also virtually as performant as possible, clocking in around ~750 ms for adding 1,000,000 atoms compared to ~600 ms if everything was already as floats and ~4700 ms with the current approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed response. 10 us * 100,000 atoms = 1 second, so using explicit units will become a performance bottleneck at some point.

As a compromise, how about we leave this as-is, but make it so that we guarantee that Atom.mass always returns values in daltons/amus? I've added a comment for where we could test that here: https://github.com/openforcefield/openff-toolkit/pull/1182/files#r803188577

Adding the test described above would resolve the blocker here, and you could feel free to merge. Optionally, it'd be great if you update the atom.mass docstring to echo the unit guarantee.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take

but make it so that we guarantee

to mean we'll add a unit test and document it as such, which sounds like a good idea to me.

Comment on lines +169 to +174
# Is this de-duplicated? i.e. can it be ["C", "C", "H"] or only ["C", "H"] ?
atom_symbol_list: List[str] = cif_entry["_chem_comp_atom.type_symbol"]

# Is this dict a de-duplicated mapping between element symbols and atomic numbers
# for the elements in this cif file? If so, can maybe do it in one step
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not blocking) In the interest of time I'm not going to give feedback on this. I forgot exactly what this function expects and we don't have tests for this file yet. This is not expected to be run by users and so runtime isn't super critical as long as it's under an hour. If you're interested in optimizing/refactoring this, feel free to do so at your discretion. The best test I can suggest would be ensuring that the openff/toolkit/utils/make_*_substructures.py scripts yield the same outputs.

Copy link
Member Author

@mattwthompson mattwthompson Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll not tinker with this more, then. Except maybe to suggest it shouldn't be in the API, and instead tucked away in another folder?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the impression that the underscore prepended to the file name communicated the privateness/here-be-dragons-ness of the contents. Though I'm not sure if that's a python convention in reality.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I remember when we first planned this, we didn't want to support a public infrastructure to do this, only just what's "enough" for the toolkit. I think I'm still in favor of this stance, since we know how this problem can be pretty complex. Also, I guess it's worth noticing that this .py file is expected to be used only once in a while. My 2 cents :). BTW, I'm loving how this is evolving, thanks for the good work!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though I'm not sure if that's a python convention in reality.

It is :) https://www.python.org/dev/peps/pep-0008/?#public-and-internal-interfaces

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to clarify a bit the distinction between two different things. I'm not sure the right jargon (happy to be told what the terminology is, I can't think of anything but "private API" and "scripts") so I'll resort to being descriptive:

  1. Code that's for core toolkit functionality and might get called at runtime while interacting with the public API, but is either too dangerous to be explicitly exposed to users or meant to hide away implementation details. Stuff like Molecule._add_trivalent_lone_pair_virtual_site or OpenEyeToolkitWrapper._find_smarts_matches. These code paths have intrinsic value (in the sense that they are the product), should be covered by unit tests, will be included in conda tarballs, and fall under the umbrella of code we're responsible for maintaining/documenting/etc.
  2. Code that's for one-off data cleaning or developer-facing operations, but not a part of a core feature and will not be called by a user (directly or indirectly) at runtime. These files are valuable in large part for what they produce (i.e. big CIF file I don't understand the details of), but aren't covered by unit tests, may or may not be included in conda tarballs, and don't have the same maintenance burden.

What I'm getting at here is that code that cannot be accessed at runtime should be moved /utilities/ or somewhere that would not be packaged as part of the Python module. That folder is from before my time, but it looks like those files didn't contribute as much to maintenance burden.

There's no danger to the user as is since it's definitely not part of the public API. But I think functions that should be used once in a while should not be in the core module (though no less important, of course).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, moving it to /utilities/ isn't a bad idea... I'm just thinking about stuff like @connordavel's possible future project with custom polymers, and that having runtime access to the substructure generator from a stable package may enable some important functionality for advanced users. Also that there's a possible future where we do make this public.

So, I understand the arguments about maintenance burden, but overall there's a lot of uncertainty about the future of this functionality, so for now I'd pick "no change" as a course of action while we have so many other things going on.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking no action here is certainly reasonable given its temporary nature at the moment. 👍

@mattwthompson mattwthompson marked this pull request as ready for review February 8, 2022 22:25
@mattwthompson
Copy link
Member Author

The failing test is also failing upstream on #951

openff/toolkit/tests/test_molecule.py::TestMolecule::test_chemical_environment_matches_OE

@mattwthompson
Copy link
Member Author

I also can't reproduce the collection errors in the "CI / Test on ubuntu-latest, Python 3.7, RDKit=false, OpenEye=true".

@mattwthompson
Copy link
Member Author

There are two failures in CI, each of which I believe are unrelated to these changes here

  1. openff/toolkit/tests/test_molecule.py::TestMolecule::test_chemical_environment_matches_OE - reported upstream, unclear
  2. Tests with RDKit not installed and OpenEye installed are failing to collect the molecule fixtures. I can't reproduce this locally

I'm going to merge this and we can fix those later

@mattwthompson mattwthompson mentioned this pull request Feb 10, 2022
2 tasks
@mattwthompson mattwthompson merged commit 1115d25 into topology-biopolymer-refactor Feb 10, 2022
@mattwthompson mattwthompson deleted the minimal-elements branch February 10, 2022 21:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants