-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor to openff.units.elements #1182
Refactor to openff.units.elements #1182
Conversation
Codecov Report
|
@j-wags it might take multiple iterations of reviews but it's ready for a first review - I hope it's material enough to make a decision on if this is a direction we want to proceed with. One new test ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @mattwthompson. This looks really solid. I left a few comments, mostly non-blocking. The one blocking thing is probably a misunderstanding on my part, let's discuss at our check in tomorrow.
The only additional thing I'd request is a few sentences of description/migration guide in the releasenotes. This is the first mention of openff-units
in this repo, so it'd be good to give some info (differences in getting and setting unit bearing quantities, how to adapt scripts if users are already using openmm units, what is openff units (with a link) and pint (with a link) )
@@ -17,8 +17,7 @@ dependencies: | |||
- coverage | |||
- numpy | |||
- networkx | |||
- parmed # ??? | |||
- mendeleev | |||
- parmed # ??? what ??? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(not blocking) Maybe this was for example tests (from before they were separated into their own CI)? Feel free to try removing if you have time to let CI run.
@@ -1306,7 +1306,8 @@ def create_openmm_system(self, topology, **kwargs): | |||
# This means that even though virtual sites may have been created via | |||
# the molecule API, an empty VirtualSites tag must exist in the FF | |||
for atom in topology.topology_atoms: | |||
system.addParticle(atom.mass) | |||
# addParticle(mass.m_as(unit.dalton)) would be safer but slower |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(blocking) Would the explicit units be substantially slower? Like, more than 1 second for 100,000 atoms? And won't atom.mass.m
involve a conversion to openmm units anyway? I'd feel way safer if the conversion were explicit. I'm probably misunderstanding something - Let's touch base on this at our check in this week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surprisingly, there is no conversion to an OpenMM quantity; internally the quickest way through this method is to pass it an int. Here is a snippet from the SWIG-generated file I found in .../site-packages/openmm/openmm.py
:
def addParticle(self, mass):
r"""
addParticle(self, mass) -> int
Add a particle to the System. If the mass is 0, Integrators will ignore the particle and not modify its position or velocity. This is most often used for virtual sites, but can also be used as a way to prevent a particle from moving.
Parameters
----------
mass : double
the mass of the particle (in atomic mass units)
Returns
-------
int
the index of the particle that was added
"""
if unit.is_quantity(mass):
mass = mass.value_in_unit(unit.amu)
return _openmm.System_addParticle(self, mass)
So, giving it something that looks like a number is fastest. If it thinks it sees something wrapped with OpenMM units, it will do the necessary conversion.
In [1]: from openff.units import unit
In [2]: from openmm import System, unit as openmm_unit
In [3]: dummy_mass_openff = 1.0 * unit.dalton
In [4]: dummy_mass_openmm = 1.0 * openmm_unit.amu
In [5]: system = System()
In [6]: %timeit system.addParticle(mass=1.0)
592 ns ± 81 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
In [7]: %timeit system.addParticle(mass=dummy_mass_openff.m)
756 ns ± 136 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
In [8]: %timeit system.addParticle(mass=dummy_mass_openmm)
4.76 µs ± 1.46 µs per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [9]: %timeit system.addParticle(mass=dummy_mass_openff.m_as(unit.dalton))
17.2 µs ± 2.16 µs per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
Quantity.m
is fast because it does no unit checking, it just reports the magnitude in whatever it's currently set as. Ensuring that the values are in Dalton is slower, but only so much slower than allowing OpenMM do the analogous check internally.
Of course, the problem here is dealing with other masses - not something I've ever run into, but it could be considered. The current implementation will always be in Daltons as long as we don't change this upstream. There is no setter on the Atom
class, so I don't even know it could get to that state without other changes that themselves might significant behavior changes.
If we can't assume Daltons, I'm having a hard time figuring out any way around the time it takes to do this conversion. The only middle ground I can think of now is using the setter to ensure its units; this would safeguard against similar cases in which we wrote code without considering the non-Dalton case. It has the downside of pushing the cost elsewhere, of course.
In conclusion, I think the current approach is the best one (if accidentally - I did not think this all through while working on this). It assumes the units, but those are hardcoded in openff-units
. It's also virtually as performant as possible, clocking in around ~750 ms for adding 1,000,000 atoms compared to ~600 ms if everything was already as floats and ~4700 ms with the current approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed response. 10 us * 100,000 atoms = 1 second, so using explicit units will become a performance bottleneck at some point.
As a compromise, how about we leave this as-is, but make it so that we guarantee that Atom.mass
always returns values in daltons/amus? I've added a comment for where we could test that here: https://github.com/openforcefield/openff-toolkit/pull/1182/files#r803188577
Adding the test described above would resolve the blocker here, and you could feel free to merge. Optionally, it'd be great if you update the atom.mass
docstring to echo the unit guarantee.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I take
but make it so that we guarantee
to mean we'll add a unit test and document it as such, which sounds like a good idea to me.
# Is this de-duplicated? i.e. can it be ["C", "C", "H"] or only ["C", "H"] ? | ||
atom_symbol_list: List[str] = cif_entry["_chem_comp_atom.type_symbol"] | ||
|
||
# Is this dict a de-duplicated mapping between element symbols and atomic numbers | ||
# for the elements in this cif file? If so, can maybe do it in one step | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(not blocking) In the interest of time I'm not going to give feedback on this. I forgot exactly what this function expects and we don't have tests for this file yet. This is not expected to be run by users and so runtime isn't super critical as long as it's under an hour. If you're interested in optimizing/refactoring this, feel free to do so at your discretion. The best test I can suggest would be ensuring that the openff/toolkit/utils/make_*_substructures.py
scripts yield the same outputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll not tinker with this more, then. Except maybe to suggest it shouldn't be in the API, and instead tucked away in another folder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was under the impression that the underscore prepended to the file name communicated the privateness/here-be-dragons-ness of the contents. Though I'm not sure if that's a python convention in reality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I remember when we first planned this, we didn't want to support a public infrastructure to do this, only just what's "enough" for the toolkit. I think I'm still in favor of this stance, since we know how this problem can be pretty complex. Also, I guess it's worth noticing that this .py
file is expected to be used only once in a while. My 2 cents :). BTW, I'm loving how this is evolving, thanks for the good work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though I'm not sure if that's a python convention in reality.
It is :) https://www.python.org/dev/peps/pep-0008/?#public-and-internal-interfaces
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to clarify a bit the distinction between two different things. I'm not sure the right jargon (happy to be told what the terminology is, I can't think of anything but "private API" and "scripts") so I'll resort to being descriptive:
- Code that's for core toolkit functionality and might get called at runtime while interacting with the public API, but is either too dangerous to be explicitly exposed to users or meant to hide away implementation details. Stuff like
Molecule._add_trivalent_lone_pair_virtual_site
orOpenEyeToolkitWrapper._find_smarts_matches
. These code paths have intrinsic value (in the sense that they are the product), should be covered by unit tests, will be included in conda tarballs, and fall under the umbrella of code we're responsible for maintaining/documenting/etc. - Code that's for one-off data cleaning or developer-facing operations, but not a part of a core feature and will not be called by a user (directly or indirectly) at runtime. These files are valuable in large part for what they produce (i.e. big CIF file I don't understand the details of), but aren't covered by unit tests, may or may not be included in conda tarballs, and don't have the same maintenance burden.
What I'm getting at here is that code that cannot be accessed at runtime should be moved /utilities/
or somewhere that would not be packaged as part of the Python module. That folder is from before my time, but it looks like those files didn't contribute as much to maintenance burden.
There's no danger to the user as is since it's definitely not part of the public API. But I think functions that should be used once in a while should not be in the core module (though no less important, of course).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, moving it to /utilities/
isn't a bad idea... I'm just thinking about stuff like @connordavel's possible future project with custom polymers, and that having runtime access to the substructure generator from a stable package may enable some important functionality for advanced users. Also that there's a possible future where we do make this public.
So, I understand the arguments about maintenance burden, but overall there's a lot of uncertainty about the future of this functionality, so for now I'd pick "no change" as a course of action while we have so many other things going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking no action here is certainly reasonable given its temporary nature at the moment. 👍
Co-authored-by: Jeff Wagner <jwagnerjpl@gmail.com>
The failing test is also failing upstream on #951
|
I also can't reproduce the collection errors in the "CI / Test on ubuntu-latest, Python 3.7, RDKit=false, OpenEye=true". |
There are two failures in CI, each of which I believe are unrelated to these changes here
I'm going to merge this and we can fix those later |
Resolves #1178 if accepted.
This PR proposes to not use an external elements package and further remove the concept of atoms containing elements. It seems that every use case can be covered by knowing only the atomic number.
Atom.element
is removedAtom.element.symbol
becomesAtom.symbol
Atom.element.mass
becomesAtom.mass
Atom.element.atomic_number
probably should never have been used, but was also probably always in sync withAtom.atomic_number
Upsides:
Downsides:
Atom.mass
Quick profiling script