Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for NequIP models #60

Open
wants to merge 46 commits into
base: main
Choose a base branch
from
Open

Add support for NequIP models #60

wants to merge 46 commits into from

Conversation

sef43
Copy link
Contributor

@sef43 sef43 commented Oct 4, 2023

This PR adds in support for NequIP models to openmm-ml. There are no pre-trained models available but the model framework is well defined. This will allow users to use their own trained NequIP models in OpenMM simulations.

Also adds code to compute neighbor lists with pytorch that will be used for MACE models too. (NNPOps neighbor list can be added later)

Addresses #48 and see mir-group/nequip#288 for further discussion.

TODO: Need to add testing but not sure how to do this cleanly in CI considering NequIP needs to be installed via pip

sef43 and others added 18 commits February 15, 2023 12:01
* create openmmml/models/nequippotential.py
* create example for toluene example/run_nequip.py
* implement PBC
* cleanup nequip types
* user specified unit conversions
* should work on GPU and CPU
* add example for toy model with PBC
* uses nequip 0.6.0 @develop branch
* uses torch-nl compute_neighborlist
* sets torch dtype from the loaded model metadata
Copy link
Member

@peastman peastman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be a really nice feature to have. I did a first pass through the code and made some comments.

Reading through this, it occurs to me that we really need proper documentation. The README gives a brief overview, but as we expand to more options than just ANI, and especially as we add models that are more complex to use and require installing other packages, that won't be enough.

return NequIPPotentialImpl(name, model_path, distance_to_nm, energy_to_kJ_per_mol, atom_types)

class NequIPPotentialImpl(MLPotentialImpl):
"""This is the MLPotentialImpl implementing the NequIP potential.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should clarify what this means, since there isn't really such a thing as "the NequIP potential". NequIP is a code, not a potential. Presumably this class can be used for any model implemented with that code, including Allegro models for example?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class can be used with any model generated by NequIP. I'm currently verifying if this also applies to Allegro, though I anticipate it should.

input_dict["pos"] = positions

# compute edges
mapping, shifts_idx = simple_nl(positions, input_dict["cell"], pbc, self.r_max)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason not to use the NNPOps neighbor list? In mir-group/nequip#288 (comment) you found it was much faster. NNPOps is already a dependency of this module.


"""

def __init__(self, name, model_path, distance_to_nm, energy_to_kJ_per_mol, atom_types):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does NequIP have default units it uses, or are the units arbitrary? If it has a preferred set of units, we can put the conversion factors here as default values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the discussion at mir-group/nequip#288 (comment), it appears that NequIP/Allegro is entirely agnostic to units and preserves those of the training dataset. I think it's better for users to receive a TypeError indicating missing arguments rather than potentially proceeding with incorrect conversions.

Comment on lines 105 to 106
self.register_buffer('nm_to_distance', torch.tensor(1.0/distance_to_nm))
self.register_buffer('distance_to_nm', torch.tensor(distance_to_nm))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why register two redundant buffers with the same information?

Comment on lines 52 to 53
model_path: str
path to deployed NequIP model
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also allow the user to directly pass in the model as a PyTorch object?

@sef43 sef43 mentioned this pull request Nov 3, 2023
@jchodera
Copy link
Member

Can we train a Nequip model on SPICE and enable that to be usable through openmm-ml?

@svarner9
Copy link

Hello,

Has there been any further progress on this? I have used NequIP in LAMMPS but would like to instead use OpenMM because it is more compatible with the enhanced sampling packages that I use.

I have tried running simulations with a NequIP potential with openmm-ml in its current state, however the speed is significantly slower than in LAMMPS. Both simulations are run on a single GPU, however in LAMMPS I also use 32 cpu threads and kokkos.

I am not sure if I am doing something incorrect in running openmm-ml, but currently it is unusable for my rather simple system of 645 atoms. Is it expected for it to be slow on a system of this size in its current state?

I can provide further information if needed. Thank you so much in advance!

Best,
Sam

@JMorado
Copy link
Contributor

JMorado commented Apr 23, 2024

@svarner9, could you try the current implementation available here? It uses the NNPOps neighbor list, so I anticipate it might be slightly faster for a system of the size you're working with. You can create the MLPotential using something along these lines:

potential = MLPotential('nequip', modelPath='model.pth', lengthScale=0.1, energyScale=4.184)

What speed-up did you observe in your LAMMPS simulations compared to OpenMM/OpenMM-ML?

@JMorado
Copy link
Contributor

JMorado commented May 7, 2024

This is done from my side. If someone could take a look and review the changes, that would be great. Performance benchmarks on test models can be found here.

Many thanks!

Comment on lines 75 to 76
``atomTypes`` parameter. This must be a list containing an integer specifying
the atom type of each particle in the system. Note that by default the model
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the code, I think this description is incorrect. It actually should contain the atom type for each particle that will be modeled with the ML potential. So if you call createMixedSystem(), it should contain one element for each element of the atoms argument, not one for each particle in the System. Can you make this clear both here and in the description of the atomTypes argument below?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for noticing this. I have now clarified it. Now, atomTypes is also an argument passed during system creation. This allows creating systems with varying ML regions from the same MLPotential in cases where custom nequip atom types are being used.

Parameters
----------
name : str
The name of the deployed model.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the name that was specified in the MLPotential constructor, which in this case will always be nequip.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

typeNameToTypeIndex = {
typeNames: i for i, typeNames in enumerate(typeNames)
}
self.atomTypes = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modifying the object this method is called on is dangerous. If you create a MLPotential and then create multiple Systems from it, this will lead to incorrect results on the second and later calls.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Fixed.

Comment on lines 225 to 236
model : str
The path to the deployed NequIP model.
lengthScale : float
The energy conversion factor from the model units to kJ/mol.
energyScale : float
The length conversion factor from the model units to nanometers.
dtype : torch.dtype
The precision of the model.
r_max : torch.Tensor
The maximum distance for the neighbor search.
inputDict : dict
The input dictionary passed to the model.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't match the actual list of attributes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Since buffers can also be accessed as attributes, should those be included in the docstring?

Comment on lines 172 to 174
self.model, metadata = nequip.scripts.deploy.load_deployed_model(
self.modelPath, device="cpu", freeze=False
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is another instance of modifying self inside a method that should treat it as immutable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.



@pytest.mark.parametrize("implementation,platform_int", list(itertools.product(['nnpops', 'torchani'], list(platform_ints))))
class TestMLPotential:

def testCreateMixedSystem(self, implementation, platform_int):
pdb = app.PDBFile('alanine-dipeptide-explicit.pdb')
pdb = app.PDBFile(os.path.join(test_data_dir, 'alanine-dipeptide/alanine-dipeptide-explicit.pdb'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded unix path separator will fail on Windows.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Comment on lines 15 to 16
test_data_dir = os.path.dirname(os.path.abspath(__file__))
test_data_dir = os.path.join(test_data_dir, "data")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first line is confusing: you set test_data_dir to a directory that doesn't contain the test data. It's better to give it a different name, or just combine these two lines.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -0,0 +1,23 @@
HETATM 1 C1 UNL 1 2.199 -0.143 0.062 1.00 0.00 C
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The correct residue name for toluene is MBN. See http://ligand-expo.rcsb.org/reports/M/MBN/index.html.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

conda install -c conda-forge openmm-torch nnpops
```

Then install the development versions of NequIP and `openmm-ml` using pip:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't be telling people to install pre-release versions of packages unless there's a really good reason. Any release of OpenMM-ML that doesn't include NequIP support also doesn't include these examples, so we should always direct people to the latest release. Why is the pre-release NequIP needed, and when will the necessary features be in a release?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I rewrote the instructions so that that the packages from the OpenMM ecosystem, viz. NNPOps and openmm-ml packages, are installed from conda-forge.

Regarding the pre-release of NequIP, the one currently available through pip is version 0.5.6, while pip install git+https://github.com/mir-group/nequip@develop installs version 0.6.0. There's some discussion in this thread as to why the development version of NequIP might be better (or necessary) to use with this interface. I am sure @Linux-cpp-lisp is in a better position to provide more informed answers to your questions. @Linux-cpp-lisp could you please clarify why is the development version of NequiP required and whether there any plans to make it available through pip and/or conda? Many thanks!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for coming around to this thread late, and thanks both for your efforts on this!

This is something I need to fix and hope to have fixed and available normally on PyPI in the very near future; I'll let you know as soon as I have that up. Since you are using load_deployed_model, this will probably work with the current main as well, but ideally I will just get that released and we will restrict to >=0.6.0. Thanks for your flexibility while I clean this up.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, 0.6.0 is now available from PyPI: https://pypi.org/project/nequip/

Let me know if this resolves the issues for you.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I would recommend restricting the nequip version to 0.6.0+ just for simplicity's sake)

Copy link
Contributor

@JMorado JMorado May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks @Linux-cpp-lisp. That's really useful! I'll integrate it soon.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

simulation.minimizeEnergy()

# Run the simulation
simulation.step(1000)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before starting the simulation it's a good idea to call

simulation.context.setVelocitiesToTemperature(300*unit.kelvin)

Assume people will be copying your code, so we want to make sure it follows best practices. For the same reason, it's better to use a DCDReporter instead of a PDBReporter (PDB being a terrible format for trajectories).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

"!mamba install -c conda-forge openmm-torch nnpops pytorch=*=cuda*\n",
"\n",
"!pip install git+https://github.com/mir-group/nequip@develop\n",
"!pip install git+https://github.com/sef43/openmm-ml@nequip"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not have the example install OpenMM-ML from your personal fork! Remember that whatever you do in the example, users will copy it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. And thanks for the heads up, will keep that in mind :)

@svarner9
Copy link

svarner9 commented May 8, 2024

@svarner9, could you try the current implementation available here? It uses the NNPOps neighbor list, so I anticipate it might be slightly faster for a system of the size you're working with. You can create the MLPotential using something along these lines:

potential = MLPotential('nequip', modelPath='model.pth', lengthScale=0.1, energyScale=4.184)

What speed-up did you observe in your LAMMPS simulations compared to OpenMM/OpenMM-ML?

@JMorado
I went ahead and tested out the version on the nequip branch, however I am unable to get it to run on a GPU. When I specify the potential and the platform in the following way,

potential = MLPotential("nequip",
                            modelPath='model.pth',
                            lengthScale=0.1,
                            energyScale=96.48)
...

plat = openmm.Platform.getPlatformByName("CUDA")
properties = {"Precision": "double", "DeviceIndex": "0",
              "UseBlockingSync": "false"}
simulation = app.Simulation(topology, system, integrator, plat, properties)

I get the following set of warnings and errors:

/home/svarner/miniconda3/envs/practicum/lib/python3.11/site-packages/torchani/aev.py:16: UserWarning: cuaev not installed
  warnings.warn("cuaev not installed")
/home/svarner/miniconda3/envs/practicum/lib/python3.11/site-packages/nequip/scripts/deploy.py:138: UserWarning: Models deployed before v0.6.0 don't contain information about their default_dtype or model_dtype; assuming the old default of float32 for both, but this might not be right if you had explicitly set default_dtype=float64.
  warnings.warn(
/home/svarner/miniconda3/envs/practicum/lib/python3.11/site-packages/nequip/utils/_global_options.py:59: UserWarning: !! Upstream issues in PyTorch versions >1.11 have been seen to cause unusual performance degredations on some CUDA systems that become worse over time; see https://github.com/mir-group/nequip/discussions/311. At present we *strongly* recommend the use of PyTorch 1.11 if using CUDA devices; while using other versions if you observe this problem, an unexpected lack of this problem, or other strange behavior, please post in the linked GitHub issue.
  warnings.warn(
/home/svarner/miniconda3/envs/practicum/lib/python3.11/site-packages/nequip/utils/_global_options.py:70: UserWarning: Setting the GLOBAL value for jit fusion strategy to `[('DYNAMIC', 3)]` which is different than the previous value of `[('STATIC', 2), ('DYNAMIC', 10)]`
  warnings.warn(
Traceback (most recent call last):
  File "/home/svarner/Practicum/sim.py", line 174, in <module>
    run(1,1,1,1,1)
  File "/home/svarner/Practicum/sim.py", line 145, in run
    simulation = app.Simulation(topology, system, integrator, plat, properties)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/svarner/miniconda3/envs/practicum/lib/python3.11/site-packages/openmm/app/simulation.py", line 106, in __init__
    self.context = mm.Context(self.system, self.integrator, platform, platformProperties)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/svarner/miniconda3/envs/practicum/lib/python3.11/site-packages/openmm/openmm.py", line 12171, in __init__
    _openmm.Context_swiginit(self, _openmm.new_Context(*args))
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
openmm.OpenMMException: Specified a Platform for a Context which does not support all required kernels

Here is my mamba list:

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
ase                       3.22.1             pyhd8ed1ab_1    conda-forge
blinker                   1.8.2              pyhd8ed1ab_0    conda-forge
brotli                    1.1.0                hd590300_1    conda-forge
brotli-bin                1.1.0                hd590300_1    conda-forge
brotli-python             1.1.0           py311hb755f60_1    conda-forge
bzip2                     1.0.8                hd590300_5    conda-forge
c-ares                    1.28.1               hd590300_0    conda-forge
ca-certificates           2024.2.2             hbcca054_0    conda-forge
cached-property           1.5.2                hd8ed1ab_1    conda-forge
cached_property           1.5.2              pyha770c72_1    conda-forge
certifi                   2024.2.2           pyhd8ed1ab_0    conda-forge
charset-normalizer        3.3.2              pyhd8ed1ab_0    conda-forge
click                     8.1.7           unix_pyh707e725_0    conda-forge
contourpy                 1.2.1           py311h9547e67_0    conda-forge
cudatoolkit               11.5.2              hbdc67f6_13    conda-forge
cycler                    0.12.1             pyhd8ed1ab_0    conda-forge
e3nn                      0.5.1                    pypi_0    pypi
filelock                  3.14.0             pyhd8ed1ab_0    conda-forge
flask                     3.0.3              pyhd8ed1ab_0    conda-forge
fonttools                 4.51.0          py311h459d7ec_0    conda-forge
freetype                  2.12.1               h267a509_2    conda-forge
fsspec                    2024.3.1           pyhca7485f_0    conda-forge
gmp                       6.3.0                h59595ed_1    conda-forge
gmpy2                     2.1.5           py311he48d604_0    conda-forge
h5py                      3.11.0          nompi_py311hebc2b07_100    conda-forge
hdf5                      1.14.3          nompi_h4f84152_101    conda-forge
idna                      3.7                pyhd8ed1ab_0    conda-forge
importlib-metadata        7.1.0              pyha770c72_0    conda-forge
importlib_metadata        7.1.0                hd8ed1ab_0    conda-forge
itsdangerous              2.2.0              pyhd8ed1ab_0    conda-forge
jinja2                    3.1.3              pyhd8ed1ab_0    conda-forge
keyutils                  1.6.1                h166bdaf_0    conda-forge
kiwisolver                1.4.5           py311h9547e67_1    conda-forge
krb5                      1.21.2               h659d440_0    conda-forge
lark-parser               0.12.0             pyhd8ed1ab_0    conda-forge
lcms2                     2.16                 hb7c19ff_0    conda-forge
ld_impl_linux-64          2.40                 h55db66e_0    conda-forge
lerc                      4.0.0                h27087fc_0    conda-forge
libabseil                 20230802.1      cxx17_h59595ed_0    conda-forge
libaec                    1.1.3                h59595ed_0    conda-forge
libblas                   3.9.0           22_linux64_openblas    conda-forge
libbrotlicommon           1.1.0                hd590300_1    conda-forge
libbrotlidec              1.1.0                hd590300_1    conda-forge
libbrotlienc              1.1.0                hd590300_1    conda-forge
libcblas                  3.9.0           22_linux64_openblas    conda-forge
libcurl                   8.7.1                hca28451_0    conda-forge
libdeflate                1.20                 hd590300_0    conda-forge
libedit                   3.1.20191231         he28a2e2_2    conda-forge
libev                     4.33                 hd590300_2    conda-forge
libexpat                  2.6.2                h59595ed_0    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc-ng                 13.2.0               h77fa898_7    conda-forge
libgfortran-ng            13.2.0               h69a702a_7    conda-forge
libgfortran5              13.2.0               hca663fb_7    conda-forge
libgomp                   13.2.0               h77fa898_7    conda-forge
libjpeg-turbo             3.0.0                hd590300_1    conda-forge
liblapack                 3.9.0           22_linux64_openblas    conda-forge
libnghttp2                1.58.0               h47da74e_1    conda-forge
libnsl                    2.0.1                hd590300_0    conda-forge
libopenblas               0.3.27          pthreads_h413a1c8_0    conda-forge
libpng                    1.6.43               h2797004_0    conda-forge
libprotobuf               4.25.1               hf27288f_2    conda-forge
libsqlite                 3.45.3               h2797004_0    conda-forge
libssh2                   1.11.0               h0841786_0    conda-forge
libstdcxx-ng              13.2.0               hc0a3c3a_7    conda-forge
libtiff                   4.6.0                h1dd3fc0_3    conda-forge
libtorch                  2.1.2           cpu_generic_ha017de0_3    conda-forge
libuuid                   2.38.1               h0b41bf4_0    conda-forge
libuv                     1.48.0               hd590300_0    conda-forge
libwebp-base              1.4.0                hd590300_0    conda-forge
libxcb                    1.15                 h0b41bf4_0    conda-forge
libxcrypt                 4.4.36               hd590300_1    conda-forge
libzlib                   1.2.13               hd590300_5    conda-forge
markupsafe                2.1.5           py311h459d7ec_0    conda-forge
matplotlib-base           3.8.4           py311h54ef318_0    conda-forge
mpc                       1.3.1                hfe3b2da_0    conda-forge
mpfr                      4.2.1                h9458935_1    conda-forge
mpmath                    1.3.0              pyhd8ed1ab_0    conda-forge
munkres                   1.1.4              pyh9f0ad1d_0    conda-forge
ncurses                   6.4.20240210         h59595ed_0    conda-forge
nequip                    0.6.0                    pypi_0    pypi
networkx                  3.3                pyhd8ed1ab_1    conda-forge
nnpops                    0.6             cpu_py311h7697b17_7    conda-forge
nomkl                     1.0                  h5ca1d4c_0    conda-forge
numpy                     1.26.4          py311h64a7726_0    conda-forge
ocl-icd                   2.3.2                hd590300_1    conda-forge
ocl-icd-system            1.0.0                         1    conda-forge
openjpeg                  2.5.2                h488ebb8_0    conda-forge
openmm                    8.1.1           py311h28d7ac7_1    conda-forge
openmm-torch              1.4             cpu_py311h446247e_4    conda-forge
openmmml                  1.1                      pypi_0    pypi
openssl                   3.3.0                hd590300_0    conda-forge
opt-einsum                3.3.0                    pypi_0    pypi
opt-einsum-fx             0.1.4                    pypi_0    pypi
packaging                 24.0               pyhd8ed1ab_0    conda-forge
pillow                    10.3.0          py311h18e6fac_0    conda-forge
pip                       24.0               pyhd8ed1ab_0    conda-forge
pthread-stubs             0.4               h36c2ea0_1001    conda-forge
pyparsing                 3.1.2              pyhd8ed1ab_0    conda-forge
pysocks                   1.7.1              pyha2e5f31_6    conda-forge
python                    3.11.9          hb806964_0_cpython    conda-forge
python-dateutil           2.9.0              pyhd8ed1ab_0    conda-forge
python_abi                3.11                    4_cp311    conda-forge
pytorch                   2.1.2           cpu_generic_py311h1584bb0_3    conda-forge
pyyaml                    6.0.1                    pypi_0    pypi
readline                  8.2                  h8228510_1    conda-forge
requests                  2.31.0             pyhd8ed1ab_0    conda-forge
scipy                     1.13.0          py311h517d4fd_1    conda-forge
setuptools                65.3.0             pyhd8ed1ab_1    conda-forge
setuptools-scm            6.3.2              pyhd8ed1ab_0    conda-forge
setuptools_scm            6.3.2                hd8ed1ab_0    conda-forge
six                       1.16.0             pyh6c4a22f_0    conda-forge
sleef                     3.5.1                h9b69904_2    conda-forge
sympy                     1.12            pypyh9d50eac_103    conda-forge
tk                        8.6.13          noxft_h4845f30_101    conda-forge
tomli                     2.0.1              pyhd8ed1ab_0    conda-forge
torch-ema                 0.3                      pypi_0    pypi
torch-runstats            0.2.0                    pypi_0    pypi
torchani                  2.2.4           cpu_py311h12a0d1d_3    conda-forge
tqdm                      4.66.4                   pypi_0    pypi
typing_extensions         4.11.0             pyha770c72_0    conda-forge
tzdata                    2024a                h0c530f3_0    conda-forge
urllib3                   2.2.1              pyhd8ed1ab_0    conda-forge
werkzeug                  3.0.3              pyhd8ed1ab_0    conda-forge
wheel                     0.43.0             pyhd8ed1ab_1    conda-forge
xorg-libxau               1.0.11               hd590300_0    conda-forge
xorg-libxdmcp             1.1.3                h7f98852_0    conda-forge
xz                        5.2.6                h166bdaf_0    conda-forge
zipp                      3.17.0             pyhd8ed1ab_0    conda-forge
zstd                      1.5.6                ha6fb4c9_0    conda-forge

If I don't specify any platform, then the simulation runs, but extremely slowly since it is on CPU.

Thank you so much in advance!

Best,
Sam

@peastman
Copy link
Member

peastman commented May 8, 2024

That means a plugin couldn't be loaded. Try printing the value of Platform.getPluginLoadFailures(). It will tell you which ones failed, and what the errors were.

Usually it's because some library they depended on couldn't be found, and it can be fixed by adding the directory containing the library to LD_LIBRARY_PATH.

@svarner9
Copy link

svarner9 commented May 8, 2024

That means a plugin couldn't be loaded. Try printing the value of Platform.getPluginLoadFailures(). It will tell you which ones failed, and what the errors were.

Usually it's because some library they depended on couldn't be found, and it can be fixed by adding the directory containing the library to LD_LIBRARY_PATH.

Thank you for the quick response!

I tried that based on some previous replies of yours that I found. I ran the following:

print(pluginLoadedLibNames)
print(Platform.getPluginLoadFailures())

and the output was:

('/home/svarner/miniconda3/envs/practicum/lib/plugins/libOpenMMPME.so', '/home/svarner/miniconda3/envs/practicum/lib/plugins/libOpenMMCPU.so', '/home/svarner/miniconda3/envs/practicum/lib/plugins/libOpenMMCUDA.so', '/home/svarner/miniconda3/envs/practicum/lib/plugins/libOpenMMOpenCL.so', '/home/svarner/miniconda3/envs/practicum/lib/plugins/libOpenMMRPMDCUDA.so', '/home/svarner/miniconda3/envs/practicum/lib/plugins/libOpenMMDrudeCUDA.so', '/home/svarner/miniconda3/envs/practicum/lib/plugins/libOpenMMAmoebaCUDA.so', '/home/svarner/miniconda3/envs/practicum/lib/plugins/libOpenMMRPMDOpenCL.so', '/home/svarner/miniconda3/envs/practicum/lib/plugins/libOpenMMTorchOpenCL.so', '/home/svarner/miniconda3/envs/practicum/lib/plugins/libOpenMMDrudeOpenCL.so', '/home/svarner/miniconda3/envs/practicum/lib/plugins/libOpenMMAmoebaOpenCL.so', '/home/svarner/miniconda3/envs/practicum/lib/plugins/libOpenMMRPMDReference.so', '/home/svarner/miniconda3/envs/practicum/lib/plugins/libOpenMMTorchReference.so', '/home/svarner/miniconda3/envs/practicum/lib/plugins/libOpenMMDrudeReference.so', '/home/svarner/miniconda3/envs/practicum/lib/plugins/libOpenMMAmoebaReference.so')

()

The failures command returned an empty tuple.

Best,
Sam

@peastman
Copy link
Member

peastman commented May 8, 2024

The versions of PyTorch and OpenMM-Torch you have installed are CPU only:

openmm-torch              1.4             cpu_py311h446247e_4    conda-forge
pytorch                   2.1.2           cpu_generic_py311h1584bb0_3    conda-forge

That might be because you have an older version of cudatoolkit:

cudatoolkit               11.5.2              hbdc67f6_13    conda-forge

If you upgrade it to 11.8, you might be able to get it to install the CUDA version of PyTorch. Conda installation issues like this tend to be frustrating and hard to figure out. They often depend on the precise order you install packages in.

@svarner9
Copy link

svarner9 commented May 8, 2024

The versions of PyTorch and OpenMM-Torch you have installed are CPU only:

openmm-torch              1.4             cpu_py311h446247e_4    conda-forge
pytorch                   2.1.2           cpu_generic_py311h1584bb0_3    conda-forge

That might be because you have an older version of cudatoolkit:

cudatoolkit               11.5.2              hbdc67f6_13    conda-forge

If you upgrade it to 11.8, you might be able to get it to install the CUDA version of PyTorch. Conda installation issues like this tend to be frustrating and hard to figure out. They often depend on the precise order you install packages in.

Ahhh I see. Thank you!

I went ahead an uninstalled openmm-torch and pytorch. I upgraded the cudatoolkit, and then installed the cuda version of pytorch:

install pytorch pytorch-cuda=11.8 -c pytorch -c nvidia

Installing openmm-torch downgraded it back to the cpu version, but then installing nnpops upgraded it back to the cuda version. I agree, conda installations are very frustrating.

It is working on GPU now, but only getting about 0.2 ns/day, whereas on lammps I was getting 1.5 ns/day. To your knowledge, could any of the following warnings have to do with it being slow?

/home/svarner/miniconda3/envs/practicum/lib/python3.11/site-packages/nequip/scripts/deploy.py:138: UserWarning: Models deployed before v0.6.0 don't contain information about their default_dtype or model_dtype; assuming the old default of float32 for both, but this might not be right if you had explicitly set default_dtype=float64.
  warnings.warn(
/home/svarner/miniconda3/envs/practicum/lib/python3.11/site-packages/nequip/utils/_global_options.py:59: UserWarning: !! Upstream issues in PyTorch versions >1.11 have been seen to cause unusual performance degredations on some CUDA systems that become worse over time; see https://github.com/mir-group/nequip/discussions/311. At present we *strongly* recommend the use of PyTorch 1.11 if using CUDA devices; while using other versions if you observe this problem, an unexpected lack of this problem, or other strange behavior, please post in the linked GitHub issue.
  warnings.warn(
/home/svarner/miniconda3/envs/practicum/lib/python3.11/site-packages/nequip/utils/_global_options.py:70: UserWarning: Setting the GLOBAL value for jit fusion strategy to `[('DYNAMIC', 3)]` which is different than the previous value of `[('STATIC', 2), ('DYNAMIC', 10)]`
  warnings.warn(

I tried to install the packages in such a way to allow me to use pytorch 1.11.0 (which according to the error is the most stable version with nequip), however, as far as I can tell there is no way to use pytorch 1.11.0 with openmm-torch. Every time I would install openmm-torch it would install pytorch 2.1.2.

This is the order that I did everything:

mamba create -n env
mamba activate env
mamba install python=3.10
mamba install -c conda-forge openmm cudatoolkit=11.8
pip install git+https://github.com/mir-group/nequip@develop
pip install git+https://github.com/sef43/openmm-ml@nequip
mamba install pytorch=1.11 pytorch-cuda=11.8 -c pytorch -c nvidia
mamba install -c conda-forge openmm-torch nnpops

@JMorado
Copy link
Contributor

JMorado commented May 8, 2024

Many thanks for the thorough review, @peastman! Most of it should be now resolved.

Thanks for testing, @svarner9. I think the slow performance you're seeing is not related to that warning, the underlying issue of which is described here. You could test if the issue that underlies that warning is indeed present by identifying a slowdown in performance over time. I ran some performance benchmarks on systems much smaller than yours and did not see any decrease in performance over time, and the simulation speed is around what I would expect.

If that is your baseline OpenMM performance, I wonder what could be causing that. Do you remember by any chance what was the performance you were getting with the previous neighbor list? Does anyone have any ideas about whether it's possible to improve performance here?

@svarner9
Copy link

svarner9 commented May 8, 2024

Yes many thanks @peastman for the help!

@JMorado I am not sure, but there are a few things I can think of that might be the issue, but I am not an expert and have not looked through the code, so it might be a bit naive.

  1. In LAMMPS the nequip pairstyle works with Kokkos, so in that case I was using 1 gpu + 32 cpus.
mpiexec -n 1 ./lmp -in in.script -k on g 1 t 32 -sf kk -pk kokkos newton on neigh full
  1. The LAMMPS nequip pairstyle uses libtorch instead of pytorch, which could make a difference?
  2. When reading in the model, is the cutoff set to the cutoff of the MLP? Most of them have very short cutoffs of around 5 Angstroms, so if that cutoff is not being used for neighborlists, then that could be leading to slow performance. Is that something that should be set separately?
  3. I am getting this warning for jit but I am not sure if it is important or could be affecting performance. I have seen the NequIP devs say that it can usually be silently ignored.
/home/svarner/miniconda3/envs/practicum/lib/python3.10/site-packages/nequip/utils/_global_options.py:70: UserWarning: Setting the GLOBAL value for jit fusion strategy to `[('DYNAMIC', 3)]` which is different than the previous value of `[('STATIC', 2), ('DYNAMIC', 10)]`
  warnings.warn(

Best,
Sam

@Linux-cpp-lisp
Copy link

Is there an option to predict a formation energy instead of total energy, or to subtract off per-atom mean energies? That leads to a much smaller output value and better accuracy.

We actually do this internally, at least from develop onward---single precision calculations are done in a more numerically favorable range, and the final energy scalings, shiftings, and sums are done in float64, regardless of the precision of the weights. The final predictions you get should be float64, and if they aren't, something might be off.

Regarding the reproducibility of energies between ASE and OpenMM: you can try turning off TF32, or even better using a fully F64 model (default_dtype: float64 and model_dtype: float64) to ensure that this is just numerics as a sanity check.

@Linux-cpp-lisp
Copy link

@svarner9 a few questions on performance:

  • What are the actual LAMMPS vs OpenMM numbers? Not sure where they were in this thread.
  • Yes, there will be additional Python and doubled neighborlist overhead in OpenMM, both of which are absent in pair_allegro. This should be more important for smaller models and smaller systems.
  • You can ignore that particular warning about the fusion strategy safely, it is just there to ensure that nequip never silently sets global state when called from someone else's program

@peastman
Copy link
Member

There shouldn't be any overhead from Python. The model gets compiled to torchscript, and the simulation gets run by C++ code.

@Linux-cpp-lisp
Copy link

Do you call TorchScript from Python here, or directly from C++? Not that I would expect a roundtrip through Python to matter much, just curious.

@peastman
Copy link
Member

It's called directly from C++.

@JMorado
Copy link
Contributor

JMorado commented May 14, 2024

@peastman @Linux-cpp-lisp, I've trained a model with these settings:

default_dtype: float64
model_dtype: float64
allow_tf32: true  

and the energy and force differences between ASE and OpenMM are indeed very small, on the order of $10^{−10}$, when combined with {"Precision": "double"} in the simulation settings.

@Linux-cpp-lisp
Copy link

@JMorado great!

(Note that allow_tf32: true is a no-op when model_dtype: float64 and we should probably error on this configuration, but that doesn't change the results.)

@svarner9
Copy link

@svarner9 a few questions on performance:

  • What are the actual LAMMPS vs OpenMM numbers? Not sure where they were in this thread.
  • Yes, there will be additional Python and doubled neighborlist overhead in OpenMM, both of which are absent in pair_allegro. This should be more important for smaller models and smaller systems.
  • You can ignore that particular warning about the fusion strategy safely, it is just there to ensure that nequip never silently sets global state when called from someone else's program

@Linux-cpp-lisp I was getting 1.5 ns/day on lammps and 0.2 ns/day on openmm for a system with 645 atoms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants