LillyMol is a set of Linux executables for Cheminformatics. These tools are built on a high performance C++ library for Cheminformatics.
LillyMol does only a subset of Cheminformatics tasks, but tries to do those tasks efficiently and correctly.
LillyMol has some novel approaches to substructure searching, reaction enumeration and chemical similarity. These have been developed over many years, driven by the needs of Computational and Medicinal Chemists at Lilly and elsewhere.
Recent work has focussed on making de-novo molecule construction and there are several tools desiged to either support or complement A/I driven molecule generation.
LillyMol is fast and scalable, with modest memory requirements.
This release includes a number of C++ unit tests. All tests can be run with address sanitizer, with no problems reported.
The file Molecule_Tools/introduction.cc provides an introduction to LillyMol for anyone wishing to develop with C++.
This release includes first steps towards more extensive documentation of LillyMol, see the docs directory. More work is needed on this front. Most parts of LillyMol have been feature stable for a long time.
This release include a python interface to LillyMol via pybind11. This first release includes most Molecule related functionality, substructure searching and reaction enumeration. In the pybind directory there are some *_test.py files that exemplify much of the current functionality. Documentation is in docs. This should be sufficient support for a great many tasks involving querying or manipulation of molecules at the connection table level.
The current roadmap for the python interface primarily involves two directions
- Enabling gfp fingerprints for similarity calculations.
- Making existing LillyMol applications available.
There is already a Julia interface to an earlier version of LillyMol, and this release will soon be adapted to support Julia.
Note: This is significantly different from prior versions.
LillyMol is primarily developed on RedHat and Ubuntu systems.
The primary build system used for LillyMol is bazel. You might also choose to use bazelisk which makes keeping bazel up to date easier. That is strongly encouraged.
Within a GitHub CodeSpace, this worked to install bazelisk.
sudo apt install npm
sudo npm install -g @bazel/bazelisk
If you use the module system
module load bazelisk
If you are NOT building the python bindings, bazel or bazelisk is equivalent.
The software requires a gcc version of at least version 10. This version of LillyMol uses some fairly recent c++ features, which require a recent compiler. The software has been tested with gcc13.
If you use the module system
module load gcc10
module load bazelisk
module load git
Other system components that are needed
- wget
- unzip
- libz-dev
If you wish to build the python bindings, you will need a recent version of python. Development was done with python3.11 and has not been tested on any other version, although we have no reason to believe it will not work with other versions. You will need to install
pip install pybind11 absl-py protobuf
apt install python-dev
If you wish to use the xgboost QSAR model building tools in LillyMol, also pip install xgboost, scikit-learn, matplotlib and pandas.
Make sure that python-dev and libblas-dev are installed at system level.
sudo apt install python-dev libblas-dev
Things seem to work seamlessly in virtualenv.
Note that with the default build (below) Python bindings are not built, but 'make all' will.
If you have bazelisk and gcc installed, there is a reasonable possibility that
issuing make
in the top level directory will work (but see note below
about NFS filesystems).
# Inside Lilly use the private repo
git clone https://github.com/EliLillyCo/LillyMol
cd /path/to/LillyMol
make
Executables will be in bin/$(uname)
and libraries in lib
. More details
below. There is no concept of installation prefix, everything remains
in the repo.
Note by default neither Python bindings nor Berkeley DB dependencies are built. If you wish to build either of those
make python
make berkeleydb
or
make all
If you look at Makefile you will see that all it is doing is sequentially invoking the three scripts discussed below, with different shell variables set.
Within the src directory, the file WORKSPACE
configures the build environment
for bazel
. If you are building python bindings, this file needs to be updated
to reflect the location of your local python. The script update_bazel_configs.sh
does this automatically from the Makefile.
There is an 'install' target in the BUILD files, and defined in
build_deps/install.bzl.
This is where LillyMol executables will be installed when
the 'install' run target is run. Again update_bazel_configs.sh
will update this to '/path/to/LillyMol/bin/$(uname)'.
Check to see that the update has
been done correctly and adjust if not, or to set another location.
tail build_deps/install.bzl
That file contains other mechanisms for specifying the install directory.
But remember, every time make
is run, that file will be automatically
updated again. Remove the calls to update_bazel_configs.sh
from
the Makefile if needed.
There are several dependencies which could be installed on the system, which would considerably simplify the build configuration, but during development we have frequently found ourselves on machines that could not be updated to the versions we needed, or where we lacked privileges, or... So external dependencies are downloaded and managed explicitly. The preferred way of using third party software is via the Bazel Module system. Most of the external dependencies needed are handled via that mechanism. Today that includes
- absl: Google's c++ library - we use crc32c and some data structures.
- eigen: matrix operations
- googletest: Google's c++ unit tests
- protobuf: Google's Protocol Buffers
- re2: Google's regular expression library
- tbb: Threaded Building Blocks for multi-threading
- zlib: compression
The complete listing is in the file MODULE.bazel.
Other third party dependencies are downloaded and built by the
build_third_party.sh script, which will create a third_party
directory (next to src) and then download, build and install the following dependencies
- BerkeleyDb: used for key/value databases
- f2c/libf2c: there is some fortran in LillyMol.
Running 'build_third_party.sh' needs to be done once.
Note that BerkeleyDB and Python bindings are only built if requested. In Makefile you will see use of the shell variables 'BUILD_PYTHON' and 'BUILD_BDB' which if set, enables building of these optional features. These can be set any time.
Running build_third_party.sh
may be a lengthy process. It can be re-run at
any time thereafter. For those repos that are cloned GitHub repos, it will
pull a new version and build. Remove the entire third_party
directory and
re-run the script and all dependencies are downloaded and rebuilt. If there is
an individual dependency that you would like to rebuild, just remove it from
the third_party
directory, run the script again and it will be rebuilt.
Note too that installing these external dependencies and running bazel may require considerable amounts of disk space. For example at the time of writing my 'third_party' directory contains 1.2GB and my bazel temporary area contains 2.2GB.
Note that .bazelrc contains a hardware restriction to quite old
Intel hardware. You should update update --cxxopt
to reflect
your hardware. Using --cxxopt=-march=native --cxxopt=-mtune=native
is likely
what you want. Build for the local hardware.
During building of external dependencies (with build_third_party.sh
and if BUILD_PYTHON is set)
the script update_python_in_workspace.py
will examine your python
installation and get information about the include path. With that
info it will update WORKSPACE with new values for the 'path'
attributes of the python related features.
Note that if it does not find a pybind11 installation, the build will continue, but the python related parts of the build will subsequently fail.
You can of course manually update WORKSPACE to point to your python installation. See the 'new_local_repository' sections for 'python' and 'pybind11'
Note that we recently observed a change in how shared libraries are handled by bazel. For now, there is a .bazelversion file that freezes the bazel version until we figure out how to handle shared libraries with newer versions of bazel. Having a .bazelversion file makes use of bazelisk superfluous, but once we figure out the new shared library stuff, the bazel version will again be allowed to float via bazelisk. The current way shared libraries are handled is not ideal, and causes some undesirable behaviour in LillyMol/python.
Once the third party dependencies have been built, and WORKSPACE and install.bzl configured, LillyMol building can begin.
Bazel needs to be able to store its cache on a local disk, not NFS. When building
inside Lilly, I have used --output_user_root=/node/scratch/${USER}
to
use local scratch storage for bazel's cache. Note that if there is a
recycling policy in place for the cache, you may see unexpected outcomes.
Purge the cache completely to start afresh.
If outside Lilly, the 'build_from_src.sh' script (below) will check to
see if your HOME directory is on an NFS mounted file system, and if so, will
specify /tmp for bazel's cache. This is almost certainly not what you want,
so edit 'build_from_src.sh' to specify a local directory for
--output_user_root
. Again, only needed if you are on an NFS file system.
You can also enter this value in bazel's configuration file .bazelrc
.
By default, bazel will use all cores available on the local machine.
If needed, limit the number of cores with the --jobs
option inside
'build_from_src.sh' (sorry no command line options here).
Optionally set shell variables BUILD_BDB and BUILD_PYTHON to enable building of optional features.
Once the bazel preconditions are set, do the build, test and installs
cd src # you might already be here
./build_from_src.sh # takes a while
The script will
- run the C++ unit tests,
- build all executables
- build the python bindings
- install executables into the
/path/to/LillyMol/bin/$(uname)
directory (build_deps/install.bzl) - copy python related shared libraries to /path/to/LillyMol/lib (if BUILD_PYTHON)
- run python unit tests (if BUILD_PYTHON)
Step 5 is done via the copy_shared_libraries.sh script. It also copies some python compiled protos. Adjust as needed.
For anyone interesting in doing their own development, a typical build inside Lilly might be (change the path for test_env)
bazelisk --output_user_root=/node/scratch/${USER}
build
--jobs=8
-c opt
--cxxopt=-DGIT_HASH=\"$(git rev-parse --short --verify HEAD)\"
--cxxopt=-DTODAY=\"$(date +%Y-%b-%d)\"
--test_env=${C3TK_DATA_PERSISTENT}=/full/path/to/LillyMolQueries
Molecule_Tools:all <- or some other target
Most will want to put this in a small shell script, and/or add to .bazelrc where possible.
When building for release, it is convenient to include the git hash and the date of the build in the executables. That is not necessary, omit those if not needed. Note that because the date is included with cxxopt, this will cause a daily recompile. While this is hardly desirable, the benefits are many.
The distribution contains cmake
infrastructure, that is currently
not functional. Within Lilly we have not been able to make it work,
usually as a result of conflicting protcol buffer versions on the
system. Work is ongoing to get cmake working for the public release.