Sort & Slice: A Simple and Superior Alternative to Hash-Based Folding for Extended-Connectivity Fingerprints (ECFPs)
Code repository for the paper Sort & Slice: A Simple and Superior Alternative to Hash-Based Folding for Extended-Connectivity Fingerprints.
This repository contains:
- A simple, self-contained and fast function to transform RDKit mol objects into vectorial extended-connectivity fingerprints (ECFPs) via Sort & Slice (using only NumPy and RDKit).
- The code base and data sets to fully reproduce the computational results from the paper.
- Original numerical results from the experiments conducted in the paper.
- The function in sort_and_slice_ecfp_featuriser.py constitutes a computationally efficient, easy-to-use and self-contained method to create a featuriser that can transform RDKit mol objects into vectorial ECFPs based on substructure pooling via Sort & Slice (rather than via classical hash-based folding).
- It only relies on RDKit and NumPy and can be readily employed for molecular feature extraction and other ECFP-based applications.
- This function should be all you need in case you want to employ vectorial Sort & Slice ECFPs.
- An extensive series of strict computational experiments indicates that ECFPs pooled via Sort & Slice regularly lead to higher (and sometimes substantially higher) predictive performance than ECFPs pooled via classical hash-based folding across a wide variety of molecular property prediction scenarios.
EXAMPLE:
First select a training set of RDKit mol objects
mols_train = [mol_1, mol_2, ...]
that should be used to calibrate the Sort & Slice operator. This training set can then be employed along with a set of desired ECFP hyperparameter settings to construct a molecular featurisation function:
ecfp_featuriser = construct_sort_and_slice_ecfp_featuriser(mols_train = mols_train,
max_radius = 2,
pharm_atom_invs = False,
bond_invs = True,
chirality = False,
sub_counts = True,
vec_dimension = 1024)
Then ecfp_featuriser(mol) is a 1-dimensional numpy array of length vec_dimension representing the vectorial ECFP for mol pooled via a Sort & Slice operator calibrated on mols_train.
More specifically, the function ecfp_featuriser can be thought of as
- first generating the (multi)set of integer ECFP-substructure identifiers for mol based on the ECFP hyperparameters (max_radius, pharm_atom_invs, bond_invs, chirality, sub_counts) and then
- vectorising this (multi)set via a Sort & Slice operator calibrated on mols_train with output dimension vec_dimension (rather than vectorising it via classical hash-based folding).
To now turn any list of RDKit mol objects mols_list into a feature matrix X whose rows correspond to vectorial Sort & Slice ECFPs one can simply run
X = np.array([ecfp_featuriser(mol) for mol in mols_list])
- The computational experiments from the paper can be reproduced and visualised using the Jupyter notebook substructure_pooling_experiments.ipynb which provides an easy way to interface with the code base in modules.
- The data folder contains the five (cleaned) chemical data sets for the diverse set of prediction tasks investigated in the paper: lipophilicity, aqueous solubility, SARS-CoV-2 main protease inhibition, mutagenicity, and estrogen receptor alpha antagonism. Each data set is given as a set of labelled SMILES strings.
- The computational environment in which the original results were conducted is given in environment.yml.
- The original numerical results from the paper can be found in results and are additionally backed up in results_original. If new computational results are generated via the Jupyter notebook, then by default they are only saved in (and overwrite the content of) results.