Development of easy to use and reproducible ML scripts for chemistry.
Currently we are using three datasets from MoleculeNet.ai: Lipophilicity, FreeSOLV and ESOL and two datasets retreived from tutorials: LogP14k and jak2-pIC50.
Our program supports random forest (RF), gradient decent boost (GDB), support vector machines (SVM), Adaboost, and k-nearest neightbors (KNN).
We should all be using the same conda evironment so that we do not run into the issue of "Well it works on my machine". To do this, we will host a .yml file for the shared environment on our repo (mlapp.yml).
-
Create an conda virual environment from the mlapp.yml file
conda env create -f mlapp.yml
-
Update the virtual environment as necessary using
conda install
-
Update the mlapp.yml file using
conda env export > mlapp.yml --no-builds --from-history
. Make sure that you add the mlapp.yml file to git, if it not already being watched.Note: Sometimes packages cannot be installed from conda, such as descriptastorus. If this is the case, you may need to use pip to install from a github link. See the mlapp.yml file for an example (descriptastorus) for an example of how to account for this in the mlapp.yml file.
- pip: - "git+git://github.com/bp-kelley/descriptastorus.git#egg=descriptastorus"
-
Commit your changes, which include the mlapp.yml file.
git commit -m "your commit message here"
This is the overview of our MLModels Python class functions. main.py
essentially just iteratively runs this workflow with different input algorithms, data sets and featurization methods.