Author : Lewis Mervin, lhm30@cam.ac.uk
Supervisor : Dr. A. Bender
Protein target prediction using Random Forests (RFs) trained on bioactivity data from PubChem (extracted 07/06/18) and ChEMBL (version 24), using the RDKit and Scikit-learn, which employ a modification of the reliability-density neighbourhood Applicability Domain (AD) analysis by Aniceto [1]. This project is the sucessor to PIDGIN version 1 [2] and PIDGIN version 2 [3]. Target prediction with extended NCBI pathway and DisGeNET disease enrichment calculation is available as implemented in [4].
- Molecular Descriptors : 2048bit Rdkit Extended Connectivity FingerPrints (ECFP) [5]
- Algorithm: Random Forests with dynamic number of trees (see docs for details), class weight = 'balanced', sample weight = ratio Inactive:Active
- Models generated at four different cut-off's: 100μM, 10μM, 1μM and 0.1μM
- Models generated both with and without mapping to orthologues, as implemented in [3]
- Pathway information from NCBI BioSystems
- Disease information from DisGeNET
- Target/pathway/disease enrichment calculated using Fisher's exact test and the Chi-squared test
Details for sizes across all activity cut-off's:
Without orthologues | With orthologues | |
---|---|---|
Distinct Models | 10,446 | 14,678 |
Distinct Targets [exhaustive total] | 7,075 [7,075] | 16,623 [60,437] |
Total Bioactivities Over all models | 39,424,168 | 398,340,769 |
Actives | 3,204,038 | 35,009,629 |
Inactives [Of which are Sphere Exclusion (SE)] | 36,220,130 [27,435,133] | 363,331,140 [248,782,698] |
Full details on all models are provided in the uniprot_information.txt files in the orthologue and no_orthologue directories
Development occurs on GitHub.
Documentation, installation and instructions are on ReadtheDocs.
- Use the ReadtheDocs! You MUST download the models before running!
- The program recognises as input line-separated SMILES in either .smi/.smiles or .sdf format
- If the SMILES input contains data additional to the SMILES string, the first entries after the SMILES are automatically interpreted as identifiers (see the OpenSMILES specification §4.5) - although there are options to change this behaviour
- Molecules are automatically standardized when running models (can be turned off)
- Do not modify the 'pkls', 'ad_data' etc. names or directories
- Files in the examples directory are included for testing as on the ReadtheDocs tutorials.
- For installation and usage instructions, see the documentation.
PIDGINv3 is available under the GNU General Public License v3.0 (GPLv3).
[1] | Aniceto, N, et al. A novel applicability domain technique for mapping predictive reliability across the chemical space of a QSAR: Reliability-density neighbourhood. J. Cheminform. 8: 69 (2016). |
[2] | Mervin, L H., et al. Target prediction utilising negative bioactivity data covering large chemical space. J. Cheminform. 7: 51 (2015). |
[3] | (1, 2) Mervin, L H., et al. Orthologue chemical space and its influence on target prediction. Bioinformatics. 34: 72–79 (2018). |
[4] | Mervin, L H., et al. Understanding Cytotoxicity and Cytostaticity in a High-Throughput Screening Collection. ACS Chem. Biol. 11: 11 (2016) |
[5] | Rogers D & Hahn M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50: 742-54 (2010). |