Skip to content

Hierarchical Augmented Multi-Task Multi-Modal classifiER for fingerprint-based SMILES classifications πŸ”¨

License

Notifications You must be signed in to change notification settings

earth-metabolome-initiative/hammer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ”¨ Hammer

Hammer is a Hierarchical Augmented Multi-modal Multi-task classifiER that, given a SMILE as input, computes selected fingerprints and predicts its associated taxonomical ranking.

The classifier can employ a diverse set of molecular fingerprints and descriptors to capture the unique features of the SMILES and enable accurate predictions across multiple tasks.

Furthermore, the model can be beheaded (remove the output layers) and used either as a feature extractor or as a pre-trained model for transfer learning on other tasks. This package provides also tooling to extract and visualize all of the features used in the model, which can be used to train other models or to perform downstream analyses. If you intend to use this model for transfer learning, pay attention to not include in your test set SMILEs used for training this model to avoid biasing your evaluations!

Installation

This library will be available to install via pip, but for now you can install it by cloning the repository and running the following command:

pip install .

Command line interface and usage

While the package can be entirely used as a library, it also provides a command line interface that can be used to perform a variety of tasks and reproduce the experiments that we have conducted or design new ones.

In the following sections, we will describe the usage of the command line interface of the Hammer package. These commands are readily available after installing the package, no additional setup is required.

Feature visualization

To visualize the features used in the model using PCA and t-SNE, you can run the following command:

hammer visualize --verbose --dataset NPC --output-directory "data_visualizations" --image-format "png"

This will generate a set of plots that show the distribution of the features used in the model. The plots will be saved in the data_visualizations directory in the png format. You can change the output directory and image format by changing the --output-directory and --image-format arguments, respectively. The resulting plots will look like the following (this one illustrates the t-SNE and PCA decomposition of the Topological Torsion 1024 bits):

Topological Torsion (1024 bits)

It is also possible to visualize specific feature sets, for example the MAP4 features, by using the --include-map4 argument:

hammer visualize --verbose\
    --dataset NPC\
    --include-map4\
    --output-directory "data_visualizations"\
    --image-format "png"
hammer visualize --verbose\
    --dataset GNPS\
    --output-directory "data_visualizations/gnps"\
    --image-format "png"

DAG Coverage

One of the goals of this project is to, over time and with the help of the community, increase the overall number of pathways, superclasses, and classes that the model can predict. The model employs as a form of static attention a DAG that harmonizes the predictions of the different tasks. At this time, the dataset we are using DOES NOT cover all of the combinations of pathways, superclasses and classes that the DAG allows for. We aim to increase the coverage of the DAG over time, and we welcome contributions to the dataset that can help us achieve this goal. We are starting out from the dataset made available by NP Classifier.

You can compute a summary of the coverage of the DAG using the following command:

hammer dag-coverage --dataset NPC --verbose
hammer dag-coverage --dataset NPCHarmonized --verbose

At the time of writing, the coverage of the DAG is as follows:

Dataset Layer Coverage
NPC pathways 1
NPC superclasses 0.922078
NPC classes 0.936782
NPC DAG 0.730228
NPCHarmonized pathways 1
NPCHarmonized superclasses 0.948052
NPCHarmonized classes 0.95546
NPCHarmonized DAG 0.813651

Feature sets evaluation

To evaluate the feature sets used in the model, you can run the following command. This will perform a 10-fold cross-validation evaluation of the feature sets. The performance for all holdouts and all considered features will be saved in the feature_sets_evaluation.csv file, while the barplots will be saved in the feature_sets_evaluation_barplots directory.

The dataset is split using first a stratified split by the rarest class, then subsequently holdouts number of stratified Monte Carlo splits into sub-training and validation. The test set is not touched during this evaluation process, as we will use it to evaluate the model over the selected feature set.

The model used for these evaluations is the same Hammer model that is used for the predictions, changing only the number of input feature sets.

hammer feature-sets-evaluation \
    --verbose \
    --holdouts 10 \
    --dataset NPC \
    --test-size 0.2 \
    --validation-size 0.2 \
    --performance-path "performance/feature_sets_evaluation.csv" \
    --training-directory "training/feature_selection" \
    --barplot-directory "barplots/feature_sets_evaluation"
hammer feature-sets-evaluation \
    --verbose \
    --holdouts 10 \
    --dataset NPCHarmonized \
    --test-size 0.2 \
    --validation-size 0.2 \
    --performance-path "performance/feature_sets_evaluation_harmonized.csv" \
    --training-directory "training/feature_selection_harmonized" \
    --barplot-directory "barplots/feature_sets_evaluation_harmonized"
hammer holdouts \
    --verbose \
    --holdouts 10 \
    --dataset GNPS \
    --test-size 0.2 \
    --validation-size 0.2 \
    --performance-path "performance/feature_sets_evaluation_gnps.csv" \
    --training-directory "training/feature_selection_gnps" \
    --barplot-directory "barplots/feature_sets_evaluation_gnps"

Executing this command will generate the barplots you can find in this directory. In the following barplot, you will find the AUPRC for each class, for validation, test a, for each feature set, averaged over all holdouts:

In the following table, we illustrate the mean and standard deviation of the validation AUPRC for the different feature sets.

Feature Set V2 Mean V2 STD V1 Mean V1 Std V0 Mean V0 Std
Atom Pair (2048b) 0.937478 0.031032 0.927321 0.009292 0.857813 0.002735
Auto-Correlation 0.842348 0.039879 0.869845 0.021085 0.811050 0.004208
Avalon (2048b) 0.944524 0.023624 0.919810 0.044258 0.900124 0.003449
Extended Connectivity (2r, 2048b) 0.952927 0.001753 0.935591 0.002339 0.884863 (r=1) 0.003651 (r=1)
Functional Groups 0.577002 0.015732 0.572862 0.019137 0.589513 0.015186
Ghose-Crippen 0.642189 0.020406 0.629073 0.021072 0.659410 0.001983
Laggner 0.812406 0.010464 0.805593 0.014127 0.759160 0.012833
Layered (2048b) 0.946703 0.002667 0.929109 0.013407 0.898108 0.003496
Lingo (1024b) 0.927148 0.001787 0.915304 0.002246 0.837630 0.002075
MACCS 0.849523 0.016972 0.848236 0.015830 0.810575 0.005415
MAP4 0.945194 0.002993 0.934615 0.002147 0.855033 0.005111
MinHashed (2r, 2048b) 0.939163 0.001409 0.928356 0.002958 0.839570 0.009128
Molecular Quantum Numbers 0.617027 0.024393 0.614444 0.029922 0.672846 0.008695
Pattern (2048b) 0.934383 0.036935 0.892704 0.043349 0.893990 0.005862
PubChem 0.936316 0.005425 0.911609 0.025582 0.885968 0.003264
RDKit (2048b) 0.942030 0.002838 0.927816 0.003440 0.871136 0.008087
SMILES Extended Connectivity (1r, 2048b) 0.881445 0.004091 0.870877 0.007103 0.827976 0.004601
Topological Torsion (1024b) 0.946040 0.002092 0.931260 0.002015 0.863581 0.002356
Van Der Waals Surface Area 0.769931 0.084615 0.817110 0.063946 0.795765 0.006069

AUPRC barplot

It is also possible to run the feature-sets-evaluation on a subset of features:

hammer feature-sets-evaluation \
    --verbose \
    --holdouts 5 \
    --dataset NPC \
    --include-map4 \
    --test-size 0.2 \
    --validation-size 0.2 \
    --performance-path "performance/map4_feature_evaluation.csv" \
    --training-directory "training/map4_feature" \
    --barplot-directory "barplots/map4_feature_evaluation"

Features sets synergy

After having evaluated the feature sets for a given dataset, it remains open the question of how the feature sets interact with each other. It may very well be that the performance of the model is not simply the sum of the performance of the individual feature sets, but that there is a synergy between them, or that by extending the input space with redoundant features we may actually decrease the performance of the model by excessively increasing the dimensionality of the input space, thus making the model more prone to overfitting.

This approach fixes a subset of the feature sets as the base feature sets, and then iterates on all of the low-dimensionality (less than 1024) feature sets, adding them one by one to the base feature sets. The performance of the model is then evaluated on the validation set, and the performance of the model is saved in the feature_sets_synergy_training.csv file, while the barplots will be saved in the feature_sets_synergy_barplots directory.

We pick the base feature sets as the layered feature set, as it is nearly the best performing feature set, and differently from Avalon, we know fully how it is computed while there is no paper for the Avalon fingerprints.

hammer feature-sets-synergy \
    --verbose \
    --holdouts 10 \
    --dataset NPC \
    --base-feature-sets "extended_connectivity" \
    --test-size 0.2 \
    --validation-size 0.2 \
    --performance-path "performance/synergy/extended_connectivity.csv" \
    --training-directory "trainings/synergy/extended_connectivity" \
    --barplot-directory "barplots/synergy/extended_connectivity"

For the NPC dataset, we have identified that the secondary feature most synergistic (has the best validation AUPRC) with the base feature sets and also the smallest feature size is the Van Der Waals Surface Area, as illustrated in the following barplot:

Feature Set + Layered Mean Std Feature size
Auto-Correlation 0.914921 0.002188 192
Functional Groups 0.905355 0.004740 85
Ghose-Crippen 0.905958 0.004123 110
Laggner 0.907647 0.004016 307
MACCS 0.906761 0.010752 166
Molecular Quantum Numbers 0.909889 0.003788 42
PubChem 0.913653 0.003556 881
Van Der Waals Surface Area 0.914309 0.004208 47

Synergy barplot

We can now proceed to identify the tertiary feature set that is most synergistic with the base feature sets and the secondary feature set. We cannot simply pick the next secondary feature set that is most synergistic with the base feature sets, as this would not take into account the interaction between the secondary and tertiary feature sets, and the increased dimensionality of the input space. We need to evaluate the performance of the model on the validation set for all possible combinations of the base, secondary and tertiary feature sets, and select the one that has the best performance, if there is still an improvement in the performance of the model.

hammer feature-sets-synergy \
    --verbose \
    --holdouts 10 \
    --dataset NPC \
    --base-feature-sets "layered" "van_der_waals_surface_area" \
    --test-size 0.2 \
    --validation-size 0.2 \
    --performance-path "tertiary_feature_sets_synergy_training.csv" \
    --training-directory "tertiary_feature_sets_synergy_training" \
    --barplot-directory "tertiary_feature_sets_synergy_barplots"

Train a model variant

hammer train \
    --verbose \
    --dataset NPCHarmonized \
    --include-extended-connectivity \
    --include-layered \
    --include-topological-torsion \
    --include-van-der-waals-surface-area \
    --include-molecular-quantum-numbers \
    --test-size 0.2 \
    --training-directory "npc.harmonized.v2.tar.gz"

Predict

You can run predictions for a single SMILES using the following command:

hammer predict \
    --input "CN1[C@H]2CC[C@@H]1[C@@H](C(OC)=O)[C@@H](OC(C3=CC=CC=C3)=O)C2" \
    --version npc.harmonized.v1

which will output:

SMILES: CN1[C@H]2CC[C@@H]1[C@@H](C(OC)=O)[C@@H](OC(C3=CC=CC=C3)=O)C2
Alkaloids (0.9942)
└── Ornithine alkaloids (0.9988)
    └── Tropane alkaloids (0.9999)

Analogously, by running the following command for a multi-class compound:

hammer predict \
    --input "CCC(C)C1NC(=O)C(Cc2ccccc2)N(C)C(=O)C(C(C)CC)N2C(=O)C(CCC2OC)NC(=O)C(CCCN=C(N)N)NC(=O)C(NC(=O)C(CO)OS(=O)(=O)O)C(C)OC1=O" \
    --version npc.harmonized.v1

you will get the following output:

SMILES: CCC(C)C1NC(=O)C(Cc2ccccc2)N(C)C(=O)C(C(C)CC)N2C(=O)C(CCC2OC)NC(=O)C(CCCN=C(N)N)NC(=O)C(NC(=O)C(CO)OS(=O)(=O)O)C(C)OC1=O
Amino acids and Peptides (0.9807)
└── Oligopeptides (0.9994)
Polyketides (0.9821)
└── Oligopeptides (0.9994)
    β”œβ”€β”€ Cyclic peptides (0.9999)
    β”œβ”€β”€ Depsipeptides (0.9998)
    └── Ahp-containing cyclodepsipeptides (0.9385)

You can also run predictions for SMILES from a CSV, TSV or SSV file:

hammer predict \
    --input "divergent_npc_entries/divergent_pathways.csv" \
    --version npc.harmonized.v1 \
    --verbose \
    --output-dir "divergent_npc_entries/npc.harmonized.v1/"

It is even possible to run predictions directly on the SMILES in an MGF file metadata. In such a case, you must specify that you want to run a prediction on the MGF file metadata, and not the Spectra themselves, by using the --only-smiles flag:

hammer predict \
    --input "matchms.mgf" \
    --model-path "npc.harmonized.v2.tar.gz" \
    --verbose \
    --output-dir "matchms_predictions" \
    --only-smiles

Citation

If you use this model in your research, please cite us:

[TODO: we still need to properly publish the model, so this is a placeholder and will be updated in the future]

@software{hammer,
  author = {Cappelletti, Luca, et al.},
  title = {Hammer: Hierarchical Augmented Multi-modal Multi-task classifiER},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/LucaCappelletti94/hammer}},
}

Contributing

If you want to contribute to this project, please read the CONTRIBUTING file for details on our code of conduct, and the process for submitting pull requests to us.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Hierarchical Augmented Multi-Task Multi-Modal classifiER for fingerprint-based SMILES classifications πŸ”¨

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages