Molecular input #234

simonsung06 · 2023-07-01T04:03:38Z

MolecularInput feature with the ability to define molecular feature types. Introduced a new kernel type, TanimotoKernel too. The implementations here are designed to work with SingleTaskGP. Analogous to a previous PR #194, except that this one is a more robust implementation for just the MolecularInput feature. Categorical Molecular inputs planned to be implemented in a separate PR #210.

jduerholt

Hi Simon,

thank you very much. Overall it looks good for me.

Besides the inline comments, I have one main point:

If I remember correctly, we have agreed to keep the smiles out of the MolecularInput and put it into CategoricalMolecularInput. togehter with stuff as get_bounds etc., so that MolecularInput can just be used for predictions and not for optimization. I think that also the molfeatures attribute can be removed from MolecularInput.

In addition, we can think of replacing the MolecularEncodingEnum with Molfeatures.

We can setup a call to discuss this in more detail.

Best and thanks,

Johannes

bofire/data_models/molfeatures/molfeatures.py

bofire/data_models/molfeatures/types.py

bofire/data_models/enum.py

bofire/data_models/surrogates/single_task_gp.py

bofire/data_models/surrogates/botorch.py

bofire/kernels/fingerprint_kernels/base_fingerprint_kernel.py

bofire/utils/cheminformatics.py

bofire/data_models/features/molecular.py

simonsung06 · 2023-07-10T08:00:20Z

Hi Johannes. Adjustments have been made based on your suggestions above and from our conversations in private, the following are the notable points:

Enums are no longer used to define the molecular feature desired. Instead, the individual MolFeatures classes should be instantiated with the desired parameters and then passed via input_processing_specs.
MolecularInput therefore no longer need smiles or molfeatures to be instantiated
Separate data_model for TanimotoGPSurrogate has been created to default to the ideal kernel and scaler. This will still map to the SingleTaskGPSurrogate surrogate though.
MolFeatures have been slightly re-designed to generate descriptor names on the fly as suggested
docstrings updated to the correct style for the TanimotoKernel
Note: Bag of characters is currently not enabled because the number of generated features can vary, i.e. test data containing unseen molecules can have a different number of features to the training set. Therefore, predictions cannot be made on the test data. To my knowledge, I do not know a way around this issue at the moment so I am open to ideas/corrections.

jduerholt · 2023-07-10T11:43:42Z

Hi Simon, thank you very much. I will have a look these days ;)

jduerholt · 2023-07-10T20:36:35Z

Can you just close (resolve) the comments that you adressed? This makes it easier to follow ;)

bofire/data_models/features/molecular.py

jduerholt

Hi Simon, thank you very much! It is almost done. In addtion to the comments inline can you also create tests for the functionality in molfeatures and then also include it in specs and the respective serialization and deserialization tests?

Furthermore, I think also some merge conflicts needs to be resolved.

jduerholt · 2023-07-14T11:47:51Z

bofire/data_models/domain/features.py

-        ]
+        # next check that only Categoricalwithdescriptor have the value DESCRIPTOR or are of type MolFeatures
+        descriptor_keys = []
+        for key, value in specs.items():


I am not sure, is this also raisig an error if one assigns a molfeatures transform to a categoricaldescriptorinput?

Can you then also write tests for the addtions in this method?

Normally the validate input_processing_specs would have caught what you describe I think But nonetheless, def _validate_transform_specs has been improved to make sure that it will solve these types of issues too. It has been changed from how I had it before so that checking of CategoricalEncodingEnum.DESCRIPTOR and MolFeatures is separate to avoid a bug that can occur in case of user errors that can happen when there are multiple categorical variables inputs with various mistakes in transform type. Furthermore, MolecularInputs require a MolFeatures in the transform specs. Hopefully that's fine with you too.

jduerholt · 2023-07-14T12:08:32Z

tests/bofire/data_models/specs/surrogates.py

Can you also create a new specs thingy for the Molfeatures data models and put it also into the serialization and deserialization tests?

jduerholt · 2023-07-14T12:11:50Z

bofire/data_models/domain/features.py

@@ -348,6 +350,16 @@ def _get_transform_info(
                    [f"{feat.key}{_CAT_SEP}{d}" for d in feat.descriptors]
                )
                counter += len(feat.descriptors)
+            elif isinstance(specs[feat.key], MolFeatures):


Can you also include this in the tests?

Added MolecularInput to test_inputs_get_transform_info

jduerholt · 2023-07-14T12:12:14Z

bofire/data_models/domain/features.py

@@ -383,6 +395,9 @@ def transform(
            elif specs[feat.key] == CategoricalEncodingEnum.DESCRIPTOR:
                assert isinstance(feat, CategoricalDescriptorInput)
                transformed.append(feat.to_descriptor_encoding(s))
+            elif isinstance(specs[feat.key], MolFeatures):


Can you also include this in the tests?

Testing for this can be found in test_inputs_transform_molecular. This only tests the transform in the forward direction. This is kept separate from test_inputs_transform for now because the inverse transform for molecular inputs is not implemented yet.

simonsung06 · 2023-07-16T04:22:53Z

Apologies for those merge conflicts. Probably should have worked more quickly on this... Mind if we resolve them together in a call ?

jduerholt · 2023-07-17T19:56:06Z

Apologies for those merge conflicts. Probably should have worked more quickly on this... Mind if we resolve them together in a call ?

Hi Simon, nothing to be sorry. It is not your fault. It is because it took me always so long to work on it. Currently I am in vacation. But I try to resolve them some evening or is it super urgent?

simonsung06 · 2023-07-18T05:37:40Z

Apologies for those merge conflicts. Probably should have worked more quickly on this... Mind if we resolve them together in a call ?

Hi Simon, nothing to be sorry. It is not your fault. It is because it took me always so long to work on it. Currently I am in vacation. But I try to resolve them some evening or is it super urgent?

Hi Johannes. Not urgent. Enjoy your vacation ;)

jduerholt · 2023-08-01T14:47:44Z

@simonsung06: should be ok now from my perspective. Just have a look ;)

simonsung06 · 2023-08-02T01:55:03Z

@simonsung06: should be ok now from my perspective. Just have a look ;)

Thanks! Looks good 👍

Simon added 7 commits May 24, 2023 07:43

working. only tests remaining

fff267b

updated botorch validator and black

8f40ee7

tests

cab7db2

tests complete

050a526

general fixes and more tests

4121c59

added skipif lines related to no rdkit to tests

29356d2

rdkit-related skips to tests

6df508b

simonsung06 requested a review from jduerholt July 1, 2023 07:25

jduerholt requested changes Jul 5, 2023

View reviewed changes

jduerholt requested changes Jul 6, 2023

View reviewed changes

Simon Sung added 3 commits July 8, 2023 23:53

refactored

dfe64f2

refactored

208af95

Clean up

225ee76

jduerholt reviewed Jul 10, 2023

View reviewed changes

bofire/data_models/features/molecular.py Outdated Show resolved Hide resolved

removed descriptor_values for MolecularInput

e252160

jduerholt requested changes Jul 14, 2023

View reviewed changes

Simon Sung added 3 commits July 16, 2023 10:05

added tests and improved validate transform_specs

2114de5

more robust tests against rdkit

6c6d8cd

more robust tests against rdkit

bc62b87

jduerholt added 5 commits August 1, 2023 15:22

Merge branch 'main' into basic_molecular_input

fca6a94

fix test

06f4844

set default descriptors for tanimoto

fd12a81

fix issue with annotated

7d49979

fix bug

189fcae

jduerholt added 3 commits August 1, 2023 15:55

skip tests if rdkit not available

781f0dd

remove dependency on rdkit in molfeatures

22441fa

take test into account

857be19

jduerholt added 2 commits August 1, 2023 16:57

fix pyright

1254648

fix pyright2

ed83a3c

jduerholt added 3 commits August 2, 2023 16:18

fix test

a139854

update list of names

49013c3

bump rdkit dependency

c734620

jduerholt self-requested a review August 2, 2023 17:22

jduerholt approved these changes Aug 2, 2023

View reviewed changes

jduerholt merged commit e313881 into main Aug 2, 2023
10 checks passed

simonsung06 deleted the basic_molecular_input branch August 25, 2023 07:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Molecular input #234

Molecular input #234

simonsung06 commented Jul 1, 2023

jduerholt left a comment

simonsung06 commented Jul 10, 2023 •

edited

Loading

jduerholt commented Jul 10, 2023

jduerholt commented Jul 10, 2023

jduerholt left a comment

jduerholt Jul 14, 2023

jduerholt Jul 14, 2023

simonsung06 Jul 16, 2023

jduerholt Jul 14, 2023

simonsung06 Jul 16, 2023

jduerholt Jul 14, 2023

simonsung06 Jul 16, 2023

jduerholt Jul 14, 2023

simonsung06 Jul 16, 2023

simonsung06 commented Jul 16, 2023

jduerholt commented Jul 17, 2023

simonsung06 commented Jul 18, 2023

jduerholt commented Aug 1, 2023

simonsung06 commented Aug 2, 2023

Molecular input #234

Molecular input #234

Conversation

simonsung06 commented Jul 1, 2023

jduerholt left a comment

Choose a reason for hiding this comment

simonsung06 commented Jul 10, 2023 • edited Loading

jduerholt commented Jul 10, 2023

jduerholt commented Jul 10, 2023

jduerholt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simonsung06 commented Jul 16, 2023

jduerholt commented Jul 17, 2023

simonsung06 commented Jul 18, 2023

jduerholt commented Aug 1, 2023

simonsung06 commented Aug 2, 2023

simonsung06 commented Jul 10, 2023 •

edited

Loading