Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Molecular input #234

Merged
merged 27 commits into from
Aug 2, 2023
Merged

Molecular input #234

merged 27 commits into from
Aug 2, 2023

Conversation

simonsung06
Copy link
Contributor

MolecularInput feature with the ability to define molecular feature types. Introduced a new kernel type, TanimotoKernel too. The implementations here are designed to work with SingleTaskGP. Analogous to a previous PR #194, except that this one is a more robust implementation for just the MolecularInput feature. Categorical Molecular inputs planned to be implemented in a separate PR #210.

Copy link
Contributor

@jduerholt jduerholt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Simon,

thank you very much. Overall it looks good for me.

Besides the inline comments, I have one main point:

If I remember correctly, we have agreed to keep the smiles out of the MolecularInput and put it into CategoricalMolecularInput. togehter with stuff as get_bounds etc., so that MolecularInput can just be used for predictions and not for optimization. I think that also the molfeatures attribute can be removed from MolecularInput.

In addition, we can think of replacing the MolecularEncodingEnum with Molfeatures.

We can setup a call to discuss this in more detail.

Best and thanks,

Johannes

bofire/data_models/molfeatures/molfeatures.py Outdated Show resolved Hide resolved
bofire/data_models/molfeatures/types.py Outdated Show resolved Hide resolved
bofire/data_models/molfeatures/types.py Outdated Show resolved Hide resolved
bofire/data_models/molfeatures/types.py Outdated Show resolved Hide resolved
bofire/data_models/molfeatures/types.py Outdated Show resolved Hide resolved
bofire/data_models/enum.py Outdated Show resolved Hide resolved
bofire/data_models/surrogates/single_task_gp.py Outdated Show resolved Hide resolved
bofire/data_models/surrogates/botorch.py Outdated Show resolved Hide resolved
bofire/utils/cheminformatics.py Outdated Show resolved Hide resolved
bofire/data_models/features/molecular.py Outdated Show resolved Hide resolved
bofire/data_models/features/molecular.py Outdated Show resolved Hide resolved
bofire/data_models/features/molecular.py Outdated Show resolved Hide resolved
bofire/data_models/features/molecular.py Show resolved Hide resolved
@simonsung06
Copy link
Contributor Author

simonsung06 commented Jul 10, 2023

Hi Johannes. Adjustments have been made based on your suggestions above and from our conversations in private, the following are the notable points:

  1. Enums are no longer used to define the molecular feature desired. Instead, the individual MolFeatures classes should be instantiated with the desired parameters and then passed via input_processing_specs.
  2. MolecularInput therefore no longer need smiles or molfeatures to be instantiated
  3. Separate data_model for TanimotoGPSurrogate has been created to default to the ideal kernel and scaler. This will still map to the SingleTaskGPSurrogate surrogate though.
  4. MolFeatures have been slightly re-designed to generate descriptor names on the fly as suggested
  5. docstrings updated to the correct style for the TanimotoKernel
  6. Note: Bag of characters is currently not enabled because the number of generated features can vary, i.e. test data containing unseen molecules can have a different number of features to the training set. Therefore, predictions cannot be made on the test data. To my knowledge, I do not know a way around this issue at the moment so I am open to ideas/corrections.

@jduerholt
Copy link
Contributor

Hi Simon, thank you very much. I will have a look these days ;)

@jduerholt
Copy link
Contributor

Can you just close (resolve) the comments that you adressed? This makes it easier to follow ;)

Copy link
Contributor

@jduerholt jduerholt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Simon, thank you very much! It is almost done. In addtion to the comments inline can you also create tests for the functionality in molfeatures and then also include it in specs and the respective serialization and deserialization tests?

Furthermore, I think also some merge conflicts needs to be resolved.

]
# next check that only Categoricalwithdescriptor have the value DESCRIPTOR or are of type MolFeatures
descriptor_keys = []
for key, value in specs.items():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, is this also raisig an error if one assigns a molfeatures transform to a categoricaldescriptorinput?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you then also write tests for the addtions in this method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally the validate input_processing_specs would have caught what you describe I think But nonetheless, def _validate_transform_specs has been improved to make sure that it will solve these types of issues too. It has been changed from how I had it before so that checking of CategoricalEncodingEnum.DESCRIPTOR and MolFeatures is separate to avoid a bug that can occur in case of user errors that can happen when there are multiple categorical variables inputs with various mistakes in transform type. Furthermore, MolecularInputs require a MolFeatures in the transform specs. Hopefully that's fine with you too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also create a new specs thingy for the Molfeatures data models and put it also into the serialization and deserialization tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -348,6 +350,16 @@ def _get_transform_info(
[f"{feat.key}{_CAT_SEP}{d}" for d in feat.descriptors]
)
counter += len(feat.descriptors)
elif isinstance(specs[feat.key], MolFeatures):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also include this in the tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added MolecularInput to test_inputs_get_transform_info

@@ -383,6 +395,9 @@ def transform(
elif specs[feat.key] == CategoricalEncodingEnum.DESCRIPTOR:
assert isinstance(feat, CategoricalDescriptorInput)
transformed.append(feat.to_descriptor_encoding(s))
elif isinstance(specs[feat.key], MolFeatures):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also include this in the tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing for this can be found in test_inputs_transform_molecular. This only tests the transform in the forward direction. This is kept separate from test_inputs_transform for now because the inverse transform for molecular inputs is not implemented yet.

@simonsung06
Copy link
Contributor Author

Apologies for those merge conflicts. Probably should have worked more quickly on this... Mind if we resolve them together in a call ?

@jduerholt
Copy link
Contributor

Apologies for those merge conflicts. Probably should have worked more quickly on this... Mind if we resolve them together in a call ?

Hi Simon, nothing to be sorry. It is not your fault. It is because it took me always so long to work on it. Currently I am in vacation. But I try to resolve them some evening or is it super urgent?

@simonsung06
Copy link
Contributor Author

Apologies for those merge conflicts. Probably should have worked more quickly on this... Mind if we resolve them together in a call ?

Hi Simon, nothing to be sorry. It is not your fault. It is because it took me always so long to work on it. Currently I am in vacation. But I try to resolve them some evening or is it super urgent?

Hi Johannes. Not urgent. Enjoy your vacation ;)

@jduerholt
Copy link
Contributor

@simonsung06: should be ok now from my perspective. Just have a look ;)

@simonsung06
Copy link
Contributor Author

@simonsung06: should be ok now from my perspective. Just have a look ;)

Thanks! Looks good 👍

@jduerholt jduerholt self-requested a review August 2, 2023 17:22
@jduerholt jduerholt merged commit e313881 into main Aug 2, 2023
10 checks passed
@simonsung06 simonsung06 deleted the basic_molecular_input branch August 25, 2023 07:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants