Maybe rename "Coverage" to "Rediscovery" #38

sgbaird · 2022-07-30T23:35:44Z

https://www.benevolent.com/guacamol

kjappelbaum · 2022-08-02T12:52:24Z

aren't coverage and rediscovery different things (?):

coverage: describing the "shape" of the generated distribution (some of the "diversity" metrics in https://www.nature.com/articles/s41467-020-17755-8#Sec9 might be interesting - they really come from the discussion of the diversity in ecosystems)
rediscovery: how many "known" materials did the model generate(?) Where the "knowledge" base might typically be the training set

sgbaird · 2022-08-02T16:58:57Z

The notion of coverage came from the CDVAE paper:

Coverage (COV). Inspired by Xu et al. (2021a); Ganea et al. (2021), we define two coverage metrics, COV-R (Recall) and COV-P (Precision), to measure the similarity between ensembles of generated materials and ground truth materials in test set. Intuitively, COV-R measures the percentage of ground truth materials being correctly predicted, and COV-P measures the percentage of predicted materials having high quality (details in Appendix G).

"Rediscovery" based on the word itself seems applicable since the metric implemented in matbench-genmetrics is the match rate between a held-out test set and the generated materials "how many 'known' materials did the model generate" as you mentioned. However, guacamol only uses this in a goal-directed setting whereas matbench-genmetrics (right now) does not assume goal direction other than generating realistic materials from the distribution of the training set.

From guacamol paper:

Rediscovery benchmarks are closely related to the similarity benchmarks described above. The major difference is that the rediscovery task explicitly aims to rediscover the target molecule, not to generate many molecules similar to it.

sgbaird · 2022-08-02T17:03:16Z

Guacamol also uses what they call similarity metrics:

Similarity is one of the core concepts of chemoinformatics. (73,74) It serves multiple purposes and is an interesting objective for optimization. First, it is a surrogate for machine learning models, since it mimics an interpretable nearest neighbor model. However, it has the strong advantage over more complex machine learning (ML) algorithms that deficiencies in the ML models, stemming from training on small data sets or activity cliffs, cannot be as easily exploited by the generative models. Second, it is directly related to virtual screening: de novo design with a similarity objective can be interpreted as a form of inverse virtual screening, where molecules similar to a given target compound are generated on the fly instead of looking them up in a large database. In the similarity benchmarks, models aim to generate molecules similar to a target that was removed from the training set. Models perform well for the similarity benchmarks, if they are able to generate many molecules that are closely related to a given target molecule. Alternatively, the concept of similarity can be applied to exclude molecules that are too similar to other molecules.

I think this is also only used in the context of goal-directed generation.

sgbaird · 2022-08-02T17:13:09Z

coverage: describing the "shape" of the generated distribution (some of the "diversity" metrics in nature.com/articles/s41467-020-17755-8#Sec9 might be interesting - they really come from the discussion of the diversity in ecosystems)

Some excerpts from the paper you linked:

We use diversity metrics37 to quantify the coverage of these databases in terms of variety (V), balance (B) and disparity (D)

Variety measures the number of bins that are sampled, balance the evenness of the distribution of materials among the sampled bins, and disparity the spread of the sampled bins

To compute the diversity metrics, we first split the high-dimensional spaces into a fixed number of bins by assigning all the structures to their closest centroid found from k-means clustering. Here, we use the percentage of all the bins sampled by a database as the variety metric. Furthermore, we use Pielou’s evenness65 to measure the balance of a database, i.e., how even the structures are distributed among the sampled bins. Other metrics, including relative entropy and Kullback–Leibler divergence are a transformation of Pielou’s evenness and provide the same information (see Supplementary Note 16 for comparison). Here, we use 1000 bins for these analyses (see sensitivity analysis to the number of bins in Supplementary Note 16). Lastly, we compute disparity, a measure of spread of the sampled bins, based on the area of the concave hull of the first two principal components of the structures in a database normalized with the area of the concave hull of the current design space. The areas were computed using Shapely66 with circumference to area ratio cutoff of 1.

Interesting that it says Kl divergence provides the same information as Pielou's evenness (the balance (B) metric) since KL divergence is one of the distribution metrics used by guacamol. Not sure I understand what "spread" means in the context of the disparity (D) metric. If I'm understanding correctly, a more reliable metric would be computing the concave hull in high-dimensional space (i.e. approximating the hypervolume of the sampled points in some sense), but they do it in a low-dimensional projection for simplicity.

Variety (V) seems similar to what I've been calling uniqueness, i.e. measuring the dissimilarity of the generated compounds within themselves.

I think matbench-genmetrics could evolve into something more like https://github.com/uncertainty-toolbox/uncertainty-toolbox where you can choose which metrics you want to evaluate. The diversity metrics in that paper seem like a good candidate for another set of metrics to implement. I think there are some similar tools for non-materials-specific generative modeling geared towards calculating generative metrics.

sgbaird · 2022-08-18T03:02:27Z

From the following article:

Wei, L.; Li, Q.; Song, Y.; Stefanov, S.; Siriwardane, E. M. D.; Chen, F.; Hu, J. Crystal Transformer: Self-Learning Neural Language Model for Generative and Tinkering Design of Materials. arXiv April 25, 2022. http://arxiv.org/abs/2204.11953

They use the term "recovery rate":

The recovery rate measures the percentage of samples from the training or testing set that have been re-generated by the generator model. The high recovery rate over the test set indicates that a generator has high discovery performance since the test set samples are known crystals that actually exist.

kjappelbaum · 2022-08-22T19:44:36Z

Not sure I understand what "spread" means in the context of the disparity (D) metric. If I'm understanding correctly, a more reliable metric would be computing the concave hull in high-dimensional space (i.e. approximating the hypervolume of the sampled points in some sense

yea, I know that Mohammad played a bit with the bins for those metrics (and one would need to check for convergence). This is the reason I do not like them too much.

sgbaird mentioned this issue Aug 2, 2022

Consider use of fingerprint distance instead of StructureMatcher for comparison between generated and test #39

Closed

sgbaird mentioned this issue Oct 28, 2022

Combining metrics (i.e. structures that pass multiple criteria) #67

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Maybe rename "Coverage" to "Rediscovery" #38

Maybe rename "Coverage" to "Rediscovery" #38

sgbaird commented Jul 30, 2022

kjappelbaum commented Aug 2, 2022

sgbaird commented Aug 2, 2022

sgbaird commented Aug 2, 2022

sgbaird commented Aug 2, 2022 •

edited

Loading

sgbaird commented Aug 18, 2022 •

edited

Loading

kjappelbaum commented Aug 22, 2022

Maybe rename "Coverage" to "Rediscovery" #38

Maybe rename "Coverage" to "Rediscovery" #38

Comments

sgbaird commented Jul 30, 2022

kjappelbaum commented Aug 2, 2022

sgbaird commented Aug 2, 2022

sgbaird commented Aug 2, 2022

sgbaird commented Aug 2, 2022 • edited Loading

sgbaird commented Aug 18, 2022 • edited Loading

kjappelbaum commented Aug 22, 2022

sgbaird commented Aug 2, 2022 •

edited

Loading

sgbaird commented Aug 18, 2022 •

edited

Loading