- Employs the power of a pre-trained RoBERTa encoder to predict energy levels using textual inputs.
- Processes human-interpretable text to embed target features for energy prediction.
- Analyzes attention scores to reveal how CatBERTa focuses on the incorporated features.
- Achieves a mean absolute error (MAE) of 0.75 eV, comparable to earlier Graph Neural Networks (GNNs).
- Enhances energy difference predictions by effectively canceling out systematic errors for chemically similar systems. (for more details, refer to the paper: Beyond Independent Error Assumptions in Large GNNs).
Follow these steps to start using CatBERTa for predicting catalyst adsorption energy:
Before you begin, ensure you have the following prerequisites installed:
- Python 3.8
- PyTorch 1.11
- transformers 4.29
-
Clone the CatBERTa repository:
# clone the source code of CatBERTa git clone https://github.com/hoon-ock/CatBERTa.git cd CatBERTa
-
Preprocessed textual data
The
data
folder houses the preprocessed textual data derived from the Open Catalyst 2020 dataset. Due to storage limitations, we offer a small subset of our training and validation data as an illustrative example. This subset showcases the format and structure of the data that CatBERTa utilizes for energy prediction.For access to the full dataset, please visit the following link: Google Drive - Full Dataset.
-
Structural data
The Open Catalyst Project dataset serves as a crucial source of textual generation for CatBERTa. This comprehensive dataset comprises a diverse collection of structural relaxation trajectories of adsorbate-catalyst systems, each accompanied by their corresponding energies.
To access the Open Catalyst Project dataset and learn more about its attributes, please refer to the official repository: Open Catalyst Project Dataset
For access to the model checkpoints, please reach out to us.
The training configurations for CatBERTa can be found in the config/ft_config.yaml
file.
During the training process, CatBERTa automatically creates and manages checkpoints to keep track of model progress. The checkpoints are saved in the checkpoint/finetune
folder. This folder is created automatically if it doesn't exist and serves as the storage location for checkpoints.
$ python finetune_regression.py
To analyze energy and embedding predictions using CatBERTa, you can utilize the catberta_prediction.py
script. This script allows you to generate predictions for either energy or embedding values.
$ python catberta_prediction.py --target energy --base --ckpt_dir "Path/to/checkpoint" --data_path "Path/to/data"
or
$ python catberta_prediction.py --target embed --base --ckpt_dir "Path/to/checkpoint" --data_path "Path/to/data"
The AttentionVisualizer repository provides a robust toolkit to visualize and analyze attention scores. To use this tool effectively with CatBERTa, you can load the finetuned Roberta encoder into the AttentionVisualizer package.
@article{ock2023catberta,
author = {Ock, Janghoon and Guntuboina, Chakradhar and Barati Farimani, Amir},
title = {Catalyst Energy Prediction with CatBERTa: Unveiling Feature Exploration Strategies through Large Language Models},
journal = {ACS Catalysis},
volume = {13},
number = {24},
pages = {16032-16044},
year = {2023},
doi = {10.1021/acscatal.3c04956},
URL = {
https://doi.org/10.1021/acscatal.3c04956
},
eprint = {
https://doi.org/10.1021/acscatal.3c04956
}
}
questions or support, feel free to contact us through jock@andrew.cmu.edu.