This repository contains an implementation of entropy-targeted active learning (ET-AL) for materials data bias mitigation, associated with our paper.
This code is open-sourced under the MIT license. Feel free to use all or portions for your research or related projects so long as you provide the following citation information:
Hengrui Zhang, Wei (Wayne) Chen, James M. Rondinelli, and Wei Chen, ET-AL: Entropy-targeted active learning for bias mitigation in materials data, Applied Physics Reviews 10, 021403 (2023).
@article{zhang2023etal,
author = {Zhang, Hengrui and Chen, Wei Wayne and Rondinelli, James M. and Chen, Wei},
title = {ET-AL: entropy-targeted active learning for bias mitigation in materials data},
journal = {Applied Physics Reviews},
volume = {10},
number = {2},
pages = {021403},
year = {2023},
doi = {10.1063/5.0138913},
url = {https://doi.org/10.1063/5.0138913}
}
etal_main.py
implements the ET-AL algorithm and demonstrates on the Jarvis-CFID dataset.
ML_comparison.ipynb
compares several ML models on different training sets.
plot_data.ipynb
is used for creating relevant plots for visualization.
datasets/
provides data required for reproducing the results in our paper.
results/
contains data generated in ET-AL demonstration on the Jarvis-CFID dataset
utils/
contains tools for data pre-processing:
Jarvis_data.ipynb
is used for retrieving, cleaning the Jarvis CFID data and generating graph embeddings.Jarvis_featurize.ipynb
generates physical descriptors for the Jarvis CFID data.compound_featurizer.py
automatic tool for physical descriptorscgcnn/
the CGCNN model for graph embeddings
Navigate to the code directory and create the environment:
conda env create -f environment.yml
Then activate the new environment:
conda activate gp-torch
Organize the dataset in a Data Frame and change the data paths in etal_main.py
. For demonstration purposes, a dataset derived from the Jarvis CFID data is provided in datasets/
: the crystal structures and properties are in data_cleaned.pkl
, and the graph embeddings are in cgcnn_embeddings.pkl
.
*Note: Git LFS is required for data_cleaned.pkl
to be downloaded properly. Please download the file manually if you do not have Git LFS.
-
Set up experimental parameters in
etal_main.py
:n_iter
for maximum number of ET-AL iterations,n_test
for number of data points left as test set,n_unlabeled
for number of data points left as unlabeled. Edit the following part to change the selection of unlabeled data. -
Run ET-AL model:
python etal_main.py
-
Run
ML_comparison
to compare ML models on training set generated by ET-AL sampling and random sampling. -
Use
plot_data
to visualize the results and reproduce plots in the paper.