Skip to content

msg-byu/ML-for-CurieTemp-Predictions

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ML-for-Magnetic-Predictions

Code for reproducing all the results in [give ref here]

Figures

Figure 1

Generated by running "Code/T-SNE-DS1.py"

Figure 2

Generated by running "Code/T-SNE-DS2.py"

Figure 3

2NN plot generated by running "Code/KNN-DS1.py"
Random forest plot generated by running "Code/RandomForest-DS1.py"

Figure 4

Generated by running "Code/RandomForest-DS1_train-DS2_test.py"

Figure 5

Generated by running "Code/RandomForest-DS1-MASTML.py"

Figure 6

Generated by running "Code/EvenDistributionRandomForest-DS1.py"

Figure 7

Generated by running "Code/DatavsMAE-RF.py"

Figure 8

Generated by running "Code/SweepBinaryPredictions.py"

Figure 9

Generated by running "Code/RandomForest-DS1.py"

Figure 10

Generated by running "Code/TernaryPredictionPlot.py"

Figure 11

Generated by running "Code/TernaryPredictionPlotZoom.py"

Data

DS1-RAW.csv

The raw data set of ferromagnetic materials and the corresponding Curie temperatures compiled by James Nelson and Stefano Sanvito.

DS1.csv

Cleaned version of DS1-RAW.csv. The feature vector has 85 features, each one describing a distinct element found in the data. Each compound is characterized by placing the percentage that each element occupies in the compound in the appropriate feature.

DS1-Compounds.csv

Version of DS1.csv that only includes the compound names and Curie temperatures. Used for feature generation.

DS1-Incompatible.csv

Version of DS1.csv that only includes features found in DS1-RAW.csv. Cannot be used with DS2.csv in machine learning models.

DS1-MASTML-Features.csv

Version of DS1.csv where there are only 20 features. These features were generated and selected by the MAST-ML python library.

DS2-RAW.txt

The raw data set of ferromagnetic materials and the corresponding Curie temperatures compiled by Valentin Taufour.

DS2.csv

Cleaned version of DS2-RAW.csv. The feature vector has 85 features, each one describing a distinct element found in the data. Each compound is characterized by placing the percentage that each element occupies in the compound in the appropriate feature.

DS1+DS2.csv

Combination of DS1.csv and DS2.csv. Any overlapping magnetic compounds were only included once.

Code

Clean-DS1-RAW.py

Cleans DS1-RAW.csv and saves cleaned data to Data/DS1.csv

Clean-DS2-RAW.py

Cleans DS2-RAW.csv and saves cleaned data to Data/DS2.csv

Combine-DS1+DS2.py

Combines DS1.csv and DS2.csv into one dataset. Any overlapping magnetic compounds are only included once. Saves combined dataset to Data/DS1+DS2.py.

DatavsMAE-RF-Above-600K.py

Extracts all the compounds form DS1+DS2.csv with a Tc > 600. Uses this new data to determine how the amount of training data used in a random forest model affects the mean absolute error of the predictions. Uses 1/3 of the data as test data and samples different sizes of training data from the remaining compounds. Saves a plot of training data size vs. MAE to "Plots/MAE vs Training Data Size Above 600K.png" and the same plot with a log log scale to "Plots/MAE vs Training Data Above 600K loglog.png".

DatavsMAE-RF.py

Uses DS1+DS2.csv to determine how the amount of training data used in a random forest model affects the mean absolute error of the predictions. Uses 1/3 of the data as test data and samples different sizes of training data from the remaining compounds. Saves a plot of training data size vs. MAE to "Plots/MAE vs Training Data Size.png" and the same plot with a log log scale to "Plots/MAE vs Training Data loglog.png".

EvenDistributionRandomForest-DS1.py

Creates an evenly distributed sub-dataset of DS1. Sorts data into 10 bins based on Curie temperature. Takes a random sample of 100 points from each bin and adds them all to a new set of data. 2/3 of this data is then used to train a random forest model. The model predicts on the remaining 1/3 of the data. MAE is printed in the terminal. A plot with the predicted Tc values vs. the experimental Tc values is saved to "Plots/Even Distribution Random Forest DS1.png". It also creates two error plots. One plots the experimental Tc value vs the difference between the experimental and predicted value for each point in the test set. This plot is saved to "Plots/Even Distribution Random Forest Experimental Error DS1.png". The other plots the predicted Tc value vs the difference between the experimental and predicted value for each point in the test set. This plot is saved to "Plots/Even Distribution Random Forest Predicted Error DS1.png".

Generate-MASTML-Features.ipynb

Python notebook used to create DS1-MASTML-Features.csv. Uses MAST-ML python library to generate many features for DS1-Compounds.csv then selects the 20 most meaningful features. Saves new dataset with generated features to "Data/DS1-MASTML-Features.csv".

GenerateTernaryMaterialsCo+Fe+X.py

Generates all possible ternary combinations that include Cobalt, Iron, and any other element found in compounds with a Tc > 600 K. The amount of each element in a generated compound changes in increments of 1%. Eliminates any duplicates in the generated data, formats it to be compatible with DS1.csv and DS2.csv and saves the new data to "Data/Generated Materials/GC_Ternary_Co+Fe+X.csv".

GenerateTernaryMaterialsFe+XX+X.py

Generates all possible ternary combinations that include Iron, any other elements found in compounds with a Tc > 600 K, and excludes Cobalt. Holds the Iron concentration at a minimum of 80% for all generated compounds. The amount of each element in a generated compound changes in increments of 1%. Eliminates any duplicates in the generated data, formats it to be compatible with DS1.csv and DS2.csv and saves the new data to "Data/Generated Materials/GC_Ternary_Fe80+XX+X.csv".

GroupedRandomForest.py

Creates a subset of data containing all compounds with a specified majority element found in DS1+DS2.csv. The chosen majority element must be specified for the variable MAJORITY_ELEMENT near the top of the script. 2/3 of this data is then used to train a random forest model. The model predicts on the remaining 1/3 of the data. MAE is printed in the terminal. A plot with the predicted Tc values vs. the experimental Tc values is saved to "Plots/Random Forest " + MAJORITY_ELEMENT + "-majority.png". It also creates two error plots. One plots the experimental Tc value vs the difference between the experimental and predicted value for each point in the test set. This plot is saved to "Plots/Random Forest Experimental Error " + MAJORITY_ELEMENT + "-majority.png". The other plots the predicted Tc value vs the difference between the experimental and predicted value for each point in the test set. This plot is saved to "Plots/Random Forest Predicted Error " + MAJORITY_ELEMENT + "-majority.png".

KfoldCV.py

Performs 50 rounds of 3-fold cross validation on DS1.csv. Prints the mean MAE and the standard deviation in the terminal.

KNN-DS1-Random.py

Imports DS1.csv and randomly shuffles the Tc values. 2/3 of the data is then used to train a 2-nearest-neighbors model. MAE is printed in the terminal. The model predicts on the remaining 1/3 of the data. A plot with the predicted Tc values vs. the experimental Tc values is saved to "Plots/2 Nearest Neighbors Random.png". It also creates two error plots. One plots the experimental Tc value vs the difference between the experimental and predicted value for each point in the test set. This plot is saved to "Plots/2 Nearest Neighbors Experimental Error Random.png". The other plots the predicted Tc value vs the difference between the experimental and predicted value for each point in the test set. This plot is saved to "Plots/2 Nearest Neighbors Predicted Error Random.png".

KNN-DS1.py

Imports DS1.csv. 2/3 of the data is then used to train a 2-nearest-neighbors model. The model predicts on the remaining 1/3 of the data. MAE is printed in the terminal. A plot with the predicted Tc values vs. the experimental Tc values is saved to "Plots/2 Nearest Neighbors.png". It also creates two error plots. One plots the experimental Tc value vs the difference between the experimental and predicted value for each point in the test set. This plot is saved to "Plots/2 Nearest Neighbors Experimental Error.png". The other plots the predicted Tc value vs the difference between the experimental and predicted value for each point in the test set. This plot is saved to "Plots/2 Nearest Neighbors Predicted Error.png".

RandomForest-DS1_train-DS2_test.py

Imports DS1.csv and DS2.csv. DS1 is then used to train a random forest model. The model predicts on all the data in DS2. MAE is printed in the terminal. A plot with the predicted Tc values vs. the experimental Tc values is saved to "Plots/DS1_train DS2_test Random Forest.png". It also creates two error plots. One plots the experimental Tc value vs the difference between the experimental and predicted value for each point in the test set. This plot is saved to "Plots/DS1_train DS2_test Random Forest Experimental Error.png". The other plots the predicted Tc value vs the difference between the experimental and predicted value for each point in the test set. This plot is saved to "Plots/DS1_train DS2_test Random Forest Predicted Error.png".

RandomForest-DS1-MASTML.py

Imports DS1.csv and DS1-MASTML-Features.csv. 2/3 of DS1 is then used to train a random forest model. The model predicts on the remaining 1/3 of the data. MAE is printed in the terminal. This same process is repeated with DS1-MASTML-Features. Plots with the predicted Tc values vs. the experimental Tc values from both models are overlaid for comparison and saved to "Plots/MASTML Random Forest.png". It also creates two error plots for the model that uses the MAST-ML features. One plots the experimental Tc value vs the difference between the experimental and predicted value for each point in the test set. This plot is saved to "Plots/MASTML Random Forest Experimental Error.png". The other plots the predicted Tc value vs the difference between the experimental and predicted value for each point in the test set. This plot is saved to "Plots/MASTML Random Forest Predicted Error.png".

RandomForest-DS1-PCA.py

Imports DS1.csv. Uses principal component analysis to reduce the number of features in DS1. Decreases the number of features from 85 to 5 in increments of 1. At each increment, 2/3 of the data is used to train a random forest model. The model predicts on the remaining 1/3 of the data then moves on to the next increment. Finally, it creates a plot with the number of features vs. the MAE of the random forest predictions and saves it to "Plots/PCA Random Forest - DS1.png".

RandomForest-DS1-Random.py

Imports DS1.csv and randomly shuffles the Tc values. 2/3 of the data is then used to train a random forest model. MAE is printed in the terminal. The model predicts on the remaining 1/3 of the data. A plot with the predicted Tc values vs. the experimental Tc values is saved to "Plots/Random Forest DS1 Random.png". It also creates two error plots. One plots the experimental Tc value vs the difference between the experimental and predicted value for each point in the test set. This plot is saved to "Plots/Random Forest Experimental Error DS1 Random.png". The other plots the predicted Tc value vs the difference between the experimental and predicted value for each point in the test set. This plot is saved to "Plots/Random Forest Predicted Error DS1 Random.png".

RandomForest-DS1.py

Imports DS1.csv. 2/3 of the data is then used to train a random forest model. MAE is printed in the terminal. The model predicts on the remaining 1/3 of the data. A plot with the predicted Tc values vs. the experimental Tc values is saved to "Plots/Random Forest DS1.png". It also creates two error plots. One plots the experimental Tc value vs the difference between the experimental and predicted value for each point in the test set. This plot is saved to "Plots/Random Forest Experimental Error DS1.png". The other plots the predicted Tc value vs the difference between the experimental and predicted value for each point in the test set. This plot is saved to "Plots/Random Forest Predicted Error DS1.png".

RandomForest-DS1+DS2.py

Imports DS1+DS2.csv. 2/3 of the data is then used to train a random forest model. MAE is printed in the terminal. The model predicts on the remaining 1/3 of the data. A plot with the predicted Tc values vs. the experimental Tc values is saved to "Plots/DS1+DS2 Random Forest DS1.png". It also creates two error plots. One plots the experimental Tc value vs the difference between the experimental and predicted value for each point in the test set. This plot is saved to "Plots/DS1+DS2 Random Forest Experimental Error.png". The other plots the predicted Tc value vs the difference between the experimental and predicted value for each point in the test set. This plot is saved to "Plots/DS1+DS2 Random Forest Predicted Error DS1.png".

RandomForest-DS2.py

Imports DS2.csv. 2/3 of the data is then used to train a random forest model. MAE is printed in the terminal. The model predicts on the remaining 1/3 of the data. A plot with the predicted Tc values vs. the experimental Tc values is saved to "Plots/Random Forest DS2.png". It also creates two error plots. One plots the experimental Tc value vs the difference between the experimental and predicted value for each point in the test set. This plot is saved to "Plots/Random Forest Experimental Error DS2.png". The other plots the predicted Tc value vs the difference between the experimental and predicted value for each point in the test set. This plot is saved to "Plots/Random Forest Predicted Error DS2.png".

SweepBinaryPredictions.py

Imports DS1.csv. Trains a random forest model on all the data. Uses the random forest model to make a sweeping predictions over every combination for 8 different binary pairs. These pairs are: [Co,Fe],[Fe,Ni],[Fe,O],[Mn,Ni],[Ni,Sm],[Ni,Tb],[Se,U],[Ga,U]. For a pair of elements, every combination of the 2 elements is generated with 1% increments. The random forest makes Tc predictions on all these combinations. The predictions are plotted as well as the experimental values of any real combinations of the two elements found in DS1. This process is repeated 8 times, once for each pair of elements. All 8 plots are put together in one plot and saved to "Plots/Random Forest Prediction Sweep - DS1.png".

T-SNE-DS1.py

Creates two 2-dimensional t-SNE plots to visualize the data in DS1. One plot is colored according to the majority element in each compound and is saved to "Plots/ELEMTSNE-DS1.png". The other plot is colored according to Curie temperature and is saved to "Plots/TempTSNE-DS1.png".

T-SNE-DS2.py

Creates two 2-dimensional t-SNE plots to visualize the data in DS1. One plot is colored according to the majority element in each compound and is saved to "Plots/ELEMTSNE-DS2.png". The other plot is colored according to Curie temperature and is saved to "Plots/TempTSNE-DS2.png".

TernaryPredictionPlot.py

MUST RUN GenerateTernaryMaterialsCo+Fe+X.py BEFORE RUNNING THIS SCRIPT! The csv generated by GenerateTernaryMaterialsCo+Fe+X.py is too large for Github so it must be generated manually by the user.

Imports DS1+DS2.csv. Filters out all the compounds in the data that have a Tc less than 600 K. The remaining data are used to train a random forest model. The model then makes Tc predictions for all of the generated ternary compounds in "Data/Generated Materials/GC_Ternary_Co+Fe+X.csv". The predicted curie temperatures are then plotted in a ternary heatmap and saved to "Plots/GC_Ternary_PlotCo+Fe+X.png".

*The plots can be made using other elements, but the must be specified in the ELEMENT1, ELEMENT2, and ELEMENT3 variables at the top of the script. Any changes in these variables must also be changed in the GenerateTernaryMaterialsCo+Fe+X.py script and used to generate the data that the ternary plot requires.

TernaryPredictionPlotZoom.py

MUST RUN GenerateTernaryMaterialsFe+XX+X.py BEFORE RUNNING THIS SCRIPT! The csv generated by GenerateTernaryMaterialsFe+XX+X.py is too large for Github so it must be generated manually by the user.

Imports DS1+DS2.csv. Filters out all the compounds in the data that have a Tc less than 600 K. The remaining data are used to train a random forest model. The model then makes Tc predictions for all of the generated ternary compounds in "Data/Generated Materials/GC_Ternary_Fe80+XX+X.csv". The predicted curie temperatures are then plotted in a ternary heatmap and saved to "Plots/GC_Ternary_PlotFe80+XX+X.png". This plot forces the first axis to have a range from 80% to 100% while the other 2 axes have ranges from 0% to 20%

*The plots can be made using other elements, but the must be specified in the ELEMENT1, ELEMENT2, and ELEMENT3 variables at the top of the script. Any changes in these variables must also be changed in the GenerateTernaryMaterialsFe+XX+X.py script and used to generate the data that the ternary plot requires.

UMAP-DS1.py

Creates two 2-dimensional UMAP plots to visualize the data in DS1. One plot is colored according to the majority element in each compound and is saved to "Plots/UMAP-ELEM-DS1.png". The other plot is colored according to Curie temperature and is saved to "Plots/UMAP-Temp-DS1.png".

UMAP-DS2.py

Creates two 2-dimensional UMAP plots to visualize the data in DS2. One plot is colored according to the majority element in each compound and is saved to "Plots/UMAP-ELEM-DS2.png". The other plot is colored according to Curie temperature and is saved to "Plots/UMAP-Temp-DS2.png".

About

Code for reproducing all the results in [give ref here]

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published