Exploring the representation of high-entropy alloys for screening electrocatalysis of oxygen reduction reaction via feature engineering
This is repository for high entropy alloys(HEAs) experiments for My Graduation Project "基于特征工程探究高熵合金表达在氧还原电催化剂筛选的应用"
A regression model is proposed to predict the *OH adsorption energy of HEAs(high-entropy alloys), which can perfectly handle the problem of "input disorder" and has a excellent performance that the mean absolute error is within 0.038 eV compared with traditional calculations.Moreover, Feature engineering is used to data augment, and shapley value is used for analysing the feature selected by genetic algorithm. It is worth noting that the absorbed atoms’ molar mass and coordination number of atoms constituting the HEAs make great contributions to the prediction of the model. At last, WGAN-GP(Wasserstein GAN using gradient penalty) is used to generate HEAs environments and compositions.
Except for predicting adsorption energy of HEAs, this method can also be used for any other multiatomic systems which are similarly constrained by datasets shortages.
The prominent packages are:
- SHAP
- numpy
- pandas
- seaborn
- matplotlib
- scikit-learn
- pytorch 1.8.1
To install all the dependencies quickly and easily, you should use pip install requirements.txt
pip install -r requirements.txt
I build up my dataset based on neural-network-design-of-HEA, you can refer this repository for more information.
Because of the ownership of the dataset, this repository doesn't provide HEAs dataset! Therefore, you have to collect your own data!
The data structure is shown below.
Atom | Ru | Rh | Pd | Ir | PT | |
---|---|---|---|---|---|---|
A | Period | 5 | 5 | 5 | 6 | 6 |
Group | 8 | 9 | 10 | 9 | 10 | |
B | Radius | 1.338 | 1.345 | 1.375 | 1.357 | 1.387 |
C | CN | |||||
D | AtSite | |||||
E | pauling Negativity | 2.20 | 2.28 | 2.20 | 2.20 | 2.28 |
VEC | 8 | 9 | 10 | 9 | 10 | |
F | M | 101.07 | 102.906 | 106.42 | 192.2 | 195.08 |
atomic number | 44 | 45 | 46 | 77 | 78 |
where CN is coordination number, AtSite is active sites, and M is molar mass.The left features are descriptors we deisred, which are denoted as 'A, B, C, D, E, F' in above table.
You have to follow the coord_numbers coord_nums to fill in the blanks.
If you use the dataset from neural-network-design-of-HEA, you should follow the steps below:
After build up the dataset with 9 features, you should use Pearson correlation coefficient to drop out highly related features to reduce copmutaion cost, run following code:
cd utils
python PearsonSelection.py
PearsonSelection.py
use Pearson correlation coefficient to drop out highly related features.
The result will be like:
The model can handle any numbers of atoms and is defined by the number of features which means it can also have no limitation in input dimension.
To train the model, you can simply use the following command, and you will get a checkpoint:
# training a model for downstream tasks
python K_fold.py
Obtaining the plot of MAE and RMSE compared with DFT-calculated adsorption energy
# training a model for downstream tasks, you need to update the checkpoint path first!
python main.py
You can also just simply use the checkpoint I have provided in checkpoint/6_500epochs_5_model.pth.
Visualize the data, and the features processed by the model.
python t_SNE.py
python data_augment.py
use
python Feature_selection.py
python SHAP.py
After running the code, you will get a best_result.csv file which will tell you what's the best combination of 90 features.
Shapley value analysis will tell you which feature effects the model prediction of *OH adsorption energy most.
You can switch the mode to choose whether to train the regression model. The result of loss plot demonstrates that the training process of GAN is not good :(
python Joint_training.py