We provide the data in two formats: processed with only the variables used in our paper for ML model evaluation and the full raw VASP output.
The processed data contain the unrelaxed structures, energies, formation energies, HOMO, LUMO and derived variables.
The archive can be downloaded and viewed directly at the Constructor Research Platform.
Alternatively, the data are available in DVC:
- Clone the repository
- Ensure that DVC[S3] is installed, for example by running
pip install dvc[s3]
- Download the datasets
dvc pull -R processed-high-density processed-low-density datasets/processed/{high,low}_density_defects datasets/csv_cif/high_density_defects/{MoS2,WSe2,BP_spin,GaSe_spin,InSe_spin,hBN_spin}_500 datasets/csv_cif/low_density_defects/{MoS2,WSe2}
_id
unique structure identifierdescriptor_id
identifier of the defect type as specified indescriptors.csv
defect_id
unusedenergy
total potential energy of the system as reported by VASP, eVenergy_per_atom
total potential energy of the system divided by the number of atoms, eVfermi_level
Fermi level, eVhomo
is highest occupied molecular orbital (HOMO) energy, eVlumo
is lowest unoccupied molecular orbital (LUMO) energy, eVnormalized_homo
is HOMO value normalised respective to the host valence band maximum (VBM) (see section "DFT computations" in the paper), eVnormalized_homo
is LUMO value normalised respective to the host valence band maximum (VBM) (see section "DFT computations" in the paper), eVE_1
is the energy of the first Kohn–Sham orbital of the structure with defect (see section "DFT computations" in the paper), eVhomo_lumo_gap
is the band gap, LUMO - HOMO, eVtotal_mag
is the total magnetisation*_{majority,minority}
are the corresponding quantities computed for the majority and minority spin channels for materials computed with spinband_gap
OBSOLETE
Same as defects.csv.gz
plus additional derivative variables:
formation_energy
is the defect formation energy, computed according equation 1 from the paperformation_energy_per_site
is the defect formation energy divided by the number of defects according to equation 2 from the paper*_{min,max}
are the minimim and maximum of quantities with the respect to to different spin channels
The archive initial.tar.gz
contains the unrelaxed structures in the CIF format. Names correspond to the unique identifiers _id
in defects.csv.gz
. Note that the structures were relaxed prior to computing the properties.
_id
unique identifier of the defect type, corresponds to thedescriptor_id
column indefects.csv
description
is a short semantic abbreviation of the defect typebase
is the chemical formula of the pristine materialcell
is the supercell sizedefects
is a dictionary describing each point defect
Contains chemical potentials (in eV) of the elements, to be used in formation energy computation.
Contains the properties of pristine material.
base
is the chemical formula of the pristine materialcell_size
is the supercell sizeenergy
total potential energy of the system, eVfermi
is the Fermi level, eVE_1
is the energy of the first Kohn–Sham orbital of the pristine structure (see section "DFT computations" in the paper), eVE_VBM
is the energy of the valence band maximum of pristine structure
Unit cells of the pristine materials used to produce the structures in the folder.
The raw VASP output, including the relaxation trajectories, is available in DVC:
- Clone the repository
- Ensure that DVC[S3] is installed, for example by running
pip install dvc[s3]
- Download the VASP output:
dvc pull -R datasets/raw_vasp/high_density_defects datasets/raw_vasp/dichalcogenides8x8_vasp_nus_202110
- Some of the data are packed into
tar.gz
, as its unpacked size is ~300Gb. You might want to use ratarmount to work with it.