-
Notifications
You must be signed in to change notification settings - Fork 12
ALF database design doc
Structure:
Compressed_data/
Each Stoichiometry has its own entry, with an outer batching axis
Al28
‘species’, ’cell’, ’coordinates’, ’energy’, ‘test,’ other metadata ect.
Al128
‘species’, ’cell’, ’coordinates’, ’energy’, ‘test,’ ect.
Uncompressed_data/
Similar to above, but each geometry will be it’s own entry
Data_reductions/
Reduction_1/
Complete=True/False
0/
1/
2/
Al28/
index: array that has [-1,0,1], or other possible values
-1: data is anomalous and should not be used
0: data is not in training/test set
1: data is in training/test set
Al128
Reduction_2/
Functions:
Compress_dataset():
Reallocates the compressed_data arrays to accommodate data from uncompressed_data, moves the uncompressed data into new arrays. Removes old compressed_data arrays.
Build_strided_train_cache(directory,data_reduction,train_fraction,valid_fraction,n_caches,use_uncompressed_data,format,seed,return_direct):
uses the compressed_data, with the specified data_reduction array to generate a series of train test sets for training an ensemble. Data is dumped to preferred format (h5py, npy, json, ect.). seed is a random number seed for train/valid spits. Return direct returns the database in memory, rather than to disk location. Should add a ‘path’ entry to the dataset that points each entry back to original dataset entry (so that reduction bits can be manipulated).
ASE_evaluate_unused_data(ase_calculator,reduction):
iterates through all data in ‘compressed_data’ that has a ‘0’ in the current reduction/X/ array. Evaluates the configuration with an ASE evaluator (yes, inefficient but general), and returns side by side arrays of predicted vs. Actual properties. Should return a ‘path’ entry for each datapoint that points back to original dataset location for manipulating reduction bits. This function may implement restartability by only operating on a single stochiometry at a time and saving predicted energies and forces in the appropriate reduction point. These should be deleted when moving to the next step.
Update_data_reductions(reduction,prediction_arrays,true_arrays,fraction,exclude_high_error,proceedure):
takes data from ASE_evaluate_unused_data and adds the highest error data, or random data that exceed some threshold to the training set. It generates a new iteration under the given reduction and sets the appropriate indices.
Notes:
‘data_reductions’ arrays will be shorter than the compressed data arrays because the compressed arrays will be expanded over time. It will be important that new data is always appended so that early indices do not break. You should be able to recover old datasets from completed/incompleted reductions.
The ‘test’ parameter within the database itself indicates whether data should be permanently held out as part of a unique test set.
Individual neural networks can implement a faster version of ‘ASE_evaluate_unused_data with ‘build_strided_train_cache’ with train_fraction=1.0, valid_fraction=0.0 and reading in the preferred database format.