Skip to content

ALF database design doc

bnebgen-LANL edited this page May 9, 2023 · 4 revisions

Structure:

Compressed_data/

Each Stoichiometry has its own entry, with an outer batching axis

Al28

	‘species’, ’cell’, ’coordinates’, ’energy’, ‘test,’ other metadata ect. 

Al128

	‘species’, ’cell’, ’coordinates’, ’energy’, ‘test,’ ect. 

Uncompressed_data/

Similar to above, but each geometry will be it’s own entry

Data_reductions/

Reduction_1/

	Complete=True/False

	0/

	1/

	2/

		Al28/

			index: array that has [-1,0,1], or other possible values

			-1: data is anomalous and should not be used

			0: data is not in training/test set

			1: data is in training/test set

		Al128

Reduction_2/

Functions:

Compress_dataset(): 

Reallocates the compressed_data arrays to accommodate data from uncompressed_data, moves the uncompressed data into new arrays. Removes old compressed_data arrays.

Build_strided_train_cache(directory,data_reduction,train_fraction,valid_fraction,n_caches,use_uncompressed_data,format,seed,return_direct):

uses the compressed_data, with the specified data_reduction array to generate a series of train test sets for training an ensemble. Data is dumped to preferred format (h5py, npy, json, ect.). seed is a random number seed for train/valid spits. Return direct returns the database in memory, rather than to disk location. Should add a ‘path’ entry to the dataset that points each entry back to original dataset entry (so that reduction bits can be manipulated).

ASE_evaluate_unused_data(ase_calculator,reduction): 

iterates through all data in ‘compressed_data’ that has a ‘0’ in the current reduction/X/ array. Evaluates the configuration with an ASE evaluator (yes, inefficient but general), and returns side by side arrays of predicted vs. Actual properties. Should return a ‘path’ entry for each datapoint that points back to original dataset location for manipulating reduction bits. This function may implement restartability by only operating on a single stochiometry at a time and saving predicted energies and forces in the appropriate reduction point. These should be deleted when moving to the next step.

Update_data_reductions(reduction,prediction_arrays,true_arrays,fraction,exclude_high_error,proceedure): 

takes data from ASE_evaluate_unused_data and adds the highest error data, or random data that exceed some threshold to the training set. It generates a new iteration under the given reduction and sets the appropriate indices.

Notes:

‘data_reductions’ arrays will be shorter than the compressed data arrays because the compressed arrays will be expanded over time. It will be important that new data is always appended so that early indices do not break. You should be able to recover old datasets from completed/incompleted reductions.

The ‘test’ parameter within the database itself indicates whether data should be permanently held out as part of a unique test set.

Individual neural networks can implement a faster version of ‘ASE_evaluate_unused_data with ‘build_strided_train_cache’ with train_fraction=1.0, valid_fraction=0.0 and reading in the preferred database format.

Clone this wiki locally