Each data type (i.e. protein, protein families, enrichments etc) should have its own folder named {data_type} (i.e. "gb1" for the gb1 protein). The sequence data for each data type should be saved as a .csv file with columns standarized and name {data_type}_data_full.csv. Specifically each data type folder should contain:
-
a {data_type}_data_full.csv : File to save the aa sequences with standarized columns naming and structure.
Column names
"aa_seq" : includes AA sequences
"len": includes the length of the AA sequence
"score"/"binary_score"/"label" : includes a fitness score or generally labels for the specific AA sequence (score = regression, binary_score = binary classification, label = multiclass classification)
"{split_name1}","{split_name2}" : includes value 'train' , 'test' and 'validation' to facilitate the dataset splits during experiments. -
data_parse.py : a python script that parses the data corresponding to the data type. Performs data preprocessing (duplicates removal, denoising) and formats the data to produce the {data_type}_data_full.csv
-
"embeddings" folder : Includes .pt files named "{data_type}_{model_name}embs_layer{layer}{reduction}.pt" corresponding to the protein language model, the layer and the reduction used to calculate the embeddings for the amino acid sequences of the respective data type