Multiple Representation DeepInsight package
A step-by-step guide to run MRep-DeepInsight codes. Language: MATLAB
MRep-DeepInsight is built upon the previous package DeepInsight3D. Therefore, all preceding packages be executed using MRep-DeepInsight
package, which includes DeepInsight, DeepFeature and DeepInsight3D.
Setting up the Parameters.m
file enables one to run the package in various ways including all the previously developed packages.
The MRep-DeepInsight3D package has 2 main components. 1) conversion of tabular data to image samples, and 2) processing images to the convolutional neural network (CNN).
Figure 1 depicts the MRep-DeepInsight approach, where part a shows the transformation phase, part b shows the model estimation phase, and part c illustrates the model analysis phase.
Figure 1: An overview of the MRep-DeepInsight approach
OS: Linux Ubuntu 20.04; Matlab version: 2022b; GPU A100 (4 parallel);
Sharma A, Lopez Y, Jia S, Lysenko A, Boroevich KA, Tsunoda T, Multi-representation DeepInsight: an improvement on tabular data analysis, 2023 Paper link
-
Download the Matlab package MRep-DeepInsight.tar.gz or the entire directory from the link above. Store it in your working directory. Gunzip and untar as follows:
>> gunzip MRep-DeepInsight.tar.gz >> tar -xvf MRep-DeepInsight.tar
Note:
- Install R/Python software to use UMAP (see
umap_Rmatlab.m
). - This package also uses liblinear tools. Therefore, load liblinear package and set the path correctly in Line 54, Integrated_Test.m. Alternatively, comment out call for weighted integrated accuracy & AUC (Lines 38-75, Integrated_Test.m).
- Install R/Python software to use UMAP (see
-
Download the example dataset from the following link (caution: data size is 17MB):
Move the dataset4.mat to the folder
MRep-DeepInsight/Data/
.The dataset is given in the struct format of Matlab. Use any other data (binary class or multi-class) in a similar struct format for MRep-DeepInsight.
-
Download and Install example CNN net such as ResNet-50 in Matlab, see details about ResNet-50 from MathWorks link. You may use different nets as desired.
-
Executing the MRep-DeepInsight: all the codes should be run in the folder ../MRep-DeepInsight/, if you want to run in a different folder then addpath to appropriate directories in Matlab
The following mapping techniques can be used for MRep-DeepInsight
The supplement techniques modify the mappings of manifold techniques. These techniques can't be run independently and therefore at least one manifold technique is required to use. The supplement techniques are:
-1) Gabor filtering
+2) Blurring technique
-3) Assignment distribution algorithm
In order to use one or a combination of the above techniques, please set the following parameter correctly.
-
Open
Parameters.m
file. -
Change (Line 5)
Parm.UseIntegrate='yes';
. Options are eitheryes
orno
. This will trigger theMRep-DeepInsight
methodology, and overrideParm.Method
option. -
Change (Line 133)
$\textcolor{red}{\textsf{Parm.integrate}}$ as required. Some examples are given here under:ex-1) Use both tSNE with hamming distance and tSNE with Euclidean distance:
`Parm.integrate={'tsne','hamming','tsne','euclidean'};`
i.e. Define distance after tSNE technique {tsne, distance,...}
ex-2) Use tSNE with hamming distance and UMAP technique:
`Parm.integrate={'tsne','hamming','umap'};`
i.e. umap does not require to define any distance. Same is true for KPCA and PCA.
ex-3) Use UMAP. Kernel PCA and PCA:
`Parm.integrate={'umap','kpca','pca'};`
ex-4) Use tSNE with cosine, UMAP, Gabor, Blurring, Assignment and tsne with Chebychev:
`Parm.integrate={'tsne','cosine','umap','gabor','blur','assignment','tsne','chebychev'};`
Please note the term
blur
is used forBlurring technique
; and'assignment
is used for Assignment distribution technique.
In this example, tabular data with 2539 dimensions is used. It has 1178 training samples and 131 test samples. It is divided into two classes, namely Alzheimer's Disease (AD) and Normal Control (NC). First, the dataset is converted to images by the MRep-DeepInsight converter. Then the CNN net (resnet50) has been trained. The performance evaluation, in terms of accuracy, is done on the test set of the data.
-
File: open the Example1.m file in the Matlab Editor.
-
In order to activate MRep-DeepInsight pipeline, set true the variable
Parm.UseIntegrate=yes
in theParameters.m
file. -
Depending upon how many representations are required, setup
Parm.integrate
in theParameters.m
file. For e.g. defineParm.integrate={'tsne','hamming','tsne','cosine'}
, i.e., two representations (m=2). -
For a quick test of codes, use 1 objective function; i.e.,
Parm.MaxObj=1
. The recommended MaxObj value is 25 or over. -
Set up other parameters as required by changing the
Parameters.m
file, otherwise leave all as default. However, based on your hardware requirements, changeParm.miniBatchSize
to lower value if encountering memory problems (we use the default value as 1024) and alsoParm.ExecutionEnvironment
(default is multi-gpu). If you don't want to see the training progress plot produced by CNN training, then setParm.trainingPlot=none
. -
Dataset calling: since the dataset name is
dataset4.mat
, set the variableDSETnum=4
(at Line 17 of Example1.m) has been used. If the name of the dataset isdatasetX.m
then variableDSETnum
should be set asX
. -
Example1.m file uses updated function DeepInsight3D.m. This function has two parts: 1) tabular data to image conversion using
func_Prepare_Data.m
(supports previously developed converters) andfunc_integrate.m
(supports MRep-DeepInsight), and 2) CNN training using resent50 (default or change as required) usingfunc_TrainModel.m
. -
The output is AUC (for 2-class problem only), C (confusion matrix) and Accuracy of the test set (at Line 28). It also gives ValErr which is the validation error.
-
By default, trained CNN models (such as model.mat, 0*.mat) and converted tabular data to images (either Out1.mat or Out2.mat) will be saved in folder /Models/Run4/ (since DSETnum=4; if DSETnum=N then saved in ../RunN/) and figures will be stored in folder /FIGS/Run4/ (since DSETnum=4). The saving of files is done by calling the functions
func_SaveModels.m
andfunc_SaveFigs.m
-
The execution results are stored in the file
DeepInsight3D_Results.txt
which is stored in the folder /MRep-DeepInsight/. -
A few messages will be displayed by running Example1.m on the Command Window of Matlab, such as
Dataset: Alzheimer 1 and 5 NORM-2 tSNE with exact algorithm is used Distance: hamming Pixels: 224 x 224 Dataset: Alzheimer 1 and 5 NORM-2 tsne with exact algorithm is used Distance: cosine Pixels: 224 x 224 Integrated conversion finished and saved as Out1.mat or Out2.mat! Training model begins: Net1 ... |Iter | Eval result | Objective | ... |1 | Best | 0.18345 | ... .... Optimization completed MaxObjectiveEvaluations of 1 reached. Total function evaluations: 1 Total elapsed time: 1785.2313 seconds Total objective function evaluation time: 1784.7876 Best observed feasible point: InitialLearnRate Momentum L2Regularization 4.9866e-05 0.80103 0.012516 Training model ends weighted integrated accuracy: 84.73 weighted integrated AUC: 0.8702 model = struct with fields: bestIdx: 1 fileName: "0.18343.mat" prob: [1x1 struct] valError: 0.1834 Model Files Saved ... Figures Saved in the FIGS folder... End of script Example1.
Note that the above values might differ.
The following training plot (Figure 2) can be seen if the
Parm.trainingPlot
option is set totraining-progress
.Figure 2 Training progress plot
The objective function figure will be shown for the Bayesian Optimization Technique (BOT). By default 'no BOT' will be applied; i.e.
Parm.MaxObj=1
. However, if BOT is required then change parameter `Parm.MaxObj' to a value higher than 1. If it is set as 'Parm.MaxObj=25' then 25 objective functions will be searched for tuning hyperparameters and the best one (with the minimum validation error) will be selected.Results file: check
DeepInsight3D_Results.txt
for more information, such asAUC: 0.8692 ConfusionMatrix 99 3 18 11
-
All the results will be stored in the current stage folder
~/DeepInsight3D_pkg/Models/Run4/StageX
where X is the current stage; -
Similarly, all the figures will be stored in a folder
~/DeepInsight3D_pkg/FIGS/Run4/StageX
where X is the current stage. -
For feature selection: If the loop continues then the value of X will increment to 1, 2, 3, …; i.e., repeating the model to find a smaller subset of features/genes.
For hyperparameter tuning, Bayesian Optimization Technique (BOT) can be used. If Parm.MaxObj=1
then NO BOT will be applied. If it is N>1 (i.e. greater than 1) then N objectives functions will be created and the best hyperparameters (for which the validation error is the minimum) will be selected.
Therefore, for BOT, use,
Parm.MaxObj=N
where N is any number greater than 1, e.g. N=10 gives 10 objective functions.
For, NO BOT, use,
Parm.MaxObj=1
-
MRep-DeepInsight
has 4 folders: Data, DeepResults, FIGS, and Models. It has several .m files. However, the main file isDeepinsight3D.m
, which performs tabular data to image conversion and CNN modelling. The codes of MRep-DeepInsight is developed on the DeepInsight3D package and therefore it can perform all tasks of previously developed models such as DeepInsight, DeepFeature and DeepInsight3D. All the parameter settings can be done in theParameters.m
file. -
DeepInsight3D.m has following functions:
-
func_integrated
: This function supports transforming tabular data to image data using MRep-DeepInsight methodology. It loads the data, splits the training data into the Train and Validation sets, normalizes all the 3 sets (including the Test set), and converts samples to images form using the Training set. The Test and Validation sets are not used to find pixel locations. The image datasets are stored as Out1.mat or Out2.mat depending on whether norm1 or norm2 was selected. -
Integrated_Test
: This function computes the integrated performance (as shown in Figure 1c: model analysis phase). -
func_Prepare_Data
: This function supports previous models (DeepFeature, DeepInsight and DeepInsight3D). It loads the data, splits the training data into the Train and Validation sets, normalizes all the 3 sets (including the Test set), and converts multi-layered non-image samples to 3D image form using the Training set. The Test and Validation sets are not used to find pixel locations. Once the pixel locations are obtained, all the non-image samples are converted to 3D image samples. The image datasets are stored as Out1.mat or Out2.mat depending on whether norm1 or norm2 was selected. -
func_TrainModel
: This function executes the convolution neural network (CNN) using many pretrained and custom nets. The user may change the net as required. The default values of hyperparameters for CNN are used. However, ifParm.MaxObj
is greater than 1 then it optimizes hyper-parameters using the Bayesian Optimization Technique. It uses a Training set and Validation set to tune and evaluate the model hyper-parameters.Note: To tune hyperparameters of CNN automatically, use a higher value of
Parm.MaxObj
.The best model (in case Parm.MaxObj>1) is stored in the DeepResults folder as .mat files, where the file name depicts the best validation error achieved. For example, file 0.32624.mat in the DeepResults folder tells the hyper-parameters at validation error 0.32624. Also, the model file
model.mat
details the weights file and other relevant information to be stored.
-
-
Feature selection functions
func_FeatureSelection
: This will find activation maps at the ReLu layer, perform Region Accumulation (RA) step and Element Decoder step to find the element/gene subset. The input is model.mat (fromfunc_TrainModel
) and related .mat file from the folder DeepResults. This function finds CAM for each sample and provides the union of all maps.func_FS_class_basedCAM
: This function performs class-based CAM, i.e., each class will have a distinct CAM.func_FeatureSelection_avgCAM
: This function finds the common CAM across all the samples.
-
Non-image to image conversion: two core sub-functions of
func_Prepare_Data
andfunc_integrated
are used to convert samples from non-image to image. These are described below.-
Cart2Pixel
: The input to this function is the entire Training set. The output is the feature or gene locations Z in the pixel frame. The size of the pixel frame is pre-defined by the user. -
ConvPixel
: The input is a non-image sample or feature vector and Z (from above). The output is an image sample corresponding to the input sample.
-
-
Compression Snow-fall algorithm (SnowFall.m): Not used in this package. However, this compression algorithm is used to provide more space for features in the given pixel frame. Since the conversion from Cartesian coordinates system to the pixel frame depends on the pixel resolution, it becomes difficult to fit all the features without overlapping each other. This algorithm tries to create more space such that the overlapping of feature or gene locations can be minimized. The input is the locations of genes or features with the pixel size information. The output is the readjusted image. It is up to the user to use Snow-fall compression or not by setting
Parm.SnowFall
to either0
(not use) or1
(use). -
Extraction of Gene Names (optional): This option is useful for enrichment analysis. Two files for the extraction of genes are GeneNames_Extract.m and GeneNames.m. The list of names of genes is stored in
~/DeepInsight3D_pkg/Models/RunY/StageX/
folder.After running the feature selection function, the results will be stored in the corresponding RunY and StageX folders (where X and Y are integers 1,2,3…). If it is required to find the gene IDs/names of the obtained subset for each cancer type, then execute
GeneNames_Extract
function. Go to Line 4, and set theOut_Stages
variable. For e.g., if Stage 2 has been saved inside Run1 after executingfunc_FS_class_basedCAM
, useOut_Stages = 2
. Then go to Line 6 and defineFileRun
. For MRep-DeepInsight, we have not used feature selection.The gene list per class will be generated. If there are 10 cancer types, then 10 files will be generated. In addition, one file with all genes listed will be generated (e.g. GeneList_UnCmprss.txt). The results will be stored in
~/Models/RunY/StageX
as RunYStageX.tar.gz and a folder with the same results will also be created as RunYStageX. In this example, it will be stored in the folderRun1Stage2
and Run1Stage2.tar.gz.
A number of parameters/variables are used to control the DeepFeature_pkg. The details are given hereunder
-
Parm.Method
(select dimensionality reduction technique)Dimensionality reduction technique (DRT) can be considered as one of the following methods; 1) tSNE 2) Principal component analysis (PCA) 3) kernel PCA, 4) uniform manifold approximation and projection (umap). For umap you can use python or R scripts (please see umapa_Rmatlab.m). Please note that these DRTs are not used in the conventional manner. Only the element locations are obtained by DRTs, and the reduction of features or dimensions is NOT performed.
Select this variable in Parameter.m file or after calling
Parm = Parameter(DSETnum)
changeParm.Method = ‘tSNE’, ‘kpca’, ‘pca’ or ‘umap’
Default is tSNE.
-
Parm.UseIntegrate
: can be'yes'
or'no'
. If 'yes' then the followingParm.integrate
variable will be used, otherwiseParm.Method
will be used. -
Parm.integrate
: This will support for than one representation (which is not possible withParm.Method
). Various manifold techniques (with respective distances esp. for tSNE) and supplement methods can be listed here to integrate the performance of these techniques. See Line 133 inParameters.m
file. The usage is:Parm.integrate = {Manifold1,Distance1,Manifold1,Distance2,... Manifold3,Manifold4,..., Supplement1,Supplement2,Supplement3}
where Manifold1 is tSNE and Distance1...Distance11 are tSNE distances. Manifold3, Manifold4,... are other manifold techniques such as KPCA, UMAP and PCA. Supplement1, Supplement2,.. are Blur, Gabor and Assignment. All or any of these combinations can be used for
Parm.integrate
as long as more than 1 technique is selected to render multiple representation strategy. -
Parm.Dist
(Distance selection only for tSNE) - This parameter is NOT used whenParm.UseInteregrate=yes
.If tSNE is used, then one of the following distances can be used. The default distance is ‘euclidean’.
Parm.Dist = ‘cosine’, ‘hamming’, ‘mahalanobis’, ‘educidean’, ‘chebychev’, ‘correlation’, ‘minkowski’, ‘jaccard’, or ‘seuclidean’ (standardized Eucliden distance).
-
Parm.Max_Px_Size
(maximum pixel frame either row or column)The default value is 224 as required by ResNet-50 architecture.
-
Parm.ValidRatio
(ratio of validation data and training data)The amount of training data required to be used as a validation set. Default is 0.1; i.e., 10% of training data is kept aside as a validation set. The new training set will be 90% of the original size. Note: If
Parm.ValidRatio=0
then not validation set will be kept aside. In this case, entire training set will be used for model estimaton. -
Parm.Seed
Random parameter seed to split the data.
-
Parm.NetName
: use pre-trained nets such asresnet50
,inceptionresnetv2
,nasnetlarge
,efficientnetb0
,googlenet
and so on. See a list of pre-trained nets from Matlab link here -
Parm.ExecutionEnvironment
: execution environment based on your hardware. Options arecpu
,gpu
,multi-gpu
,parallel
, andauto
. Please check trainingOptions (Matlab) for further details. -
Parm.ParallelNet
: if '1' then this option overridesParm.NetName
. The custom made net frommakeObjFcn2.m
will be used. -
Parm.miniBatchSize
: define miniBatchSize, default is 1024 (for 4 parallel A100 GPUs of 40GB each). -
Parm.Augment
: augment samples during training progress, select '1' for yes and '0' for no. -
Parm.AugMeth
: select method '1' or '2'. Method 1 automatically augments samples whereas Method 2 is done by the user -
Parm.aug_tr
: ifParm.AugMeth=2
thenParm.aug_tr=500
will augment 500 samples of training set if the number of samples in a class is less than 500. -
Parm.aug_val
: ifParm.Aug=2
thenParm.aug_val=50
will augment 50 samples of validation set if the number of samples in a class is less than 50. -
Parm.ApplyFS
: if '1' it applies a feature selection process using Logistic Regression before applying DeepInsight transformation. -
Parm.FeatureMap
: has following options.0
means use 'all' omics or multi-layered data for conversion. '1' means use the 1st layer for conversion (e.g. expression) '2' means use the 2nd layer for conversion (e.g. methylation) '3' means use the 3rd layer for conversion (e.g. mutation) -
Parm.TransLearn
: if '1' then learn CNN from previously trained nets on your different datasets. Please savemodel.mat
and pretrained model0*.mat
files generated from the previous run toModels/Run32/Stage1
folder. The current execution of CNN will train on the pretrained modelModels/Run32/Stage1/0*.mat
. This will render transfer learning from0*.mat
andmodel.mat
files. -
Parm.FileRun
Change the name as RunX, where X is an integer defining the run of DeepFeature on your data.
Change the value X for new runs.
-
Parm.SnowFall
(compression algorithm)Suppose SnowFall compression algorithm is used then set the value as 1, otherwise 0. Default is set as 1.
-
Parm.Threshold
(for Class Activation Maps)Set the threshold of class activation maps (CAMs) by changing the value between 0 and 1. If the value is high (towards 1), then the region of activation maps will be very fine. On the other hand, the region will be broader towards value 0. Default is 0.3.
-
Parm.DesiredGenes
Expected number of genes to be selected. Default is set as 1200. However, change as required.
-
Parm.UsePrevModel
The iterative way runs in multiple stages. If you want to avoid running CNN multiple times then set these values as ‘y’ (yes); i.e., the previous weights of CNN will be used for the current model. This way, the processing time is shorter, however, performance (in terms of selection and accuracy) would be lower. The default setting is ‘n’ (no).
-
Parm.SaveModels
For saving models type ‘y’, otherwise ‘n’. Default is set as yes ‘y’.
-
Parm.Stage
Define the stage of execution. The default value is set as
Parm.Stage=1
. All the results will be saved in RunXStage1. If iterative process is executed then results will be stored in Stage2, Stage3… and so on. -
Parm.PATH
Default paths for FIGS, Models and Data are
~/MRep-DeepInsight/FIGS/
,~/MRep-DeepInsight/Models/
and~/MRep-DeepInsight/Data/
, respectively. Runtime parameters will be stored in~/MRep-DeepInsight/
folder (such as model.mat, Out1.mat or Out2.mat). -
Log and performance file (including an overview of parameter information)
The runtime results will be stored in
~/MRep-DeepInsight/DeepInsight3D_Results.txt
with complete information about the run.
A YouTube video about the original DeepInsight method is available here. A Matlab page on DeepInsight can be viewed from here.
Sharma A, Vans E, Shigemizu D, Boroevich KA, Tsunoda T, DeepInsight: A methodology to transform a non-image data to an image for convolution neural network architecture, Scientific Reports, 9(1), 1-7, 2019.
Sharma A, Lysenko A, Boroevich K, Vans E, Tsunoda T, DeepFeature: feature selection in nonimage data using convolutional neural network, Briefings in Bioinformatics, 22(6), 2021.
Sharma A, Lysenko A, Boroevich K, Tsunoda T, DeepInsight-3D architecture for anti-cancer drug response prediction with deep-learning on multi-omics, Scientific Reports, 13(2483), 2023.
Jia S, Lysenko A, Boroevich K, Sharma A, Tsunoda T, scDeepInsight: a supervised cell-type identification method for scRNA-seq data with deep learning, Briefings in Bioinformatics, 2023. https://doi.org/10.1093/bib/bbad266
Overall weblink here