Skip to content
sukhadia_s edited this page Nov 4, 2022 · 6 revisions

ImaGene Workflow

image

Models supported (starting from v3.2v):

  1) Random Forest
  2) DecisionTree Classifier
  3) Multi-layer Perceptron (MLP) Classifier
  4) Support Vector Classifier (SVC)
  5) Logistic Regression
  6) Decision Tree Regressor
  7) Linear Regression
  8) LinearModel
  9) multiTaskLinearModel
  10) LASSO
  11) multiTaskLASSO
  
  Note: Please refer README page as well.

ImaGene's workflow consists of following functions:

  1. Python main
  2. Process
  3. Read_dataset
  4. Preprocessing
  5. Correlation
  6. Splitdata
  7. Normal_dataframe
  8. BuildModel
  9. Evaluate

1. Python main:

main

This function is the main entry point for Python that intakes the data, label, config.ini, model and prediction_out as files. The data and label are in .csv format sharing a common ID column (examples in supplementary files on github: IBC and HNSCC radiomics and gene fpkm CSVs). The config.ini file consists of several user defined parameters such as: model type, mode (Train, validate or predict), test data size (for train-test split in case of Train mode), scoring, K-fold cross validation splitter, normalization method, p-value adjustment method, correlation method, correlation co-efficient threshold and other model hyperparameters. Example of a config file is on github: config_IBC_LR.ini (meant for training of a Linear Regression model for IBC case).

The model (.pkl) and prediction_out arguments are need not be specified for "Train" mode but for "validate" and "predict" mode. "Train" mode generates a model.pkl while "validate" and "predict" uses that model for validation and prediction respectively. Note that "testing" is the part of "Train" mode. The mode is set in config.ini as stated above.

Note that when you operate ImaGene through the web platform (www.imagene.pgxguide.org) as opposed to the command-line-interface on linux/mac using the docker container, you need not use any config.ini as the platform automatically generates such config based on the parameter-values users select on the web-platform for the respective experiment/run.

The main function calls process function two times: 1) With FeatureSelection flag set to "0" where the regular workflow of correlations and model training and testing executes, 2) With that flag set to "1", where the correlation step is skipped completely and the model's training and testing executes considering the features selected using Correlation weights or feature importances obtained from initial training of the Regression or DecisionTree based models respectively (per model type selected by user in the config). Note that FeatureSelection run with its flag set to "1" only executes for when the mode set to "Train" and when gridsearch set to "False" (to not over-complicate the model's training).

The results without feature selection gets stored in the main run directory whereas those with feature selection gets stored in a sub-directory named/prefixed as "FeatureSelection".

2. Process

process

The process function entails the entire process starting from training through testing and validation per the modes of operation specified by user. It starts with reading the data and label CSVs, preprocessing them to check for matching sample IDs across files and removing any feature columns having NaN values in one or more cells, calls correlations between data and label features (when mode set to "Train"), calls Preprocessing, splitdata for train-test split ("Train" mode), BuildModel (to build a model with or without gridsearch depending on gridsearch set True or False by the user), conducts validation (in "validate" mode) and prediction (in "predict" mode). As it goes through each step (aka function-calls) an html-based Report gets constructed (pdf convertible later).

3. Read_dataset

read_dataset

Reads datasets (for both data and labels as called by process function) and converts each to a pandas dataframe.

4. Preprocessing

preprocessing

Conducts preprocessing of data and label dataframes to check SampleID concordance across both and eliminates feature columns containing 'nan' value in one or more than one cells.

5. Correlation

Correlation

Conduct correlations for the given data and label features depending on the correlation method (aka pearson or spearman) and correlation co-efficient threshold chosen by user. Returns significantly (p_adjust<0.05) correlated features. Users have option to select the method used to adjust p-values in the config. Correlation-plots get casted on the Report. Additionally, text files are outputted to indicate the significantly correlated features along with their correlation co-efficients and the respective adjusted/corrected p-values.

6. Splitdata

splitdata

Splits data (and labels) into train and test sets and calls normalization for each of those tests depending on the normalization method set by user in the config. Returns normalized train and test data and labels.

7. Normal_dataframe

normal_dataframe

Normalize data and label dataframes (as called from within the splitdata function for the train and test datasets) based on data and label normalization methods defined by user. Set the method to None or 'none' to skip the normalization. Returns normalized dataframe.

8. BuildModel

BuildModel

Initializes the model based on whether gridsearch set to False or True and executes the training accordingly. Tests the model using test data via evaluate method which returns the labels tested successfully at AUC>0.9 and R-square>0.25. More details regarding evaluate method in the Evaluate section below. Further, the returned_labels are permuted randomly multiple times (currently n=20) and the AUCs and R-squares recorded to evaluate p-values for both metrics downstream. The permuted outputs appear both in Report as well as individual text files to indicate the AUCs and R-squares respectively.

9. Evaluate

evaluate

Evaluates the model's performance and reports variety of metrics such as AUCs, R-squares, RMSE, Stdev and RMSE:Stdev ratio for predicted labels. Also, yields scatter plot for observed v/s expected values for labels, bar-plot to indicate RMSE:Stdev ratio for each label and the plot of AUC v/s decision-thresholds post conversion of target variable values to binary (from continuous) at each decision-threshold (ex: normalized fpkm values: 0.1-0.9) while leaving the predicted variable continuous for each label. From biological perspective, this allows user to deem the AUC of model at a variety of low to high gene-expression cut-offs to decipher the expression level at which the model is confident enough to predict the respective gene label using imaging features from the respective tumor region of interest (ROI). The AUCs and R-squares for each such label are validated further through random permutations as indicated in the BuildModel function above to aid determination of their p-values respectively. Finally, per user review, the gene labels having pvalue=0.0 for both AUC and R-square are reported as results in the manuscript. Further, the biological significance of these genes are deemed citing previous results from research in non-radiogenomic domains (such as wet-lab, pure bioinformatics and systems-biology approaches) which further validates ImaGene's findings.

ImaGene aims to test as many cancer types as possible across multiple organizations across the globe to be able to arrive at a consensus on imaging->omic or omic->imaging predictions thereby achieving the most non-invasive diagnosis for tumors till date.

We welcome collaborations from multiple labs to run experiments through ImaGene with us (for an expert-guided support) using multimodal datasets such as imaging, genomics, proteomics, therapy outcomes, patient survival, etc. to boost and innovate the treatment strategies for cancer patients worldwide.