From 90eb44754f56a78fd6bc2c0fc1ee6dc450487f14 Mon Sep 17 00:00:00 2001
From: amit-samal <amit.samal@infocusp.com>
Date: Wed, 13 Nov 2024 15:58:53 +0530
Subject: [PATCH 1/2] updated tutorials

---
 .../differential_gene_expression/dge.ipynb    |   2 +-
 tutorials/pipeline/scalr_pipeline.ipynb       |  24 +-
 .../preprocessing/batch_correction.ipynb      | 206 +-----
 tutorials/preprocessing/normalization.ipynb   | 661 +++++++++++++++++-
 4 files changed, 683 insertions(+), 210 deletions(-)
diff --git a/tutorials/analysis/differential_gene_expression/dge.ipynb b/tutorials/analysis/differential_gene_expression/dge.ipynb
index c6f90fd..65c1d47 100644
--- a/tutorials/analysis/differential_gene_expression/dge.ipynb
+++ b/tutorials/analysis/differential_gene_expression/dge.ipynb
@@ -165,7 +165,7 @@
    },
    "source": [
     "## <span style=\"color: steelblue;\">Downloading data</span>\n",
-    "- Downloading an anndata from `cellxgene` and making a subset anndata with 1000 genes for the downstream analysis."
+    "- Downloading an anndata from `cellxgene`([Jin et al. (2021) iScience](https://doi.org/10.1016/j.isci.2021.103115)) and making a subset anndata with 1000 genes for the downstream analysis."
    ]
   },
   {
diff --git a/tutorials/pipeline/scalr_pipeline.ipynb b/tutorials/pipeline/scalr_pipeline.ipynb
index 6826c0e..610c8ee 100644
--- a/tutorials/pipeline/scalr_pipeline.ipynb
+++ b/tutorials/pipeline/scalr_pipeline.ipynb
@@ -8,7 +8,14 @@
    "source": [
     "<left><img src=\"https://github.com/infocusp/scaLR/raw/sj/fullntest_samples_analysis/img/scaLR_logo.png\" width=\"150\" height=\"180\"></left>\n",
     "\n",
-    "# Single-cell analysis using Low Resource (scaLR)"
+    "# <span style=\"color: steelblue;\">Single-cell analysis using Low Resource (scaLR)</span>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## <span style=\"color: steelblue;\">Cloning scaLR</span>"
    ]
   },
   {
@@ -36,6 +43,13 @@
     "!git clone https://github.com/infocusp/scaLR.git"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Install all requirements after cloning the repository, excluding packages that are pre-installed in Colab."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -58,8 +72,12 @@
    },
    "outputs": [],
    "source": [
-    "# Install all requirements after cloning the repository, excluding the torch package since it is pre-installed in Colab.\n",
-    "!pip install $(grep -ivE \"torch\" scaLR/requirements.txt)"
+    "import sys\n",
+    "imported_packages = {pkg.split('.')[0] for pkg in sys.modules.keys()}\n",
+    "ignore_libraries = \"|\".join(imported_packages)\n",
+    "\n",
+    "!pip install $(grep -ivE \"$ignore_libraries\" scaLR/requirements.txt)\n",
+    "!pip install memory-profiler==0.61.0"
    ]
   },
   {
diff --git a/tutorials/preprocessing/batch_correction.ipynb b/tutorials/preprocessing/batch_correction.ipynb
index 2b54596..eae5dd4 100644
--- a/tutorials/preprocessing/batch_correction.ipynb
+++ b/tutorials/preprocessing/batch_correction.ipynb
@@ -1,205 +1 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "# <span style=\"color: steelblue;\">Batch Correction using scaLR</span>"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Keypoints\n",
-    "\n",
-    "1. This notebook is designed as a tutorial for using batch correction using metadatalaoder from a scaLR library.\n",
-    "2. The dataloader is extensible to add any column from metadata as one hot-encoded vectors which can be useful for model training."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## <span style=\"color: steelblue;\">What is the batch correction?</span>\n",
-    "\n",
-    "- Single-cell genomic datasets related to a specific disease or trait are often compiled from multiple experiments, each sequenced from different single cells under varying conditions, such as capturing times, handling personnel, reagent lots, equipment, and even technology platforms.\n",
-    "- These differences lead to large variations also known as batch effects in the data. When combining these datasets for analysis and modeling, it's crucial to ensure that the model isn't biased towards data from certain batches due to their higher or lower value ranges. Therefore, it's necessary to eliminate batch effects from these datasets.\n",
-    "\n",
-    "\n",
-    "## <span style=\"color: steelblue;\">How to perform batch correction?</span>\n",
-    "\n",
-    "- Many statistical tools such as Scanpy, Seurat, Harmony, Combat, etc. performed batch correction by performing normalization and dimensional reduction and then removing the batch effect by fitting linear or mixed models, calculating the k-nearest neighbor or mutual nearest neighbors distance, canonical correlation analysis, etc. \n",
-    "\n",
-    "- While traditional batch corrections are robust and widely used, they have limitations in handling non-linear relationships, scalability, flexibility, and the preservation of biological signals, AI/ML-based batch correction methods offer significant advantages in these areas, making them a powerful alternative in complex and large-scale single-cell genomic datasets. \n",
-    "\n",
-    "- The batch correction approach in the scaLR platform is inspired by the scGPT tool, where batch information is integrated into the feature data to inform the model about the origin of each sample. Since batch is a categorical variable, directly including it as a label-encoded feature is not appropriate, as no batch is inherently superior to another. Instead, the solution is to use a one-hot encoded vector to represent batch information.\n",
-    "\n",
-    "- For example, if we have four batches in the dataset, the one-hot encoding would work as follows:\n",
-    "    - Batch 1 -> 0 0 0 1\n",
-    "    - Batch 2 -> 0 0 1 0\n",
-    "    - Batch 3 -> 0 1 0 0\n",
-    "    - Batch 4 -> 1 0 0 0\n",
-    "\n",
-    "- These encoded vectors represent the batches and are added to the feature data. In this case, four additional columns will be included in the feature data, ensuring that the model is aware of the batch information while training on samples from different batches.\n",
-    "\n",
-    "\n",
-    "## <span style=\"color: steelblue;\">How is it implemented in the scaLR platform?</span>\n",
-    "\n",
-    "- In the scaLR platform, we've implemented SimpleMetaDataLoader data loader that handles this process automatically.\n",
-    "- You can specify the metadata column you want to one-hot encode and add it to the feature data, and the data loader will take care of the encoding. \n",
-    "- We've also extended the functionality to allow users to include multiple columns from the metadata as one-hot encoded vectors in the feature data. Bypassing a list of columns, you can easily incorporate additional information.\n",
-    "- This approach is particularly useful in scenarios like predicting disease vs. non-disease outcomes, where certain metadata, such as cell type, might enhance the model's predictive power.\n",
-    "- By adding the cell type information to the feature data using this method, you can improve the model's performance.\n",
-    "- Generally, this won't be used as a library utility - it will be mostly used as a part of pipeline. Please find below a explaination a code snippet to understand its basics.\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import sys\n",
-    "sys.path.append('/path/to/scaLR')"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Required imports\n",
-    "import anndata\n",
-    "import numpy as np\n",
-    "import pandas as pd\n",
-    "\n",
-    "from scalr.nn.dataloader import build_dataloader, simple_metadataloader\n",
-    "\n",
-    "# Setting seed for reproducibility\n",
-    "np.random.seed(0)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Creating anndata object.\n",
-    "adata = anndata.AnnData(X=np.random.rand(15, 7))\n",
-    "adata.obs = pd.DataFrame.from_dict({\n",
-    "    'celltype': np.random.choice(['B', 'C', 'DC', 'T'], size=15),\n",
-    "    'batch': np.random.choice(['batch1', 'batch2'], size=15),\n",
-    "    'env': np.random.choice(['env1', 'env2', 'env3'], size=15)\n",
-    "})\n",
-    "adata.obs.index = adata.obs.index.astype('O')"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "{'batch_size': int, 'target': str, 'mappings': dict, 'padding': int}"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    }
-   ],
-   "source": [
-    "# Below are the required params for metadataloader. \n",
-    "simple_metadataloader.SimpleDataLoader.__init__.__annotations__"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Defining required parameters for metadataloader.\n",
-    "\n",
-    "# For batch correction you can pass the `batch` column inside `metadata_col`. \n",
-    "metadata_col = ['batch', ]\n",
-    "\n",
-    "# We need to use `SimpleMetaDataLoader` for doing batch correction.\n",
-    "dataloader_config = {\n",
-    "    'name': 'SimpleMetaDataLoader',\n",
-    "    'params': {\n",
-    "        'batch_size': 3,\n",
-    "        'metadata_col': metadata_col\n",
-    "    }\n",
-    "}\n",
-    "\n",
-    "# Generating mappings for anndata obs columns.\n",
-    "mappings = {}\n",
-    "for column_name in adata.obs.columns:\n",
-    "    mappings[column_name] = {}\n",
-    "\n",
-    "    id2label = []\n",
-    "    id2label += adata.obs[column_name].astype(\n",
-    "        'category').cat.categories.tolist()\n",
-    "\n",
-    "    label2id = {id2label[i]: i for i in range(len(id2label))}\n",
-    "    mappings[column_name]['id2label'] = id2label\n",
-    "    mappings[column_name]['label2id'] = label2id"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Creating dataloader object.\n",
-    "dataloader, _ = build_dataloader(dataloader_config=dataloader_config,\n",
-    "                                    adata=adata,\n",
-    "                                    target='celltype',\n",
-    "                                    mappings=mappings)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Features shape : torch.Size([3, 9])\n"
-     ]
-    }
-   ],
-   "source": [
-    "# We can check if `batch` is added as a one hot-encoded vectors in features data.\n",
-    "# Initially features shape is (batch_size, 7) & there are 2 batches in data.\n",
-    "# So number of features after adding this column to features data should be 7+2=9.\n",
-    "# Hence features shape has to be (batch_size, 9) post doing batch correction.\n",
-    "\n",
-    "for feature, _ in dataloader:\n",
-    "    print('Features shape :', feature.shape)\n",
-    "    break"
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "scalr_minerva",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "name": "python",
-   "version": "3.9.19"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}
+{"cells":[{"cell_type":"markdown","metadata":{"id":"WZtTrD0KNRJy"},"source":["# <span style=\"color: steelblue;\">Batch Correction using scaLR</span>"]},{"cell_type":"markdown","metadata":{"id":"dEBM-AOGNRJ1"},"source":["Keypoints\n","\n","1. This notebook is designed as a tutorial for using batch correction using **`SimpleMetaDataLoader`** from **`scaLR`** library.\n","2. The dataloader is extensible to add any column from metadata as one hot-encoded vectors which can be useful for model training."]},{"cell_type":"markdown","metadata":{"id":"z90HlqMeNRJ2"},"source":["## <span style=\"color: steelblue;\">What is the batch correction?</span>\n","\n","- Single-cell genomic datasets related to a specific disease or trait are often compiled from multiple experiments, each sequenced from different single cells under varying conditions, such as capturing times, handling personnel, reagent lots, equipment, and even technology platforms.\n","- These differences lead to large variations also known as `batch effects` in the data. When combining these datasets for analysis and modeling, it's crucial to ensure that the model isn't biased towards data from certain batches due to their higher or lower value ranges. Therefore, it's necessary to eliminate `batch effects` from these datasets.\n","\n","\n","## <span style=\"color: steelblue;\">How to perform batch correction?</span>\n","\n","- Many statistical tools such as `Scanpy`, `Seurat`, `Harmony`, `Combat`, etc. performed `batch correction` by performing `normalization` and `dimensional reduction` and then removing the `batch effect` by fitting `linear or mixed models`, calculating the `k-nearest neighbor` or `mutual nearest neighbors distance`, `canonical correlation analysis`, etc.\n","\n","- While traditional `batch corrections` are robust and widely used, they have limitations in handling non-linear relationships, scalability, flexibility, and the preservation of biological signals, AI/ML-based batch correction methods offer significant advantages in these areas, making them a powerful alternative in complex and large-scale single-cell genomic datasets.\n","\n","- The `batch correction` approach in the **`scaLR`** platform is inspired by the [scGPT](https://www.nature.com/articles/s41592-024-02201-0)(Cui et al.) tool, where batch information is integrated into the feature data to inform the model about the origin of each sample. Since batch is a `categorical` variable, directly including it as a `label-encoded` feature is not appropriate, as no batch is inherently superior to another. Instead, the solution is to use a `one-hot encoded vector` to represent batch information.\n","\n","- For example, if we have four batches in the dataset, the one-hot encoding would work as follows:\n","    - Batch 1 -> 0 0 0 1\n","    - Batch 2 -> 0 0 1 0\n","    - Batch 3 -> 0 1 0 0\n","    - Batch 4 -> 1 0 0 0\n","\n","- These encoded vectors represent the batches and are added to the feature data. In this case, four additional columns will be included in the feature data, ensuring that the model is aware of the batch information while training on samples from different batches.\n","\n","\n","## <span style=\"color: steelblue;\">How is it implemented in the scaLR platform?</span>\n","\n","- In the **`scaLR`** platform, we've implemented **`SimpleMetaDataLoader`** data loader that handles this process automatically.\n","- You can specify the `metadata` column you want to one-hot encode and add it to the feature data, and the data loader will take care of the encoding.\n","- We've also extended the functionality to allow users to include multiple columns from the `metadata` as `one-hot encoded vectors` in the feature data. Bypassing a list of columns, you can easily incorporate additional information.\n","- This approach is particularly useful in scenarios like predicting `disease vs. non-disease` outcomes, where certain `metadata`, such as `cell type`, might enhance the model's predictive power.\n","- By adding the `cell type` information to the feature data using this method, you can improve the model's performance.\n","- Generally, this won't be used as a library utility - it will be mostly used as a part of pipeline. Please find below a explaination a code snippet to understand its basics.\n"]},{"cell_type":"markdown","source":["## <span style=\"color: steelblue;\">Cloning scaLR</span>"],"metadata":{"id":"Lyk1yNRW9iMI"}},{"cell_type":"code","source":["!git clone https://github.com/infocusp/scaLR.git"],"metadata":{"id":"ufzA5OylNWsA","collapsed":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["## <span style=\"color: steelblue;\">Library Installation and Imports</span>\n","\n"],"metadata":{"id":"__FIQG1T9r-V"}},{"cell_type":"code","source":["!pip install anndata==0.10.9 pydeseq2==0.4.11 scanpy==1.10.3"],"metadata":{"collapsed":true,"id":"59UdOT0nRro6"},"execution_count":null,"outputs":[]},{"cell_type":"code","execution_count":null,"metadata":{"id":"nZwdNgQ7NRJ4"},"outputs":[],"source":["# Required imports.\n","import sys\n","sys.path.append('./scaLR')\n","\n","import anndata\n","import numpy as np\n","import pandas as pd\n","\n","from scalr.nn.dataloader import build_dataloader, simple_metadataloader\n","\n","# Setting seed for reproducibility\n","np.random.seed(0)"]},{"cell_type":"markdown","source":["## <span style=\"color: steelblue;\">Downloading data</span>\n","\n","For this tutorial, we will use two datasets from `cellxgene`([Jin et al. (2021) iScience](https://doi.org/10.1016/j.isci.2021.103115)). The first dataset will serve as batch 1, and the second as batch 2."],"metadata":{"id":"RgQ7PSXy_dtY"}},{"cell_type":"code","source":["!wget -P ./data https://datasets.cellxgene.cziscience.com/16acb1d0-4108-4767-9615-0b42abe09992.h5ad\n","!wget -P ./data https://datasets.cellxgene.cziscience.com/8651e63c-0f98-4a87-bdbd-2da41bdf6de5.h5ad"],"metadata":{"id":"xYf5O0MbdcJ7"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["## <span style=\"color: steelblue;\">Loading datasets and merging</span>"],"metadata":{"id":"6wZhhGrwClT7"}},{"cell_type":"code","source":["data_1 = anndata.read_h5ad('/content/data/16acb1d0-4108-4767-9615-0b42abe09992.h5ad')\n","data_2 = anndata.read_h5ad('/content/data/8651e63c-0f98-4a87-bdbd-2da41bdf6de5.h5ad')\n","print(f'\\nDataset-1 has {data_1.n_obs} cells and {data_1.n_vars} genes\\nDataset-2 has {data_2.n_obs} cells and {data_2.n_vars} genes')"],"metadata":{"id":"u8fBz_7Zdc7g"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["print(f'Shape of \"obs\" before adding batch\\n\\nDataset-1 : {data_1.obs.shape}\\nDataset-2 : {data_2.obs.shape}')"],"metadata":{"id":"R1_oqEnRDBEu"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["# Adding batch information.\n","data_1.obs['batch'] = 'batch1'\n","data_2.obs['batch'] = 'batch2'\n","print(f'Shape of \"obs\" after adding batch\\n\\nDataset-1 : {data_1.obs.shape}\\nDataset-2 : {data_2.obs.shape}')"],"metadata":{"id":"5OZ99T9jfOMR"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["# Combining two datasets.\n","adata = anndata.concat([data_1, data_2])\n","print(f'Combined datset has shape : {adata.shape}')"],"metadata":{"id":"g1urx3oLfcLY"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["## <span style=\"color: steelblue;\">Batch correction</span>"],"metadata":{"id":"ZJTpOrIXGcAK"}},{"cell_type":"code","execution_count":null,"metadata":{"id":"cUMPGYJ-NRJ5"},"outputs":[],"source":["# Below are the required params for SimpleMetaDataLoader.\n","simple_metadataloader.SimpleMetaDataLoader.__init__.__annotations__"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"kmCGPXeLNRJ6"},"outputs":[],"source":["# Defining required parameters for SimpleMetaDataLoader.\n","\n","# For batch correction you can pass the `batch` column inside `metadata_col`.\n","metadata_col = ['batch', ]\n","\n","# Generating mappings for anndata obs columns.\n","mappings = {}\n","for column_name in adata.obs.columns:\n","    mappings[column_name] = {}\n","\n","    id2label = []\n","    id2label += adata.obs[column_name].astype(\n","        'category').cat.categories.tolist()\n","\n","    label2id = {id2label[i]: i for i in range(len(id2label))}\n","    mappings[column_name]['id2label'] = id2label\n","    mappings[column_name]['label2id'] = label2id"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"BtTzBQ8oNRJ6"},"outputs":[],"source":["# Creating dataloader object.\n","dataloader = simple_metadataloader.SimpleMetaDataLoader(batch_size=3,\n","                                                        target='cell_type',\n","                                                        mappings=mappings,\n","                                                        metadata_col=metadata_col)"]},{"cell_type":"markdown","source":["We can check if `batch` is added as a one hot-encoded vectors in features data`(i.e. 23586 genes + 2 batches)`.\n","Initially features shape is `(batch_size, 23586)` & there are 2 batches in data.\n","So number of features after adding this column to features data should be `23586+2=23588`.\n","Hence features shape has to be `(batch_size, 23588)` post doing batch correction."],"metadata":{"id":"gzv3mvFYOl5S"}},{"cell_type":"code","execution_count":null,"metadata":{"id":"1O2AyKFxNRJ6"},"outputs":[],"source":["# Verifying features in the dataloader\n","for feature, _ in dataloader(adata):\n","    print('Features shape :', feature.shape)\n","    print('Features :', feature)\n","    break"]},{"cell_type":"markdown","source":["We observe that the feature tensor has a shape of `[3, 23588]`, representing `3` samples with `23,588` features each. The batch information is appended to the gene expression values using one-hot encoding: in this case, each feature vector ends with values `[1, 0]`, representing the samples from `batch 1`."],"metadata":{"id":"OVeWJS8h_Ctu"}},{"cell_type":"code","source":["# Checking what is one hot encoding vector for batches.\n","onhotencode_batches = dataloader.metadata_onehotencoder['batch'].transform(np.array(['batch1', 'batch2']).reshape(-1, 1))\n","onhotencode_batches.A"],"metadata":{"id":"WiQYdHjhzoIW"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["It has been verified that the one-hot encoding vector for batch 1 is [1, 0] and for batch 2 is [0, 1]"],"metadata":{"id":"7t9oD5BoBl1t"}},{"cell_type":"code","source":[],"metadata":{"id":"1LHHx8A1CJf7"},"execution_count":null,"outputs":[]}],"metadata":{"kernelspec":{"display_name":"scalr_minerva","language":"python","name":"python3"},"language_info":{"name":"python","version":"3.9.19"},"colab":{"provenance":[]}},"nbformat":4,"nbformat_minor":0}
\ No newline at end of file
diff --git a/tutorials/preprocessing/normalization.ipynb b/tutorials/preprocessing/normalization.ipynb
index 5abc892..d511784 100644
--- a/tutorials/preprocessing/normalization.ipynb
+++ b/tutorials/preprocessing/normalization.ipynb
@@ -1 +1,660 @@
-{"cells":[{"cell_type":"markdown","id":"a7e5bdf5","metadata":{"id":"a7e5bdf5"},"source":["# <span style=\"color: steelblue;\">Normalization using scaLR</span>"]},{"cell_type":"markdown","id":"0b3cbafc-1741-4535-91c1-75185c5afc08","metadata":{"id":"0b3cbafc-1741-4535-91c1-75185c5afc08"},"source":["Keypoints\n","\n","1. This notebook is designed as a tutorial for using normalization from a scaLR library.\n","2. Also, we have compared results using standard library like sklearn, scanpy for normalization etc.\n","3. These packages are built so to handle very large data say lakhs of samples with low resource constraints, which standard libraries can't handle at once."]},{"cell_type":"markdown","id":"e6012f9b-e45a-4401-9349-0c1fa2a3c81a","metadata":{"id":"e6012f9b-e45a-4401-9349-0c1fa2a3c81a"},"source":["## <span style=\"color: steelblue;\">Cloning scaLR</span>"]},{"cell_type":"code","source":["!git clone https://github.com/infocusp/scaLR.git"],"metadata":{"id":"boKYqxU-QSYV"},"id":"boKYqxU-QSYV","execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["## <span style=\"color: steelblue;\">Library Installation and Import</span>"],"metadata":{"id":"3866F5bHNiHC"},"id":"3866F5bHNiHC"},{"cell_type":"code","source":["!pip install anndata==0.10.9 scanpy==1.10.3 pydeseq2==0.4.11"],"metadata":{"collapsed":true,"id":"WT_923FIQu_H"},"id":"WT_923FIQu_H","execution_count":null,"outputs":[]},{"cell_type":"code","execution_count":null,"id":"de5a0712-dff6-4b01-a5cb-f248d944355e","metadata":{"id":"de5a0712-dff6-4b01-a5cb-f248d944355e"},"outputs":[],"source":["from copy import deepcopy\n","import sys\n","sys.path.append('scaLR')\n","\n","import anndata\n","import pandas as pd\n","import numpy as np\n","\n","# scalr library normalization modules.\n","from scalr.data.preprocess import standard_scale, sample_norm\n","from scalr.data_ingestion_pipeline import DataIngestionPipeline\n","from scalr.utils.file_utils import read_data, write_data, write_chunkwise_data\n","\n","# Scanpy library for sample-norm\n","import scanpy as sc\n","# Sklearn library for standard scaler object\n","from sklearn.preprocessing import StandardScaler\n","from os import path\n","\n","%reload_ext autoreload\n","%autoreload 2"]},{"cell_type":"markdown","source":["## <span style=\"color: steelblue;\">Downloading data</span>\n","- Downloading an anndata from `cellxgene`."],"metadata":{"id":"PXTHtcJwN6fk"},"id":"PXTHtcJwN6fk"},{"cell_type":"code","source":["# This shell will take approximately 00:00:24 (hh:mm:ss) to run.\n","!wget -P data https://datasets.cellxgene.cziscience.com/16acb1d0-4108-4767-9615-0b42abe09992.h5ad"],"metadata":{"id":"s3uaC_9bN8dk"},"id":"s3uaC_9bN8dk","execution_count":null,"outputs":[]},{"cell_type":"code","source":["# Reading data\n","adata = anndata.read_h5ad('data/16acb1d0-4108-4767-9615-0b42abe09992.h5ad')\n","print(f\"\\nThe anndata has '{adata.n_obs}' cells and '{adata.n_vars}' genes\")"],"metadata":{"id":"GCWWSGSYU1sS"},"id":"GCWWSGSYU1sS","execution_count":null,"outputs":[]},{"cell_type":"code","source":["# Verifying expression values of 1-10th gene in first 10 cells\n","adata.X[:10,:10].A"],"metadata":{"id":"sfoNuTBHVABc"},"id":"sfoNuTBHVABc","execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["- In the current `AnnData` object, the gene expression data in `X` has already been normalized. Ideally, normalization should be applied only if the raw data is present in `X`.\n","- For this tutorial, we will create a new `AnnData` object using the raw gene expression values."],"metadata":{"id":"XsoN2jlLXlNe"},"id":"XsoN2jlLXlNe"},{"cell_type":"code","source":["# Checking for raw gene expression\n","print(f\"Raw expression data in anndata : {adata.raw is not None}\")"],"metadata":{"id":"DhupF7iacYuL"},"id":"DhupF7iacYuL","execution_count":null,"outputs":[]},{"cell_type":"code","source":["adata.raw.X[:10,:10].A"],"metadata":{"id":"mDVcW5TScnch"},"id":"mDVcW5TScnch","execution_count":null,"outputs":[]},{"cell_type":"code","source":["raw_adata = anndata.AnnData(X=adata.raw.X,var=adata.var,obs=adata.obs)\n","sc.write('/content/data/raw_adata.h5ad',raw_adata)"],"metadata":{"id":"ju4QSAYbX5Br"},"id":"ju4QSAYbX5Br","execution_count":null,"outputs":[]},{"cell_type":"markdown","source":[],"metadata":{"id":"gzAvbXfeVONf"},"id":"gzAvbXfeVONf"},{"cell_type":"markdown","id":"35ef2c7b-e7ae-44f4-b5e2-429f515d8108","metadata":{"id":"35ef2c7b-e7ae-44f4-b5e2-429f515d8108"},"source":["## <span style=\"color: steelblue;\">Data Generation</span>\n","\n","- In this section, the downloaded anndata will be split into train, validation, and test sets.\n","- To accomplish this, we’ll implement the `generate_train_val_test_split` method in the `DataIngestionPipeline` of scaLR.\n","- We need the required parameters in data config in the form of a dictionary. For more information, please refer to the `DATA CONFIG` section in the [config.yaml](https://github.com/infocusp/scaLR/blob/main/config/config.yaml) file of scaLR.\n"]},{"cell_type":"code","source":["# Parameters of `DataIngestionPipeline`\n","data_config = {'sample_chunksize': 1000,\n","               'train_val_test': {'full_datapath': '/content/data/raw_adata.h5ad',\n","                                  'splitter_config': {'name': 'GroupSplitter',\n","                                                      'params': {'split_ratio': [7, 1, 2.5],'stratify': 'donor_id'}}},\n","               'target': 'cell_type'}"],"metadata":{"id":"52RqqHCf5bo1"},"id":"52RqqHCf5bo1","execution_count":null,"outputs":[]},{"cell_type":"code","source":["data_config"],"metadata":{"id":"a5yZRuSR844S"},"id":"a5yZRuSR844S","execution_count":null,"outputs":[]},{"cell_type":"code","source":["# Splitting data\n","data_split = DataIngestionPipeline(data_config=data_config,\n","                                   dirpath = './data')\n","data_split.generate_train_val_test_split()\n","# Data splits can be found at `./data/train_val_test_split`"],"metadata":{"id":"HvGS6Op85TDI"},"id":"HvGS6Op85TDI","execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["### Verifying `train`, `val`, and `test` data"],"metadata":{"id":"eYtJoocq7rcL"},"id":"eYtJoocq7rcL"},{"cell_type":"code","source":["datapath = path.join('./data', 'train_val_test_split')\n","\n","train_adata = read_data(path.join(datapath, 'train'))[:,:].to_adata()\n","val_adata = read_data(path.join(datapath, 'val'))[:,:].to_adata()\n","test_adata = read_data(path.join(datapath, 'test'))[:,:].to_adata()"],"metadata":{"id":"2Hjjtc5E7kzR"},"id":"2Hjjtc5E7kzR","execution_count":null,"outputs":[]},{"cell_type":"code","source":["# Gene expression data for the first 10 cells and genes in `train data`.\n","train_adata.X[:10, :10].A"],"metadata":{"id":"goHDw7W-8xBu"},"id":"goHDw7W-8xBu","execution_count":null,"outputs":[]},{"cell_type":"code","source":["# Gene expression data for the first 10 cells and genes in `val data`.\n","val_adata.X[:10, :10].A"],"metadata":{"id":"J-brkrpQ9zHH"},"id":"J-brkrpQ9zHH","execution_count":null,"outputs":[]},{"cell_type":"code","source":["# Gene expression data for the first 10 cells and genes in `test data`.\n","test_adata.X[:10, :10].A"],"metadata":{"id":"MCeHgomX94f2"},"id":"MCeHgomX94f2","execution_count":null,"outputs":[]},{"cell_type":"markdown","id":"11b34e08-7022-4202-b993-31d33e25e47d","metadata":{"id":"11b34e08-7022-4202-b993-31d33e25e47d"},"source":["## <span style=\"color: steelblue;\">Normalization</span>\n","## <span style=\"color: steelblue;\">1. StandardScaler</span>\n","This method used to normalize the data so that each gene has a mean of 0 and a standard deviation of 1. This standardization balances the data, reducing biases from genes with larger ranges or higher average expression, and improves the consistency of downstream analyses."]},{"cell_type":"markdown","id":"c6053926-c6be-4fac-a32a-5fc0fa2a6f9e","metadata":{"id":"c6053926-c6be-4fac-a32a-5fc0fa2a6f9e"},"source":["### <span style=\"color: steelblue;\">scalr package - how to to use it?</span>"]},{"cell_type":"code","execution_count":null,"id":"00377fe1-fb3e-4558-a492-d5561187a29f","metadata":{"id":"00377fe1-fb3e-4558-a492-d5561187a29f"},"outputs":[],"source":["# Creating object for standard scaling normalization.\n","scalr_std_scaler = standard_scale.StandardScaler(with_mean=False)\n","\n","print('\\n1. `fit()` function parameters :', scalr_std_scaler.fit.__annotations__)\n","print('\\n2. `transform()` function parameters :', scalr_std_scaler.transform.__annotations__)"]},{"cell_type":"code","execution_count":null,"id":"57e7ba69","metadata":{"id":"57e7ba69"},"outputs":[],"source":["# Datapath to store processed_data\n","processed_datapath = './processed_data_ss'"]},{"cell_type":"code","execution_count":null,"id":"92cf9094-9a2c-4f96-a6cf-66f211c4d1f4","metadata":{"scrolled":true,"id":"92cf9094-9a2c-4f96-a6cf-66f211c4d1f4"},"outputs":[],"source":["# Fitting object on train data.\n","## chunk size to process data in chunks - to extract required parameters from data. Enter value that can fit in your memory.\n","## It can be 2k, 3k , 5k, 10k etc...\n","sample_chunksize = 1000\n","scalr_std_scaler.fit(read_data(path.join(datapath, 'train')), sample_chunksize)\n","\n","# Transforming the test data using above created object & storing it at `preprocessed_datapath`.\n","scalr_std_scaler.process_data(read_data(path.join(datapath, 'test')),\n","                                          sample_chunksize,\n","                                          path.join(processed_datapath, 'test'))"]},{"cell_type":"code","execution_count":null,"id":"5d421f92","metadata":{"id":"5d421f92"},"outputs":[],"source":["# Reading transformed test data\n","test_adata_pipeline = read_data(path.join(processed_datapath, 'test'))\n","test_adata_pipeline[:, :].X[:10, :10]"]},{"cell_type":"markdown","id":"36566aec-54ce-4c54-80c9-86a557418f29","metadata":{"id":"36566aec-54ce-4c54-80c9-86a557418f29"},"source":["### <span style=\"color: steelblue;\">sklearn package for standardscaling</span>\n","- Developers can ignore this section"]},{"cell_type":"code","execution_count":null,"id":"a54c31bd-9de2-4772-ba68-405605d65cf4","metadata":{"id":"a54c31bd-9de2-4772-ba68-405605d65cf4"},"outputs":[],"source":["# Standard scaling using sklearn package\n","sklearn_std_scaler = StandardScaler(with_mean=False)\n","sklearn_std_scaler.fit(train_adata.X.A)\n","test_adata_sklearn = sklearn_std_scaler.transform(test_adata.X.A)\n","test_adata_sklearn[:10, :10]"]},{"cell_type":"markdown","id":"5a592691-0f87-4eb8-aee2-0af44623871e","metadata":{"id":"5a592691-0f87-4eb8-aee2-0af44623871e"},"source":["### <span style=\"color: steelblue;\">Comparing scalr library results with sklearn's library results</span>"]},{"cell_type":"code","execution_count":null,"id":"3b269d5f-976e-4d99-8ae5-6f822fa7c68d","metadata":{"id":"3b269d5f-976e-4d99-8ae5-6f822fa7c68d"},"outputs":[],"source":["# Checking if error is less than 1e-9\n","assert sum(\n","abs(scalr_std_scaler.train_mean[0] -\n","    sklearn_std_scaler.mean_).flatten() < 1e-9\n",") == train_adata.shape[1], \"Train data mean is not correctly calculated...\"\n","\n","assert sum(\n","abs(scalr_std_scaler.train_std[0] - sklearn_std_scaler.scale_).flatten() <\n","1e-9) == train_adata.shape[\n","    1], \"Train data standard deviation is not correctly calculated...\""]},{"cell_type":"markdown","id":"c053b40b-d170-4666-b6ec-b6e1415f2fbb","metadata":{"id":"c053b40b-d170-4666-b6ec-b6e1415f2fbb"},"source":["## <span style=\"color: steelblue;\">2. SampleNorm</span>\n","- In scRNA-seq, each cell may have a different sequencing depth, resulting in some cells having higher total counts (or reads) than others. Normalizing each cell by its total gene count using `SampleNorm` addresses this variability, ensuring consistent expression levels across the dataset and enabling reliable cell-to-cell comparisons.\n","\n","- After normalization, the default sum of gene expression in each cell becomes one. This can be adjusted by specifying a different total using the `scaling_factor` parameter, as in `sample_norm.SampleNorm(scaling_factor='intended sum value')`."]},{"cell_type":"markdown","id":"2def9f30-1d9d-4167-9b18-bb2012b2c6b7","metadata":{"id":"2def9f30-1d9d-4167-9b18-bb2012b2c6b7"},"source":["### <span style=\"color: steelblue;\">scalr package - how to to use it?</span>"]},{"cell_type":"code","execution_count":null,"id":"ebaf5460-f7cd-4cbb-a3f8-502c5fb73861","metadata":{"id":"ebaf5460-f7cd-4cbb-a3f8-502c5fb73861"},"outputs":[],"source":["# Sample norm using pipeline\n","scalr_sample_norm = sample_norm.SampleNorm()\n","\n","print('\\n1. `transform()` function parameters :', scalr_sample_norm.transform.__annotations__)"]},{"cell_type":"code","execution_count":null,"id":"44abc1dd","metadata":{"id":"44abc1dd"},"outputs":[],"source":["# Datapath to store processed_data\n","processed_datapath = './processed_data_sn'"]},{"cell_type":"code","execution_count":null,"id":"18f0b716-ff8d-4ecb-9591-195217f9ac27","metadata":{"id":"18f0b716-ff8d-4ecb-9591-195217f9ac27"},"outputs":[],"source":["# Fitting is not required on train data for sample-norm.\n","sample_chunksize = 1000\n","\n","# Transforming on test data.\n","scalr_sample_norm.process_data(read_data(path.join(datapath, 'test')),\n","                               sample_chunksize,\n","                               path.join(processed_datapath, 'test'))"]},{"cell_type":"code","execution_count":null,"id":"161952ae","metadata":{"id":"161952ae"},"outputs":[],"source":["# Reading transformed test data\n","test_data_sample_norm_pipeline = read_data(path.join(processed_datapath, 'test'))"]},{"cell_type":"markdown","id":"d5b86a86-7e97-4a40-85e3-6558c082c8f2","metadata":{"id":"d5b86a86-7e97-4a40-85e3-6558c082c8f2"},"source":["### <span style=\"color: steelblue;\">Scanpy package for sample-norm</span>\n","- Developers can ignore this section"]},{"cell_type":"code","execution_count":null,"id":"b795b67e","metadata":{"id":"b795b67e"},"outputs":[],"source":["test_adata = read_data(path.join(datapath, 'test'), backed=None)\n","test_adata = test_adata[:, :].to_adata()\n","test_adata"]},{"cell_type":"code","execution_count":null,"id":"0d6bed81-105b-4f66-9c87-ca1ae5f60412","metadata":{"id":"0d6bed81-105b-4f66-9c87-ca1ae5f60412"},"outputs":[],"source":["# Sample norm using scanpy package\n","test_data_sample_norm_sc = sc.pp.normalize_total(test_adata, target_sum=1, inplace=False)\n","test_data_sample_norm_sc['X'][:10, :10].A"]},{"cell_type":"markdown","id":"1ec3be1c-0031-49dc-8783-e1c769baf24d","metadata":{"id":"1ec3be1c-0031-49dc-8783-e1c769baf24d"},"source":["### <span style=\"color: steelblue;\">Comparing scalr library results with scanpy library results</span>"]},{"cell_type":"code","execution_count":null,"id":"ea3e91e9-50c8-49d5-a916-8620975e01d7","metadata":{"id":"ea3e91e9-50c8-49d5-a916-8620975e01d7"},"outputs":[],"source":["# Checking if error is less than 1e-15\n","(abs(test_data_sample_norm_sc['X'] - test_data_sample_norm_pipeline[:, :].X) < 1e-15)[:10, :10]"]},{"cell_type":"code","execution_count":null,"id":"19387599","metadata":{"id":"19387599"},"outputs":[],"source":[]}],"metadata":{"kernelspec":{"display_name":"scalr_minerva","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.9.19"},"colab":{"provenance":[]}},"nbformat":4,"nbformat_minor":5}
\ No newline at end of file
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "a7e5bdf5",
+   "metadata": {
+    "id": "a7e5bdf5"
+   },
+   "source": [
+    "# <span style=\"color: steelblue;\">Normalization using scaLR</span>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0b3cbafc-1741-4535-91c1-75185c5afc08",
+   "metadata": {
+    "id": "0b3cbafc-1741-4535-91c1-75185c5afc08"
+   },
+   "source": [
+    "Keypoints\n",
+    "\n",
+    "1. This notebook is designed as a tutorial for using normalization from a scaLR library.\n",
+    "2. Also, we have compared results using standard library like sklearn, scanpy for normalization etc.\n",
+    "3. These packages are built so to handle very large data say lakhs of samples with low resource constraints, which standard libraries can't handle at once."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e6012f9b-e45a-4401-9349-0c1fa2a3c81a",
+   "metadata": {
+    "id": "e6012f9b-e45a-4401-9349-0c1fa2a3c81a"
+   },
+   "source": [
+    "## <span style=\"color: steelblue;\">Cloning scaLR</span>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "boKYqxU-QSYV",
+   "metadata": {
+    "id": "boKYqxU-QSYV"
+   },
+   "outputs": [],
+   "source": [
+    "!git clone https://github.com/infocusp/scaLR.git"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3866F5bHNiHC",
+   "metadata": {
+    "id": "3866F5bHNiHC"
+   },
+   "source": [
+    "## <span style=\"color: steelblue;\">Library Installation and Import</span>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "WT_923FIQu_H",
+   "metadata": {
+    "collapsed": true,
+    "id": "WT_923FIQu_H"
+   },
+   "outputs": [],
+   "source": [
+    "!pip install anndata==0.10.9 scanpy==1.10.3 pydeseq2==0.4.11"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "de5a0712-dff6-4b01-a5cb-f248d944355e",
+   "metadata": {
+    "id": "de5a0712-dff6-4b01-a5cb-f248d944355e"
+   },
+   "outputs": [],
+   "source": [
+    "from copy import deepcopy\n",
+    "import sys\n",
+    "sys.path.append('scaLR')\n",
+    "\n",
+    "import anndata\n",
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "\n",
+    "# scalr library normalization modules.\n",
+    "from scalr.data.preprocess import standard_scale, sample_norm\n",
+    "from scalr.data_ingestion_pipeline import DataIngestionPipeline\n",
+    "from scalr.utils.file_utils import read_data, write_data, write_chunkwise_data\n",
+    "\n",
+    "# Scanpy library for sample-norm\n",
+    "import scanpy as sc\n",
+    "# Sklearn library for standard scaler object\n",
+    "from sklearn.preprocessing import StandardScaler\n",
+    "from os import path\n",
+    "\n",
+    "%reload_ext autoreload\n",
+    "%autoreload 2"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "PXTHtcJwN6fk",
+   "metadata": {
+    "id": "PXTHtcJwN6fk"
+   },
+   "source": [
+    "## <span style=\"color: steelblue;\">Downloading data</span>\n",
+    "- Downloading an anndata from `cellxgene`([Jin et al. (2021) iScience](https://doi.org/10.1016/j.isci.2021.103115))."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "s3uaC_9bN8dk",
+   "metadata": {
+    "id": "s3uaC_9bN8dk"
+   },
+   "outputs": [],
+   "source": [
+    "# This shell will take approximately 00:00:24 (hh:mm:ss) to run.\n",
+    "!wget -P data https://datasets.cellxgene.cziscience.com/16acb1d0-4108-4767-9615-0b42abe09992.h5ad"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "GCWWSGSYU1sS",
+   "metadata": {
+    "id": "GCWWSGSYU1sS"
+   },
+   "outputs": [],
+   "source": [
+    "# Reading data\n",
+    "adata = anndata.read_h5ad('data/16acb1d0-4108-4767-9615-0b42abe09992.h5ad')\n",
+    "print(f\"\\nThe anndata has '{adata.n_obs}' cells and '{adata.n_vars}' genes\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "sfoNuTBHVABc",
+   "metadata": {
+    "id": "sfoNuTBHVABc"
+   },
+   "outputs": [],
+   "source": [
+    "# Verifying expression values of 1-10th gene in first 10 cells\n",
+    "adata.X[:10,:10].A"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "XsoN2jlLXlNe",
+   "metadata": {
+    "id": "XsoN2jlLXlNe"
+   },
+   "source": [
+    "- In the current `AnnData` object, the gene expression data in `X` has already been normalized. Ideally, normalization should be applied only if the raw data is present in `X`.\n",
+    "- For this tutorial, we will create a new `AnnData` object using the raw gene expression values."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "DhupF7iacYuL",
+   "metadata": {
+    "id": "DhupF7iacYuL"
+   },
+   "outputs": [],
+   "source": [
+    "# Checking for raw gene expression\n",
+    "print(f\"Raw expression data in anndata : {adata.raw is not None}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "mDVcW5TScnch",
+   "metadata": {
+    "id": "mDVcW5TScnch"
+   },
+   "outputs": [],
+   "source": [
+    "adata.raw.X[:10,:10].A"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ju4QSAYbX5Br",
+   "metadata": {
+    "id": "ju4QSAYbX5Br"
+   },
+   "outputs": [],
+   "source": [
+    "raw_adata = anndata.AnnData(X=adata.raw.X,var=adata.var,obs=adata.obs)\n",
+    "sc.write('/content/data/raw_adata.h5ad',raw_adata)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "gzAvbXfeVONf",
+   "metadata": {
+    "id": "gzAvbXfeVONf"
+   },
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "35ef2c7b-e7ae-44f4-b5e2-429f515d8108",
+   "metadata": {
+    "id": "35ef2c7b-e7ae-44f4-b5e2-429f515d8108"
+   },
+   "source": [
+    "## <span style=\"color: steelblue;\">Data Generation</span>\n",
+    "\n",
+    "- In this section, the downloaded anndata will be split into train, validation, and test sets.\n",
+    "- To accomplish this, we’ll implement the `generate_train_val_test_split` method in the `DataIngestionPipeline` of scaLR.\n",
+    "- We need the required parameters in data config in the form of a dictionary. For more information, please refer to the `DATA CONFIG` section in the [config.yaml](https://github.com/infocusp/scaLR/blob/main/config/config.yaml) file of scaLR.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "52RqqHCf5bo1",
+   "metadata": {
+    "id": "52RqqHCf5bo1"
+   },
+   "outputs": [],
+   "source": [
+    "# Parameters of `DataIngestionPipeline`\n",
+    "data_config = {'sample_chunksize': 1000,\n",
+    "               'train_val_test': {'full_datapath': '/content/data/raw_adata.h5ad',\n",
+    "                                  'splitter_config': {'name': 'GroupSplitter',\n",
+    "                                                      'params': {'split_ratio': [7, 1, 2.5],'stratify': 'donor_id'}}},\n",
+    "               'target': 'cell_type'}"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a5yZRuSR844S",
+   "metadata": {
+    "id": "a5yZRuSR844S"
+   },
+   "outputs": [],
+   "source": [
+    "data_config"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "HvGS6Op85TDI",
+   "metadata": {
+    "id": "HvGS6Op85TDI"
+   },
+   "outputs": [],
+   "source": [
+    "# Splitting data\n",
+    "data_split = DataIngestionPipeline(data_config=data_config,\n",
+    "                                   dirpath = './data')\n",
+    "data_split.generate_train_val_test_split()\n",
+    "# Data splits can be found at `./data/train_val_test_split`"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "eYtJoocq7rcL",
+   "metadata": {
+    "id": "eYtJoocq7rcL"
+   },
+   "source": [
+    "### Verifying `train`, `val`, and `test` data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2Hjjtc5E7kzR",
+   "metadata": {
+    "id": "2Hjjtc5E7kzR"
+   },
+   "outputs": [],
+   "source": [
+    "datapath = path.join('./data', 'train_val_test_split')\n",
+    "\n",
+    "train_adata = read_data(path.join(datapath, 'train'))[:,:].to_adata()\n",
+    "val_adata = read_data(path.join(datapath, 'val'))[:,:].to_adata()\n",
+    "test_adata = read_data(path.join(datapath, 'test'))[:,:].to_adata()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "goHDw7W-8xBu",
+   "metadata": {
+    "id": "goHDw7W-8xBu"
+   },
+   "outputs": [],
+   "source": [
+    "# Gene expression data for the first 10 cells and genes in `train data`.\n",
+    "train_adata.X[:10, :10].A"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "J-brkrpQ9zHH",
+   "metadata": {
+    "id": "J-brkrpQ9zHH"
+   },
+   "outputs": [],
+   "source": [
+    "# Gene expression data for the first 10 cells and genes in `val data`.\n",
+    "val_adata.X[:10, :10].A"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "MCeHgomX94f2",
+   "metadata": {
+    "id": "MCeHgomX94f2"
+   },
+   "outputs": [],
+   "source": [
+    "# Gene expression data for the first 10 cells and genes in `test data`.\n",
+    "test_adata.X[:10, :10].A"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "11b34e08-7022-4202-b993-31d33e25e47d",
+   "metadata": {
+    "id": "11b34e08-7022-4202-b993-31d33e25e47d"
+   },
+   "source": [
+    "## <span style=\"color: steelblue;\">Normalization</span>\n",
+    "## <span style=\"color: steelblue;\">1. StandardScaler</span>\n",
+    "This method used to normalize the data so that each gene has a mean of 0 and a standard deviation of 1. This standardization balances the data, reducing biases from genes with larger ranges or higher average expression, and improves the consistency of downstream analyses."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c6053926-c6be-4fac-a32a-5fc0fa2a6f9e",
+   "metadata": {
+    "id": "c6053926-c6be-4fac-a32a-5fc0fa2a6f9e"
+   },
+   "source": [
+    "### <span style=\"color: steelblue;\">scalr package - how to to use it?</span>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "00377fe1-fb3e-4558-a492-d5561187a29f",
+   "metadata": {
+    "id": "00377fe1-fb3e-4558-a492-d5561187a29f"
+   },
+   "outputs": [],
+   "source": [
+    "# Creating object for standard scaling normalization.\n",
+    "scalr_std_scaler = standard_scale.StandardScaler(with_mean=False)\n",
+    "\n",
+    "print('\\n1. `fit()` function parameters :', scalr_std_scaler.fit.__annotations__)\n",
+    "print('\\n2. `transform()` function parameters :', scalr_std_scaler.transform.__annotations__)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "57e7ba69",
+   "metadata": {
+    "id": "57e7ba69"
+   },
+   "outputs": [],
+   "source": [
+    "# Datapath to store processed_data\n",
+    "processed_datapath = './processed_data_ss'"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "92cf9094-9a2c-4f96-a6cf-66f211c4d1f4",
+   "metadata": {
+    "id": "92cf9094-9a2c-4f96-a6cf-66f211c4d1f4",
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "# Fitting object on train data.\n",
+    "## chunk size to process data in chunks - to extract required parameters from data. Enter value that can fit in your memory.\n",
+    "## It can be 2k, 3k , 5k, 10k etc...\n",
+    "sample_chunksize = 1000\n",
+    "scalr_std_scaler.fit(read_data(path.join(datapath, 'train')), sample_chunksize)\n",
+    "\n",
+    "# Transforming the test data using above created object & storing it at `preprocessed_datapath`.\n",
+    "scalr_std_scaler.process_data(read_data(path.join(datapath, 'test')),\n",
+    "                                          sample_chunksize,\n",
+    "                                          path.join(processed_datapath, 'test'))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5d421f92",
+   "metadata": {
+    "id": "5d421f92"
+   },
+   "outputs": [],
+   "source": [
+    "# Reading transformed test data\n",
+    "test_adata_pipeline = read_data(path.join(processed_datapath, 'test'))\n",
+    "test_adata_pipeline[:, :].X[:10, :10]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "36566aec-54ce-4c54-80c9-86a557418f29",
+   "metadata": {
+    "id": "36566aec-54ce-4c54-80c9-86a557418f29"
+   },
+   "source": [
+    "### <span style=\"color: steelblue;\">sklearn package for standardscaling</span>\n",
+    "- Developers can ignore this section"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a54c31bd-9de2-4772-ba68-405605d65cf4",
+   "metadata": {
+    "id": "a54c31bd-9de2-4772-ba68-405605d65cf4"
+   },
+   "outputs": [],
+   "source": [
+    "# Standard scaling using sklearn package\n",
+    "sklearn_std_scaler = StandardScaler(with_mean=False)\n",
+    "sklearn_std_scaler.fit(train_adata.X.A)\n",
+    "test_adata_sklearn = sklearn_std_scaler.transform(test_adata.X.A)\n",
+    "test_adata_sklearn[:10, :10]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5a592691-0f87-4eb8-aee2-0af44623871e",
+   "metadata": {
+    "id": "5a592691-0f87-4eb8-aee2-0af44623871e"
+   },
+   "source": [
+    "### <span style=\"color: steelblue;\">Comparing scalr library results with sklearn's library results</span>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3b269d5f-976e-4d99-8ae5-6f822fa7c68d",
+   "metadata": {
+    "id": "3b269d5f-976e-4d99-8ae5-6f822fa7c68d"
+   },
+   "outputs": [],
+   "source": [
+    "# Checking if error is less than 1e-9\n",
+    "assert sum(\n",
+    "abs(scalr_std_scaler.train_mean[0] -\n",
+    "    sklearn_std_scaler.mean_).flatten() < 1e-9\n",
+    ") == train_adata.shape[1], \"Train data mean is not correctly calculated...\"\n",
+    "\n",
+    "assert sum(\n",
+    "abs(scalr_std_scaler.train_std[0] - sklearn_std_scaler.scale_).flatten() <\n",
+    "1e-9) == train_adata.shape[\n",
+    "    1], \"Train data standard deviation is not correctly calculated...\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c053b40b-d170-4666-b6ec-b6e1415f2fbb",
+   "metadata": {
+    "id": "c053b40b-d170-4666-b6ec-b6e1415f2fbb"
+   },
+   "source": [
+    "## <span style=\"color: steelblue;\">2. SampleNorm</span>\n",
+    "- In scRNA-seq, each cell may have a different sequencing depth, resulting in some cells having higher total counts (or reads) than others. Normalizing each cell by its total gene count using `SampleNorm` addresses this variability, ensuring consistent expression levels across the dataset and enabling reliable cell-to-cell comparisons.\n",
+    "\n",
+    "- After normalization, the default sum of gene expression in each cell becomes one. This can be adjusted by specifying a different total using the `scaling_factor` parameter, as in `sample_norm.SampleNorm(scaling_factor='intended sum value')`."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2def9f30-1d9d-4167-9b18-bb2012b2c6b7",
+   "metadata": {
+    "id": "2def9f30-1d9d-4167-9b18-bb2012b2c6b7"
+   },
+   "source": [
+    "### <span style=\"color: steelblue;\">scalr package - how to to use it?</span>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ebaf5460-f7cd-4cbb-a3f8-502c5fb73861",
+   "metadata": {
+    "id": "ebaf5460-f7cd-4cbb-a3f8-502c5fb73861"
+   },
+   "outputs": [],
+   "source": [
+    "# Sample norm using pipeline\n",
+    "scalr_sample_norm = sample_norm.SampleNorm()\n",
+    "\n",
+    "print('\\n1. `transform()` function parameters :', scalr_sample_norm.transform.__annotations__)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "44abc1dd",
+   "metadata": {
+    "id": "44abc1dd"
+   },
+   "outputs": [],
+   "source": [
+    "# Datapath to store processed_data\n",
+    "processed_datapath = './processed_data_sn'"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "18f0b716-ff8d-4ecb-9591-195217f9ac27",
+   "metadata": {
+    "id": "18f0b716-ff8d-4ecb-9591-195217f9ac27"
+   },
+   "outputs": [],
+   "source": [
+    "# Fitting is not required on train data for sample-norm.\n",
+    "sample_chunksize = 1000\n",
+    "\n",
+    "# Transforming on test data.\n",
+    "scalr_sample_norm.process_data(read_data(path.join(datapath, 'test')),\n",
+    "                               sample_chunksize,\n",
+    "                               path.join(processed_datapath, 'test'))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "161952ae",
+   "metadata": {
+    "id": "161952ae"
+   },
+   "outputs": [],
+   "source": [
+    "# Reading transformed test data\n",
+    "test_data_sample_norm_pipeline = read_data(path.join(processed_datapath, 'test'))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d5b86a86-7e97-4a40-85e3-6558c082c8f2",
+   "metadata": {
+    "id": "d5b86a86-7e97-4a40-85e3-6558c082c8f2"
+   },
+   "source": [
+    "### <span style=\"color: steelblue;\">Scanpy package for sample-norm</span>\n",
+    "- Developers can ignore this section"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b795b67e",
+   "metadata": {
+    "id": "b795b67e"
+   },
+   "outputs": [],
+   "source": [
+    "test_adata = read_data(path.join(datapath, 'test'), backed=None)\n",
+    "test_adata = test_adata[:, :].to_adata()\n",
+    "test_adata"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0d6bed81-105b-4f66-9c87-ca1ae5f60412",
+   "metadata": {
+    "id": "0d6bed81-105b-4f66-9c87-ca1ae5f60412"
+   },
+   "outputs": [],
+   "source": [
+    "# Sample norm using scanpy package\n",
+    "test_data_sample_norm_sc = sc.pp.normalize_total(test_adata, target_sum=1, inplace=False)\n",
+    "test_data_sample_norm_sc['X'][:10, :10].A"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1ec3be1c-0031-49dc-8783-e1c769baf24d",
+   "metadata": {
+    "id": "1ec3be1c-0031-49dc-8783-e1c769baf24d"
+   },
+   "source": [
+    "### <span style=\"color: steelblue;\">Comparing scalr library results with scanpy library results</span>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ea3e91e9-50c8-49d5-a916-8620975e01d7",
+   "metadata": {
+    "id": "ea3e91e9-50c8-49d5-a916-8620975e01d7"
+   },
+   "outputs": [],
+   "source": [
+    "# Checking if error is less than 1e-15\n",
+    "(abs(test_data_sample_norm_sc['X'] - test_data_sample_norm_pipeline[:, :].X) < 1e-15)[:10, :10]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "19387599",
+   "metadata": {
+    "id": "19387599"
+   },
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "colab": {
+   "provenance": []
+  },
+  "kernelspec": {
+   "display_name": "scalr_minerva",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.19"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

From e4bbf93df4ba02ea2df17904aef6c3e07d1ed597 Mon Sep 17 00:00:00 2001
From: amit-samal <amit.samal@infocusp.com>
Date: Wed, 13 Nov 2024 16:46:52 +0530
Subject: [PATCH 2/2] updated README

---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index eaf6dd0..f99c7ea 100644
--- a/README.md
+++ b/README.md
@@ -72,7 +72,7 @@ Detailed tutorials have been made on how to use some functionalities as a scaLR
 - **Differential gene expression analysis** [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/infocusp/scaLR/blob/main/tutorials/analysis/differential_gene_expression/dge.ipynb)
 - **Gene recall curve** [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/infocusp/scaLR/blob/main/tutorials/analysis/gene_recall_curve/gene_recall_curve.ipynb)
 - **Normalization** [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/infocusp/scaLR/blob/main/tutorials/preprocessing/normalization.ipynb)
-- [Batch correction](https://github.com/infocusp/scaLR/blob/main/tutorials/preprocessing/batch_correction.ipynb)
+- **Batch correction** [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/infocusp/scaLR/blob/main/tutorials/preprocessing/batch_correction.ipynb)
 - [SHAP analysis](https://github.com/infocusp/scaLR/blob/main/tutorials/analysis/shap_analysis/shap_heatmap.ipynb)
 
 ## Experiment Output Structure