From 611b4f3eeba222b5234dd02792dbd56af904b9cf Mon Sep 17 00:00:00 2001
From: Anthony Christidis <anthony.christidis@stat.ubc.ca>
Date: Sat, 7 Sep 2024 16:24:45 -0400
Subject: [PATCH] Add details about datasets in vignette and pkgdown website.

---
 _pkgdown.yml                             |  7 ++++++-
 inst/script/ReferenceQueryData.R         |  1 -
 vignettes/AnnotationAnomalies.Rmd        |  2 +-
 vignettes/CellDistancesDiagnostics.Rmd   |  2 +-
 vignettes/DatasetMarkerGeneAlignment.Rmd |  4 ++--
 vignettes/StatisticalMeasures.Rmd        | 12 ++++++------
 vignettes/scDiagnostics.Rmd              | 12 ++++++++----
 7 files changed, 24 insertions(+), 16 deletions(-)

diff --git a/_pkgdown.yml b/_pkgdown.yml
index b04b392..53ce521 100644
--- a/_pkgdown.yml
+++ b/_pkgdown.yml
@@ -81,7 +81,12 @@ reference:
   - title: Misc
     contents:
       - projectPCA 
-      - calculateCategorizationEntropy     
+      - calculateCategorizationEntropy
+  - title: Datasets
+    contents:
+      - reference_data 
+      - query_data
+      - qc_data
 
 right:
   - icon: fa-github
diff --git a/inst/script/ReferenceQueryData.R b/inst/script/ReferenceQueryData.R
index 5f80672..1295d59 100644
--- a/inst/script/ReferenceQueryData.R
+++ b/inst/script/ReferenceQueryData.R
@@ -10,7 +10,6 @@ set.seed(100)
 indices <- sample(ncol(SummarizedExperiment::assay(sce)), 
                   size = floor(0.8 * ncol(SummarizedExperiment::assay(sce))), 
                   replace = FALSE)
-ref_indices <- sample(indices, 0.7*length(indices))
 reference_data <- sce[, sample(indices, 1500)]
 query_data <- sce[, -indices]
 
diff --git a/vignettes/AnnotationAnomalies.Rmd b/vignettes/AnnotationAnomalies.Rmd
index 1359082..1fbf797 100644
--- a/vignettes/AnnotationAnomalies.Rmd
+++ b/vignettes/AnnotationAnomalies.Rmd
@@ -99,7 +99,7 @@ The function also provides detailed visualizations and statistical outputs to he
 
 ### Parameters
 
-The function takes a `SingleCellExperiment` object as `reference_data` and trains an isolation forest model on the reference PCA-projected data, with an optional `query_data` for projecting onto this PCA space for anomaly detection. You can specify cell type annotations through `ref_cell_type_col` and `query_cell_type_col`, and limit the analysis to certain cell types using the `cell_types` parameter. The function allows you to select specific principal components to use to train the isolation forest via `pc_subset`, adjust the number of trees with `n_tree`, and set an `anomaly_threshold` for classifying anomalies.
+The function takes a `r `BiocStyle::Biocpkg("SingleCellExperiment")` object as `reference_data` and trains an isolation forest model on the reference PCA-projected data, with an optional `query_data` for projecting onto this PCA space for anomaly detection. You can specify cell type annotations through `ref_cell_type_col` and `query_cell_type_col`, and limit the analysis to certain cell types using the `cell_types` parameter. The function allows you to select specific principal components to use to train the isolation forest via `pc_subset`, adjust the number of trees with `n_tree`, and set an `anomaly_threshold` for classifying anomalies.
 
 
 ### Return Value
diff --git a/vignettes/CellDistancesDiagnostics.Rmd b/vignettes/CellDistancesDiagnostics.Rmd
index e9bb6c6..3760b85 100644
--- a/vignettes/CellDistancesDiagnostics.Rmd
+++ b/vignettes/CellDistancesDiagnostics.Rmd
@@ -91,7 +91,7 @@ distance_data <- calculateCellDistances(
 ```
 
 In the code above:
-- `query_data` and reference_data: These are `SingleCellExperiment` objects containing the respective datasets for analysis.
+- `query_data` and reference_data: These are `r `BiocStyle::Biocpkg("SingleCellExperiment")` objects containing the respective datasets for analysis.
 - `query_cell_type_col` and ref_cell_type_col: These arguments specify the columns in the `colData` of each dataset that contain cell type annotations.
 - `pc_subset`: Specifies which principal components (1 to 10) are used to compute distances. PCA is applied for dimensionality reduction before calculating distances.
 
diff --git a/vignettes/DatasetMarkerGeneAlignment.Rmd b/vignettes/DatasetMarkerGeneAlignment.Rmd
index 024477a..4d275e1 100644
--- a/vignettes/DatasetMarkerGeneAlignment.Rmd
+++ b/vignettes/DatasetMarkerGeneAlignment.Rmd
@@ -101,7 +101,7 @@ data("query_data")
 set.seed(0)
 ```
 
-Some functions in the vignette are designed to work with `SingleCellExperiment` objects that contain data from only one cell type. We will create separate `SingleCellExperiment` objects that only CD4 cells, to ensure compatibility with these functions.
+Some functions in the vignette are designed to work with `r `BiocStyle::Biocpkg("SingleCellExperiment")` objects that contain data from only one cell type. We will create separate `r `BiocStyle::Biocpkg("SingleCellExperiment")` objects that only CD4 cells, to ensure compatibility with these functions.
 ```{r, message=FALSE, fig.show='hide'}
 # Load library
 library(scran)
@@ -233,7 +233,7 @@ The `plotPairwiseDistancesDensity()` function is designed to calculate and visua
 
 ### Functionality
 
-The function operates on `SingleCellExperiment` objects, which are commonly used to store single-cell data, including expression matrices and associated metadata. Users specify the cell types of interest in both the query and reference datasets, and the function computes either the distances or correlation coefficients between these cells.
+The function operates on `r `BiocStyle::Biocpkg("SingleCellExperiment")` objects, which are commonly used to store single-cell data, including expression matrices and associated metadata. Users specify the cell types of interest in both the query and reference datasets, and the function computes either the distances or correlation coefficients between these cells.
 
 When principal component analysis (PCA) is applied, the function projects the expression data into a lower-dimensional PCA space, which can be specified by the user. This allows for a more focused analysis of the major sources of variation in the data. Alternatively, if no dimensionality reduction is desired, the function can directly use the expression data for computation.
 
diff --git a/vignettes/StatisticalMeasures.Rmd b/vignettes/StatisticalMeasures.Rmd
index 78c7dd5..e0ef715 100644
--- a/vignettes/StatisticalMeasures.Rmd
+++ b/vignettes/StatisticalMeasures.Rmd
@@ -95,7 +95,7 @@ set.seed(0)
 
 The calculateCramerPValue function is designed to perform the Cramer test for comparing multivariate empirical cumulative distribution functions (ECDFs) between two samples in single-cell data. This test is particularly useful for assessing whether the distributions of principal components (PCs) differ significantly between the reference and query datasets for specific cell types.
 
-To use this function, you first need to provide two key inputs: `reference_data` and `query_data`, both of which should be `SingleCellExperiment` objects containing numeric expression matrices. If `query_data` is not supplied, the function will only use the `reference_data.` You should also specify the column names for cell type annotations in both datasets via `ref_cell_type_col` and `query_cell_type_col.` If `cell_types` is not provided, the function will automatically include all unique cell types found in the datasets. The `pc_subset` parameter allows you to define which principal components to include in the analysis, with the default being the first five PCs.
+To use this function, you first need to provide two key inputs: `reference_data` and `query_data`, both of which should be `r BiocStyle::Biocpkg("SingleCellExperiment")` objects containing numeric expression matrices. If `query_data` is not supplied, the function will only use the `reference_data.` You should also specify the column names for cell type annotations in both datasets via `ref_cell_type_col` and `query_cell_type_col.` If `cell_types` is not provided, the function will automatically include all unique cell types found in the datasets. The `pc_subset` parameter allows you to define which principal components to include in the analysis, with the default being the first five PCs.
 
 The function performs the following steps: it first projects the data into PCA space, subsets the data according to the specified cell types and principal components, and then applies the Cramer test to compare the ECDFs between the reference and query datasets. The result is a named vector of p-values from the Cramer test for each cell type, which indicates whether there is a significant difference in the distributions of PCs between the two datasets.
 
@@ -123,7 +123,7 @@ In this example, the function compares the distributions of the first five princ
 
 The `calculateHotellingPValue()` function is designed to compute Hotelling's T-squared test statistic and corresponding p-values for comparing multivariate means between reference and query datasets in the context of single-cell RNA-seq data. This statistical test is particularly useful for assessing whether the mean vectors of principal components (PCs) differ significantly between the two datasets, which can be indicative of differences in the cell type distributions.
 
-To use this function, you need to provide two `SingleCellExperiment` objects: `query_data` and `reference_data`, each containing numeric expression matrices. You also need to specify the column names for cell type annotations in both datasets with `query_cell_type_col` and `ref_cell_type_col.` The `cell_types` parameter allows you to choose which cell types to include in the analysis, and if not specified, the function will automatically include all cell types present in the datasets. The `pc_subset` parameter determines which principal components to consider, with the default being the first five PCs. Additionally, `n_permutation` specifies the number of permutations for calculating p-values, with a default of 500.
+To use this function, you need to provide two `r BiocStyle::Biocpkg("SingleCellExperiment")` objects: `query_data` and `reference_data`, each containing numeric expression matrices. You also need to specify the column names for cell type annotations in both datasets with `query_cell_type_col` and `ref_cell_type_col.` The `cell_types` parameter allows you to choose which cell types to include in the analysis, and if not specified, the function will automatically include all cell types present in the datasets. The `pc_subset` parameter determines which principal components to consider, with the default being the first five PCs. Additionally, `n_permutation` specifies the number of permutations for calculating p-values, with a default of 500.
 
 The function works by first projecting the data into PCA space and then performing Hotelling's T-squared test for each specified cell type to compare the means between the reference and query datasets. It uses permutation testing to determine the p-values, indicating whether the observed differences in means are statistically significant. The result is a named numeric vector of p-values for each cell type.
 
@@ -153,7 +153,7 @@ The function begins by reducing the dimensionality of both the query and referen
 
 Here is a detailed explanation of how to use the `calculateNearestNeighborProbabilities()` function:
 
-1. **Loading the Data**: First, ensure you have the data available in the form of `SingleCellExperiment` objects for both query and reference datasets.
+1. **Loading the Data**: First, ensure you have the data available in the form of `r BiocStyle::Biocpkg("SingleCellExperiment")` objects for both query and reference datasets.
 
 2. **Function Call**: The function `calculateNearestNeighborProbabilities()` is called with several parameters including the query and reference datasets, the names of the columns containing cell type annotations, the subset of principal components to use, and the number of nearest neighbors to consider.
 
@@ -187,9 +187,9 @@ In summary, the `calculateNearestNeighborProbabilities()` function provides a ro
 
 # `calculateAveragePairwiseCorrelation()` 
 
-The `calculateAveragePairwiseCorrelation()` function is designed to compute the average pairwise correlations between specified cell types in single-cell gene expression data. This function operates on `SingleCellExperiment` objects and is ideal for single-cell analysis workflows. It calculates pairwise correlations between query and reference cells using a specified correlation method, and then averages these correlations for each cell type pair. This helps in assessing the similarity between cells in the reference and query datasets and provides insights into the reliability of cell type annotations.
+The `calculateAveragePairwiseCorrelation()` function is designed to compute the average pairwise correlations between specified cell types in single-cell gene expression data. This function operates on `r BiocStyle::Biocpkg("SingleCellExperiment")` objects and is ideal for single-cell analysis workflows. It calculates pairwise correlations between query and reference cells using a specified correlation method, and then averages these correlations for each cell type pair. This helps in assessing the similarity between cells in the reference and query datasets and provides insights into the reliability of cell type annotations.
 
-To use the `calculateAveragePairwiseCorrelation()` function, you need to supply it with two `SingleCellExperiment` objects: one for the query cells and one for the reference cells. The function also requires column names specifying cell type annotations in both datasets, and optionally a vector of cell types to include in the analysis. Additionally, you can specify a subset of principal components to use, or use the raw data directly if `pc_subset` is set to `NULL`. The correlation method can be either "spearman" or "pearson".
+To use the `calculateAveragePairwiseCorrelation()` function, you need to supply it with two `r BiocStyle::Biocpkg("SingleCellExperiment")` objects: one for the query cells and one for the reference cells. The function also requires column names specifying cell type annotations in both datasets, and optionally a vector of cell types to include in the analysis. Additionally, you can specify a subset of principal components to use, or use the raw data directly if `pc_subset` is set to `NULL`. The correlation method can be either "spearman" or "pearson".
 
 Here's an example of how to use this function:
 ```{r, fig.height=5, fig.width=10, fig.show='hide'}
@@ -215,7 +215,7 @@ Note that there is also a plot method for the object return for `calculateAverag
 
 # `regressPC()`
 
-The `regressPC()` function performs linear regression of a covariate of interest onto one or more principal components using data from a `SingleCellExperiment` object. This method helps quantify the variance explained by a covariate, which can be useful in applications such as quantifying batch effects, assessing clustering homogeneity, and evaluating alignment between query and reference datasets in cell type annotation settings.
+The `regressPC()` function performs linear regression of a covariate of interest onto one or more principal components using data from a `r BiocStyle::Biocpkg("SingleCellExperiment")` object. This method helps quantify the variance explained by a covariate, which can be useful in applications such as quantifying batch effects, assessing clustering homogeneity, and evaluating alignment between query and reference datasets in cell type annotation settings.
 
 The function calculates the R-squared value from the linear regression of the covariate onto each principal component. The variance contribution of the covariate effect per principal component is computed as the product of the variance explained by the principal component and the R-squared value. The total variance explained by the covariate is obtained by summing these contributions across all principal components.
 
diff --git a/vignettes/scDiagnostics.Rmd b/vignettes/scDiagnostics.Rmd
index 1edb9fa..2ef4f19 100644
--- a/vignettes/scDiagnostics.Rmd
+++ b/vignettes/scDiagnostics.Rmd
@@ -108,7 +108,11 @@ To explore the full capabilities of the `scDiagnostics` package, you have the op
 
 ## Loading Datasets
 
-In these datasets available in the `scDiagnostics` package, `reference_data`, `query_data`, and `qc_data` are all `SingleCellExperiment` objects that include a `logcounts` assay, which stores the log-transformed expression values for the genes.
+In these datasets available in the `scDiagnostics` package, `reference_data`, `query_data`, and `qc_data` are all `r BiocStyle::Biocpkg("SingleCellExperiment")` objects that include a `logcounts` assay, which stores the log-transformed expression values for the genes.
+
+The `reference_data` and `query_data` objects both originate from scRNA-seq experiments on hematopoietic tissues, specifically bone marrow samples, as provided by the `r BiocStyle::Biocpkg("scRNAseq")` package. These datasets have undergone comprehensive processing and cleaning, ensuring high-quality data for downstream analysis. Log-normalized counts were added to both datasets using the `r BiocStyle::Biocpkg("scuttle")` package. The `query_data` object has been further annotated with cell type assignments using the `r BiocStyle::Biocpkg("SingleR")` package, and it includes `annotation_scores` that reflect the confidence in these annotations. Additionally, gene set scores were computed and incorporated into the `query_data` using the `r BiocStyle::Biocpkg("AUCell")` package. For feature selection, the top 500 highly variable genes (HVGs) common to both datasets were identified and retained using the `r BiocStyle::Biocpkg("scran")` package. Finally, dimensionality reduction techniques including PCA, t-SNE, and UMAP were applied to both datasets, with the results stored within each object using the `r BiocStyle::Biocpkg("scater")` package.
+
+The `qc_data` dataset in this package is derived from the `hpca` dataset available in the `r BiocStyle::Biocpkg("celldex")` package. Like the other datasets, `qc_data` has undergone significant cleaning and processing to ensure high data quality. Quality control (QC) metrics were added using the `r BiocStyle::Biocpkg("scuttle")` package. Cell type annotations and associated annotation_scores were generated using the `r BiocStyle::Biocpkg("SingleR")` package. Additionally, the top highly variable genes were selected using the `r BiocStyle::Biocpkg("scran")` package to enhance the dataset’s utility for downstream analyses.
 
 ```{r, message=FALSE, fig.show='hide'}
 # Load datasets
@@ -119,13 +123,13 @@ data("qc_data")
 # Set seed for reproducibility
 set.seed(0)
 ```
-The `reference_data` object is a curated dataset that has been cleaned and processed, and it contains column data labeled `expert_annotation`, which provides cell type annotations assigned by experts. On the other hand, `query_data` also includes `expert_annotation`, but it additionally features `SingleR_annotation`, which is the cell type annotation generated by the `r BiocStyle::Biocpkg("SingleR")` package, a popular package for cell type assignment based on reference datasets. The `qc_data` object contains a special column called `annotation_scores`, which holds the scores from the `SingleR` annotations, providing a measure of confidence or relevance for the assigned cell types.
+The `reference_data` contains a column data labeled `expert_annotation`, which provides cell type annotations assigned by experts. On the other hand, `query_data` also includes `expert_annotation`, but it additionally features `SingleR_annotation`, which is the cell type annotation generated by the `r BiocStyle::Biocpkg("SingleR")` package, a popular package for cell type assignment based on reference datasets. The `qc_data` object contains a special column called `annotation_scores`, which holds the scores from the `SingleR` annotations, providing a measure of confidence or relevance for the assigned cell types.
 
 By working with these datasets, you can gain hands-on experience with the various diagnostic tools and functions offered by `scDiagnostics`, allowing you to better understand how well it aligns query and reference datasets, assesses annotation ambiguity, and evaluates cluster heterogeneity and marker gene alignment.
 
 ## Subsetting the Datasets
 
-Some functions in the vignette are designed to work with `SingleCellExperiment` objects that contain data from only one cell type. We will create separate `SingleCellExperiment` objects that only CD4 cells, to ensure compatibility with these functions.
+Some functions in the vignette are designed to work with `r BiocStyle::Biocpkg("SingleCellExperiment")` objects that contain data from only one cell type. We will create separate `r BiocStyle::Biocpkg("SingleCellExperiment")` objects that only CD4 cells, to ensure compatibility with these functions.
 ```{r, message=FALSE, fig.show='hide'}
 # Load library
 library(scran)
@@ -274,7 +278,7 @@ plot(subspace_comparison)
 
 ## `plotPairwiseDistancesDensity()`
 
-The `plotPairwiseDistancesDensity()` function calculates and visualizes pairwise distances or correlations between cell types in query and reference datasets, aiding in the evaluation of cell type annotation consistency in single-cell RNA sequencing (scRNA-seq) analysis. Operating on `SingleCellExperiment` objects, it allows users to specify cell types of interest and compute either distances or correlation coefficients, with the option to project data into PCA space for focused analysis. The function generates a density plot using `ggplot2`, comparing cell relationships within and between datasets. 
+The `plotPairwiseDistancesDensity()` function calculates and visualizes pairwise distances or correlations between cell types in query and reference datasets, aiding in the evaluation of cell type annotation consistency in single-cell RNA sequencing (scRNA-seq) analysis. Operating on `r BiocStyle::Biocpkg("SingleCellExperiment")` objects, it allows users to specify cell types of interest and compute either distances or correlation coefficients, with the option to project data into PCA space for focused analysis. The function generates a density plot using `ggplot2`, comparing cell relationships within and between datasets. 
 
 ```{r, fig.height=5, fig.width=10, fig.show='hide'}
 # Example usage of the function