Add details about datasets in vignette and pkgdown website.

ccb-hms · Sep 7, 2024 · 611b4f3 · 611b4f3
1 parent 92b6e27
commit 611b4f3
Show file tree

Hide file tree

Showing 7 changed files with 24 additions and 16 deletions.
diff --git a/_pkgdown.yml b/_pkgdown.yml
@@ -81,7 +81,12 @@ reference:
   - title: Misc
     contents:
       - projectPCA 
-      - calculateCategorizationEntropy     
+      - calculateCategorizationEntropy
+  - title: Datasets
+    contents:
+      - reference_data 
+      - query_data
+      - qc_data
 
 right:
   - icon: fa-github

diff --git a/inst/script/ReferenceQueryData.R b/inst/script/ReferenceQueryData.R
@@ -10,7 +10,6 @@ set.seed(100)
 indices <- sample(ncol(SummarizedExperiment::assay(sce)), 
                   size = floor(0.8 * ncol(SummarizedExperiment::assay(sce))), 
                   replace = FALSE)
-ref_indices <- sample(indices, 0.7*length(indices))
 reference_data <- sce[, sample(indices, 1500)]
 query_data <- sce[, -indices]
 

diff --git a/vignettes/AnnotationAnomalies.Rmd b/vignettes/AnnotationAnomalies.Rmd
@@ -99,7 +99,7 @@ The function also provides detailed visualizations and statistical outputs to he
 
 ### Parameters
 
-The function takes a `SingleCellExperiment` object as `reference_data` and trains an isolation forest model on the reference PCA-projected data, with an optional `query_data` for projecting onto this PCA space for anomaly detection. You can specify cell type annotations through `ref_cell_type_col` and `query_cell_type_col`, and limit the analysis to certain cell types using the `cell_types` parameter. The function allows you to select specific principal components to use to train the isolation forest via `pc_subset`, adjust the number of trees with `n_tree`, and set an `anomaly_threshold` for classifying anomalies.
+The function takes a `r `BiocStyle::Biocpkg("SingleCellExperiment")` object as `reference_data` and trains an isolation forest model on the reference PCA-projected data, with an optional `query_data` for projecting onto this PCA space for anomaly detection. You can specify cell type annotations through `ref_cell_type_col` and `query_cell_type_col`, and limit the analysis to certain cell types using the `cell_types` parameter. The function allows you to select specific principal components to use to train the isolation forest via `pc_subset`, adjust the number of trees with `n_tree`, and set an `anomaly_threshold` for classifying anomalies.
 
 
 ### Return Value

diff --git a/vignettes/CellDistancesDiagnostics.Rmd b/vignettes/CellDistancesDiagnostics.Rmd
@@ -91,7 +91,7 @@ distance_data <- calculateCellDistances(
 ```
 
 In the code above:
-- `query_data` and reference_data: These are `SingleCellExperiment` objects containing the respective datasets for analysis.
+- `query_data` and reference_data: These are `r `BiocStyle::Biocpkg("SingleCellExperiment")` objects containing the respective datasets for analysis.
 - `query_cell_type_col` and ref_cell_type_col: These arguments specify the columns in the `colData` of each dataset that contain cell type annotations.
 - `pc_subset`: Specifies which principal components (1 to 10) are used to compute distances. PCA is applied for dimensionality reduction before calculating distances.
 

diff --git a/vignettes/DatasetMarkerGeneAlignment.Rmd b/vignettes/DatasetMarkerGeneAlignment.Rmd
@@ -101,7 +101,7 @@ data("query_data")
 set.seed(0)
 ```
 
-Some functions in the vignette are designed to work with `SingleCellExperiment` objects that contain data from only one cell type. We will create separate `SingleCellExperiment` objects that only CD4 cells, to ensure compatibility with these functions.
+Some functions in the vignette are designed to work with `r `BiocStyle::Biocpkg("SingleCellExperiment")` objects that contain data from only one cell type. We will create separate `r `BiocStyle::Biocpkg("SingleCellExperiment")` objects that only CD4 cells, to ensure compatibility with these functions.
 ```{r, message=FALSE, fig.show='hide'}
 # Load library
 library(scran)
@@ -233,7 +233,7 @@ The `plotPairwiseDistancesDensity()` function is designed to calculate and visua
 
 ### Functionality
 
-The function operates on `SingleCellExperiment` objects, which are commonly used to store single-cell data, including expression matrices and associated metadata. Users specify the cell types of interest in both the query and reference datasets, and the function computes either the distances or correlation coefficients between these cells.
+The function operates on `r `BiocStyle::Biocpkg("SingleCellExperiment")` objects, which are commonly used to store single-cell data, including expression matrices and associated metadata. Users specify the cell types of interest in both the query and reference datasets, and the function computes either the distances or correlation coefficients between these cells.
 
 When principal component analysis (PCA) is applied, the function projects the expression data into a lower-dimensional PCA space, which can be specified by the user. This allows for a more focused analysis of the major sources of variation in the data. Alternatively, if no dimensionality reduction is desired, the function can directly use the expression data for computation.
 

diff --git a/vignettes/StatisticalMeasures.Rmd b/vignettes/StatisticalMeasures.Rmd
@@ -95,7 +95,7 @@ set.seed(0)
 
 The calculateCramerPValue function is designed to perform the Cramer test for comparing multivariate empirical cumulative distribution functions (ECDFs) between two samples in single-cell data. This test is particularly useful for assessing whether the distributions of principal components (PCs) differ significantly between the reference and query datasets for specific cell types.
 
-To use this function, you first need to provide two key inputs: `reference_data` and `query_data`, both of which should be `SingleCellExperiment` objects containing numeric expression matrices. If `query_data` is not supplied, the function will only use the `reference_data.` You should also specify the column names for cell type annotations in both datasets via `ref_cell_type_col` and `query_cell_type_col.` If `cell_types` is not provided, the function will automatically include all unique cell types found in the datasets. The `pc_subset` parameter allows you to define which principal components to include in the analysis, with the default being the first five PCs.
+To use this function, you first need to provide two key inputs: `reference_data` and `query_data`, both of which should be `r BiocStyle::Biocpkg("SingleCellExperiment")` objects containing numeric expression matrices. If `query_data` is not supplied, the function will only use the `reference_data.` You should also specify the column names for cell type annotations in both datasets via `ref_cell_type_col` and `query_cell_type_col.` If `cell_types` is not provided, the function will automatically include all unique cell types found in the datasets. The `pc_subset` parameter allows you to define which principal components to include in the analysis, with the default being the first five PCs.
 
 The function performs the following steps: it first projects the data into PCA space, subsets the data according to the specified cell types and principal components, and then applies the Cramer test to compare the ECDFs between the reference and query datasets. The result is a named vector of p-values from the Cramer test for each cell type, which indicates whether there is a significant difference in the distributions of PCs between the two datasets.
 
@@ -123,7 +123,7 @@ In this example, the function compares the distributions of the first five princ
 
 The `calculateHotellingPValue()` function is designed to compute Hotelling's T-squared test statistic and corresponding p-values for comparing multivariate means between reference and query datasets in the context of single-cell RNA-seq data. This statistical test is particularly useful for assessing whether the mean vectors of principal components (PCs) differ significantly between the two datasets, which can be indicative of differences in the cell type distributions.
 
-To use this function, you need to provide two `SingleCellExperiment` objects: `query_data` and `reference_data`, each containing numeric expression matrices. You also need to specify the column names for cell type annotations in both datasets with `query_cell_type_col` and `ref_cell_type_col.` The `cell_types` parameter allows you to choose which cell types to include in the analysis, and if not specified, the function will automatically include all cell types present in the datasets. The `pc_subset` parameter determines which principal components to consider, with the default being the first five PCs. Additionally, `n_permutation` specifies the number of permutations for calculating p-values, with a default of 500.
+To use this function, you need to provide two `r BiocStyle::Biocpkg("SingleCellExperiment")` objects: `query_data` and `reference_data`, each containing numeric expression matrices. You also need to specify the column names for cell type annotations in both datasets with `query_cell_type_col` and `ref_cell_type_col.` The `cell_types` parameter allows you to choose which cell types to include in the analysis, and if not specified, the function will automatically include all cell types present in the datasets. The `pc_subset` parameter determines which principal components to consider, with the default being the first five PCs. Additionally, `n_permutation` specifies the number of permutations for calculating p-values, with a default of 500.
 
 The function works by first projecting the data into PCA space and then performing Hotelling's T-squared test for each specified cell type to compare the means between the reference and query datasets. It uses permutation testing to determine the p-values, indicating whether the observed differences in means are statistically significant. The result is a named numeric vector of p-values for each cell type.
 
@@ -153,7 +153,7 @@ The function begins by reducing the dimensionality of both the query and referen
 
 Here is a detailed explanation of how to use the `calculateNearestNeighborProbabilities()` function:
 
-1. **Loading the Data**: First, ensure you have the data available in the form of `SingleCellExperiment` objects for both query and reference datasets.
+1. **Loading the Data**: First, ensure you have the data available in the form of `r BiocStyle::Biocpkg("SingleCellExperiment")` objects for both query and reference datasets.
 
 2. **Function Call**: The function `calculateNearestNeighborProbabilities()` is called with several parameters including the query and reference datasets, the names of the columns containing cell type annotations, the subset of principal components to use, and the number of nearest neighbors to consider.
 
@@ -187,9 +187,9 @@ In summary, the `calculateNearestNeighborProbabilities()` function provides a ro
 
 # `calculateAveragePairwiseCorrelation()` 
 
-The `calculateAveragePairwiseCorrelation()` function is designed to compute the average pairwise correlations between specified cell types in single-cell gene expression data. This function operates on `SingleCellExperiment` objects and is ideal for single-cell analysis workflows. It calculates pairwise correlations between query and reference cells using a specified correlation method, and then averages these correlations for each cell type pair. This helps in assessing the similarity between cells in the reference and query datasets and provides insights into the reliability of cell type annotations.
+The `calculateAveragePairwiseCorrelation()` function is designed to compute the average pairwise correlations between specified cell types in single-cell gene expression data. This function operates on `r BiocStyle::Biocpkg("SingleCellExperiment")` objects and is ideal for single-cell analysis workflows. It calculates pairwise correlations between query and reference cells using a specified correlation method, and then averages these correlations for each cell type pair. This helps in assessing the similarity between cells in the reference and query datasets and provides insights into the reliability of cell type annotations.
 
-To use the `calculateAveragePairwiseCorrelation()` function, you need to supply it with two `SingleCellExperiment` objects: one for the query cells and one for the reference cells. The function also requires column names specifying cell type annotations in both datasets, and optionally a vector of cell types to include in the analysis. Additionally, you can specify a subset of principal components to use, or use the raw data directly if `pc_subset` is set to `NULL`. The correlation method can be either "spearman" or "pearson".
+To use the `calculateAveragePairwiseCorrelation()` function, you need to supply it with two `r BiocStyle::Biocpkg("SingleCellExperiment")` objects: one for the query cells and one for the reference cells. The function also requires column names specifying cell type annotations in both datasets, and optionally a vector of cell types to include in the analysis. Additionally, you can specify a subset of principal components to use, or use the raw data directly if `pc_subset` is set to `NULL`. The correlation method can be either "spearman" or "pearson".
 
 Here's an example of how to use this function:
 ```{r, fig.height=5, fig.width=10, fig.show='hide'}
@@ -215,7 +215,7 @@ Note that there is also a plot method for the object return for `calculateAverag
 
 # `regressPC()`
 
-The `regressPC()` function performs linear regression of a covariate of interest onto one or more principal components using data from a `SingleCellExperiment` object. This method helps quantify the variance explained by a covariate, which can be useful in applications such as quantifying batch effects, assessing clustering homogeneity, and evaluating alignment between query and reference datasets in cell type annotation settings.
+The `regressPC()` function performs linear regression of a covariate of interest onto one or more principal components using data from a `r BiocStyle::Biocpkg("SingleCellExperiment")` object. This method helps quantify the variance explained by a covariate, which can be useful in applications such as quantifying batch effects, assessing clustering homogeneity, and evaluating alignment between query and reference datasets in cell type annotation settings.
 
 The function calculates the R-squared value from the linear regression of the covariate onto each principal component. The variance contribution of the covariate effect per principal component is computed as the product of the variance explained by the principal component and the R-squared value. The total variance explained by the covariate is obtained by summing these contributions across all principal components.