Create signature matrix with average_clusters() using bulkRNAseq data #397

saphir746 · 2023-03-13T11:30:59Z

hello,

I am trying to create a cell-type signature matrix from bulkRNAseq of FACS sorted mono-cell-types samples:

expr_matrix %>% head()

	Sample_1	Sample_2	Sample_3	Sample_4
TSPAN6	0.6621047	0.6621047	8.4720554	0.6621047
TNMD	0.6621047	0.6621047	2.771366	6.9039605
DPM1	9.2066392	8.8886292	0.6621047	10.17191
SCYL3	5.5968998	3.0201094	9.9043603	8.514964
C1orf112	3.6115171	7.806794	5.371021	4.5565736

design_matrix %>% head()

Sample_name	Sex	Cell_type
Sample_1	F	Neutrophils_BoneMarrow
Sample_2	F	MCs_BoneMarrow
Sample_3	M	Neutrophils_BoneMarrow
Sample_4	M	MCs_BoneMarrow

whereby the columns represent samples (different patient samples) and rows are annotated GrCh38 gene names. expr_matrix is derived from raw counts, after alignment to GRch38 using the standard nf-core/RNAseq pipeline, and then normalised using varianceStabilizingTransformation() in DESeq2
I understand that in Clustifyr the best approach would be to use average_clusters() (?)

new_ref_matrix <- clustifyr::average_clusters(
  mat = expr_matrix,
  metadata = design_matrix$Cell_type,
  cluster_col = "Sample_name",
  method = 'median',
  cut_n = TRUE
)

But then I am unsure if:

What I have done in terms of pre-processing the bulk RNAseq count data is right ?
I'm calling average_clusters() correctly ?
I need to perform some extra scaling / normalisation on the resulting reference matrix new_ref_matrix ?
I can then integrate new_ref_matrix with other reference signature matrices derived from single cell data?

Any advice, insights or lead to helpful material will be very much appreciated

Thanks all

The text was updated successfully, but these errors were encountered:

kriemo · 2023-03-15T22:42:34Z

Thanks for your interest in the package:

In general we recommend using similar normalization approaches between the reference data the scRNA-seq dataset. So I would recommend just using log-transformed normalized counts from DESeq2. I haven't tried using transformed counts (from varianceStabilizingTransformation() or rlog()) however I would suspect that you would see similar results to using log-transformed normalized counts.
I wouldn't recommend using the cut_n parameter as it is a crude method to exclude low abundance genes and in most cases isn't necessary. The cluster_col = "Sample_name" is also unnecessary if you pass a vector to metadata.
No additional scaling should be necessary
Yes, you can combine multiple references into the same matrix (making sure that the genes are compatible), or run clustifyr independently for each different reference.

kriemo closed this as completed Aug 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create signature matrix with average_clusters() using bulkRNAseq data #397

Create signature matrix with average_clusters() using bulkRNAseq data #397

saphir746 commented Mar 13, 2023

kriemo commented Mar 15, 2023

Create signature matrix with average_clusters() using bulkRNAseq data #397

Create signature matrix with average_clusters() using bulkRNAseq data #397

Comments

saphir746 commented Mar 13, 2023

kriemo commented Mar 15, 2023