Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create signature matrix with average_clusters() using bulkRNAseq data #397

Closed
saphir746 opened this issue Mar 13, 2023 · 1 comment
Closed

Comments

@saphir746
Copy link

hello,

I am trying to create a cell-type signature matrix from bulkRNAseq of FACS sorted mono-cell-types samples:

expr_matrix %>% head()
  Sample_1 Sample_2 Sample_3 Sample_4
TSPAN6 0.6621047 0.6621047 8.4720554 0.6621047
TNMD 0.6621047 0.6621047 2.771366 6.9039605
DPM1 9.2066392 8.8886292 0.6621047 10.17191
SCYL3 5.5968998 3.0201094 9.9043603 8.514964
C1orf112 3.6115171 7.806794 5.371021 4.5565736
design_matrix %>% head()
  Sample_name Sex Cell_type
Sample_1 F Neutrophils_BoneMarrow
Sample_2 F MCs_BoneMarrow
Sample_3 M Neutrophils_BoneMarrow
Sample_4 M MCs_BoneMarrow

whereby the columns represent samples (different patient samples) and rows are annotated GrCh38 gene names. expr_matrix is derived from raw counts, after alignment to GRch38 using the standard nf-core/RNAseq pipeline, and then normalised using varianceStabilizingTransformation() in DESeq2
I understand that in Clustifyr the best approach would be to use average_clusters() (?)

new_ref_matrix <- clustifyr::average_clusters(
  mat = expr_matrix,
  metadata = design_matrix$Cell_type,
  cluster_col = "Sample_name",
  method = 'median',
  cut_n = TRUE
)

But then I am unsure if:

  1. What I have done in terms of pre-processing the bulk RNAseq count data is right ?
  2. I'm calling average_clusters() correctly ?
  3. I need to perform some extra scaling / normalisation on the resulting reference matrix new_ref_matrix ?
  4. I can then integrate new_ref_matrix with other reference signature matrices derived from single cell data?

Any advice, insights or lead to helpful material will be very much appreciated

Thanks all

@kriemo
Copy link
Member

kriemo commented Mar 15, 2023

Thanks for your interest in the package:

  1. In general we recommend using similar normalization approaches between the reference data the scRNA-seq dataset. So I would recommend just using log-transformed normalized counts from DESeq2. I haven't tried using transformed counts (from varianceStabilizingTransformation() or rlog()) however I would suspect that you would see similar results to using log-transformed normalized counts.

  2. I wouldn't recommend using the cut_n parameter as it is a crude method to exclude low abundance genes and in most cases isn't necessary. The cluster_col = "Sample_name" is also unnecessary if you pass a vector to metadata.

  3. No additional scaling should be necessary

  4. Yes, you can combine multiple references into the same matrix (making sure that the genes are compatible), or run clustifyr independently for each different reference.

@kriemo kriemo closed this as completed Aug 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants