Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

possible to compute distances on a subset of genes? #640

Open
aterceros opened this issue Aug 14, 2024 · 3 comments
Open

possible to compute distances on a subset of genes? #640

aterceros opened this issue Aug 14, 2024 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@aterceros
Copy link

Description of feature

Hi!
Thank you for making this package available! I was wondering if it is possible to compute distances between groups of cells for a subset of genes (for example differentially expressed between 2 groups)?
Thanks in advance.

@aterceros aterceros added the enhancement New feature or request label Aug 14, 2024
@Zethson Zethson self-assigned this Aug 14, 2024
@stefanpeidli
Copy link
Collaborator

Great suggestion! We usually calculate most distances in lower-dimensional spaces (such as PCA) since distances in high dimensions are bad. Depending on how large your set of genes is you can either

  • (If many genes and partially redundant) Calculate PCA on a subset of genes, then use that subset PCA for calculating distances. You can use the mask_var argument in scanpy.pp.pca for this.
  • (If few genes) Directly calculate distances on the subset. In this case I would just put your subset in adata.obsm['X_subset'] = adata[:, gene_subset'].X.copy(), then specify pt.tl.Distance(metric="euclidean", obsm_key="X_subset").

@Zethson since our distance function is already flexible enough to handle this case by specifying a different key in obsm I think there is no need to implement this feature here directly. We could a small example on this to the docs though because this approach is quite useful for analysis.

@aterceros
Copy link
Author

Thank you for the comment! I'll try the second option!

@aterceros
Copy link
Author

Hi!
Thank you for the suggestion above, I tried the second suggestion and seems to work well. However, when I run the bootsrap option, I get very large variances (i.e. between 120-160) for some comparisons only. Would you say that such large variance values can occur?

What I ran:
adata.obsm['X_subset'] = adata[:, geneset].X.copy()
distance = pt.tl.Distance(metric="wasserstein", obsm_key="X_subset")
X = adata.obsm["X_subset"][adata.obs["condition"] == "A"]
Y = adata.obsm["X_subset"][adata.obs["condition"] == "B"]
D = distance.bootstrap(X,Y)

  • my gene subsets are ~ 100 genes (DEGs).

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants