Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dropulation doublet rate very high #339

Open
terencewtli opened this issue Jun 23, 2023 · 8 comments
Open

Dropulation doublet rate very high #339

terencewtli opened this issue Jun 23, 2023 · 8 comments

Comments

@terencewtli
Copy link

Hi,

Thanks for providing this tool! We have four individuals multiplexed in a 10X Multiome fibroblast to stem cell reprogramming experiment. I'm writing because I have some strange demultiplexing results - almost all of the doublet p-values are >= 0.9 for both ATAC and RNA. I used the parameters as recommended in this discussion (namely MAX_ERROR_RATE = 0.05): #321. I've attached the parameters for RNA/ATAC below.

drop_params.txt

In the previous discussion, you mentioned that doublets are the cells where doublet_pval >= 0.9. However, using that threshold, we get ~98% of the cells called as doublets. Even using the doublet_pval threshold as 1.0, we see a high doublet proportion:

image

I was wondering if you had ideas regarding this behavior, perhaps further tuning MAX_ERROR_RATE? Let me know if you'd like me to provide additional details, such as the full output of dropulation.

@jamesnemesh
Copy link
Collaborator

jamesnemesh commented Jun 23, 2023 via email

@terencewtli
Copy link
Author

Thanks for the fast response!

These results are coming from three different runs (across different timepoints in the experiment) - here are the cell numbers from CellRanger:

Timepoint 1: 9487
Timepoint 2: 21010
Timepoint 3: 37402

There's some issues with the mapping for timepoints 2/3, the coverage per cell is much lower in those than in timepoint 1.
I think that's what's happening for the doublet detection. Here are the plots for the number of informative UMIs and total vs. informative UMIs - the top left panel is across all ~70k cells across all three timepoints, and the following panels are split by timepoint:

image
image

We have four individuals multiplexed in each timepoint, and the donor VCF I'm giving as input only includes those four individuals. It's array data, and we imputed SNPs with the TopMed server.

If I'm remembering correctly, we originally expected around 12,000-15,000 cells, but CellRanger ends up calling a lot more for timepoints 2/3 - would it be informative to limit the number of total cells that it can detect?

@jamesnemesh
Copy link
Collaborator

jamesnemesh commented Jun 23, 2023 via email

@Angel-Wei
Copy link

Hi @jamesnemesh and @terencewtli ,

Thank you for your inputs on this post. I wonder if @terencewtli has any updates or solutions of resolving this problem as I have really similar observations. Take one of my multiplexed sample for example (sequenced by NovaSeq X, ~25K cells after initial QC filtering when creating a Seurat object), cells in this multiplexed sample are from 4 individuals from 4 cell lines. Here's a bit of break down my procedures and the usage of AssignCellsToSamples and DetectDoublets pipelines:

  • We completed preprocessing using CellRanger. When using tool demuxlet to demultiplex cells, we found a really high doublet rates (DBL, around 70%)
  • To provide more information troubleshooting this problem, we've followed some tips mentioned in previous discussion Dropulation high doublet call rate #321 to implement AssignCellsToSamples and DetectDoublets pipelines. MAX_ERROR_RATE with 0.05 was used. SAMPLE_FILE was used to restrict the donor lists. After examining the bestSample calls from both outputs, we noticed that though we could still get decent number of confident (FDR_pvalue <= 0.05) cells with donors assigned, very few cells are identified as Singlets by DetectDoublets pipeline and fewer cells were found to be "Confident singlet cells".
    venn
    After checking the contingency table of cells' bestSample calls made by the two pipelines, we found what's been classified as "Donor1" by AssignCellsToSamples mapped to a big proportion of doublet mixtures like Donor1:Donor2, Donor1:Donor3, Donor1:Donor4, the sampleOneMixtureRatio metic (If this sampleOne refers to the 1st part of mixtures, aka, Donor1) may range from 0.2 to 0.8. It would be greatly appreciated if we can get some help interpreting our results!
  • I compiled outputs from these 2 pipelines and plotted nCount_RNA, num_umis (column in AssignCellsToSamples output), num_inform_umis (column in DetectDoublets output) of cells that were retained (~25K) in Seurat object. The right most panel is pretty similar to what Terence observed. Taking all these panels together, does it indicate well enough that we have a really low sequencing depth for this data?
    hist
  • Here's a density plot of our doublet_pval, which is not close to the ideal bimodal. The quantiles distribution are as follows:
    0% 25% 50% 75% 100%
    0.06006853 0.99999980 1.00000000 1.00000000 1.00000000
    density

Any suggestions/inputs on these observations are much appreciated! Thank you so much!

@jamesnemesh
Copy link
Collaborator

jamesnemesh commented Feb 7, 2024

SAMPLE_FILE was used to restrict the donor lists.

We suggest running AssignCellsToSamples without this argument. The program should be able to figure out who is in your pool without hints. Further, if you made a mistake in your donor list and provided the wrong set of donors, the algorithm will be restricted to the wrong answers and try to maximize them anyway, resulting in miserable FDR pvalues. You should use this for DetectDoublets to force the program to consider those donors as possible contributors to doublets, in addition to any donors discovered by AssignCellsToSamples, just to cover an odd edge case where a donor is never called as the most likely for any cell but somehow is a smaller cell of a doublet - you'd have to invoke some strange situation of cells with very different sizes for this to happen, but better safe. For the most part, the programs infer this information, and you'll feel better with the calls knowing you didn't have to tell the program which donors to expect in your pool.

Your number of UMIs (nCount_RNA) seems reasonable, but translates to very few informative UMIs in the second panel (num_umis). The algorithm is starved for data for most cells, so you're not getting very good results. We typically see something on the order of 20-40% of our total UMIs per cell have transcribed SNPs, but it looks like your conversion is far lower.

Is your VCF from whole genome sequencing? If not, is it a SNP array where you've also performed imputation? Imputation is a great way to boost the number of sites you can interrogate, but it's critical to filter to high quality sites (INFO>=0.8). Is this nuclei or whole cell data? Have you enabled intronic reads via LOCUS_FUNCTION_LIST=INTRONIC, which can have a dramatic impact on the number of UMIs when most UMIs are intronic? Cell Ranger by default uses the equivalent of that option to calculate expression, which might account for some of the difference if you aren't using it - you might be losing as much as 70-80% of your UMIs.

To set your expectations, Here's a random experiment UMI yield (should be the same as your panel nCount Panel):

image

And the number of informative UMIs (should be the same as your num_umis panel):
image

@Angel-Wei
Copy link

Hi James,

Thank you so much for your inputs! While I'm testing some changes and checking VCFs, I have a quick question regarding the --VCF_OUTPUT flag in AssignCellsToSamples pipeline.
Initially, I used the same input VCF file for running both AssignCellsToSamples and DetectDoublets. However, I noticed that, as mentioned in drneavin's commands in #321, AssignCellsToSamples enables outputting another VCF file when the program was completed and DetectDoublets continue using the file from --VCF_OUTPUT.
I tried incorporated this in a test run, though I'm still seeing many doublets, but the number of Singlet calls increased a bit. I wonder is --VCF_OUTPUT flag mandatory to be invoked by DetectDoublets later and both pipelines cannot use the same VCF file?

@jamesnemesh
Copy link
Collaborator

AssignCellsToSamples streams the VCF file and does not store all genotypes in memory. It looks through both the BAM file and VCF, and emits only variants that are transcribed and donors that are relevant to the pool to the output VCF.

DetectDoublets needs to optimize over all genotypes, so stores the genotype information in-memory. Using the output VCF from AssignCellsToSamples as the input to DetecDoublets will save you memory and computation time.

Technically, using either the 'raw' or processed VCF as input to DetectDoublets should result in the same answer, but I haven't tested that at any length in recent years - in larger cell and pool sizes the memory gets prohibitively expensive. All of my validation work on the pipelines assumes that you funnel the outputs from Assign into Detect.

@Angel-Wei
Copy link

Angel-Wei commented Feb 8, 2024

Thank you for the information! I opted out --VCF_OUTPUT in initial attempts assuming it shouldn't affect the results generation. but I can surely add that back to save computing time and memory for the downstream pipeline! Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants