Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dropulation high doublet call rate #321

Open
drneavin opened this issue Oct 12, 2022 · 9 comments
Open

Dropulation high doublet call rate #321

drneavin opened this issue Oct 12, 2022 · 9 comments

Comments

@drneavin
Copy link

Hello,

I'm writing because I have recently trialed dropulation on a couple datasets and have found higher doublet rate estimations than anticipated and want to know if some of the parameters that I am using might be incorrect or if additional parameters should be set. I have run dropulation on two different datasets:

  • 74 pools of PBMCs with ~16 individuals per pool and ~20,000 cells captured per pool
  • 11 pools of fibroblasts with ~8 individuals per pool and ~10,000 cells captured per pool

I find that doublet estimation to be ~60-70% but it should be between 5 and 20%. I have also run other demultiplexing methods and find the doublet rate to be within the expected range. The commands that I'm currently running are:

AssignCellsToSamples --INPUT_BAM tagged_bam.bam --VCF Merged_MAF0.01.dose_GeneFiltered_hg38_nochr_NoAdditionalchr.vcf --OUTPUT assignments.tsv.gz --VCF_OUTPUT out_vcf.vcf --CELL_BARCODE_TAG CB --MOLECULAR_BARCODE_TAG UB --CELL_BC_FILE barcodes.tsv --SAMPLE_FILE Individuals.txt --FUNCTION_TAG XF --EDIT_DISTANCE 1 --READ_MQ 10 --GQ_THRESHOLD 30 --RETAIN_MONOMORPIC_SNPS false --FRACTION_SAMPLES_PASSING 0.5 --IGNORED_CHROMOSOMES X --IGNORED_CHROMOSOMES Y -IGNORED_CHROMOSOMES MT --ADD_MISSING_VALUES true --DNA_MODE false --SNP_LOG_RATE 1000 --GENE_NAME_TAG gn --GENE_STRAND_TAG gs --GENE_FUNCTION_TAG gf --STRAND_STRATEGY SENSE --LOCUS_FUNCTION_LIST CODING --LOCUS_FUNCTION_LIST UTR --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false

and

DetectDoublets --INPUT_BAM tagged_bam.bam --VCF out_vcf.vcf --SINGLE_DONOR_LIKELIHOOD_FILE assignments.tsv.gz --OUTPUT likelihoods.tsv.gz --CELL_BARCODE_TAG CB --MOLECULAR_BARCODE_TAG UB --CELL_BC_FILE barcodes.tsv --SAMPLE_FILE Individuals.txt --EDIT_DISTANCE 1 --READ_MQ 10 --GQ_THRESHOLD 30 --FRACTION_SAMPLES_PASSING 0.5 --IGNORED_CHROMOSOMES X --IGNORED_CHROMOSOMES Y --IGNORED_CHROMOSOMES MT --FORCED_RATIO 0.8 --TEST_UNEXPECTED_DONORS true --SCALE_LIKELIHOODS true --DNA_MODE false --GENE_NAME_TAG gn --GENE_STRAND_TAG gs --GENE_FUNCTION_TAG gf --STRAND_STRATEGY SENSE --LOCUS_FUNCTION_LIST CODING --LOCUS_FUNCTION_LIST UTR --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false

And then I consider the calls from the DetectDoublets command in the bestSample column to be the droplet calls. Please let me know if you identify anything that could be optimized with this submission or if calling the final cell types should be altered.

Thanks so much for your help!

@jamesnemesh
Copy link
Collaborator

jamesnemesh commented Oct 12, 2022 via email

@jamesnemesh
Copy link
Collaborator

I should include some lame R code here to give you a sense of how we generate a mapping from cell barcode to donor ID. Someday I should actually document this whole process with a cookbook as I've done for some of our other tools.

createDonorMap<-function (singletFile, doubletFile, outMapFile, fdrThreshold=0.05, doubletPvalueThreshold=0.9) {
    a=read.table(singletFile, header=T, stringsAsFactors = F, sep="\t")
    b=read.table(doubletFile, header=T, stringsAsFactors = F, sep="\t")
    
    confidentAssignmentCells=a[a$FDR_pvalue<=fdrThreshold,"cell"]
    singletCells=b[b$doublet_pval<doubletPvalueThreshold, "cell"]
    confidentAssignmentSingletCells=intersect(confidentAssignmentCells, singletCells)
    map=a[match(confidentAssignmentSingletCells, a$cell), c("cell", "bestSample")]
    write.table(map, outMapFile, row.names=F, col.names = T, quote=F, sep="\t")
}

@drneavin
Copy link
Author

Great, thank you @jamesnemesh! That code will help and I'll try changing the MAX_ERROR_RATE as well.

For what it's worth, I don't think that these pools have unusually high ambient RNA (< 10%). Would that be enough to impact the doublet calling?

Also, I noticed that the default ratio for doublet proportions is 0.8:0.2. I was curious about the reason for this ratio as opposed to 0.5:0.5 and am wondering if altering this would have a large impact?

@jamesnemesh
Copy link
Collaborator

jamesnemesh commented Oct 12, 2022 via email

@drneavin
Copy link
Author

Hi @jamesnemesh,

I'm happy to report that changing the MAX_ERROR_RATE to 0.05 had a dramatic impact on calling doublets! It's still slightly higher (1-2%) than some of the other methods I've tried but it's now well within the expected range (and much speedier than most other methods).

You're right, I'm comparing the different demultiplexing and doublet detecting methods - let me know if there are any additional parameters that you think I should test for dropulation.

Thanks for pointing out GenerateSyntheticDoublets, I'm definitely going to try it out. I was looking at the different parameters and wondered how it deals with barcodes that might be in different bam files when supplying a list of bams and a list of barcodes to include? I guess I'm wondering how it knows which bam file to search for the correct barcode in? Or it just surveys all of them? And what happens if the barcode is in two of the bams?

Thanks again for all your help!

@jamesnemesh
Copy link
Collaborator

Hi,

GenerateSyntheticDoublets is not that smart, so it's not going to deal well with barcode collisions. I believe what I did was pre-select barcodes from a few BAM files that didn't have collisions (for example, the top 5000 largest cells from 5 different BAMS), filter each BAM to the subset of cells via FilterBamByTag, merge them into a single BAM to test, and use GenerateSyntheticDoublets to generate an additional doublets BAM file that could then be merged into the singlets BAM.

On the speed front: because we treat each cell as an independent observation, you can run the donor assignment programs on subsets of the data. SplitBamByCell will take a BAM file and split it into a number of chunks, each of which contains the complete set of reads for the cell barcodes in the BAM. By doing that early in the process, you can run many chunks of data in parallel, which really speeds things up - we frequently use 2Gb chunks of data in our workflows.

That program's documentation is a little hard to parse the first time, so here's an example:

SplitBamByCell I=input.bam TARGET_BAM_SIZE=2G OUTPUT_LIST=output.bam_list SPLIT_TAG=CB OUTPUT=output.__SPLITNUM__.bam OVERWRITE_EXISTING=true

The other thing that can help is to use a BCF instead of a VCF file. Because I'm using the HTSJDK implementation of a BCF reader, you need to use Picard's VcfFormatConverter. That can net a significant speedup of the genotype data, which winds up being the bottleneck once you have smaller BAM files.

@jamesnemesh
Copy link
Collaborator

jamesnemesh commented Oct 20, 2022

@drneavin If you don't mind, I had a few more thoughts on additional bake off scenarios you might consider.

  1. Related individuals: How well does donor assignment work when you have related individuals or full trios in your data set: That is, mother/father/one or more children.

Our programs work just as well on related individuals and unrelated individuals, including full trios.

  1. Sample swap issues - If your lab is generating relatively small pools (~20 donors) but you're working with hundreds or even thousands of samples, what happens when someone grabs the wrong tube and an unexpected donor is introduced into the pool? Then what is the behavior if you do have genotype information, or you don't have information for the sample (IE: are they in the VCF?)

We run AssignCellsToSamples without a SAMPLE_FILE. Because our program runs fairly quickly, we can use a VCF with 1000+ individuals (and pool sizes in excess of 100 individuals) easily. When a donor is introduced that is in the VCF, we'll identify that donor properly for the set of all possible donors. When the donor introduced is not present in the VCF (a cryptic donor), the single donor assignment qualities will be poor, and the cells will all be flagged as mixtures of two known donors. This makes sense, as a combination of two random donors would better explain a random set of alleles than a single individual.

It's possible Vireo would do really well with cryptic donors as it has fewer priors - perhaps you want to keep the cells from that donor and know that they all come from one true (but unknown) source instead of being thrown away.

We've encountered plenty of plate swap and cryptic donor issues, especially when working with 3rd party sources of samples, so this sort of utility is incredibly important for a production environment.

@drneavin
Copy link
Author

Thanks @jamesnemesh! I think that the first scenario will be hard for us to test with our current dataset but I think there may be some publicly available that we might try.

The second question is a good one and not one I've done a comparison with but I think some of the methods will clearly just call the missing individuals as doublets but depends on how the parameters are set. I'll have to think about the best way to compare this across methods

Also, I was going to test out the GenerateSyntheticDoublets script but can't seem to find the executable available in Drop-seq_tools-2.5.1. Is the executable something I should be able to generate or somewhere else I could find it? If I can generate it, could you give me the general commands to do it (sorry, it's not something I've done before so I'm not as familiar with it)

@jamesnemesh
Copy link
Collaborator

For the first scenario, if you have a VCF with more samples in it than the set in your pool, you can run on the entire VCF. I imagine it's hard to get the wrong donor assignment and see where your algorithm is making mistakes if you only allow the algorithm to pick valid choices.

As for GenerateSyntheticDoublets - looks like we’re not exposing a command line stand-alone for it, but it can be accessed from the jar:

java -jar Drop-seq_tools-2.5.1/jar/dropseq.jar GenerateSyntheticDoublets
USAGE: GenerateSyntheticDoublets [arguments]

Takes a fraction of the cell barcodes and merges the barcode labels and reads of those cells into synthetic doublets. 
For example if cell A and B wereselected to be a doublet, then a new cell called A_B would be generated that contained
all the reads from the A and B cell barcodes.
Version:2.5.1(680c2ea_1642084299)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants