Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

isContaminant with batch option #130

Open
Martinique-F opened this issue Jun 28, 2023 · 2 comments
Open

isContaminant with batch option #130

Martinique-F opened this issue Jun 28, 2023 · 2 comments

Comments

@Martinique-F
Copy link

Hello everybody,
I have 10 MiSeq-runs which included 5 negative controls and 30 samples each. I want to analyze them all together with Qiime2, however, the deontam step has to be done for each run individually. Therefore, I tried the "batch" option of the "isContaminant" command. However, the resulting table including the information if an ASV is classified as a contaminant or not does not distinguish between the runs anymore and sums up the prevalence of the ASVs from the different runs.
For example:
I have the runs R1 and R2 which were both analyzed with Qiime2 and resulted in the phyloseq-object physeq1. I first created a dataframe df where the samplesnames and Run-IDs are listed:

>head(df)
  sample      batch
  sample1     R1
  sample2     R1
  sample3     R2
  sample4     R2

I created a named vector V1 like this:

V1 <- df$batch
V1 <- setNames(V1, df$sample)

Then I ran the following commands:

sample_data(physeq1)$is.neg <- sample_data(physeq1)$Sample_or_Control == "Control Sample"
contam <- isContaminant(physeq1, method="prevalence", neg="is.neg", threshold=0.5, batch = V1)

The result "contam" looks like this:

>head(contam)
                                        freq prev p.freq    p.prev         p contaminant
6f3f68e5c8e2a11b388ddbbea9fa182d 0.052955868   49     NA 0.4278075 0.4278075        TRUE
3ebe761bfb1238c87195d431f41bf976 0.010071447   29     NA 0.6080978 0.6080978       FALSE
e8386d3a307c208c4b9f0a756259cd6b 0.006279860   11     NA 0.6898396 0.6898396       FALSE
6de4a253e36f3d4e6a2e3acb26a0c030 0.022417636   45     NA 0.9721925 0.9721925       FALSE
ad4cba5280fb47cbebea01e7031af61c 0.007089322   28     NA 0.8013751 0.8013751       FALSE
651d9a773d5fc2e2b20411b9d7c28e0b 0.018931103   50     NA 0.9908327 0.9908327       FALSE

So, the batch information isn't included anymore. I checked for one ASV that is included in R1 and R2 and the prevalence of this ASV is summed up in the "contam" table in comparison to when I run the analysis separately for each run. The problem is, that in R1 this ASV is detected as contamination, in R2 it is not. So it shouldn't be deleted from all runs, only from samples that are included in R1.
I wasn't able to find more details about the "batch" function and how I could modify it so that I can run "prune_taxa" per run in the end.

I hope I was able to explain my problem and that someone has a solution to that.
Thank you in advance!
Martinique

@benjjneb
Copy link
Owner

So, the batch information isn't included anymore. I checked for one ASV that is included in R1 and R2 and the prevalence of this ASV is summed up in the "contam" table in comparison to when I run the analysis separately for each run. The problem is, that in R1 this ASV is detected as contamination, in R2 it is not. So it shouldn't be deleted from all runs, only from samples that are included in R1.

If this is the functionality you want, you will have to run decontam and ASV removal per batch "by hand", as you outlined here. There is no automated way to perform per-batch decontamination in the package, you'll need to do some simple R looping.

That said, I would probably not do this. If a contaminant is identifed and removed in one batch, it should typically be removed from all batches, so as to keep a consistent set of non-contaminant ASVs in the potentially detected universe in all your batches. That is, consistent treatment across batches of contaminants is usually preferable than doing it on a per-batch basis.

@Martinique-F
Copy link
Author

Thank you for your answer.
I'm working with low-template samples which are very likely to be highly diverse in their microbial community. So we assume, that what might be a contamination in one run might actually belong to the microbial community in another run. That's why it's important for us to run this analysis separately. But it's good to know that the "batch" option doesn't have the goal to achieve this, so thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants