Skip to content

An open-source R package containing decontamination pipelines for low-biomass microbiome data

Notifications You must be signed in to change notification settings

rachelgriffard/micRoclean

Repository files navigation

micRoclean: Decontamination for low-biomass microbiome data

micRoclean contains two pipelines for decontaminating low-biomass microbiome data.

For questions on installation or usage, please submit an issue or discussion via GitHub.

Please download the vignette file file in this repository for a detailed run through of this package functionality.

Installation

To install the micRoclean package, users should use the install_github function from the devtools package. The full command is as follows:

devtools::install_github("rachelgriffard/micRoclean")
library(micRoclean)

The latest micRoclean release is available for download from the repository.

Usage

Please download the vignette file file in this repository for a detailed run through of this package functionality.

micRoclean input

  1. Count matrix - A samples (n) by features (p) matrix
head(counts)
taxa_1 taxa_2 taxa_3 taxa_4
Sample_1 0 5 0 20
Sample_2 15 5 0 0
Sample_3 0 13 0 200
Sample_4 4 5 0 0
Sample_5 0 1 6 0
Sample_6 6 14 21 2
  1. Metadata - A metadata matrix with samples (n) as rows and two required columns is_control and sample_type. Two optional columns can be included named batch and sample_well. It is important that the naming scheme of the columns for the metadata match as seen below.
  • is_control - Boolean variable where TRUE indicates a extraction negative control sample and FALSE otherwise.
  • sample_type - Sample types named by string, indicating which samples should be read together.
  • (optional) batch - String indicating batch.
  • (optional) sample_well - String that indicates the well location of the sample. Must be in LETTER-NUMBER format, e.g. A1.
head(metadata)
is_control sample_type batch sample_well
Sample_1 FALSE plasma A A2
Sample_2 FALSE plasma B A4
Sample_3 TRUE DNA extraction control B B3
Sample_4 FALSE plasma A B1
Sample_5 TRUE DNA extraction control B B4
Sample_6 FALSE plasma B C12

Pipeline 1

This pipeline should be used when the user:

  1. Has sample well information available
  2. Wants to primarily characterize the original composition of the sample prior to contamination
  3. Has only one batch OR has multiple batches wtih controls in each batch

Furthermore, users must have control samples present in each batch for this method to be used.

This pipeline implements the SCRuB method for decontamination (Austin et al., 2023). To run this pipeline, the user can input their data as such:

pipeline_1_results = pipeline1(counts = counts,
                               meta = metadata)

Once run, the pipeline will return a list object with:

  1. Decontaminated counts matrix (decontaminated_count) - A samples (n) by features (p) matrix with the decontaminated counts
  2. Filtering loss value (FL) - A numeric value between 0 and 1

Pipeline 2

This pipeline should be used when the user:

  1. Wants to primarily identify potential biomarkers
  2. Does not have sample well information available

Pipeline 2 contains multiple steps to identify potential contaminants, as visualized here:

To run this pipeline, the user can input their data as such:

pipeline_2_results = pipeline2(counts = counts,
                               meta = metadata,
                               blocklist = bl,
                               technical_replicates = tr,
                               remove_if = 1, #optional
                               step2_threshold = 0.5) #optional

Where:

  • blocklist - character string
  • technical_replicates - Data frame indicating pairs of technical replicates across batches by sample name. See example below where Sample 1 and Sample 6 are technical replicates.
batch_1 batch_2
Sample_1 Sample_6
Sample_2 Sample_4

Once run, the pipeline will return a list object with:

  1. Decontaminated counts matrix (decontaminated_count) - A samples (n) by non-contaminant features (p - c) matrix with the decontaminated counts
  2. Filtering loss value (FL) - A numeric value between 0 and 1
  3. Contaminant ID (contaminant_id) - Dataframe with features (p) by removal steps and boolean value indicating TRUE if tagged as contaminant in that step, FALSE otherwise.
  4. Removed (removed) - Character vector of all samples tagged as contaminants and removed from the decontaminated count matrix

For convenience, the default blocklist from Eisenhofer et al. (2019) is included below and can be copied, if desired:

bl = c('Actinomyces','Corynebacterium','Arthrobacter',
       'Rothia','Propionibacterium','Atopobium',
       'Sediminibacterium','Porphyromonas','Prevotella',
       'Chryseobacterium','Capnocytophaga','Chryseobacterium',
       'Flavobacterium','Pedobacter','UnclassifiedTM7',
       'Bacillus','Geobacillus','Brevibacillus','Paenibacillus',
       'Staphylococcus','Abiotrophia','Granulicatella',
       'Enterococcus','Lactobacillus','Streptococcus',
       'Clostridium','Coprococcus','Anaerococcus','Dialister','Megasphaera',
       'Veillonella','Fusobacterium','Leptotrichia','Brevundimonas','Afipia',
       'Bradyrhizobium','Devosia','Methylobacterium','Mesorhizobium','Phyllobacterium',
       'Rhizobium','Methylobacterium','Phyllobacterium','Roseomonas','Novosphingobium	',
       'Sphingobium','Sphingomonas','Achromobacter','Burkholderia','Acidovorax',
       'Comamonas','Curvibacter','Pelomonas','Cupriavidus','Duganella',
       'Herbaspirillum','Janthinobacterium','Massilia','Oxalobacter','Ralstonia',
       'Leptothrix','kingella','Neisseria','Escherichia','Haemophilus',
       'Acinetobacter','Enhydrobacter','Pseudomonas','Stenotrophomonas','Xanthomonas')

Optionally, users can input their results list object from pipeline2 into the visualize_pipeline function. If interactive is set to true, the resulting visualization is interactive.

visualize_pipeline(pipeline_2_results,
                   interactive = FALSE)

Filtering loss (FL)

First introduced for use in a filtering method PERfect by Smirnova, Huzurbazar, and Jafari (2019), the filtering loss (FL) statistic is implemented in the micRoclean package to quantify the impact due to filtering features out in the above pipelines. The filtering loss value is between zero and one, indicating low to high contribution respectively from the removed reads to the total convariance structure. As the filtering loss value gets closer to one, users should be concerned about potential overfiltering.

Filtering loss for removal of reads $J$ is defined as

$$FL(J) = 1 - \frac{||Y^T Y||_F^2}{||X^TX||_F^2}$$

where the filtering loss represents a ratio of the unfiltered ($X$) and filtered ($Y$) covariance matrices.

For more detailed information, users are suggested to read the methods section 2.1 of the Smirnova, Huzurbazar, and Jafari (2109) publication.

References

Austin, G. I., Park, H., Meydan, Y., Seeram, D., Sezin, T., Lou, Y. C., Firek, B. A., Morowitz, M. J., Banfield, J. F., Christiano, A. M., Pe'er, I., Uhlemann, A. C., Shenhav, L., & Korem, T. (2023). Contamination source modeling with SCRuB improves cancer phenotype prediction from microbiome data. Nature biotechnology, 41(12), 1820–1828. https://doi.org/10.1038/s41587-023-01696-w

Davis, N.M., Proctor, D.M., Holmes, S.P. et al. Simple statistical identification and removal of contaminant sequences in marker-gene and metagenomics data. Microbiome 6, 226 (2018). https://doi.org/10.1186/s40168-018-0605-2

Smirnova, E., Huzurbazar, S., & Jafari, F. (2019). PERFect: PERmutation Filtering test for microbiome data. Biostatistics (Oxford, England), 20(4), 615–631. https://doi.org/10.1093/biostatistics/kxy020

Zozaya-Valdés, E., Wong, S. Q., Raleigh, J., Hatzimihalis, A., Ftouni, S., Papenfuss, A. T., Sandhu, S., Dawson, M. A., & Dawson, S. J. (2021). Detection of cell-free microbial DNA using a contaminant-controlled analysis framework. Genome biology, 22(1), 187. https://doi.org/10.1186/s13059-021-02401-3

Releases

No releases published

Packages

No packages published

Languages