Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggpicrust2(): Error in metadata[, matching_columns]: #64

Open
WillErickson1 opened this issue Oct 20, 2023 · 10 comments
Open

ggpicrust2(): Error in metadata[, matching_columns]: #64

WillErickson1 opened this issue Oct 20, 2023 · 10 comments

Comments

@WillErickson1
Copy link

WillErickson1 commented Oct 20, 2023

I runned this script:
library(readr)
library(ggpicrust2)
library(tibble)
library(tidyverse)
library(ggprism)
library(patchwork)

Load necessary data: abundance data and metadata

abundance_file <- "pred_metagenome_unstrat.tsv"
metadata_2 <- read_delim(
"metadata.txt",
delim = "\t",
escape_double = FALSE,
trim_ws = TRUE
)
rm(metadata_2)

Run ggpicrust2 with imported data.frame

abundance_data <- read_delim(abundance_file, delim = "\t", col_names = TRUE, trim_ws = TRUE)

Run ggpicrust2 with input data

results_data_input <- ggpicrust2(data = abundance_data,
metadata = metadata,
group = "Traitement", # For example dataset, group = "Environment"
pathway = "KO",
daa_method = "LinDA",
ko_to_kegg = TRUE,
order = "pathway_class",
p_values_bar = TRUE,
x_lab = "pathway_name")

but I got this error : Starting the ggpicrust2 analysis...
Converting KO to KEGG...
Processing provided data frame...
Loading KEGG reference data. This might take a while...
Performing KO to KEGG conversion. Please be patient, this might take a while...
|============================================================================| 100%
KO to KEGG conversion completed. Time elapsed: 10.02 seconds.
Removing KEGG pathways with zero abundance across all samples...
KEGG abundance calculation completed successfully.
Performing pathway differential abundance analysis...
Sample names extracted.
Identifying matching columns in metadata...
Matching columns identified: NA . This is important for ensuring data consistency.
Using all columns in abundance.
Converting abundance to a matrix...
Reordering metadata...
Error in metadata[, matching_columns]:
! Can't subset columns with matching_columns.
✖ Subscript matching_columns can't contain missing values.
✖ It has a missing value at location 1.
Run rlang::last_trace() to see where the error occurred.

this is how my input files looks like:
metadata_2:
imageabundance_data
image
PS: I executed this example that u shared and it has worked

data(ko_abundance)
data(metadata)
results_file_input <- ggpicrust2(data = ko_abundance,
metadata = metadata,
group = "Environment",
pathway = "KO",
daa_method = "LinDA",
ko_to_kegg = TRUE,
order = "pathway_class",
p_values_bar = TRUE,
x_lab = "pathway_name")

@cafferychen777
Copy link
Owner

Hi @WillErickson1,

Thank you for providing a detailed overview of your issue. Based on the error you encountered and the provided data structure, I suggest the following:

  1. Metadata Modification: Try removing the second and third columns from your metadata file. It appears there might be discrepancies between your dataset and the expected format for the ggpicrust2 function, and this step might help align your data more closely with what the function anticipates.

  2. Sample Size Concern: I'd like to point out a statistical consideration. With only 3 samples in your dataset, the statistical power and interpretability of the results may be limited. Even if you obtain a visualization, deriving a statistically meaningful interpretation could be challenging. It's crucial to ensure that your results can be supported by a robust statistical analysis. If possible, consider increasing your sample size to bolster the validity of your conclusions.

Please give the above suggestion a try and let me know if it resolves the issue or if you have any other questions!

Best regards,
Chen YANG

@WillErickson1
Copy link
Author

thank you for responding so fast
I runned this script now : # Load necessary data: abundance data and metadata
abundance_file <- "pred_metagenome_unstrat.tsv"
metadata_2 <- read_delim(
"metadata.txt",
delim = "\t",
escape_double = FALSE,
trim_ws = TRUE
)
rm(metadata_2)

Run ggpicrust2 with imported data.frame

abundance_data <- read_delim(abundance_file, delim = "\t", col_names = TRUE, trim_ws = TRUE)

metadata_2 <- metadata_2[, -c(2, 3)]

Run ggpicrust2 with input data

results_data_input <- ggpicrust2(data = abundance_data,
metadata = metadata_2,
group = "Traitement", # For example dataset, group = "Environment"
pathway = "KO",
daa_method = "LinDA",
ko_to_kegg = TRUE,
order = "pathway_class",
p_values_bar = TRUE,
x_lab = "pathway_name")

with input file :
image

I have this error now :
Starting the ggpicrust2 analysis...
Converting KO to KEGG...
Processing provided data frame...
Loading KEGG reference data. This might take a while...
Performing KO to KEGG conversion. Please be patient, this might take a while...
|============================================================================| 100%
KO to KEGG conversion completed. Time elapsed: 14.2 seconds.
Removing KEGG pathways with zero abundance across all samples...
KEGG abundance calculation completed successfully.
Performing pathway differential abundance analysis...
Sample names extracted.
Identifying matching columns in metadata...
Matching columns identified: sample-id . This is important for ensuring data consistency.
Using all columns in abundance.
Converting abundance to a matrix...
Reordering metadata...
Converting metadata to a matrix and data frame...
Extracting group information...
Running LinDA analysis...
Performing LinDA analysis...
0 features are filtered!
The filtered data has 3 samples and 197 features will be tested!
Fit linear models ...
Completed.
Processing LinDA results...
LinDA analysis is complete.
Success: Found 1 statistically significant biomarker(s) in the dataset.
Annotating pathways...
Starting pathway annotation...
DAA results data frame is not null. Proceeding...
KO to KEGG is set to TRUE. Proceeding with KEGG pathway annotations...
We are connecting to the KEGG database to get the latest results, please wait patiently.

Processing pathways individually...

| | 0%
Beginning annotation for pathway 1 of 1...

Annotated pathway 1 of 1. Time taken: 2.24 seconds.

|============================================================================| 100%

Pathway annotation completed.

Returning DAA results filtered annotation data frame...
Creating pathway error bar plots...
Error in $<-.data.frame(*tmp*, "group", value = c(2L, 2L, 1L)) :
le tableau de remplacement a 3 lignes, le tableau remplacé en a 1
In addition: Warning message:
In cbind(sample = colnames(sub_relative_abundance_mat), group = Group, :
number of rows of result is not a multiple of vector length (arg 2)

@cafferychen777
Copy link
Owner

Hi @WillErickson1,

Thank you for sharing the details.

I see that you've run the script to load the necessary abundance data and metadata. Following that, you have run the ggpicrust2 function with the provided input.

From the provided error:

Error in `$<-.data.frame`(`*tmp*`, "group", value = c(2L, 2L, 1L)) : 
  replacement has 3 rows, data has 1
In addition: Warning message:
In cbind(sample = colnames(sub_relative_abundance_mat), group = Group,  :
  number of rows of result is not a multiple of vector length (arg 2)

It seems like there's an inconsistency with the data dimensions. This might be due to only one statistically significant result being found. Such a result can possibly affect the visualization, making it hard for the visualization to proceed.

To troubleshoot this:

  1. You can follow the workflow provided here to obtain intermediate results. This might provide more insight into the issue.
  2. Alternatively, if you're comfortable, you can send me the dataset. I'll be able to take a closer look and help identify the root cause.
  3. However, I suspect the underlying issue might be due to a small sample size, which might be causing this error.

Let me know how you'd like to proceed or if there's any other way I can assist.

Best regards,

Chen YANG

@WillErickson1
Copy link
Author

thanks
what's the difference between ko_abundance and kegg_abundance
kegg_abundance <- ko2kegg_abundance("path/to/your/pred_metagenome_unstrat.tsv")
ko_abundance <- read.delim("path/to/your/pred_metagenome_unstrat.tsv")
what does these 2 inputs refers to in picrust2 pipeline : https://github.com/picrust/picrust2/wiki/Full-pipeline-script

@cafferychen777
Copy link
Owner

Hi @WillErickson1 ,

The difference between ko_abundance and kegg_abundance lies in their origins and how they represent the biological data.

  1. KO (Kegg Orthology) Abundance: ko_abundance refers to the abundance of specific gene orthologs in your sample. In the context of your script, it's being directly loaded from a .tsv file:

    ko_abundance <- read.delim("path/to/your/pred_metagenome_unstrat.tsv")
  2. KEGG Abundance: kegg_abundance, on the other hand, is derived from the KO abundances by converting them into pathway abundances. This means that kegg_abundance represents the predicted abundance of entire metabolic pathways, which are composed of multiple KO groups. Your script uses the ko2kegg_abundance function to achieve this conversion:

    kegg_abundance <- ko2kegg_abundance("path/to/your/pred_metagenome_unstrat.tsv")

In the context of the PICRUSt2 pipeline:

  • ko_abundance is a direct result from the prediction step. It gives you the abundance of different gene orthologs in your microbial community.

  • kegg_abundance is more of a downstream analysis that allows you to understand the functional potential of your microbial community at the pathway level. This is more interpretable because it groups gene orthologs (KOs) into broader functional categories or pathways.

I hope this clarifies the difference between the two. Let me know if you have any further questions or if there's anything else I can assist with.

Best regards,

Chen YANG

@WillErickson1
Copy link
Author

Sorry I'm new in this field. Could you tell me exactly what they refers to. this is the architecture of output folder of picrust2 pipeline
image
-I tested the pipeline on the dataset suggested by picrust2 in their tutorial and I used edgeR as daa_method. and it has worked my wondering why it didn't work with limma or deseq2 ?

  • what's the most suitable input when performing downstream analysis in other words should I use ko_abundance or kegg_abundance to generate pathway_errorbar() , pathway_heatmap() and pathway_pca().
  • one more if you don't mind : what's the interest of performing the same analysis but on metacyc_abundance

@cafferychen777
Copy link
Owner

Hello @WillErickson1,

  1. Regarding "exactly what they refers to":
    The reference is to the pred_metagenome_unstrat.tsv.gz file within the KO_metagenome_out folder.

  2. On why edgeR works but limma and deseq2 don't:
    When testing with the tutorial dataset provided by picrust2, edgeR, when used as the differential abundance analysis method (daa_method), functions as expected. However, limma and deseq2 do not. There could be various reasons for this, including issues with the format of the input data, software dependencies, or compatibility between versions.

  3. About the most appropriate input for downstream analysis:
    When moving to downstream analysis, the decision between using ko_abundance or kegg_abundance to generate pathway_errorbar(), pathway_heatmap(), and pathway_pca() depends on the specific objectives of your analysis. The term ko_abundance refers to the abundance of KEGG Orthology (KO), which represents genes or gene clusters. On the other hand, kegg_abundance may relate to specific KEGG pathways.

    • If you're interested in differences of specific genes or gene functions, then ko_abundance might be more relevant.
    • However, if you're looking to understand variations in overall metabolic pathways, then kegg_abundance could be the better choice.
  4. Regarding the interest in analyzing metacyc_abundance:
    MetaCyc is a database of non-redundant, experimentally validated metabolic pathways, distinct from KEGG. Analyzing metacyc_abundance offers an alternative perspective on the metabolic pathway activity in the samples. By analyzing both KEGG and MetaCyc abundances simultaneously, you can achieve a more comprehensive and multi-layered biological interpretation. For further clarity on the second part of the tutorial, you might want to refer to this section on the ggpicrust2 GitHub page.

Best regards,
Chen YANG

@WillErickson1
Copy link
Author

WillErickson1 commented Oct 23, 2023

@cafferychen777 thank you for your time

Where can I find MetaCyc abundance results in the PICRUSt2 output folder ? does it in pathways_out ?
ko abundance is used only for metacyc data that's why we set ko_to_kegg to FALSE ?? because when I tried to run this script
results_file_input <- ggpicrust2(data = ko_abundance,
metadata = metadata,
group = "Environment",
pathway = "KO",
daa_method = "edgeR",
ko_to_kegg = FALSE,
order = "pathway_class",
p_values_bar = TRUE,
x_lab = "pathway_name")
it didn't work
-pathway_pca and heatmap_heatmap are only performed on metacyc data ? we can perform them on ko abundance ? if it's the case how I should modify in the code ?

  • Does each daa_method have it's own statistical test ?

@cafferychen777
Copy link
Owner

Hello @WillErickson1 ,

  1. MetaCyc Abundance Results Location:

    • The MetaCyc abundance results are indeed located in the pathways_out folder in the PICRUSt2 output directory.
  2. KO Abundance and ko_to_kegg Setting:

    • The statement "ko abundance is used only for metacyc data that's why we set ko_to_kegg to FALSE??" might be a bit confusing. The ko_to_kegg parameter is likely a setting to determine whether to convert KO identifiers to KEGG identifiers. It's set to FALSE when you are working with MetaCyc data since MetaCyc and KEGG are different databases with different identifiers. Thus, KO (KEGG Orthology) to KEGG conversion setting doesn't pertain when working with MetaCyc data.
  3. pathway_pca and heatmap_heatmap Execution:

    • It is possible to perform pathway_pca and heatmap_heatmap on KO abundance data. While these analyses may typically be performed on MetaCyc data, they can also be executed on KO abundance data. Some users in the issue section have shared their code on how to do this, and you can refer to those examples to understand how to modify your code accordingly.
  4. Statistical Tests in daa_method:

    • Yes, each daa_method specifies a different statistical analysis method. You should refer to the respective papers or documentation for each method to understand the statistical tests being performed.

Best regards,

Chen YANG

@cafferychen777 cafferychen777 changed the title Error in metadata[, matching_columns]: ggpicrust2(): Error in metadata[, matching_columns]: Oct 23, 2023
@MaryaSkopina
Copy link

Hello!

Thank you for the opportunity to get the visual results of Picrust!
I have a similar problem with the one described above. I've done a step-by-step workflow :
library(reader)
library(ggpicrust2)
library (Bible)
library (neat verse)
library(ggprism)
library(patchwork)

kegg_abundance <- ko2kegg_abundance("E:/Озера метаданные/R/Picrust/pred_metagenome_unstrat.tsv")
Loading data from file...
Rows: 7066 Columns: 61
── Column specification ────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): function
dbl (60): BEl1, BEl2, BEl3, Ber1, Ber2, Ber3, Bez1, Bez2, Bez3, BKa1, BKa2, BKa3, BKu1, BKu2, BKu3, Che1...

ℹ Use spec() to retrieve the full column specification for this data.
ℹ Specify the column types or set show_col_types = FALSE to quiet this message.
Loading KEGG reference data. This might take a while...
Performing KO to KEGG conversion. Please be patient, this might take a while...
|==================================================================================================| 100%
KO to KEGG conversion completed. Time elapsed: 1.55 seconds.
Removing KEGG pathways with zero abundance across all samples...
KEGG abundance calculation completed successfully.

metadata <- read.csv("D:/CCA/forCCAchem.csv", header=TRUE, sep = ";")
str(metadata)
'data.frame': 60 obs. of 7 variables:
$ sample.id: chr "Kuch1" "Kuch2" "Kuch3" "Che1" ...
$ location : chr "pelag" "litor" "zarosli" "pelag" ...
$ area : chr "niznetavd" "niznetavd" "niznetavd" "niznetavd" ...
$ climZone : chr "subtaiga" "subtaiga" "subtaiga" "subtaiga" ...
$ Area : chr "small" "small" "small" "verysmall" ...
$ Salin : chr "freshwater" "freshwater" "freshwater" "freshwater" ...
$ soil : chr "pdggTn" "pdggTn" "pdggTn" "pdggTn" ...
daa_results_df <- pathway_daa(abundance = kegg_abundance, metadata = metadata, group = "climZone", daa_method = "ALDEx2", select = NULL, reference = NULL)
Converting metadata to tibble...
Sample names extracted.
Identifying matching columns in metadata...
Matching columns identified: sample.id . This is important for ensuring data consistency.
Using all columns in abundance.
Converting abundance to a matrix...
Reordering metadata...
Converting metadata to a matrix and data frame...
Extracting group information...
Running ALDEx2 with two groups. Performing t-test...
Error in Math.factor(c(2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, :
‘round’ not meaningful for factors

Please, help me!
With gratitude, Maria

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants