Skip to content

Data Structures

kevincjnixon edited this page Jan 8, 2021 · 2 revisions

Data Structures

As of right now, there are only two data structures used with BinfTools:

  • Normalized Counts
  • Results

Normalized Counts

This object is a data frame of normalized gene counts in which the rows are represented by genes and columns are individual samples. The rownames should be the gene names. The normalization technique used is up to you. If you are using DESeq2 for differential gene expression analysis, you can use

norm_counts<-as.data.frame(counts(dds, normalized=T))

to obtain the normalized gene counts for use with BinfTools.

Additionally, you should have a character vector describing the sample conditions. These should be in the same order in which the samples appear as columns in 'norm_counts'. Again, if you are using DESeq2 for differential expression analysis, and your design for analysis was '~condition', you can use

cond<-as.character(dds$condition)

to obtain and store the sample conditions in 'cond'.

Results

This object can be either a DESeq2 results object obtained from:

res<-results(dds, contrast=c("condition","KO","WT") #Get results from DESeq2 contrasting the "KO" to "WT" conditions

Or a data frame object where the rows represent genes (rownames are genes) and it contains at least the following columns:

  • baseMean: Number representing the average normalized gene expression across all samples
  • log2FoldChange: Number representing the log2 fold-change in gene expression between the conditions compared. In the example above, it's between "KO" and "WT" where a positive log2 fold-change represents higher expression of that gene in KO compared to WT and negative log2 fold-change represents lower expression of that gene in KO compared to WT.
  • pvalue: The unadjusted p-value generated from the differential expression test
  • padj: The adjusted p-value generated from the differential expression test.

If you are not using DESeq2 to run your analysis, you can easily change the column names of your results object to correspond with these column names for them to be compatible with BinfTools. If you use the limma or edgeR packages for your analysis, BinfTools comes with built-in commands to convert your results objects for you:

fromLimma()

This command will generate a BinfTools compatible results object from the output of Limma's topTable(). Output is a data frame where baseMean is derived from aveExpr, log2FoldChange is derived from logFC, pvalue is derived from P.Value, and padj is derived from adj.P.Val. Note that not all columns are carried over from the limma results, so be sure to set a new variable for this ojbect:

lim_res<-topTable(fit, number=Inf)
res<-fromLimma(lim_res)

fromEdgeR()

This command will generate a BinfTools compatible results object from the output of edgeR's topTags(). Output is a data frame where baseMean is derived from logCPM. The logCPM column in edgeR's output is log2(CPM), this is automatically converted to CPM in the fromEdgeR() function. log2FoldChange is derived from logFC, pvalue is dervied from PValue, and padj is derived from FDR. Again, note that not all columns are carried over from the edgeR results, so be sure to set a new variable for this object:

er_res<-topTags(de, n=Inf)
res<-fromEdgeR(er_res)