title | author | date | vignette | output | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Metametrics: an R package with metadata metrics for annotation of genomic compendia |
Vincent J. Carey, stvjc at channing.harvard.edu |
`r format(Sys.time(), '%B %d, %Y')` |
%\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{Semantic metrics for cancer corpus} %\VignetteEncoding{UTF-8}
|
|
suppressPackageStartupMessages({
library(ggplot2)
library(plotly)
library(metametrics)
library(ssrch)
})
Using the Omicidx system, we harvested metadata about human samples for which RNA-seq data was deposited in NCBI SRA.
We work with a subset of 1009 studies for which a cancer-related term was present in study title as recorded at NCBI SRA.
library(ggplot2)
library(plotly)
library(metametrics)
data(study_publ_dates) # harvesting omicidx early 2019
library(lubridate)
ds_ca = DocSet_ca1009()
ds_ca
We accumulate (over dates of study submissions) the set of fields used in the sample annotation of the 1009 cancer studies.
study_publ_dates = na.omit(study_publ_dates)
studs1009 = ls(docs2kw(ds_ca)) # in cancer corpus
stud_dates = as_datetime(study_publ_dates[,2])
names(stud_dates) = study_publ_dates[,1]
stud_dates = stud_dates[studs1009] # limit to corpus
stud_dates = sort(stud_dates)
ofields = lapply(names(stud_dates),
function(x) names(retrieve_doc(x, ds_ca)))
freqs = table(unlist(ofields))
#sort(freqs,decreasing=TRUE)[1:20]
cumfields = ofields
for (i in 2:length(cumfields)) cumfields[[i]] =
union(cumfields[[i]], cumfields[[i-1]])
csiz = sapply(cumfields,length)
bag_fields_ca1009 = unique(unlist(cumfields))
nfields = length(bag_fields_ca1009)
mydf = data.frame(date_published=stud_dates, nfields=csiz)
The growth in size of the set of fields in use over time is displayed here:
ggplot(mydf, aes(x=date_published, y=nfields)) + geom_point()
library(plotly)
ddf = data.frame(date=stud_dates[-1], newly_introduced_fields=diff(csiz),
study=paste0(names(stud_dates[-1]), "\na"))
The next display is interactive -- hover over points to see study accession number and newly introduced field names.
incrs = lapply(2:length(cumfields), function(x) setdiff(cumfields[[x]],
cumfields[[x-1]]))
incrs = unlist(lapply(incrs, function(x) paste0(x, collapse="\n")))
sn = names(stud_dates[-1])
incrs = paste(sn, incrs, sep="\n")
dddf = cbind(ddf, incrs)
g2 = ggplot(dddf, aes(x=date, y=newly_introduced_fields, text=incrs)) + geom_point()
ggplotly(g2)
Use of common data elements is promoted by various initiatives. Dictionaries, thesauri, and ontologies are all relevant. We have examples of each in the metametrics package.
A snapshot of the Genomic Data Commons gdcdictionary, with fields
and values related to diagnosis and sample characteristics is
provided in gdc_dx_sam
.
gdc_dx_sam
A table with all entries from several ontologies and the NCI Thesaurus
is provided by load_ontolookup
:
olook = load_ontolookup()
olook
We use robust linear modeling to estimate growth in
vocabulary of fields employed over time. The data.frame
mydf
includes a variable nfields
taking a value
for each study publication date. The value of nfields
associated
with date
library(MASS)
nsecpy = 3600*24*365
summary( mm <- rlm(nfields~I(as.numeric(date_published)/nsecpy), data=mydf))
plot(nfields~I(as.numeric(date_published)/nsecpy), data=mydf)
abline(mm)