README.Rmd

---
title: "PRIDE Archive Project Tagger"
author: "Jose A. Dianes"
date: "6 April 2015"
output:
  html_document:
    fig_width: 9
    keep_md: yes
    theme: cerulean
    toc: yes
---
# Introduction and Goals  

The [PRIDE Archive](http://www.ebi.ac.uk/pride/archive) uses a system of tags 
to classify datasets. So far these tags are manually assigned by curators. So 
far the PRIDE teach considers the following tags:  

- Biological  
- Biomedical  
- Cardiovascular  
- Chromosome-centric Human Proteome Project (C-HPP)  
- Human Proteome Project  
- Metaproteomics  
- PRIME-XS Project  
- Technical  

In the present analysis, by trying to build a model that predict tags, we want
to find out what are behind those tags. Is it information such as a project
species or tissue relevant to identify a project as biomedical? Is it that
information actually hidden in the dataset textual description?  

# Getting and cleaning the data  

Let's start by getting the latest `prideR` version form `GitHub`, together with 
some other packages we will use to build our model.  

```{r, message=FALSE}
library(devtools)
install_github("PRIDE-R/prideR")
library(prideR)
require(ggplot2)
theme_set(theme_linedraw())
require(tm)
library(caTools)
library(randomForest)
library(pander)
library(wordcloud)
```

Now we get all the projects available in the archive.  

```{r, cache=TRUE}
archive.df <- as.data.frame(search.list.ProjectSummary("",0,20000))
```

Kepp only those with a project tag assigned.  

```{r}
archive.df <- subset(archive.df, project.tags!="Not available")
```

# Tag distribution  

Now we have `r nrow(archive.df)` tagged projects in the dataset. Let's have a 
quick look at how these tags are distributed.  

```{r}
ggplot(data=archive.df, aes(x=as.factor(archive.df$project.tags))) +
    geom_bar() +
    coord_flip( ) +
    ylab("Number of Projects") +
    xlab("Project Tag")
```

As we see, our data project tag distribution is very skewed. This will make 
difficult to build good models, specially now that we have very few projects.  

# Using meta-data  

Let's start by trying to predict projects tags based on meta-data only. That is,
by using information such as the *species*, *tissues*, and so on.  

## Preparing meta-data  

Right now we have all the metadata associated to PRIDE datasets in text format.
In order to use it to build a model we better convert it to categorica data.
However, most of them have too many differnet values for R machine libraries to
make use of them as predictors. For that reason we need to aggregate those low
frequency values into a unique `Other` factor level.  

```{r}
species.counts <- table(archive.df$species)
low.freq.species <- as.factor(names(which(species.counts<3)))
archive.df[archive.df$species %in% low.freq.species,]$species <- "Other"
archive.df$species <- as.factor(archive.df$species)

tissues.counts <- table(archive.df$tissues)
low.freq.tissues <- as.factor(names(which(tissues.counts<3)))
archive.df[archive.df$tissues %in% low.freq.tissues,]$tissues <- "Other"
archive.df$tissues <- as.factor(archive.df$tissues)

ptm.counts <- table(archive.df$ptm.names)
low.freq.ptm <- as.factor(names(which(ptm.counts<3)))
archive.df[archive.df$ptm.names %in% low.freq.ptm,]$ptm.names <- "Other"
archive.df$ptm.names <- as.factor(archive.df$ptm.names)

instrument.counts <- table(archive.df$instrument.names)
low.freq.instrument <- as.factor(names(which(instrument.counts<3)))
archive.df[archive.df$instrument.names %in% low.freq.instrument,]$instrument.names <- "Other"
archive.df$instrument.names <- as.factor(archive.df$instrument.names)

archive.df$project.tags <- as.factor(archive.df$project.tags)
archive.df$submissionType <- as.factor(archive.df$submissionType)
```

## Exploring metadata  

Let's explore a bit how each piece of metadata is distributed accross each 
project tag. This might give us some insight into useful predictos if we
find any species, tissue, etc., more present in a particular tag than others.  

For that we will build a linear model and check the statistical importance of each
predictor.  

```{r}
meta_log_model <- glm(
    project.tags ~ species + tissues + ptm.names + instrument.names + num.assays + submissionType, 
    data=archive.df, 
    family=binomial)
meta_log_model_summary <- summary(meta_log_model)
relevant_meta <- names(which(meta_log_model_summary$coefficients[,4]<0.05))
```

And we see that the following predictors are statistically significan when 
predicting tags **assuming a linear relationship**:  

```{r, echo=FALSE}
pander(data.frame(Predictor=relevant_meta), style = "rmarkdown", split.table = Inf)
```

We cannot conclude much due to two factors. First, our relationship is surely
not linear. And second, our dataset is too small to model better relationships.
But at least it seems to be some relationship between *species*, and 
*submission type*, and the project tag.  

## A meta-data only model  

Let's try now a *Random Forest* model using just metadata predictors. We know 
our dependent variable is very skewed, so don't expect too much accuracy. We 
decide for this type of model due to its success with non-linear problems.  

First of all, prepare a train/test split so we can check accuracy. We also 
have the problem of having too few data in general. Hopefully our models will
get better while PRIDE datasets get more numerous.  

```{r}
set.seed(123)
spl <- sample.split(archive.df, .85)

evalTrain <- archive.df[spl==T,]
evalTest <- archive.df[spl==F,]
```

And now train the model.  

```{r}
set.seed(123)
rfModelMeta <- randomForest(
    project.tags ~ species + tissues + ptm.names + instrument.names + submissionType, 
    data = evalTrain)
# Calculate accuracy on the test set
rfPredMeta <- predict(rfModelMeta, newdata=evalTest)
t <- table(evalTest$project.tags, rfPredMeta)
meta_accuracy <- (t[1,1] + t[2,2] + t[3,3] + t[4,4] + t[5,5] + t[6,6] + t[7,7] + t[8,8]) / nrow(evalTest)
```

We have a prediction accuracy of **`r meta_accuracy`** on the test data. We can 
also see how the model struggles to predict on classes with very few cases and
does better with large classes.   

```{r, echo=FALSE}
pander(t, style = "rmarkdown", split.table = Inf)
```

# Using text fields  

Surely PRIDE curators use textual description about datasets in order to assign
a tag to them. Let's try to incorporate that information into our models in order
to predict a dataset tag.  

## Preparing the corpus  

We have two textual fields, `project.title` and  `project.description`. Let's 
prepare a corpus with both of them. In both cases we will reduce them to 
lowercase, remove punctuation and stopwords, and apply stemming.   

```{r, message=FALSE}
library(tm)
evalTrain$AllText <- do.call(paste, evalTrain[,c("project.title","project.description")])
evalTest$AllText <- do.call(paste, evalTest[,c("project.title","project.description")])

corpusAll <- Corpus(VectorSource(c(evalTrain$AllText, evalTest$AllText)))
corpusAll <- tm_map(corpusAll, tolower)
corpusAll <- tm_map(corpusAll, PlainTextDocument)
corpusAll <- tm_map(corpusAll, removePunctuation)
corpusAll <- tm_map(corpusAll, removeWords, stopwords("english"))
corpusAll <- tm_map(corpusAll, stripWhitespace)
corpusAll <- tm_map(corpusAll, stemDocument)
```

We will also keep just those terms appearing in at least 3 percent of the projects.  

```{r, message=FALSE, warning=FALSE}
# Generate term matrix
dtmAll <- DocumentTermMatrix(corpusAll)
sparseAll <- removeSparseTerms(dtmAll, 0.99)
allWords <- data.frame(as.matrix(sparseAll))

colnames(allWords) <- make.names(colnames(allWords))
```

## Word clouds  

Another useful thing we could do in order to better understand our data is to 
visualise what words appear more often for a given tag. Doing this visually 
allows us to grasp the information quickly.  

We can use the same approach as before to get the data ready.  

```{r, warning=FALSE}
corpusCloud <- Corpus(VectorSource(c(evalTrain$AllText, evalTest$AllText)))
corpusCloud <- tm_map(corpusCloud, tolower)
corpusCloud <- tm_map(corpusCloud, PlainTextDocument)
corpusCloud <- tm_map(corpusCloud, removePunctuation)
corpusCloud <- tm_map(corpusCloud, removeWords, stopwords("english"))
corpusCloud <- tm_map(corpusCloud, stripWhitespace)
dtmCloud <- DocumentTermMatrix(corpusCloud)
allCloud <- data.frame(as.matrix(dtmCloud))
```

### Biological word cloud    

```{r}
allCloudBiological <- allCloud[archive.df$project.tag=="Biological",]
wordcloud(colnames(allCloudBiological), colSums(allCloudBiological), scale=c(2,0.25))
```

### Biomedical word cloud    

```{r}
allCloudBiomedical <- allCloud[archive.df$project.tag=="Biomedical",]
wordcloud(colnames(allCloudBiomedical), colSums(allCloudBiomedical), scale=c(2,0.25))
```

### Cardiovascular word cloud    

```{r}
allCloudCardiovascular <- allCloud[archive.df$project.tag=="Cardiovascular",]
wordcloud(colnames(allCloudCardiovascular), colSums(allCloudCardiovascular), scale=c(2,0.25))
```

### C-HPP word cloud    

```{r}
allCloudCHPP <- allCloud[archive.df$project.tag=="Chromosome-centric Human Proteome Project (C-HPP)",]
wordcloud(colnames(allCloudCHPP), colSums(allCloudCHPP), scale=c(2,0.25))
```

### Metaproteomics word cloud    

```{r}
allCloudMetaproteomics <- allCloud[archive.df$project.tag=="Metaproteomics",]
wordcloud(colnames(allCloudMetaproteomics), colSums(allCloudMetaproteomics), scale=c(2,0.25))
```

### PRIME-XS Project word cloud    

```{r}
allCloudPrimeXs <- allCloud[archive.df$project.tag=="PRIME-XS Project",]
wordcloud(colnames(allCloudPrimeXs), colSums(allCloudPrimeXs), scale=c(2,0.25))
```

### Technical word cloud    

```{r}
allCloudTechnical <- allCloud[archive.df$project.tag=="Technical",]
wordcloud(colnames(allCloudTechnical), colSums(allCloudTechnical), scale=c(2,0.25))
```

### Some thoughts about word clouds  

One obvious conclusion that emerges from these charts is that some tags are clearly
underrepresented. This will make our models not very efficient. However we
can count on this situation to change in the future while more and more projects
are submitted to PRIDE Archive.  

We can also see that some words rank top in most project tags, and are probably
not very useful as predictors. Specially we refer to the family of terms related
to proteomics, proteins, etc. We hope these terms will be filtered out in our
model selection process in the next section.  

## Selecting significative terms  

We have ended up with **`r ncol(allWords)` possible predictors**. But we can do 
better than this. We are going to train a *linear model* using them and our
dependent variable as outcome. Then we will get those variables that are
statistically significative and incorporate them to the main dataset that
we will use with our final model.  

```{r, message=FALSE, warning=FALSE}
# Find most significative terms
allWordsTrain2 <- head(allWords, nrow(evalTrain))
allWordsTrain2$Popular <- evalTrain$project.tag
logModelAllWords <- glm(Popular~., data=allWordsTrain2, family=binomial)
all_three_star_terms <- names(which(summary(logModelAllWords)$coefficients[,4]<0.001))
all_two_star_terms <- names(which(summary(logModelAllWords)$coefficients[,4]<0.01))
all_one_star_terms <- names(which(summary(logModelAllWords)$coefficients[,4]<0.05))

# Leave just those terms that are different between popular and unpopular articles
allWords <- subset(allWords, 
                   select=names(allWords) %in% all_one_star_terms)

# Split again
allWordsTrain <- head(allWords, nrow(evalTrain))
allWordsTest <- tail(allWords, nrow(evalTest))

# Add to dataframes
evalTrain <- cbind(evalTrain, allWordsTrain)
evalTest <- cbind(evalTest, allWordsTest)

# Remove all text variable since we don't need it
evalTrain$AllText <- NULL
evalTest$AllText <- NULL
```

We ended up with just **`r ncol(allWords)` predictors** that will be 
incorporated into our model. These are the terms together with their mean 
frequencies by project tag (remember they are **stemmed**):  

```{r}
allWords$project.tag <- archive.df$project.tag
panderOptions('round', 2)
panderOptions('keep.trailing.zeros', TRUE)
pander(t(aggregate(.~project.tag, data=allWords, mean)), round=4, style = "rmarkdown", split.table = Inf)
```


## A meta-data and text model  

So let's train now a model using our meta-data predictors together with those
selected in the previous process using textual data.  

```{r}
set.seed(123)
rfModelMetaText <- randomForest(
    project.tags ~ . - accession - publication.date - project.title - project.description, 
    data = evalTrain)
# Calculate accuracy on the test set
rfPredMetaText <- predict(rfModelMetaText, newdata=evalTest)
t <- table(evalTest$project.tags, rfPredMetaText)
meta_text_accuracy <- (t[1,1] + t[2,2] + t[3,3] + t[4,4] + t[5,5] + t[6,6] + t[7,7] + t[8,8]) / nrow(evalTest)
```

We obtain an accuracy of **`r meta_text_accuracy`**. The improvement is actually
in better predicting the `biological` and `technical` classes a bit.  

```{r, echo=FALSE}
pander(t, style = "rmarkdown", split.table = Inf)
```

# Conclusions  

We have seen our two main problems:  

 - Our data is very unbalanced. There are many more `biological` and `biomedical`
 data than anything else.  
 - The previous one becomes more of a problem due to the reduced size of our 
 training data.  
 
Hopefully both problems will become less and less important with time, while
PRIDE gets more submissions.  

We have also seen that incorporating textual data to meta-data makes our model 
more accurate. Not by a lot, but more accurate after all. In the exploration 
and selection process for significative terms, we have seen that using a linear
model selected significative terms. We have also seen that terms that appear
not relevant in our word cloud charts, were removed by this model selection
process.  

In terms of our original questions, we don't have enough confidence neither
to predict tags nor to profile each tag in terms of its associated meta-data
or textual description (although we have a starting list of candidate terms). 
However we have seen that there is a relationship, specially when we have 
enough cases for a given tag to train a model. Additionally we have seen that 
both things, meta-data and textual description contribute to predict a dataset 
tag more accuratelly.  

# Future works  

Our simple train/split might be making our model overfit the data. A good approach
would be to use cross validation. However, by using *random forests* we attenuate
this problem.  

We also need to try different models and do additional exploratory analysis.
By doing so we will come with additional insight into the nature of our data. This
will help to select variables and even to use other techniques such as ensemble
or cluster-then-predict.