-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
408 lines (314 loc) · 14.4 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
---
title: "PRIDE Archive Project Tagger"
author: "Jose A. Dianes"
date: "6 April 2015"
output:
html_document:
fig_width: 9
keep_md: yes
theme: cerulean
toc: yes
---
# Introduction and Goals
The [PRIDE Archive](http://www.ebi.ac.uk/pride/archive) uses a system of tags
to classify datasets. So far these tags are manually assigned by curators. So
far the PRIDE teach considers the following tags:
- Biological
- Biomedical
- Cardiovascular
- Chromosome-centric Human Proteome Project (C-HPP)
- Human Proteome Project
- Metaproteomics
- PRIME-XS Project
- Technical
In the present analysis, by trying to build a model that predict tags, we want
to find out what are behind those tags. Is it information such as a project
species or tissue relevant to identify a project as biomedical? Is it that
information actually hidden in the dataset textual description?
# Getting and cleaning the data
Let's start by getting the latest `prideR` version form `GitHub`, together with
some other packages we will use to build our model.
```{r, message=FALSE}
library(devtools)
install_github("PRIDE-R/prideR")
library(prideR)
require(ggplot2)
theme_set(theme_linedraw())
require(tm)
library(caTools)
library(randomForest)
library(pander)
library(wordcloud)
```
Now we get all the projects available in the archive.
```{r, cache=TRUE}
archive.df <- as.data.frame(search.list.ProjectSummary("",0,20000))
```
Kepp only those with a project tag assigned.
```{r}
archive.df <- subset(archive.df, project.tags!="Not available")
```
# Tag distribution
Now we have `r nrow(archive.df)` tagged projects in the dataset. Let's have a
quick look at how these tags are distributed.
```{r}
ggplot(data=archive.df, aes(x=as.factor(archive.df$project.tags))) +
geom_bar() +
coord_flip( ) +
ylab("Number of Projects") +
xlab("Project Tag")
```
As we see, our data project tag distribution is very skewed. This will make
difficult to build good models, specially now that we have very few projects.
# Using meta-data
Let's start by trying to predict projects tags based on meta-data only. That is,
by using information such as the *species*, *tissues*, and so on.
## Preparing meta-data
Right now we have all the metadata associated to PRIDE datasets in text format.
In order to use it to build a model we better convert it to categorica data.
However, most of them have too many differnet values for R machine libraries to
make use of them as predictors. For that reason we need to aggregate those low
frequency values into a unique `Other` factor level.
```{r}
species.counts <- table(archive.df$species)
low.freq.species <- as.factor(names(which(species.counts<3)))
archive.df[archive.df$species %in% low.freq.species,]$species <- "Other"
archive.df$species <- as.factor(archive.df$species)
tissues.counts <- table(archive.df$tissues)
low.freq.tissues <- as.factor(names(which(tissues.counts<3)))
archive.df[archive.df$tissues %in% low.freq.tissues,]$tissues <- "Other"
archive.df$tissues <- as.factor(archive.df$tissues)
ptm.counts <- table(archive.df$ptm.names)
low.freq.ptm <- as.factor(names(which(ptm.counts<3)))
archive.df[archive.df$ptm.names %in% low.freq.ptm,]$ptm.names <- "Other"
archive.df$ptm.names <- as.factor(archive.df$ptm.names)
instrument.counts <- table(archive.df$instrument.names)
low.freq.instrument <- as.factor(names(which(instrument.counts<3)))
archive.df[archive.df$instrument.names %in% low.freq.instrument,]$instrument.names <- "Other"
archive.df$instrument.names <- as.factor(archive.df$instrument.names)
archive.df$project.tags <- as.factor(archive.df$project.tags)
archive.df$submissionType <- as.factor(archive.df$submissionType)
```
## Exploring metadata
Let's explore a bit how each piece of metadata is distributed accross each
project tag. This might give us some insight into useful predictos if we
find any species, tissue, etc., more present in a particular tag than others.
For that we will build a linear model and check the statistical importance of each
predictor.
```{r}
meta_log_model <- glm(
project.tags ~ species + tissues + ptm.names + instrument.names + num.assays + submissionType,
data=archive.df,
family=binomial)
meta_log_model_summary <- summary(meta_log_model)
relevant_meta <- names(which(meta_log_model_summary$coefficients[,4]<0.05))
```
And we see that the following predictors are statistically significan when
predicting tags **assuming a linear relationship**:
```{r, echo=FALSE}
pander(data.frame(Predictor=relevant_meta), style = "rmarkdown", split.table = Inf)
```
We cannot conclude much due to two factors. First, our relationship is surely
not linear. And second, our dataset is too small to model better relationships.
But at least it seems to be some relationship between *species*, and
*submission type*, and the project tag.
## A meta-data only model
Let's try now a *Random Forest* model using just metadata predictors. We know
our dependent variable is very skewed, so don't expect too much accuracy. We
decide for this type of model due to its success with non-linear problems.
First of all, prepare a train/test split so we can check accuracy. We also
have the problem of having too few data in general. Hopefully our models will
get better while PRIDE datasets get more numerous.
```{r}
set.seed(123)
spl <- sample.split(archive.df, .85)
evalTrain <- archive.df[spl==T,]
evalTest <- archive.df[spl==F,]
```
And now train the model.
```{r}
set.seed(123)
rfModelMeta <- randomForest(
project.tags ~ species + tissues + ptm.names + instrument.names + submissionType,
data = evalTrain)
# Calculate accuracy on the test set
rfPredMeta <- predict(rfModelMeta, newdata=evalTest)
t <- table(evalTest$project.tags, rfPredMeta)
meta_accuracy <- (t[1,1] + t[2,2] + t[3,3] + t[4,4] + t[5,5] + t[6,6] + t[7,7] + t[8,8]) / nrow(evalTest)
```
We have a prediction accuracy of **`r meta_accuracy`** on the test data. We can
also see how the model struggles to predict on classes with very few cases and
does better with large classes.
```{r, echo=FALSE}
pander(t, style = "rmarkdown", split.table = Inf)
```
# Using text fields
Surely PRIDE curators use textual description about datasets in order to assign
a tag to them. Let's try to incorporate that information into our models in order
to predict a dataset tag.
## Preparing the corpus
We have two textual fields, `project.title` and `project.description`. Let's
prepare a corpus with both of them. In both cases we will reduce them to
lowercase, remove punctuation and stopwords, and apply stemming.
```{r, message=FALSE}
library(tm)
evalTrain$AllText <- do.call(paste, evalTrain[,c("project.title","project.description")])
evalTest$AllText <- do.call(paste, evalTest[,c("project.title","project.description")])
corpusAll <- Corpus(VectorSource(c(evalTrain$AllText, evalTest$AllText)))
corpusAll <- tm_map(corpusAll, tolower)
corpusAll <- tm_map(corpusAll, PlainTextDocument)
corpusAll <- tm_map(corpusAll, removePunctuation)
corpusAll <- tm_map(corpusAll, removeWords, stopwords("english"))
corpusAll <- tm_map(corpusAll, stripWhitespace)
corpusAll <- tm_map(corpusAll, stemDocument)
```
We will also keep just those terms appearing in at least 3 percent of the projects.
```{r, message=FALSE, warning=FALSE}
# Generate term matrix
dtmAll <- DocumentTermMatrix(corpusAll)
sparseAll <- removeSparseTerms(dtmAll, 0.99)
allWords <- data.frame(as.matrix(sparseAll))
colnames(allWords) <- make.names(colnames(allWords))
```
## Word clouds
Another useful thing we could do in order to better understand our data is to
visualise what words appear more often for a given tag. Doing this visually
allows us to grasp the information quickly.
We can use the same approach as before to get the data ready.
```{r, warning=FALSE}
corpusCloud <- Corpus(VectorSource(c(evalTrain$AllText, evalTest$AllText)))
corpusCloud <- tm_map(corpusCloud, tolower)
corpusCloud <- tm_map(corpusCloud, PlainTextDocument)
corpusCloud <- tm_map(corpusCloud, removePunctuation)
corpusCloud <- tm_map(corpusCloud, removeWords, stopwords("english"))
corpusCloud <- tm_map(corpusCloud, stripWhitespace)
dtmCloud <- DocumentTermMatrix(corpusCloud)
allCloud <- data.frame(as.matrix(dtmCloud))
```
### Biological word cloud
```{r}
allCloudBiological <- allCloud[archive.df$project.tag=="Biological",]
wordcloud(colnames(allCloudBiological), colSums(allCloudBiological), scale=c(2,0.25))
```
### Biomedical word cloud
```{r}
allCloudBiomedical <- allCloud[archive.df$project.tag=="Biomedical",]
wordcloud(colnames(allCloudBiomedical), colSums(allCloudBiomedical), scale=c(2,0.25))
```
### Cardiovascular word cloud
```{r}
allCloudCardiovascular <- allCloud[archive.df$project.tag=="Cardiovascular",]
wordcloud(colnames(allCloudCardiovascular), colSums(allCloudCardiovascular), scale=c(2,0.25))
```
### C-HPP word cloud
```{r}
allCloudCHPP <- allCloud[archive.df$project.tag=="Chromosome-centric Human Proteome Project (C-HPP)",]
wordcloud(colnames(allCloudCHPP), colSums(allCloudCHPP), scale=c(2,0.25))
```
### Metaproteomics word cloud
```{r}
allCloudMetaproteomics <- allCloud[archive.df$project.tag=="Metaproteomics",]
wordcloud(colnames(allCloudMetaproteomics), colSums(allCloudMetaproteomics), scale=c(2,0.25))
```
### PRIME-XS Project word cloud
```{r}
allCloudPrimeXs <- allCloud[archive.df$project.tag=="PRIME-XS Project",]
wordcloud(colnames(allCloudPrimeXs), colSums(allCloudPrimeXs), scale=c(2,0.25))
```
### Technical word cloud
```{r}
allCloudTechnical <- allCloud[archive.df$project.tag=="Technical",]
wordcloud(colnames(allCloudTechnical), colSums(allCloudTechnical), scale=c(2,0.25))
```
### Some thoughts about word clouds
One obvious conclusion that emerges from these charts is that some tags are clearly
underrepresented. This will make our models not very efficient. However we
can count on this situation to change in the future while more and more projects
are submitted to PRIDE Archive.
We can also see that some words rank top in most project tags, and are probably
not very useful as predictors. Specially we refer to the family of terms related
to proteomics, proteins, etc. We hope these terms will be filtered out in our
model selection process in the next section.
## Selecting significative terms
We have ended up with **`r ncol(allWords)` possible predictors**. But we can do
better than this. We are going to train a *linear model* using them and our
dependent variable as outcome. Then we will get those variables that are
statistically significative and incorporate them to the main dataset that
we will use with our final model.
```{r, message=FALSE, warning=FALSE}
# Find most significative terms
allWordsTrain2 <- head(allWords, nrow(evalTrain))
allWordsTrain2$Popular <- evalTrain$project.tag
logModelAllWords <- glm(Popular~., data=allWordsTrain2, family=binomial)
all_three_star_terms <- names(which(summary(logModelAllWords)$coefficients[,4]<0.001))
all_two_star_terms <- names(which(summary(logModelAllWords)$coefficients[,4]<0.01))
all_one_star_terms <- names(which(summary(logModelAllWords)$coefficients[,4]<0.05))
# Leave just those terms that are different between popular and unpopular articles
allWords <- subset(allWords,
select=names(allWords) %in% all_one_star_terms)
# Split again
allWordsTrain <- head(allWords, nrow(evalTrain))
allWordsTest <- tail(allWords, nrow(evalTest))
# Add to dataframes
evalTrain <- cbind(evalTrain, allWordsTrain)
evalTest <- cbind(evalTest, allWordsTest)
# Remove all text variable since we don't need it
evalTrain$AllText <- NULL
evalTest$AllText <- NULL
```
We ended up with just **`r ncol(allWords)` predictors** that will be
incorporated into our model. These are the terms together with their mean
frequencies by project tag (remember they are **stemmed**):
```{r}
allWords$project.tag <- archive.df$project.tag
panderOptions('round', 2)
panderOptions('keep.trailing.zeros', TRUE)
pander(t(aggregate(.~project.tag, data=allWords, mean)), round=4, style = "rmarkdown", split.table = Inf)
```
## A meta-data and text model
So let's train now a model using our meta-data predictors together with those
selected in the previous process using textual data.
```{r}
set.seed(123)
rfModelMetaText <- randomForest(
project.tags ~ . - accession - publication.date - project.title - project.description,
data = evalTrain)
# Calculate accuracy on the test set
rfPredMetaText <- predict(rfModelMetaText, newdata=evalTest)
t <- table(evalTest$project.tags, rfPredMetaText)
meta_text_accuracy <- (t[1,1] + t[2,2] + t[3,3] + t[4,4] + t[5,5] + t[6,6] + t[7,7] + t[8,8]) / nrow(evalTest)
```
We obtain an accuracy of **`r meta_text_accuracy`**. The improvement is actually
in better predicting the `biological` and `technical` classes a bit.
```{r, echo=FALSE}
pander(t, style = "rmarkdown", split.table = Inf)
```
# Conclusions
We have seen our two main problems:
- Our data is very unbalanced. There are many more `biological` and `biomedical`
data than anything else.
- The previous one becomes more of a problem due to the reduced size of our
training data.
Hopefully both problems will become less and less important with time, while
PRIDE gets more submissions.
We have also seen that incorporating textual data to meta-data makes our model
more accurate. Not by a lot, but more accurate after all. In the exploration
and selection process for significative terms, we have seen that using a linear
model selected significative terms. We have also seen that terms that appear
not relevant in our word cloud charts, were removed by this model selection
process.
In terms of our original questions, we don't have enough confidence neither
to predict tags nor to profile each tag in terms of its associated meta-data
or textual description (although we have a starting list of candidate terms).
However we have seen that there is a relationship, specially when we have
enough cases for a given tag to train a model. Additionally we have seen that
both things, meta-data and textual description contribute to predict a dataset
tag more accuratelly.
# Future works
Our simple train/split might be making our model overfit the data. A good approach
would be to use cross validation. However, by using *random forests* we attenuate
this problem.
We also need to try different models and do additional exploratory analysis.
By doing so we will come with additional insight into the nature of our data. This
will help to select variables and even to use other techniques such as ensemble
or cluster-then-predict.