Skip to content
forked from amcrisan/Adjutant

Runs a pubmed query, returns results and allows user to explore high-level structure of returned documents

License

Notifications You must be signed in to change notification settings

ceefluz/Adjutant

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Build Status

  1. Overview
  2. Download
  3. Demo

Overview

Adjutant is an open-source, interactive, and R-based application to support mining PubMed for a systematic or a literature review. Given a PubMed-compatible search query, Adjutant downloads the relevant articles and allows the user to perform an unsupervised clustering analysis to identify data-driven topic clusters. Users can also sample documents using different strategies to obtain a more manageable dataset for further analysis. Adjutant makes explicit trade-offs between speed and accuracy, which are modifiable by the user, such that a complete analysis of several thousand documents can take a few minutes. All analytic datasets generated by Adjutant are saved, allowing users to easily conduct other downstream analyses that Adjutant does not explicitly support.

We've also provided a detailed video to you help install Adjutant and to give you a tour of its functionality : https://vimeo.com/259442489

Citation

If you use Adjutant, please cite:

Adjutant: an R-based tool to support topic discovery for systematic and literature reviews Anamaria Crisan, Tamara Munzner, Jennifer L. Gardy Oxford Bioinformatics; doi: 10.1093/bioinformatics/bty722

Important Adjutant Details

In addition to the publication we have provided some extensive documentation on Adjutant's inner workings, from the specific R packages used in implementation, the the quality of the clustering results using both real and synthetic data. A pdf document of these details have been made available online.

An R notebook is also available to download and run the analysis - due to the large size of the analysis files the additional data and notebook is not shipped with Adjutant, but can be downloaded as a compressed file.

Download

Download the latest development code of Adjutant from GitHub using devtools with

devtools::install_github("amcrisan/Adjutant")

If you've got any download problems, or spot some bugs, please log an issue in the github repo.

Download: Super new to R version

Getting R and Install Adjutant

Maybe you want to use Adjutant (you read about it, you saw it somewhere, you're my mom), but you don't really know what R is, so you're not sure where to start. Here's how to get going:

You may run into trouble if you work in a place that prevents you from downloading and installing applications to your computer - this means you might need to us a home computer (but also talk to your IT team about why R is the best thing ever).

Once you have downloaded R and R Studio, you will open RStudio in order to install Adjutant and some additional packages. You can also view the video tutorial of how to install Adjutant here: https://vimeo.com/259442489

Within R or the RStudio console window (left, bottom), type the following:

install.packages("devtools")

Now we can install Adjutant! So type the following:

devtools::install_github("amcrisan/Adjutant")

Demo

Adjutant can be used through its attendant Shiny Application or within your own R script, although the design of Adjutant is driven more towards the Shiny App usage. Adjutant can be used just for looking up and downloading articles from PubMed to your computer, or performing a topic clustering analysis and/or a sophisticated document sampling.

Using the Adjuntant Shiny App

Video coming soon

With Adjutant installed, you just need to type two commands to get it going.

library(adjutant) #this gets R ready to run Adjutant
runAdjutant() #this will launch Adjutant user interface

That's it! Have a lot of fun exploring!

Using Adjutant within a R script

It is also possible to use Adjutant within your own code, bypassing the Shiny application all together.

Step 0: Load Adjutant

To run this demo the first thing you need to do is load Adjutant (once you've downloaded it), and some additional packages for analysis.

library(adjutant)
library(dplyr)
library(ggplot2)
library(tidytext) #for stop words

#also set a seed - there is some randomness in the analysis.
set.seed(416)

Step 1: Downloading data from PubMed to your computer

processSearch is Adjutant's PubMed search function and is effectively a wrapper for RISmed that formats RISmed's output into a clean data frame, with additional PubMed metadata (PubMed central citation count, language, article type etc). You can pass RISmed's EUtilsSummary parameters to the Adjutant's processSearch function.

Please note that Adjutant's downstream methods expect a dataframe with the column names that are produced by the processSearch method.

Depending upon the size of the query, it can take a few minutes to download all of the data.

df<-processSearch("(outbreak OR epidemic OR pandemic) AND genom*",retmax=2000)

Step 2: Generating a tidy text corpus

Adjutant next constructs a per-article "bag-of-words" arranged in a tidy text format. This step is necessary for further analysis in Adjutant, but can also be an interesting data set for other kinds of analysis not supported by Adjutant.

tidy_df<-tidyCorpus(corpus = df)

Step 3: Performing a dimensionality reduction using t-SNE

To learn more about t-SNE, please consult this excellent article in Distill Pub. Generally, Adjutant's topic clustering works best when there are a large number of diverse articles. It is still possible to get reasonable results with a smaller number of articles (fewer than 1,000 for example), and with a very homogenous dataset (for example just recent zika virus articles), but more general searches with large number of documents work best.

tsneObj<-runTSNE(tidy_df,check_duplicates=FALSE)

#add t-SNE co-ordinates to df object
df<-inner_join(df,tsneObj$Y,by="PMID")

# plot the t-SNE results
ggplot(df,aes(x=tsneComp1,y=tsneComp2))+
  geom_point(alpha=0.2)+
  theme_bw()

Below is an initial plot of our documents! Each point is one article, darker points means many overlapping articles, and you'll notice that some articles already form visible groups (clusters), in the next step we'll suss out which clusters are reasonably well defined (according to the algorithm).

Step 4: Perform an unsupervised clustering using hdbscan

If you use hdbscan through Adjutant, the result will be the Adjuntant optimized hdbscan minPts parameter. You can also just use hdbscan directly on the result t-SNE dimensionally reduced data and sort out what the best hbdscan minPts parameter is for yourself.

#run HDBSCAN and select the optimal cluster parameters automaticallu
optClusters <- optimalParam(df)

#add the new cluster ID's the running dataset
df<-inner_join(df,optClusters,by="PMID") %>%
    mutate(tsneClusterStatus = ifelse(tsneCluster == 0, "not-clustered","clustered"))

# plot the HDBSCAN clusters (no names yet)
clusterNames <- df %>%
  dplyr::group_by(tsneCluster) %>%
  dplyr::summarise(medX = median(tsneComp1),
                   medY = median(tsneComp2)) %>%
  dplyr::filter(tsneCluster != 0)

ggplot(df,aes(x=tsneComp1,y=tsneComp2,group=tsneCluster))+
  geom_point(aes(colour = tsneClusterStatus),alpha=0.2)+
  geom_label(data=clusterNames,aes(x=medX,y=medY,label=tsneCluster),size=2,colour="red")+
  stat_ellipse(aes(alpha=tsneClusterStatus))+
  scale_colour_manual(values=c("black","blue"),name="cluster status")+
  scale_alpha_manual(values=c(1,0),name="cluster status")+ #remove the cluster for noise
  theme_bw()

Below is the same plot as before, but now we've got some clusters that Adjutant thinks are reasonable.

Step 5: Naming the clusters

Adjutant has a function called getTopTerms that will automatically name a cluster according to its the top two most commonly occuring terms. If there are ties, it will return all.

clustNames<-df %>%
          group_by(tsneCluster)%>%
          mutate(tsneClusterNames = getTopTerms(clustPMID = PMID,clustValue=tsneCluster,topNVal = 2,tidyCorpus=tidy_df)) %>%
          select(PMID,tsneClusterNames) %>%
          ungroup()
        
#update document corpus with cluster names
df<-inner_join(df,clustNames,by=c("PMID","tsneCluster"))

#re-plot the clusters

clusterNames <- df %>%
  dplyr::group_by(tsneClusterNames) %>%
  dplyr::summarise(medX = median(tsneComp1),
                   medY = median(tsneComp2)) %>%
  dplyr::filter(tsneClusterNames != "Noise")

ggplot(df,aes(x=tsneComp1,y=tsneComp2,group=tsneClusterNames))+
  geom_point(aes(colour = tsneClusterStatus),alpha=0.2)+
  stat_ellipse(aes(alpha=tsneClusterStatus))+
  geom_label(data=clusterNames,aes(x=medX,y=medY,label=tsneClusterNames),size=3,colour="red")+
  scale_colour_manual(values=c("black","blue"),name="cluster status")+
  scale_alpha_manual(values=c(1,0),name="cluster status")+ #remove the cluster for noise
  theme_bw()

Finally, here's the same plot again, but now with the names of the clusters. You'll notice that some clusters have very specific names (zika-viru; ebola viru) and some have more generic names (sequenc-isol-outbreak). My hypothesis for what is happening is a classic signal issue - some pathogens with only a few articles seem to cluster together into these more generic clusters, whereas pathogens with a lot of publications (and hence strong signal in the data) tend to form their own clusters.

Step 6: Go forth and analyze some more!

You can use df, tidy_df, or any other object produced by Adjutant in your own analysis!

About

Runs a pubmed query, returns results and allows user to explore high-level structure of returned documents

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • R 99.2%
  • CSS 0.8%