BioSolr.html

---
layout: default
title: BioSolr
group: "in_local_navigation"
order: 3
---
<div class="blurb">
	<h3>A BioSolr Ontology Expansion plugin for Solr</h3>
	<p>BioSolr aims to significantly advance the state of the art with regard to 
		indexing and querying biomedical data with freely available open source software
	</p>
	<p>This is a short interactive introduction to the BioSolr Ontology Expansion plugin for Solr.
		If you are interested in using elasticsearch instead, please refer to the BioSolr elasticsearch tutorial 
		<a href="https://github.com/EBISPOT/BioSolr/tree/master/tutorial/elasticsearch" target="_blank">here</a>
	</p>
	<p>In this tutorial you will learn how to:</p>
	<ul>
		<li><a href="#install_solr">Install Solr</a></li>
		<li><a href="#index_sample_data">Index some example data with embedded ontology annotations</a></li>
		<li><a href="#search_data">Use an example web application to search this data</a></li>
		<li><a href="#install_biosolr_plugins">Install BioSolr plugins to ontology-enrich your Solr index</a></li>
		<li><a href="#configure_biosolr">Configure BioSolr</a></li>
		<li><a href="#reindex_data">Reindex our data to take advantage of ontology enrichment</a></li>
		<li><a href="#ontology_powered_search">Perform ontology-powered searches</a></li>
		<li><a href="#conclusions">Conclusions</a></li>
	</ul>
	<h4>Prerequisites</h4>
	<p>Java 8 - both Solr and the example web application rely on Java 8 to run.</p>
	
	<h4 id="install_solr">Install Solr</h4>
	<p>Download Solr from <a href="http://lucene.apache.org/solr/" target="_blank">http://lucene.apache.org/solr/</a> (Pick .tgz or .zip file as you prefer)</p>
	<p>Unpack the Solr download into your preferred directory - for this demo, we're going to use ~/Applications/solr-5.3.1/</p>
	<p>Once you've done this, verify you can startup Solr:</p>
	<blockquote>cd ~/Applications/solr-5.3.1/<br/>
	bin/solr start</blockquote>
	<p>Now open a browser and check we can see the Solr admin page at <a href="http://localhost:8983/" target="_blank">http://localhost:8983/</a></p>
	<p>Now we're happy we can run Solr, so let's shut down our Solr server again...</p>
	<blockquote>bin/solr stop</blockquote>
	<p>...and now get the extra BioSolr stuff. We'll be using this later in the tutorial</p>
	<p>Check out the code for this demo into your preferred directory - we're going to be using ~/Projects/:</p>
	<blockquote>cd ~/Projects<br/>
	git clone https://github.com/EBISPOT/BioSolr.git</blockquote>
	<p>We've supplied the required configuration to get us up and running quickly, so let's start a new Solr instance that uses this config:</p>
	<blockquote>cd ~/Projects/BioSolr/tutorial/solr<br/>
	export SOLR_DIR=~/Applications/solr-5.3.1<br/>
	solr-start.sh</blockquote>
	<p>Now reload your page at <a href="http://localhost:8983/" target="_blank">http://localhost:8983/</a> - this time, you should see a new documents core. If so, great! Now we're ready to start indexing some data.</p>
	
	<h4 id="index_sample_data">Index some example data with embedded ontology annotations</h4>
	<p>We've supplied some sample data to demonstrate how BioSolr can index information annotated with ontology terms. 
		This data is stored in the file <i>tutorial/data/gwas-catalog-annotation-data.csv</i> - you can open this file in Excel and take a 
		look at it if you like. You'll see this file contains a column <i>efo_uri</i> which contains annotations to the Experimental 
		Factor Ontology.
	</p>
	<p> For reference: Data is taken from the GWAS Catalog: <a href="http://www.ebi.ac.uk/gwas" target="_blank">http://www.ebi.ac.uk/gwas</a> Ontology information is from the 
		Experimental Factor Ontology: <a href="http://www.ebi.ac.uk/efo" target="_blank">http://www.ebi.ac.uk/efo</a>
	</p>
	<p>Now let's index this data. You can upload CSV files directly to Solr using curl:</p>
	<blockquote>curl http://localhost:8983/solr/documents/update --data-binary @../data/gwas-catalog-annotation-data.csv -H 'Content-type:application/csv'<br/>
	curl http://localhost:8983/solr/documents/update --data '&lt;commit/&gt;' -H 'Content-type:text/xml; charset=utf-8'</blockquote>
	<p>The first command here uploads the data, the second one commits the changes and makes your newly indexed documents visible.</p>
	<p>Now, if you open the admin interface once more and look at the overview of the documents 
		core <a href="http://localhost:8983/solr/#/documents" target="_blank">http://localhost:8983/solr/#/documents</a>, you should see we now have 26,385 documents in our index.
	</p>
	<p>We can also inspect some of the data by browsing to the query page <a href="http://localhost:8983/solr/#/documents/query" target="_blank">http://localhost:8983/solr/#/documents/query</a>
		and running a standard query - you can see that the documents in our index correspond to rows in our spreadsheet.
	</p>
	
	<h4 id="search_data">Use an example web application to search this data</h4>
	<p>Next, we've supplied you with a simple web application to search our new Solr index. Let's try running this.</p>
	<blockquote>cd ~/Projects/BioSolr/tutorial/tools<br/>
	java -jar webapp-1.0-SNAPSHOT.jar server webapp.yml</blockquote>
	<p>You should see a bunch of logging information as the web application starts up. If all goes well, you'll eventually see a message like this:</p>
	<blockquote>INFO  [2015-12-06 18:50:49,002] org.eclipse.jetty.server.Server: Started @1637ms</blockquote>
	<p>Now we can open up our example application by browsing to <a href="http://localhost:8080/" target="_blank">http://localhost:8080/.</a></p>
	<p>Let's try some example searches. Most of the GWAS data is concerned with the links between SNPs and diseases - so let's try a search for <i>Lung cancer</i>. 
		Looks like we get 31 results, all containing lung cancer in the title or the association line, so for example:</p>
	<ol><li>Deciphering the impact of common genetic variation on lung cancer risk: a genome-wide association study. <br/>Broderick P - Cancer Res. <br/>rs4254535 is associated with <i>Lung cancer</i></li></ol>
	<p>But what if we want all lung diseases? We could try searching for <i>lung disease</i> - but this only gives us 3 results, 
		probably not what we want. We can try just searching for <i>lung</i>, which looks a little better - 49 results this time, some 
		of them are lung cancer but there's also stuff about lung function, so this isn't ideal either.
	</p>
	<p>Let's try another example. We could search for <i>schizophrenia</i> - this gives us nice results, we get 51 documents that seems to 
		be annotated to schizophrenia or treatment responses in schizophrenia. But what if we want other mental disorders? If we 
		search for <i>mental disorder</i> we get no results.
	</p>
	<p>BioSolr's ontology expansion plugin will help us improve these search results a lot. So let's kill our search web application and move on to installing BioSolr.</p>
	
	<h4 id="install_biosolr_plugins">Install BioSolr plugins to ontology-enrich your Solr index</h4>
	<p>Now we're going to try and improve our search results using the structure of the ontology. To do this, we need to add the BioSolr ontology expansion plugin.</p>
	<p>Before we start, let's shutdown our running Solr server</p>
	<blockquote>cd ~/Projects/BioSolr/tutorial/solr/<br/>
	solr_stop.sh</blockquote>
	<p>Now, take the BioSolr plugin jar file out of the <i>plugins/</i> directory and copy it into our Solr setup.</p>
	<blockquote>mkdir solr-conf/documents/lib<br/>
	cp ../plugins/solr-ontology-update-processor-0.5.jar solr-conf/documents/lib/</blockquote>
	<p>
		The code for this plugin is under the `ontology/ontology-annotator` directory. <br/>
		We've installed our plugin, but we need to do a bit of reconfiguration to make Solr use it.
	</p>
	
	<h4 id="configure_biosolr">Configure BioSolr</h4>
	<p>Now we need to modify our Solr configuration to make use of this plugin.</p>
	<p><b>Adding an Ontology Lookup Processor Chain</b></p>
	<p>We need to add our plugin - an OntologyUpdateProcessor - to our default processor chain in Solr, so that Solr will perform the ontology lookup from our configured field.</p>
	<p>You'll need to edit <i>solr-conf/documents/conf/solrconfig.xml</i>. Scroll down to line 619 and uncomment this block:</p>
	<pre>
		&lt;!-- Ontology lookup processor chain --&gt;

		  &lt;updateRequestProcessorChain name="ontology"&gt;
		    &lt;processor class="uk.co.flax.biosolr.solr.update.processor.OntologyUpdateProcessorFactory"&gt;
		      &lt;bool name="enabled"&gt;true&lt;/bool&gt;
		      &lt;str name="annotationField"&gt;efo_uri&lt;/str&gt;

		      &lt;str name="olsBaseURL"&gt;http://www.ebi.ac.uk/ols/api&lt;/str&gt;
		      &lt;str name="olsOntology"&gt;efo&lt;/str&gt;
		    &lt;/processor&gt;
		    &lt;processor class="solr.LogUpdateProcessorFactory" /&gt;
		    &lt;processor class="solr.DistributedUpdateProcessorFactory" /&gt;
		    &lt;processor class="solr.RunUpdateProcessorFactory" /&gt;
		  &lt;/updateRequestProcessorChain&gt;
	</pre>
	<p>Let's take a look at this in a bit of detail. We're setting up a processor chain called 'ontology' that uses the class 
		<i>OntologyUpdateProcessorFactory</i> that has been developed as part of BioSolr. We've told it to look for annotation in the 
		'efo_uri' field, and told it where the Ontology Lookup Service (OLS) is located. We've also told it to use the ontology 
		'efo' from OLS.
	</p>
	<p>Now, we need to add this update request processor chain to our update request handler, so it is used whenever we update data.</p>
	<p>Still editing <i>solr-conf/documents/conf/solrconfig.xml</i>, scroll back up to line 431 and uncomment this block:</p>
	<pre>
		&lt;requestHandler name="/update" class="solr.UpdateRequestHandler"&gt;
	   	  &lt;!--  See below for information on defining
		 	updateRequestProcessorChains that can be used by name
		 	on each Update Request
	      	  --&gt;
	          &lt;lst name="defaults"&gt;
		 	&lt;str name="update.chain"&gt;ontology&lt;/str&gt;
	          &lt;/lst&gt;
	  	&lt;/requestHandler&gt;
	</pre>
	<p>Now we've reconfigured our server, we just have to restart...</p>
	<blockquote>cd ~/Projects/BioSolr/tutorial/solr/<br/>
	solr-start.sh
	</blockquote>
	
	<h4 id="reindex_data">Reindex our data to take advantage of ontology enrichment</h4>
	<p>This bit is simple - we can just rerun our indexing process from earlier...</p>
	<blockquote>curl http://localhost:8983/solr/documents/update --data-binary @../data/gwas-catalog-annotation-data.csv -H 'Content-type:application/csv'<br/>
	curl http://localhost:8983/solr/documents/update --data '&lt;commit/&gt;' -H 'Content-type:text/xml; charset=utf-8'
	</blockquote>
	<p>You'll notice this takes a while longer than it did earlier - this is because this time, as we index our data we're calling out to OLS to expand our data using the ontology. Hopefully network access is up to it!</p>
	<p>If you open the Solr admin interface again <a href="http://localhost:8983/solr/#/documents" target="_blank">http://localhost:8983/solr/#/documents</a>, we should still have 26,385 documents. 
		But if we do a query - <a href="http://localhost:8983/solr/#/documents/query" target="_blank">http://localhost:8983/solr/#/documents/query</a> - you should see a lot more information than we had 
		earlier, including some new fields that weren't in our spreadsheet.
	</p>
	
	<h4 id="ontology_powered_search">Perform ontology-powered searches</h4>
	<p>Now let's go back to our web application and see if we can take advantage of all this extra information.</p>
	<p>Restart the application again:</p>
	<blockquote>cd ~/Projects/BioSolr/tutorial/tools<br/>
	java -jar webapp-1.0-SNAPSHOT.jar server webapp.yml
	</blockquote>
	<p>You'll straight away notice something new - lots of additional checkboxes (you might need to reload your page). These are present because our webapp has noticed that we have additional ontology fields in our data.</p>
	<p>Now let's go back to our earlier searches. If you remember, we tried looking for <i>lung cancer</i> and we got 31 results. We should be able to do the same search again, and get the same results.</p>
	<p>Then, we tried <i>lung disease</i> and only got 3 results. Again, we should be able to verify this. But now let's check the box to use ontology expansion:</p>
	<input type="checkbox" onclick="return false;" checked="true">Include parent labels <br/>
	<p>Now if we rerun the search, we should see 53 results, across a whole variety of lung disease. One of our results, for example, should look like this:</p>
	<ol><li>Variants in FAM13A are associated with chronic obstructive pulmonary disease.<br/>
	Cho MH - Nat Genet.<br/>
	rs7671167 is associated with Chronic obstructive pulmonary disease.<br/>
	Annotation chronic obstructive pulmonary disease [http://www.ebi.ac.uk/efo/EFO_0000341].<br/>
	Children chronic bronchitis.<br/>
	Parent(s) lung disease.<br/>
	Has disease location trachea lung.</li></ol>
	<p>This looks much more like it! But we can even go one better than this - let's try searching for <i>lung</i> again. Uncheck all the boxes so we get 49 results. This time, though, let's also include diseases which are located in the lung...</p>
	<input type="checkbox" onclick="return false;" checked="true">Has disease location <br/>
	<p>Now you should see that we have 75 results; we're using relationships in the ontology to improve our results. For example, one of our results looks like this:</p>
	<ol><li>Genome-wide association study identifies BICD1 as a susceptibility gene for emphysema.<br/>
	Kong X - Am J Respir Crit Care Med.<br/>
	rs641525 is associated with Emphysema-related traits.<br/>
	Annotation emphysema [http://www.ebi.ac.uk/efo/EFO_0000464].<br/>
	Parent(s) lung disease.<br/>
	Has disease location lung.<br/>
	</li></ol>
	<p>If you look closely, you'll see that "lung" is not mentioned anywhere in our data, only the extra fields that have come from the ontology. 
		We'd actually have picked this result up by including parents (<i>lung disease</i>), but this is only because EFO nicely defines 
		hierarchy. If we were using a different ontology with a different hierarchy (maybe one which doesn't use hierarchy in the ways 
		we'd like), we can use a relationship other than <i>is a</i> to find this result.
	</p>
	<p>Next, we tried searching for <i>schizophrenia</i>. Let's try this again - yep, still 51 results. 
		You'll notice if we include parent terms, we still get 51 results - our order might shuffle around a bit though. 
		This isn't unexpected - most of our data about schizophrenia should be nicely mapped to a specific term and would 
		include the text "schizophrenia" in the title or the annotation line. But last time we tried to search for 
		other <i>mental disorder</i> and found no results at all. Now, if we search including child and parent labels, we get more than 100 results! 
		This covers a wide range of disorders, like "schizophrenia", "bipolar disorder" and many more. For example:
	</p>
	<ol><li>Cross-disorder genomewide analysis of schizophrenia, bipolar disorder, and depression.<br/>
	Huang J - Am J Psychiatry<br/>
	rs1001021 is associated with Schizophrenia, bipolar disorder and depression (combined)<br/>
	Annotation bipolar disorder [http://www.ebi.ac.uk/efo/EFO_0000289]<br/>
	Parent(s) mental or behavioural disorder
	</li></ol>
	<p>This shows the power of including additional information from the ontology in your Solr index. You'll also see the search is just as fast as it was previously: by including extra data when we built our index, we have almost no penalty in search time - which is usually the best option for users.</p>
	
	<h4 id="conclusions">Conclusions</h4>
	<p>This is the end of the BioSolr ontology expansion tutorial. You've seen how to install Solr, load some example data, extend Solr with the ontology expansion plugin developed by BioSolr into a Solr installation, and you've seen some of the extra features this plugin can give you.</p>
	<p><i>BioSolr is funded by the BBSRC Flexible Interchange Programme (FLIP) grant number BB/M013146/1</i></p>
</div><!-- /.blurb -->