This repository will not be updated. The repository will be kept available in read-only mode. Refer to https://github.com/IBM/sms-spam-filter-using-hortonworks for a similar example.
Recommendation engines are one of the most well known, widely used and highest value use cases for applying machine learning. Despite this, while there are many resources available for the basics of training a recommendation model, there are relatively few that explain how to actually deploy these models to create a large-scale recommender system.
This code pattern demonstrates the key elements of creating such a system by using Apache Spark and HDP Search (Solr). Note that this pattern is a port of Nick Pentreath's Recommender built with Elasticsearch and Apache Spark, but with a focus on using Hortonworks Data Platform (HDP) and IBM's Data Science Experience Local (DSX Local).
What is HDP and HDP Search? Hortonworks Data Platform (HDP) is a massively scalable platform for storing, processing and analyzing large volumes of data. HDP consists of the essential set of Apache Hadoop projects including MapReduce, Hadoop Distributed File System (HDFS), HCatalog, Pig, Hive, HBase, Zookeeper and Ambari. HDP Search provides applications and tools for indexing content from your HDP cluster to Solr (an open source enterprise search platform).
Hortonworks Data Platform by Hortonworks
What is DSX Local? DSX Local is an on premises solution for data scientists and data engineers. It offers a suite of data science tools that integrate with RStudio, Spark, Jupyter, and Zeppelin notebook technologies. And yes, it can be configured to use HDP, too.
This repo contains a Jupyter notebook illustrating how to use Spark for training a collaborative filtering recommendation model from ratings data stored in Solr, saving the model factors to Solr, and then using Solr to serve real-time recommendations using the model. The data you will use comes from MovieLens and is a common benchmark dataset in the recommendations community. The data consists of a set of ratings given by users of the MovieLens movie rating system, to various movies. It also contains metadata (title and genres) for each movie.
When you have completed this code pattern, you will understand how to:
- Ingest and index user event data into Solr using the Solr Spark connector
- Load event data into Spark DataFrames and use Spark's machine learning library (MLlib) to train a collaborative filtering recommender model
- Export the trained model into Solr
- Using a custom Solr plugin, compute personalized user and similar item recommendations and combine recommendations with search and content filtering
- Load the movie dataset into Apache Hadoop HDFS.
- Use Spark DataFrame operations to clean the dataset and use Spark MLlib to train a collaborative filtering recommendation model.
- Save the resulting model into Apache Solr.
- The user can run the provided notebook in IBM's Data Science Experience Local.
- As the notebook runs, Apache Livy will be called to interact with the Spark service in HDP.
- Using Solr queries and a custom vector scoring plugin, generate example recommendations.
- When necessary, retrieve information about movies, such as poster images, using The Movie Database APIs.
- IBM Data Science Experience Local: An out-of-the-box on premises solution for data scientists and data engineers. It offers a suite of data science tools that integrate with RStudio, Spark, Jupyter, and Zeppelin notebook technologies.
- Apache Spark: An open-source, fast and general-purpose cluster computing system.
- Hortonworks Data Platform (HDP): HDP is a massively scalable platform for storing, processing and analyzing large volumes of data. HDP consists of the essential set of Apache Hadoop projects including MapReduce, Hadoop Distributed File System (HDFS), HCatalog, Pig, Hive, HBase, Zookeeper and Ambari.
- HDP Search: HDP Search provides applications and tools for indexing content from your HDP cluster to Solr.
- Apache Livy: Apache Livy is a service that enables easy interaction with a Spark cluster over a REST interface.
- Jupyter Notebooks: An open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text.
- Artificial Intelligence: Artificial intelligence can be applied to disparate solution spaces to deliver disruptive technologies.
- Python: Python is a programming language that lets you work more quickly and integrate your systems more effectively.
The core of this code pattern is integrating Hortonworks Data Platform (HDP) and IBM DSX Local. If you do not already have an HDP cluster available for use, you will need to install one before attempting to complete the code pattern.
To install HDP v2.6.4, please follow the installation guide provided by Hortonworks. It first requires the installation of the Apache Ambari management platform which is then used to faciliate the HDP cluster installation. The Ambari Server is also required to complete a number of steps described in the following sections.
Note: Ensure that your Ambari Server is configured to use
Python v2.7
.
Once your HDP cluster is deployed, at a minimum, install the following services as listed in this Ambari Server UI screenshot:
Note:
- This code pattern requires that version
2.2.0
of theSpark2
service be installed.- Details on how to install and configure the
Solr
service are discussed in the Setup HDP Search section below.
From your Ambari Server UI, disable the Apache Livy server's CSRF proctection by changing the livy.server.csrf_protection.enable
variable to false
. To navigate to the panel, click on the Spark2
service, then the Configs
tab, and then open the Advanced livy2-conf
section.
Follow these steps to setup the proper environment to run our Recommender notebook locally.
- Setup HDP Search
- Download and install the Solr Spark connector
- Setup DSX Desktop
- Clone the repo
- Download and move the data to HDFS
- Setup python plugins
- Launch the notebook
- Run the notebook
This code pattern was tested with HDP Search (Solr
) v6.6.2 and assumes that it is started in cloud mode
. Minimally, this can be accomplished by performing the following steps:
-
Install the HDP Solr Management Pack.
-
Install Solr using the Ambari Server UI.
Solr
should be listed as one of the available services after clicking theAction -> Add Service
button. -
Verify that the Solr server is running. To accomplish this, either:
-
Access the Solr admin dashboard:
http://solr_host:8983/solr
OR
-
Run the
solr status
command:cd <solr_installation_dir>/bin ./solr status
-
-
Install the
Solr vector scoring plugin
.The installation instructions and plugin jar file can be found in this repository.
To help simplify and consolidate the instructions, the important steps are repeated below:
Note: The following instructions assume the installation directory for Solr is
/opt/lucidworks-hdpsearch/solr
.-
Download the VectorPlugin.jar file.
-
Create directory
plugins
in thedist
directory of the Solr installation location. -
Copy
VectorPlugin.jar
to/opt/lucideworkds-hdpsearch/solr/dist/plugins/
. -
Add the following lines to the
solrconfig.xml
file located in/opt/lucidworks-hdpsearch/solr/server/solr/configsets/data_driven_schema_configs/conf
:
<lib dir="${solr.install.dir:../../../..}/dist/plugins/" regex=".*\.jar" /> <queryParser name="vp" class="com.github.saaay71.solr.VectorQParserPlugin" />
- Add the following lines to the
managed-schema
file located in/opt/lucidworks-hdpsearch/solr/server/solr/configsets/data_driven_schema_configs/conf
:
<fieldType name="VectorField" class="solr.TextField" indexed="true" termOffsets="true" stored="true" termPayloads="true" termPositions="true" termVectors="true" storeOffsetsWithPositions="true"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.DelimitedPayloadTokenFilterFactory" encoder="float"/> </analyzer> </fieldType> <field name="vector" type="VectorField" indexed="true" termOffsets="true" stored="true" termPositions="true" termVectors="true" multiValued="true"/>
Note: When making changes to the
solrconfig.xml
andmanaged_schema
files, take care to not copy in any extraneous spaces or characters.To test out your changes, try out the samples shown in the Examples section of the plugin's README. This will require you to create test collections - see the Running Solr Docs for instructions on how to do that.
- Re-start the Ambari server:
ambari-server restart
-
This code pattern reads the movie data set and computes the model vectors using Spark
. The Spark dataframes representing the movie data set and model vectors are then written to Solr
using the Solr Spark connector
. The Solr connector for Spark can be downloaded from the Lucidworks' Spark-Solr repository.
This code pattern was tested with version 3.5.1 of the connector. The Spark configurations need to be changed to add the connector jars in the Spark driver and executor classpath. This can be done by following the steps below.
Special consideration: The Solr connector jar is available in two flavors. The first one (non-shaded) only carries classes specific to the connector. The second one (shaded) carries the connector specific classes along with all its dependencies. Ideally, we would want the shaded version that contains the dependency classes. However, it contains the full set of Spark and Scala classes which will conflict with the Spark runtime when it is specified in the Spark classpath. To work around this problem, we will need to remove all of the Spark and Scala classes from this jar.
To download and install the connector:
- Download the jar to a backup directory
- Create a temporary work directory and
cd
to it - Extract the jar using
jar -xvf <location of jar file>
- Run
rm -rf scala
- Run
rm -rf org/apache/spark
- Rebuild the jar by using
jar -cvf spark-solr-3.5.1-shaded-modified.jar *
- Log into the Ambari Server UI
- Select the
Spark2
service and select theConfigs
tab - Open the
Custom spark2-defaults
properties section - Add the following property keys:
spark.driver.extraClassPath
spark.executor.extraClassPath
spark.jars
- Set the key values to the path of the
Spark Solr connector jar
(e.g./home/user1/spark-solr-3.5.1-shaded-modified.jar
).
Note: In the example above, the path is a local system path. In case of a multi-node HDP cluster, please make sure that this jar is available under the same path on all of the nodes. Another option is to put the jar in HDFS and specify the HDFS location.
This code pattern was tested using DSX Desktop, a lightweight version of DSX Local intended for standalone use and optimized for local development. For information on how to set it up, see the DSX Desktop Install Docs.
For production deployments it is recommended to use DSX Local
with a three node configuration. For more information, refer to the DSX Local Installation Guide.
When installing DSX Desktop
, choose Jupyter Notebooks
with Anaconda (Python 2.7)
.
Now that our HDP and DSX Local are installed and configured, we can proceed with making the two communicate with each other. We start by cloning the hdp-search-spark-recommender
repository locally. In a terminal, run:
git clone https://github.com/IBM/hdp-search-spark-recommender.git
cd hdp-search-spark-recommender
In this code pattern we will be using the Movielens dataset, it contains ratings given by a set of users for movies, as well as movie metadata. For the purpose of this code pattern, we can download the "latest small" version of the data set.
This code pattern is targeted to run against a multi-node HDP cluster. Therefore, the dataset needs to be moved to HDFS. Move the data to HDFS by issuing the following commands from the Ambari Server:
mkdir /tmp/data
cd /tmp/data
wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip
unzip ml-latest-small.zip
hadoop fs -mkdir /movies
hadoop fs -put *.csv /movies
This code pattern relies upon a few python plugins for everything to work together. Some plugins are required to be installed in the node where DSX Local or DSX Desktop is installed and the others need to be installed in all the HDP compute nodes. The following table lists which plugins are required and where they need to be installed:
Library | Install Location |
---|---|
tmdbsimple | DSX node and all nodes of HDP cluster |
IPython | DSX node |
paramiko | All nodes of HDP cluster |
numpy | All nodes of HDP cluster |
simplejson | DSX node and all nodes of HDP cluster |
urllib2 | DSX node and all nodes of HDP cluster |
solrcloudpy | DSX node and all nodes of HDP cluster |
Note regarding DSX node plugins: The Jupyter notebook running in DSX Desktop will attempt to install each of the required plugins, so no futher action should be needed. If, however, there are issues and a plugin needs to be manually installed, see the Troubleshooting section below for instructions.
The plugins can be installed using the pip command.
If the
pip
command is not available on your platform, please follow the platform specific instructions to install pip on your machine.
For example:
pip install numpy
Note: Some of the plugins may already be installed/present in your environment. In that case please skip that plugin and move on to the next plugin in the table. Similarly, installing the plugins may require installing some prerequsite plugins on your platform. In this case, please follow the platform specific instuctions to install the prerequsite plugins first.
The notebook provided in this repo should work with Python 2.7.x or 3.x (and has been tested on 2.7.11 and 3.6.1)
To run the notebook you will need to start DSX Local. Below are the steps that need to be performed before running the notebook the very first time.
-
Create a new project.
-
Create the notebook.
-
Load the notebook.
Once created, you can click on the notebook to load and run it. Once running, you will see a number of variables that need to be replaced with values that match your installed environment. These variables are:
* `LIVY_URL` * `HDFS_URL_FOR_DATA` * `SOLR_INSTALL_DIR` * `SOLR_HOST` * `SOLR_PORT` * `ZKHOST` * `SSHUSER` * `SSHPASSWORD` * `tmdb.API_KEY`
Some variables such as tmdb.API_KEY, SOLR_HOST, etc. are defined twice in the notebook as they are needed in both the DSX node and the HDP cluster nodes.
Note that in the notebook, the HDP cluster nodes are accessed via the Livy interface. Cells that access the HDP cluster are distinguishable by the following tag located at the top of the cell:
%%spark
When a notebook is executed, what is actually happening is that each code cell in the notebook is executed, in order, from top to bottom.
Each code cell is selectable and is preceded by a tag in the left margin. The tag
format is In [x]:
. Depending on the state of the notebook, the x
can be:
- A blank, this indicates that the cell has never been executed.
- A number, this number represents the relative order this code step was executed.
- A
*
, this indicates that the cell is currently executing.
There are several ways to execute the code cells in your notebook:
- One cell at a time.
- Select the cell, and then press the
Play
button in the toolbar. You can also hitShift+Enter
to execute the cell and move to the next cell.
- Select the cell, and then press the
- Batch mode, in sequential order.
- From the
Cell
menu bar, there are several options available. For example, you canRun All
cells in your notebook, or you canRun All Below
, that will start executing from the first cell under the currently selected cell, and then continue executing all cells that follow.
- From the
After classifying the data by movies, ratings and genres, we train our recommender model.
After loading our model into Solr, we can predict which movies people will like based on ratings of similar movies in the same genre.
The example output in the data/examples folder shows the output of the notebook after running it in full. View it here.
-
Error:
Should DSX Desktop fail to respond at any time.
Solution: Perform the following steps:
* Close IBM DSX desktop * Kill all the processes related to DSX Local from terminal: ``` docker ps |grep dsx docker rm -f <container id> ```
-
Remove DSX Desktop metadata
cd ~/Library/Application \Support/ rm -rf ibm-dsx-desktop
-
Restart DSX Desktop
-
Load the jupyter notebook again
-
-
Error:
Missing Required Header for CSRF protection.
If you see this error when trying to create spark session through livy (the cell below), it means that you need to disable the CSRF in the livy property.
livyURL="http://host name:8999" %spark add -s spark -l python -u $livyURL
Solution: Go to the Disable Apache Livy CSRF protection section for instructions.
-
Error:
404 Client Error: Not Found for url: http://your hostname:8983/solr/ratings/schema
If you see this error when running this cell
%%spark setupSchema() conn = SolrConnection(SOLR_HOST_PORT, version="6.6.0")
go to Solr's UI, and click the
Logging
on the left side, and check the latest logs, if showingError:
org.apache.solr.common.SolrException: Error CREATEing SolrCore 'movies_vector_shard1_replica1': Unable to create core [movies_vector_shard1_replica1] Caused by: /solr/movies_vector/core_node 1/data/index/write.lock for client 172.17.0.2 already exists
Solution: Find your solr data/index directory which is specified in
conf/solrconfig.xml
, delete only the write.log in data/index folder. -
Error:
HTTPError: 401 Client Error: Unauthorized for url: https://api.themoviedb.org/3/movie/1893?api_key=...
If you see this error in your notebook while testing your TMDb API access, or generating recommendations, it means you have installed
tmdbsimple
Python package, but have not set up your API key.Solution: You will need to access The Movie Database API and follow theses instructions to get an API key. Then copy the key into the
tmdb.API_KEY = 'YOUR_API_KEY'
line in the notebook. Once you have done that, execute that cell to test your access to TMDb API. -
Error:
Notebook running in DSX Desktop is unable to install python packages or plugins
Occasionally, this can happen due to the presence of multiple python environments in the docker container that runs the notebook. If this occurs, we can install the plugins manually by performing the following steps:
- Find out the python environment that is used by the notebook by inserting a cell at the top of the notebook. Enter and run the following commands:
!which python !which pip
- In a terminal, type
docker ps
to list all the docker containers. - Find out the container id of your notebook by looking for the image name
dsx-desktop:ana27
. - Start a interactive bash shell on the container by running the following command:
docker exec -it <container_id> bash
- At the prompt, enter the following command to install the plugin, for example
solrcloudpy
(this example assumes that thepip
command location determined by Step 1 is/opt/conda/bin/pip
):
/opt/conda/bin/pip install --ignore-installed -U solrcloudpy
- Teaming on Data: IBM and Hortonworks Broaden Relationship
- Certification of IBM Data Science Experience (DSX) on HDP is a Win-Win for Customers
- An Exciting Data Science Experience on HDP
- Data Analytics Code Patterns: Enjoyed this Code Pattern? Check out our other Data Analytics Code Patterns
- AI and Data Code Pattern Playlist: Bookmark our playlist with all of our Code Pattern videos
- Watson Studio: Master the art of data science with IBM's Watson Studio
- Spark on IBM Cloud: Need a Spark cluster? Create up to 30 Spark executors on IBM Cloud with our Spark service