Benchmark Tools

Intel HiBench

HiBench is a big data benchmark test suite that helps evaluate different big data frameworks in terms of speed, throughput and system resource utilizations. This section of the page talks about how to setup and configure HiBench to run various Hadoop centric test cases using Pravega Hadoop connectors.

Micro Benchmarks

The following benchmark tests are available for testing Hadoop performance. Potentially we could run and test all of these variations using Pravega Hadoop Connector to test and measure the performance metrics.

Sort (sort): This workload sorts its text input data, which is generated using RandomTextWriter.
WordCount (wordcount): This workload counts the occurrence of each word in the input data, which are generated using RandomTextWriter. It is representative of another typical class of real world MapReduce jobs - extracting a small amount of interesting data from large data set.
TeraSort (terasort): TeraSort is a standard benchmark created by Jim Gray. Its input data is generated by Hadoop TeraGen example program.
enhanced DFSIO (dfsioe): Enhanced DFSIO tests the HDFS throughput of the Hadoop cluster by generating a large number of tasks performing writes and reads simultaneously. It measures the average I/O rate of each map task, the average throughput of each map task, and the aggregated throughput of HDFS cluster. Note: this benchmark doesn't have Spark corresponding implementation.

Cluster Setup

The steps outlined below are for setting up Hadoop and Pravega on a single node deployment (quick start) to run HiBench test suite. Ideally, we need a multi-node setup with appropriate resource (cpu/memory/disk) configurations to run the tests at scale.

Hadoop Cluster (docker)

Docker Version: 1.13.1 or more

Refer here for more details regarding Dockerfile configuration files and entry point scripts to understand how the container is configured and managed.

Create ZooKeeper and Hadoop cluster by running below docker commands. To install a specific version of ZK and Hadoop, we need to build the docker image locally from the source

docker run --rm -ti --name zk -p 2181:2181 harisekhon/zookeeper

docker run --rm -ti --name hadoop -p 8020:8020 -p 8032:8032 -p 8088:8088 -p 9000:9000 -p 10020:10020 -p 19888:19888 -p 50010:50010 -p 50020:50020 -p 50070:50070 -p 50075:50075 -p 50090:50090 harisekhon/hadoop

Pravega Cluster

Follow the steps outlined here to run Pravega in standalone mode.

HiBench

Clone the repository
Make sure Maven is installed
Build the HiBench by running the following command

mvn -Dspark=2.1 -Dscala=2.11 clean package

Running Workbench

After the workbench build is complete, we could run the wordcount test program against standard Hadoop deployment by following the steps as outlined here

The test report can be viewed from <HiBench_Root>/report/hibench.report. It is a summarized workload report, including workload name, execution duration, data size, throughput per cluster, throughput per node.

Testing WorkBench with Hadoop Connectors

Currently, we have a sample application that mimics the standard wordcount program with slight modifications to support reading/writing from Pravega streams. The support for running additional tests (sort, terasort, dfsio) will be added to the samples application.

clone and build samples repository

./gradlew clean installDist
Configure below environment variables

export HDFS=hdfs://<HOSTNAME>:8020
export HADOOP_EXAMPLES_JAR=<PATH_TO_pravega-hadoop-examples-VERSION-all.jar> 
export HADOOP_EXAMPLES_INPUT_DUMMY=${HDFS}/tmp/hadoop_examples_input_dummy
export HADOOP_EXAMPLES_OUTPUT=${HDFS}/tmp/hadoop_examples_output
export PRAVEGA_URI=tcp://<HOSTNAME>:9090
export PRAVEGA_SCOPE=myScope
export PRAVEGA_STREAM=myStream
export CMD=wordcount

run the below command to ingest test data to the Pravega stream

/usr/local/hadoop/bin/hadoop jar ${HADOOP_EXAMPLES_JAR} randomtextwriter -D mapreduce.randomtextwriter.totalbytes=32000 ${HADOOP_EXAMPLES_INPUT_DUMMY} ${PRAVEGA_URI} ${PRAVEGA_SCOPE} ${PRAVEGA_STREAM}

run the below command to perform the wordcount operation

/usr/local/hadoop/bin/hadoop jar ${HADOOP_EXAMPLES_JAR} ${CMD} ${HADOOP_EXAMPLES_INPUT_DUMMY} ${PRAVEGA_URI} ${PRAVEGA_SCOPE} ${PRAVEGA_STREAM} ${HADOOP_EXAMPLES_OUTPUT}

Note: Hadoop client is required on the machine from where you will be running the tests. You could follow these steps to install Hadoop client binary.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly