-
Notifications
You must be signed in to change notification settings - Fork 13
Benchmark Tools
HiBench is a big data benchmark test suite that helps evaluate different big data frameworks in terms of speed, throughput and system resource utilizations. This section of the page talks about how to setup and configure HiBench to run various Hadoop centric test cases using Pravega Hadoop connectors.
The following benchmark tests are available for testing Hadoop performance. Potentially we could run and test all of these variations using Pravega Hadoop Connector to test and measure the performance metrics.
-
Sort (sort): This workload sorts its text input data, which is generated using RandomTextWriter.
-
WordCount (wordcount): This workload counts the occurrence of each word in the input data, which are generated using RandomTextWriter. It is representative of another typical class of real world MapReduce jobs - extracting a small amount of interesting data from large data set.
-
TeraSort (terasort): TeraSort is a standard benchmark created by Jim Gray. Its input data is generated by Hadoop TeraGen example program.
-
enhanced DFSIO (dfsioe): Enhanced DFSIO tests the HDFS throughput of the Hadoop cluster by generating a large number of tasks performing writes and reads simultaneously. It measures the average I/O rate of each map task, the average throughput of each map task, and the aggregated throughput of HDFS cluster. Note: this benchmark doesn't have Spark corresponding implementation.
The steps outlined below are for setting up Hadoop and Pravega on a single node deployment (quick start) to run HiBench test suite. Ideally, we need a multi-node setup with appropriate resource (cpu/memory/disk) configurations to run the tests at scale.
Docker Version: 1.13.1 or more
Refer here for more details regarding Dockerfile configuration files and entry point scripts to understand how the container is configured and managed.
Create ZooKeeper and Hadoop cluster by running below docker commands. To install a specific version of ZK and Hadoop, we need to build the docker image locally from the source
docker run --rm -ti --name zk -p 2181:2181 harisekhon/zookeeper
docker run --rm -ti --name hadoop -p 8020:8020 -p 8032:8032 -p 8088:8088 -p 9000:9000 -p 10020:10020 -p 19888:19888 -p 50010:50010 -p 50020:50020 -p 50070:50070 -p 50075:50075 -p 50090:50090 harisekhon/hadoop
Follow the steps outlined here to run Pravega in standalone mode.
-
Clone the repository
-
Make sure Maven is installed
-
Build the HiBench by running the following command
mvn -Dspark=2.1 -Dscala=2.11 clean package
After the workbench build is complete, we could run the wordcount
test program against standard Hadoop deployment by following the steps as outlined here
The test report can be viewed from <HiBench_Root>/report/hibench.report
. It is a summarized workload report, including workload name, execution duration, data size, throughput per cluster, throughput per node.
Currently, we have a sample application that mimics the standard wordcount program with slight modifications to support reading/writing from Pravega streams. The support for running additional tests (sort, terasort, dfsio) will be added to the samples application.
-
clone and build samples repository
./gradlew clean installDist
-
Configure below environment variables
export HDFS=hdfs://<HOSTNAME>:8020
export HADOOP_EXAMPLES_JAR=<PATH_TO_pravega-hadoop-examples-VERSION-all.jar>
export HADOOP_EXAMPLES_INPUT_DUMMY=${HDFS}/tmp/hadoop_examples_input_dummy
export HADOOP_EXAMPLES_OUTPUT=${HDFS}/tmp/hadoop_examples_output
export PRAVEGA_URI=tcp://<HOSTNAME>:9090
export PRAVEGA_SCOPE=myScope
export PRAVEGA_STREAM=myStream
export CMD=wordcount
- run the below command to ingest test data to the Pravega stream
/usr/local/hadoop/bin/hadoop jar ${HADOOP_EXAMPLES_JAR} randomtextwriter -D mapreduce.randomtextwriter.totalbytes=32000 ${HADOOP_EXAMPLES_INPUT_DUMMY} ${PRAVEGA_URI} ${PRAVEGA_SCOPE} ${PRAVEGA_STREAM}
- run the below command to perform the
wordcount
operation
/usr/local/hadoop/bin/hadoop jar ${HADOOP_EXAMPLES_JAR} ${CMD} ${HADOOP_EXAMPLES_INPUT_DUMMY} ${PRAVEGA_URI} ${PRAVEGA_SCOPE} ${PRAVEGA_STREAM} ${HADOOP_EXAMPLES_OUTPUT}
Note: Hadoop client is required on the machine from where you will be running the tests. You could follow these steps to install Hadoop client binary.