Yahoo streaming framework is the first result to create a benchmark for state-of-the-art Big Data streaming platforms. However, the currently available Yahoo streaming benchmark only provides Cloud configuration used in the experiemnts of the paper for Auto-DaSP 2017. When you use the repository, please cite the following paper:
@inproceedings{ScalabilityAndState,
title = {Scalability and State:
{A} Critical Assessment of Throughput Obtainable on
Big Data Streaming Frameworks for
Applications With and Without State Information},
author = {Shinhyung Yang and Yonguk Jeong and ChangWan Hong and Hyunje Jun
and Bernd Burgstaller},
booktitle = {Euro-Par 2017: Parallel Processing Workshops -
Euro-Par 2017 International Workshops, Santiago de Compostela, Spain,
August 28-29, 2017, Revised Selected Papers},
series = "Lecture Notes in Computer Science",
volume = "10659",
pages = "141--152",
year = "2017",
doi = "https://doi.org/10.1007/978-3-319-75178-8\_12",
}
From Yahoo's official repository on GitHub, the revision committed on Nov 23 2016 is used. This configuration requires 30 nodes on Cloud computing platform. I recommend to use Google Compute Engine. Each node is configured with 16 vCPUs and 24 GB main memory in my Cloud setup. Please note that each node has two properties: Instance_Name and Internal_IP. Instance_Name is specific to Google Compute Engine and is used as an alias for each node. You may want to create a Cloud node manually in order to manually allocate internal IPs. Otherwise, all occurrences of IP addresses in the provided patch files need to be replaced according to your setup, in which case refer to the table below.
Clone the benchmark from the official repository:
git clone https://github.com/yahoo/streaming-benchmarks.git
Reset to the specific revision from Nov 23 2016
cd streaming-benchmarks
git reset --hard b073202b04baa640840a09b206c101996c112b95
Do initial download and setup of the benchmark
./stream-bench.sh SETUP
Setup breaks after unsuccessful download of a package. If you encounter such problem directly download packages from archive.apache.org. E.g., have a look at the case below:
cd download-cache
rm flink-1.1.3-bin-hadoop27-scala_2.10.tgz spark-1.6.2-bin-hadoop2.6.tgz
wget http://archive.apache.org/dist/flink/flink-1.1.3/flink-1.1.3-bin-hadoop27-scala_2.10.tgz
wget http://archive.apache.org/dist/spark/spark-1.6.2/spark-1.6.2-bin-hadoop2.6.tgz
# Continue with setup
cd ..
./stream-bench.sh SETUP
You may apply a patch file to your Yahoo streaming benchmark folder of each Cloud node according to the table below:
Instance_Name | Patch_Name | Internal_IP |
---|---|---|
streaming-group-0-0001 | zk.patch | 10.140.0.101 |
streaming-group-0-0002 | zk.patch | 10.140.0.102 |
streaming-group-0-0003 | zk.patch | 10.140.0.103 |
streaming-group-0-0004 | redis.patch | 10.140.0.104 |
streaming-group-0-0005 | kafka-1.patch | 10.140.0.105 |
streaming-group-0-0006 | kafka-2.patch | 10.140.0.106 |
streaming-group-0-0007 | kafka-3.patch | 10.140.0.107 |
streaming-group-0-0008 | kafka-4.patch | 10.140.0.108 |
streaming-group-0-0009 | kafka-5.patch | 10.140.0.109 |
streaming-group-0-0010 | master.patch | 10.140.0.110 |
streaming-group-0-0011 | slave.patch | 10.140.0.111 |
streaming-group-0-0012 | slave.patch | 10.140.0.112 |
streaming-group-0-0013 | slave.patch | 10.140.0.113 |
streaming-group-0-0014 | slave.patch | 10.140.0.114 |
streaming-group-0-0015 | slave.patch | 10.140.0.115 |
streaming-group-0-0016 | slave.patch | 10.140.0.116 |
streaming-group-0-0017 | slave.patch | 10.140.0.117 |
streaming-group-0-0018 | slave.patch | 10.140.0.118 |
streaming-group-0-0019 | slave.patch | 10.140.0.119 |
streaming-group-0-0020 | slave.patch | 10.140.0.120 |
streaming-group-0-0021 | dg.patch | 10.140.0.121 |
streaming-group-0-0022 | dg.patch | 10.140.0.122 |
streaming-group-0-0023 | dg.patch | 10.140.0.123 |
streaming-group-0-0024 | dg.patch | 10.140.0.124 |
streaming-group-0-0025 | dg.patch | 10.140.0.125 |
streaming-group-0-0026 | dg.patch | 10.140.0.126 |
streaming-group-0-0027 | dg.patch | 10.140.0.127 |
streaming-group-0-0028 | dg.patch | 10.140.0.128 |
streaming-group-0-0029 | dg.patch | 10.140.0.129 |
streaming-group-0-0030 | dg.patch | 10.140.0.130 |
Once you are ready with patch files, change current path to the parent of your Yahoo streaming benchmark directory and enter below command:
cd /path/to/your/streaming-benchmarks/..
patch -p0 --ignore-whitespace < /path/to/your/Patch_Name
In order to run Kafka cluster, each Kafka broker node needs a log directory. Refer to logs.dir
attribute of the Kafka configuration file in each of your Kafka broker nodes located at streaming-benchmarks/kafka_2.10-0.8.2.1/config
.
As soon as I was confronted with problems from using Storm's internal dev-zookeeper for setting up Zookeeper cluster, I added standalone zookeeper of the same version. Follow the instruction to add the application to the framework:
cd streaming-benchmarks
wget http://archive.apache.org/dist/zookeeper/zookeeper-3.4.6/zookeeper-3.4.6.tar.gz
tar xf zookeeper-3.4.6.tar.gz
rm zookeeper-3.4.6.tar.gz
In order to execute the benchmark's data-generator Clojure script, download and setup lein as follows:
mkdir ~/bin
cd ~/bin
wget https://raw.githubusercontent.com/technomancy/leiningen/stable/bin/lein
chmod +x ./lein
Then, modify line # 19 of your streaming-benchmarks/stream-bench.sh
to correctly locate your copy of lein.
LEIN="/home/your_account_name/bin/lein"
Please open a new issue to contact the author.
Code licensed under the Apache 2.0 license. See LICENSE file for terms.
This work was supported by the Next-Generation Information Computing Development Program through the National Research Foundation of Korea (NRF), funded by the Ministry of Science, ICT & Future Planning under grant NRF-2015-M3C4A-7065522.