Refactor TPC-DS benchmark (#4)

kcheeeung · Nov 10, 2020 · e191d3a · e191d3a
1 parent 95b2a34
commit e191d3a
Show file tree

Hide file tree

Showing 11 changed files with 307 additions and 276 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,6 +1,6 @@
 target/
-tpcds_kit.zip
-tpch_kit.zip
+# tpcds_kit.zip
+# tpch_kit.zip
 *.sql.log
 derby.log
 log_query/

diff --git a/README.md b/README.md
@@ -4,24 +4,11 @@
 - Hadoop 2.2 or later cluster or Sandbox.
 - Apache Hive.
 - Between 15 minutes and 2 days to generate data (depending on the Scale Factor you choose and available hardware).
-- If you plan to generate 1TB or more of data, using Apache Hive 13+ to generate the data is STRONGLY suggested.
 - Have ```gcc``` in installed your system path. If your system does not have it, install it using yum or apt-get.
 
 ## Clone
 ```
-git clone https://github.com/kcheeeung/hive-testbench.git
-```
-
-## New Cluster / Run Everything
-Run all the individual steps. If you already have tables for a scale, just run step 3.
-
-**TPC-DS**
-```
-nohup sh util_lazyrun.sh tpcds SCALE
-```
-**TPC-H**
-```
-nohup sh util_lazyrun.sh tpch SCALE
+git clone https://github.com/kcheeeung/hive-testbench.git && cd hive-testbench/
 ```
 
 # Individual Steps
@@ -39,29 +26,34 @@ Build the benchmark you want to use (do all the prerequisites)
 ```
 
 ## 2. Generate the tables
-Decide how much data you want. SCALE approximately is about # ~GB.
+Decide how much data you want. `SCALE` approximately is about # ~GB. Supported `FORMAT` includes: `orc` and `parquet`.
 
+**Generic Usage**
+```
+nohup sh script.sh SCALE FORMAT
+```
 **TPC-DS**
 ```
-nohup sh util_tablegentpcds.sh SCALE
+nohup sh util_tablegentpcds.sh 10 orc
 ```
 **TPC-H**
 ```
-nohup sh util_tablegentpch.sh SCALE
+nohup sh util_tablegentpch.sh 10 orc
 ```
 
 ## 3. Run all the queries
-- `SCALE` **must be the SAME as before or else it can't find the database name!**
-- Add or change your desired `settings.sql` file or path
+- `SCALE` **must be the SAME size from an existing database!**
+- Modify your settings in `settings.sql`.
+- By default each query has a timeout it set to **2 hours!** Change in `util_internalRunQuery.sh` where `TIME_TO_TIMEOUT=120m`.
 - Run the queries!
 
 **TPC-DS Benchmark**
 ```
-nohup sh util_runtpcds.sh SCALE
+nohup sh util_runtpcds.sh 10 orc
 ```
 **TPC-H Benchmark**
 ```
-nohup sh util_runtpch.sh SCALE
+nohup sh util_runtpch.sh 10 orc
 ```
 
 # Optional: Enable Performance Analysis Tool (PAT)
@@ -80,35 +72,31 @@ Switch the command by un/commenting. Example below.
 ```
 
 # Optional: Run Queries using Different Connection 
-Go into `util_internalRunQuery.sh`
-Switch the command by un/commenting. Example below.
+Go into `util_internalRunQuery.sh`. Switch the command by uncommenting. Example below.
 Add the appropriate information (`CLUSTERNAME` and `PASSWORD`).
 ```
-# beeline -u "jdbc:hive2://`hostname`:10001/$INTERNAL_DATABASE;transportMode=http" -i $INTERNAL_SETTINGSPATH -f $INTERNAL_QUERYPATH &>> $INTERNAL_LOG_PATH
+# timeout $TIME_TO_TIMEOUT beeline -u "jdbc:hive2://`hostname -f`:10001/$INTERNAL_DATABASE;transportMode=http" -i $INTERNAL_SETTINGSPATH -f $INTERNAL_QUERYPATH &>> $INTERNAL_LOG_PATH
 
-beeline -u "jdbc:hive2://CLUSTERNAME.azurehdinsight.net:443/$INTERNAL_DATABASE;ssl=true;transportMode=http;httpPath=/hive2" -n admin -p PASSWORD -i $INTERNAL_SETTINGSPATH -f $INTERNAL_QUERYPATH &>> $INTERNAL_LOG_PATH
+timeout $TIME_TO_TIMEOUT beeline -u "jdbc:hive2://CLUSTERNAME.azurehdinsight.net:443/$INTERNAL_DATABASE;ssl=true;transportMode=http;httpPath=/hive2" -n admin -p PASSWORD -i $INTERNAL_SETTINGSPATH -f $INTERNAL_QUERYPATH &>> $INTERNAL_LOG_PATH
 ```
 
 # Troubleshooting
 
 ## Did my X step finish?
 Check the `aaa_clock.txt` or `aab_clock.txt` file.
-OR
-```
-ps -ef | grep '\.sh'
 ```
-
-## Some errors?
-Add into the script you're running
-```
-export DEBUG_SCRIPT=X
+ps -ef | grep .sh
+ps -ef | grep beeline
 ```
 
 ## Could not find database?
-In the `settings.sql` file, add
+In the `settings.sql` file, add.
 ```
 use DATABASENAME;
 ```
 
-## TPC-H is more stable than TPC-DS
-TPC-DS has some problems for large scales (100+). Pending to fix.
+## How to debug
+Uncomment the following line.
+```
+# DEBUG_SCRIPT=X
+```
diff --git a/parselog.py b/parselog.py
@@ -5,64 +5,80 @@
 BASE_LOG_NAME = "logquery"
 LOG_EXT = ".txt"
 
-def parse_log(path, cacheHitRatios):
-    """
-        Parses the target log. File size is converted into OUT_FILE_SIZE
-    """
-    found = 0
-    cacheHit, total = 0, 0
+""" BASE PARAMS """
+os.environ["TZ"]="US/Pacific"
+time_id = datetime.datetime.now().strftime("%m.%d.%Y-%H.%M")
+OUT_NAME = "llapio_summary" + time_id + ".csv"
+
+def getCacheHitRatio(path):
+    """ Returns the cache hit ratio """
+    cacheHit, miss, total = 0, 0, 0
 
     with open(path, "r") as file:
         for line in file:
             if "CACHE_HIT_BYTES" in line:
-                cacheHit = [int(item) for item in line.split() if item.isdigit()][0]
-                found += 1
+                cacheHit += [int(item) for item in line.split() if item.isdigit()][0]
             elif "CACHE_MISS_BYTES" in line:
-                miss = [int(item) for item in line.split() if item.isdigit()][0]
-                total = cacheHit + miss
-                found += 1
+                miss += [int(item) for item in line.split() if item.isdigit()][0]
+
+    total = cacheHit + miss
+    if total != 0:
+        return cacheHit / total * 100
+    else:
+        # query fail
+        return 0.12345
 
-            # Number of items to find before stop parsing
-            if found == 2:
-                break
+def getMetadataHitRatio(path):
+    """ Returns the metadata hit ratio. 'Cache retention rate basically' """
+    metadataHit, miss, total = 0, 0, 0
 
-    if total != 0 and cacheHit != 0:
-        # query success
-        cacheHitRatios.append(cacheHit / total * 100)
+    with open(path, "r") as file:
+        for line in file:
+            if "METADATA_CACHE_HIT" in line:
+                metadataHit += [int(item) for item in line.split() if item.isdigit()][0]
+            elif "METADATA_CACHE_MISS" in line:
+                miss += [int(item) for item in line.split() if item.isdigit()][0]
+
+    total = metadataHit + miss
+    if total != 0:
+        return metadataHit / total * 100
     else:
         # query fail
-        cacheHitRatios.append(0)
+        return 0.12345
 
-def write_csv(cacheHitRatios):
+def write_csv(cacheHitRatios, metadataHitRatio):
     """
         Writes info to a csv file.
+        Modify by adding new columns and map of parsed data.
     """
+    queryNum = list(cacheHitRatios.keys())
+    queryNum.sort(key=float)
+
     with open(OUT_NAME, "w", newline="") as output_csv:
         writer = csv.writer(output_csv)
         # header
-        head = ["Query#", "Cache Hit Ratio %"]
+        head = ["Query#", "Cache Hit %", "Metadata Hit %"]
         writer.writerow(head)
         # info
-        for i in range(len(cacheHitRatios)):
-            writer.writerow([i + 1, cacheHitRatios[i]])
+        for i in queryNum:
+            writer.writerow([float(i), cacheHitRatios[i], metadataHitRatio[i]])
 
-os.environ["TZ"]="US/Pacific"
-time_id = datetime.datetime.now().strftime("%m.%d.%Y-%H.%M")
-OUT_NAME = "llapio_summary" + time_id + ".csv"
 def main():
-    # Range of queries. Counts the files so you don't need to know which benchmark it is
-    START = 1
-    END = 0
-    for file in os.listdir(LOG_FOLDER):
-        if file.startswith(BASE_LOG_NAME) and file.endswith(LOG_EXT):
-            END += 1
+    querynum_to_cacheratio = {}
+    querynum_to_metadatahitratio = {}
+    for filename in os.listdir(LOG_FOLDER):
+        if filename.startswith(BASE_LOG_NAME) and filename.endswith(LOG_EXT):
+            query_runNum = re.findall("\d+\.\d+", filename)
+            if len(query_runNum) == 1:
+                query_num = query_runNum[0]
+                filepath = LOG_FOLDER + filename
 
-    # parse all data
-    cacheHitRatios = list()
-    for i in range(START, END + 1):
-        parse_log(LOG_FOLDER + BASE_LOG_NAME + str(i) + LOG_EXT, cacheHitRatios)
+                querynum_to_cacheratio[query_num] = getCacheHitRatio(filepath)
+                querynum_to_metadatahitratio[query_num] = getMetadataHitRatio(filepath)
+            else:
+                raise Exception("Did not find query number in " + query_runNum)
 
-    write_csv(cacheHitRatios)
+    write_csv(querynum_to_cacheratio, querynum_to_metadatahitratio)
 
 if __name__ == "__main__":
     start = time.time()

diff --git a/tpcds-gen/Makefile b/tpcds-gen/Makefile
@@ -8,9 +8,9 @@ target/tpcds_kit.zip: tpcds_kit.zip
 	mkdir -p target/
 	cp tpcds_kit.zip target/tpcds_kit.zip
 
-tpcds_kit.zip:
-	curl https://public-repo-1.hortonworks.com/hive-testbench/tpcds/README
-	curl --output tpcds_kit.zip https://public-repo-1.hortonworks.com/hive-testbench/tpcds/TPCDS_Tools.zip
+# tpcds_kit.zip:
+# 	curl https://public-repo-1.hortonworks.com/hive-testbench/tpcds/README
+# 	curl --output tpcds_kit.zip https://public-repo-1.hortonworks.com/hive-testbench/tpcds/TPCDS_Tools.zip
 
 target/lib/dsdgen.jar: target/tools/dsdgen
 	cd target/; mkdir -p lib/; ( jar cvf lib/dsdgen.jar tools/ || gjar cvf lib/dsdgen.jar tools/ )

diff --git a/tpcds-gen/tpcds_kit.zip b/tpcds-gen/tpcds_kit.zip
diff --git a/tpcds-setup.sh b/tpcds-setup.sh
diff --git a/tpch-gen/Makefile b/tpch-gen/Makefile
@@ -9,9 +9,9 @@ target/tpch_kit.zip: tpch_kit.zip
 	mkdir -p target/
 	cp tpch_kit.zip target/tpch_kit.zip
 
-tpch_kit.zip:
-	curl http://dev.hortonworks.com.s3.amazonaws.com/hive-testbench/tpch/README
-	curl --output tpch_kit.zip http://dev.hortonworks.com.s3.amazonaws.com/hive-testbench/tpch/tpch_kit.zip
+# tpch_kit.zip:
+# 	curl http://dev.hortonworks.com.s3.amazonaws.com/hive-testbench/tpch/README
+# 	curl --output tpch_kit.zip http://dev.hortonworks.com.s3.amazonaws.com/hive-testbench/tpch/tpch_kit.zip
 
 target/lib/dbgen.jar: target/tools/dbgen
 	cd target/; mkdir -p lib/; ( jar cvf lib/dbgen.jar tools/ || gjar cvf lib/dbgen.jar tools/ )

diff --git a/tpch-gen/tpch_kit.zip b/tpch-gen/tpch_kit.zip