Skip to content

Commit

Permalink
Refactor TPC-DS benchmark (#4)
Browse files Browse the repository at this point in the history
  • Loading branch information
kcheeeung authored Nov 10, 2020
1 parent 95b2a34 commit e191d3a
Show file tree
Hide file tree
Showing 11 changed files with 307 additions and 276 deletions.
4 changes: 2 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
target/
tpcds_kit.zip
tpch_kit.zip
# tpcds_kit.zip
# tpch_kit.zip
*.sql.log
derby.log
log_query/
Expand Down
60 changes: 24 additions & 36 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,24 +4,11 @@
- Hadoop 2.2 or later cluster or Sandbox.
- Apache Hive.
- Between 15 minutes and 2 days to generate data (depending on the Scale Factor you choose and available hardware).
- If you plan to generate 1TB or more of data, using Apache Hive 13+ to generate the data is STRONGLY suggested.
- Have ```gcc``` in installed your system path. If your system does not have it, install it using yum or apt-get.

## Clone
```
git clone https://github.com/kcheeeung/hive-testbench.git
```

## New Cluster / Run Everything
Run all the individual steps. If you already have tables for a scale, just run step 3.

**TPC-DS**
```
nohup sh util_lazyrun.sh tpcds SCALE
```
**TPC-H**
```
nohup sh util_lazyrun.sh tpch SCALE
git clone https://github.com/kcheeeung/hive-testbench.git && cd hive-testbench/
```

# Individual Steps
Expand All @@ -39,29 +26,34 @@ Build the benchmark you want to use (do all the prerequisites)
```

## 2. Generate the tables
Decide how much data you want. SCALE approximately is about # ~GB.
Decide how much data you want. `SCALE` approximately is about # ~GB. Supported `FORMAT` includes: `orc` and `parquet`.

**Generic Usage**
```
nohup sh script.sh SCALE FORMAT
```
**TPC-DS**
```
nohup sh util_tablegentpcds.sh SCALE
nohup sh util_tablegentpcds.sh 10 orc
```
**TPC-H**
```
nohup sh util_tablegentpch.sh SCALE
nohup sh util_tablegentpch.sh 10 orc
```

## 3. Run all the queries
- `SCALE` **must be the SAME as before or else it can't find the database name!**
- Add or change your desired `settings.sql` file or path
- `SCALE` **must be the SAME size from an existing database!**
- Modify your settings in `settings.sql`.
- By default each query has a timeout it set to **2 hours!** Change in `util_internalRunQuery.sh` where `TIME_TO_TIMEOUT=120m`.
- Run the queries!

**TPC-DS Benchmark**
```
nohup sh util_runtpcds.sh SCALE
nohup sh util_runtpcds.sh 10 orc
```
**TPC-H Benchmark**
```
nohup sh util_runtpch.sh SCALE
nohup sh util_runtpch.sh 10 orc
```

# Optional: Enable Performance Analysis Tool (PAT)
Expand All @@ -80,35 +72,31 @@ Switch the command by un/commenting. Example below.
```

# Optional: Run Queries using Different Connection
Go into `util_internalRunQuery.sh`
Switch the command by un/commenting. Example below.
Go into `util_internalRunQuery.sh`. Switch the command by uncommenting. Example below.
Add the appropriate information (`CLUSTERNAME` and `PASSWORD`).
```
# beeline -u "jdbc:hive2://`hostname`:10001/$INTERNAL_DATABASE;transportMode=http" -i $INTERNAL_SETTINGSPATH -f $INTERNAL_QUERYPATH &>> $INTERNAL_LOG_PATH
# timeout $TIME_TO_TIMEOUT beeline -u "jdbc:hive2://`hostname -f`:10001/$INTERNAL_DATABASE;transportMode=http" -i $INTERNAL_SETTINGSPATH -f $INTERNAL_QUERYPATH &>> $INTERNAL_LOG_PATH
beeline -u "jdbc:hive2://CLUSTERNAME.azurehdinsight.net:443/$INTERNAL_DATABASE;ssl=true;transportMode=http;httpPath=/hive2" -n admin -p PASSWORD -i $INTERNAL_SETTINGSPATH -f $INTERNAL_QUERYPATH &>> $INTERNAL_LOG_PATH
timeout $TIME_TO_TIMEOUT beeline -u "jdbc:hive2://CLUSTERNAME.azurehdinsight.net:443/$INTERNAL_DATABASE;ssl=true;transportMode=http;httpPath=/hive2" -n admin -p PASSWORD -i $INTERNAL_SETTINGSPATH -f $INTERNAL_QUERYPATH &>> $INTERNAL_LOG_PATH
```

# Troubleshooting

## Did my X step finish?
Check the `aaa_clock.txt` or `aab_clock.txt` file.
OR
```
ps -ef | grep '\.sh'
```

## Some errors?
Add into the script you're running
```
export DEBUG_SCRIPT=X
ps -ef | grep .sh
ps -ef | grep beeline
```

## Could not find database?
In the `settings.sql` file, add
In the `settings.sql` file, add.
```
use DATABASENAME;
```

## TPC-H is more stable than TPC-DS
TPC-DS has some problems for large scales (100+). Pending to fix.
## How to debug
Uncomment the following line.
```
# DEBUG_SCRIPT=X
```
88 changes: 52 additions & 36 deletions parselog.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,64 +5,80 @@
BASE_LOG_NAME = "logquery"
LOG_EXT = ".txt"

def parse_log(path, cacheHitRatios):
"""
Parses the target log. File size is converted into OUT_FILE_SIZE
"""
found = 0
cacheHit, total = 0, 0
""" BASE PARAMS """
os.environ["TZ"]="US/Pacific"
time_id = datetime.datetime.now().strftime("%m.%d.%Y-%H.%M")
OUT_NAME = "llapio_summary" + time_id + ".csv"

def getCacheHitRatio(path):
""" Returns the cache hit ratio """
cacheHit, miss, total = 0, 0, 0

with open(path, "r") as file:
for line in file:
if "CACHE_HIT_BYTES" in line:
cacheHit = [int(item) for item in line.split() if item.isdigit()][0]
found += 1
cacheHit += [int(item) for item in line.split() if item.isdigit()][0]
elif "CACHE_MISS_BYTES" in line:
miss = [int(item) for item in line.split() if item.isdigit()][0]
total = cacheHit + miss
found += 1
miss += [int(item) for item in line.split() if item.isdigit()][0]

total = cacheHit + miss
if total != 0:
return cacheHit / total * 100
else:
# query fail
return 0.12345

# Number of items to find before stop parsing
if found == 2:
break
def getMetadataHitRatio(path):
""" Returns the metadata hit ratio. 'Cache retention rate basically' """
metadataHit, miss, total = 0, 0, 0

if total != 0 and cacheHit != 0:
# query success
cacheHitRatios.append(cacheHit / total * 100)
with open(path, "r") as file:
for line in file:
if "METADATA_CACHE_HIT" in line:
metadataHit += [int(item) for item in line.split() if item.isdigit()][0]
elif "METADATA_CACHE_MISS" in line:
miss += [int(item) for item in line.split() if item.isdigit()][0]

total = metadataHit + miss
if total != 0:
return metadataHit / total * 100
else:
# query fail
cacheHitRatios.append(0)
return 0.12345

def write_csv(cacheHitRatios):
def write_csv(cacheHitRatios, metadataHitRatio):
"""
Writes info to a csv file.
Modify by adding new columns and map of parsed data.
"""
queryNum = list(cacheHitRatios.keys())
queryNum.sort(key=float)

with open(OUT_NAME, "w", newline="") as output_csv:
writer = csv.writer(output_csv)
# header
head = ["Query#", "Cache Hit Ratio %"]
head = ["Query#", "Cache Hit %", "Metadata Hit %"]
writer.writerow(head)
# info
for i in range(len(cacheHitRatios)):
writer.writerow([i + 1, cacheHitRatios[i]])
for i in queryNum:
writer.writerow([float(i), cacheHitRatios[i], metadataHitRatio[i]])

os.environ["TZ"]="US/Pacific"
time_id = datetime.datetime.now().strftime("%m.%d.%Y-%H.%M")
OUT_NAME = "llapio_summary" + time_id + ".csv"
def main():
# Range of queries. Counts the files so you don't need to know which benchmark it is
START = 1
END = 0
for file in os.listdir(LOG_FOLDER):
if file.startswith(BASE_LOG_NAME) and file.endswith(LOG_EXT):
END += 1
querynum_to_cacheratio = {}
querynum_to_metadatahitratio = {}
for filename in os.listdir(LOG_FOLDER):
if filename.startswith(BASE_LOG_NAME) and filename.endswith(LOG_EXT):
query_runNum = re.findall("\d+\.\d+", filename)
if len(query_runNum) == 1:
query_num = query_runNum[0]
filepath = LOG_FOLDER + filename

# parse all data
cacheHitRatios = list()
for i in range(START, END + 1):
parse_log(LOG_FOLDER + BASE_LOG_NAME + str(i) + LOG_EXT, cacheHitRatios)
querynum_to_cacheratio[query_num] = getCacheHitRatio(filepath)
querynum_to_metadatahitratio[query_num] = getMetadataHitRatio(filepath)
else:
raise Exception("Did not find query number in " + query_runNum)

write_csv(cacheHitRatios)
write_csv(querynum_to_cacheratio, querynum_to_metadatahitratio)

if __name__ == "__main__":
start = time.time()
Expand Down
6 changes: 3 additions & 3 deletions tpcds-gen/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,9 @@ target/tpcds_kit.zip: tpcds_kit.zip
mkdir -p target/
cp tpcds_kit.zip target/tpcds_kit.zip

tpcds_kit.zip:
curl https://public-repo-1.hortonworks.com/hive-testbench/tpcds/README
curl --output tpcds_kit.zip https://public-repo-1.hortonworks.com/hive-testbench/tpcds/TPCDS_Tools.zip
# tpcds_kit.zip:
# curl https://public-repo-1.hortonworks.com/hive-testbench/tpcds/README
# curl --output tpcds_kit.zip https://public-repo-1.hortonworks.com/hive-testbench/tpcds/TPCDS_Tools.zip

target/lib/dsdgen.jar: target/tools/dsdgen
cd target/; mkdir -p lib/; ( jar cvf lib/dsdgen.jar tools/ || gjar cvf lib/dsdgen.jar tools/ )
Expand Down
Binary file added tpcds-gen/tpcds_kit.zip
Binary file not shown.
128 changes: 0 additions & 128 deletions tpcds-setup.sh

This file was deleted.

6 changes: 3 additions & 3 deletions tpch-gen/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,9 @@ target/tpch_kit.zip: tpch_kit.zip
mkdir -p target/
cp tpch_kit.zip target/tpch_kit.zip

tpch_kit.zip:
curl http://dev.hortonworks.com.s3.amazonaws.com/hive-testbench/tpch/README
curl --output tpch_kit.zip http://dev.hortonworks.com.s3.amazonaws.com/hive-testbench/tpch/tpch_kit.zip
# tpch_kit.zip:
# curl http://dev.hortonworks.com.s3.amazonaws.com/hive-testbench/tpch/README
# curl --output tpch_kit.zip http://dev.hortonworks.com.s3.amazonaws.com/hive-testbench/tpch/tpch_kit.zip

target/lib/dbgen.jar: target/tools/dbgen
cd target/; mkdir -p lib/; ( jar cvf lib/dbgen.jar tools/ || gjar cvf lib/dbgen.jar tools/ )
Expand Down
Binary file added tpch-gen/tpch_kit.zip
Binary file not shown.
Loading

0 comments on commit e191d3a

Please sign in to comment.