Review Hive Metastore Databases and Tables for upgrade or simply to evaluate potential issues. The intent was to make that process much more prescriptive and consumable by Cloudera customers. The application is 'Apache Hive' based, so it should work against both 'HDP', 'CDH', and 'CDP' clusters. See additional details about upgrading HDP2 and CDP 5/6 to CDP.
Download, Requirements, and Configuration Details
This process can be a long running process. I recommend using screen
or tmux
sessions to run it. That way the session won't get terminated when you disconnect from the host and you can easily reattach later to see the progress.
The process should start registering counts pretty quickly for each of the sub-checks. If you do not see these counts moving, or they show 0
, check your connection to the Metastore database (IE: URL, RDBMS Driver(needs to be in $HOME/.hive-sre/aux_libs), Username, password, and/or permissions). hive-sre
versions 2.4.0.21.0 and above will catch some of these misconfigurations and exit quickly.
Logs for the application are at $HOME/.hive-sre/logs. Check those for details on progress and/or problems.
Use one of -cdh
, -hdp2
, or -hdp3
depending on your source platform to run only the reports that make sense. If you don't use one of these options, that's fine. Just know that you'll be running MORE reports than you need and the process will run much longer. Just saying....
EACH report contains a description within the report on what actions should be taken.
If you are running the default upgrade and NOT doing the CDP Expedited Hive Upgrade Process for CDH5, CDH6, HDP2 there is a HSMM (Hive Strict Managed Migration Process) that runs AFTER the binaries and components have been upgraded to CDP's Hive 3. This process (HSMM) will fail if the issues/errors reported in reports 1 - Table / Partition Location Scan - Missing Directories, 5 - Questionable Serde's Check are not fixed BEFORE doing the upgrade.
Metadata Updates, like 3 - Hive 3 Upgrade Checks - Managed Non-ACID to ACID Table Migrations are done after the components and configurations have been updated to CDP. Therefore, the report actions for 3 - Hive 3 Upgrade Checks - Managed Non-ACID to ACID Table Migrations and 6 - Provide Script for ACID tables missing the 'bucketing_version' property will be done by HSMM.
If you are doing the Expedited Hive Upgrade process there are two options:
- Skip the HSMM process altogether. This means you will need to run the actions from the following, manually. These actions can be done BEFORE the upgrade, BUT may change behaviors expected in the legacy platform. They can be done AFTER the upgrade where there isn't a behavior impact. These will take time and dependent on the number of actions you need to perform. The tables must be updated BEFORE they can be consumed by users. But this doesn't mean you need to hold the upgrade or increase the downtime of the cluster. Organized high priority DBs first and work on less accessed DBs later.
- Use the hsmm_includelist.yaml produced by
u3
to direct HSMM towards targeted actions. See the Expedite Hive Upgrade docs for details.
Runs reports:
- 1 - Table / Partition Location Scan - Missing Directories
- 3 - Hive 3 Upgrade Checks - Managed Non-ACID to ACID Table Migrations
- 4 - Hive 3 Upgrade Checks - Compaction Check
- 5 - Hive Metastore Check
- Questionable Serde's Check
- Questionable Serde's Check
- Leagcy Kudu Serde Report
- Legacy Decimal Scale/Precision Check
- List Databases with Table/Partition Counts - 6 - Provide Script for ACID tables missing the 'bucketing_version' property
- 7 - List tables using known Storage Handlers
- 8 - Hive 3 Upgrade Checks - List Managed Non-ACID tables that need to be converted. (Run
u3e
to process) - 9 - List ACID tables missing the 'bucketing_version' property (Run
u3e
to process)
Runs reports:
- 1 - Table / Partition Location Scan - Missing Directories
- 5 - Hive Metastore Check
- Questionable Serde's Check
- Questionable Serde's Check
- Leagcy Kudu Serde Report
- Legacy Decimal Scale/Precision Check
- List Databases with Table/Partition Counts - 6 - Provide Script for ACID tables missing the 'bucketing_version' property
Runs reports:
- 1 - Table / Partition Location Scan - Missing Directories
- 3 - Hive 3 Upgrade Checks - Managed Non-ACID to ACID Table Migrations
- 5 - Hive Metastore Check
- Questionable Serde's Check
- Questionable Serde's Check
- Leagcy Kudu Serde Report
- Legacy Decimal Scale/Precision Check
- List Databases with Table/Partition Counts - 6 - Provide Script for ACID tables missing the 'bucketing_version' property
- 7 - List tables using known Storage Handlers
- 8 - Hive 3 Upgrade Checks - List Managed Non-ACID tables that need to be converted. (Run
u3e
to process) - 9 - List ACID tables missing the 'bucketing_version' property (Run
u3e
to process)
I have tested this against MariaDB 10.2, Postgres, and Oracle. I have seen reports of SQL issues against MySql 5.6, so this process will certainly have issues there. If you run this in other DB configs, I would like to hear from you about it.
For a while during the evolution of Hive 3, there was a separate 'catalog' for Spark. The queries in this process do NOT consider this alternate catalog and may yield cross products in some areas if the 'spark' catalog was used at any point. Even though this tool was designed for Hive 1 and 2 in mind, it can be run against Hive 3. We do NOT include criteria regarding the 'catalog'.
- Ran from a node on the cluster
- That includes clients and configuration files for HDFS
- Ran by a user that has READ privileges to all HDFS directories.
- If cluster is 'kerberized', the user has a valid Kerberos Ticket 'before' starting the application.
- Drivers for the HMS database are available.
- The configuration file has been defined
- The HMS Metastore DB is on a supported RDBMS for the platform (version matters!)
usage: hive-sre u3|sre|perf -cdh|-hdp2|-hdp3|-all|-i <proc[,proc...]> [options]
version:2.4.0.24.0-SNAPSHOT
Hive SRE Utility
-all,--all-reports Run ALL available processes.
-cdh,--cloudera-data-hub Run processes that make sense for CDH.
-cfg,--config <arg> Config with details for the Sre Job. Must match the
either sre or u3 selection. Default:
$HOME/.hive-sre/cfg/default.yaml
-db,--database <arg> Comma separated list of Databases. Will override
config. (upto 100)
-dbRegEx,--database-regex <arg> A RegEx of databases to process
-dp,--decrypt-password <encrypted-password> Used this in conjunction with '-pkey' to decrypt the
generated passcode from `-p`.
-edbRegEx,--exclude-database-regex <arg> A RegEx that will filter OUT matching databases from
processing
-h,--help Help
-hdp2,--hortonworks-data-platfrom-v2 Run processes that make sense for HDP2.
-hdp3,--hortonworks-data-platfrom-v3 Run processes that make sense for HDP3.
-hfw,--hive-framework <arg> The custom HiveFramework check configuration. Needs
to be in the 'Classpath'.
-i,--include <arg> Comma separated list of process id's to run. When
not specified, ALL processes are run.
-o,--output-dir <arg> Output Directory to save results from Sre.
-p,--password <password> Used this in conjunction with '-pkey' to generate the
encrypted password that you'll add to the configs for
the JDBC connections.
-pkey,--password-key <password-key> The key used to encrypt / decrypt the cluster jdbc
passwords. If not present, the passwords will be
processed as is (clear text) from the config file.
-scc,--skip-command-checks Don't process the command checks for the process.
-tsql,--test-sql Check SQL against target Metastore RDBMS
Visit https://github.com/cloudera-labs/hive-sre for detailed docs
To limit which process runs, use the -i
(include) option at the command line with a comma separated list of ids (below) of desired processes.
id (link to sample report) | process |
---|---|
1 | Table / Partition Location Scan - Missing Directories |
2 | Table / Partition Location Scan - Bad ACID ORC Filenames |
3 | Hive 3 Upgrade Checks - Managed Non-ACID to ACID Table Migrations |
4 | Hive 3 Upgrade Checks - Compaction Check |
5 | Hive Metastore Check - Questionable Serde's Check - Questionable Serde's Check - Leagcy Kudu Serde Report - Legacy Decimal Scale/Precision Check - List Databases with Table/Partition Counts |
6 | Provide Script for ACID tables missing the 'bucketing_version' property |
Run the Hive Metastore Checks Report first (-i 5
) to get a list of databases with table and partition counts. With this information, you can develop a strategy to run the tool in parts using the -db
option. Multi-tasking can be controlled in the configuration files parallelism
setting.
To ease the launch of the application below, configure these core environment variables.
The output is a set of files with actions and error (when encountered). The files maybe txt
files or markdown
. You may want to use a markdown
viewer for easier viewing of those reports. The markdown viewer needs to support github markdown tables .
All command assume the config file is here: $HOME/.hive-sre/cfg/default.yaml
All Hive Databases - Will run processes relevant to cdh
.
hive-sre u3 -cdh -o ./sre-out
Targeted Hive Database - Will run ALL processes on the 'priv_dstreev' hive database and run only reports relative to hdp2
.
hive-sre u3 -db priv_dstreev -hdp2 -o ./sre-out
Run ONLY compaction check on All Hive Database - Using the -i
option to run only the 'compaction check' sub routine.
hive-sre u3 -o ./sre-out -i 4
NO action is taken by this process. The output of each section will contain 'actions' for you to take when a scenario materializes. It is up to you to carry out those actions after reviewing them.
-
Hive 3 Upgrade Checks - Locations Scan
- Missing Directories
Missing Directories cause the upgrade conversion process to fail. To prevent that failure, there are two choices for a 'missing directory'. Either create it of drop the table/partition.
- You have two choices based on the output of this process.
- RECOMMENDED: Drop the table/partition OR
- Create the missing directory referenced by the table/partition (empty directories have been known to cause hive table CBO issues, leading to performance slowdowns).
- The output file from this process will provide commands to accomplish which ever direction you choose. Use 'hive' to run the sql statements. Use hadoopcli to run the 'hdfs' commands in bulk.
-
Hive 3 Upgrade Checks - Bad ORC Filenames
- Bad Filename Format
Tables that would be convert from a Managed Non-Acid table to an ACID transactional table require the files to match a certain pattern. This process will scan the potential directories of these tables for bad filename patterns. When located, it will indicate which tables/partitions have been file naming conventions that would prevent a successful conversion to ACID. The best and easiest way to correct these files names is to use HiveSQL to rewrite the contents of the table/partition with a simple 'INSERT OVERWRITE TABLE xxx SELECT * FROM xxx'. This type of statement will replace the current bad filenames with valid file names by rewriting the contents in HiveSQL.
-
Hive 3 Upgrade Checks - Managed Table Migrations
- Ownership Check
- Conversion to ACID tables
This process will list tables that will and 'could' be migrated to "Managed ACID" tables during the upgrade process. If these tables are used by Spark OR data is managed by a separate process that interacts with the FileSystem, DO NOT LET THESE conversion happen. The output of this process will supply Hive DDL commands to convert these tables to "EXTERNAL / PURGE" tables in Hive 3, which is the same as the 'classic' Hive 1/2 Managed Non-Acid table.
-
Hive 3 Upgrade Checks - Compaction Check
- Compaction Check
Review ACID tables for 'delta' directories. Where 'delta' directories are found, we'll
-
Metastore Report
- Questionable Serde's Check
Will list tables using SERDE's that are not standard to the platform.
- Action here either:
- Remove the table using the SERDE, if the SERDE isn't available
- Ensure the SERDE is available during the upgrade so table can be evaluated.
- Legacy Kudu Serde Report
Early versions of Hive/Impala tables using Kudu were built before Kudu became an Apache Project. Once it became an Apache Project, the base Kudu Storage Handler classname changed. This report locates and reports on tables using the legacy storage handler class.
- Legacy Decimal Scale and Precision Check
When the
DECIMAL
data type was first introduced in Hive 1, it did NOT include a Scale or Precision element. This causes issues in later integration with Hive and Spark. We'll identify and suggest corrective action for tables where this condition exists. - Managed Table Shadows
In Hive 3, Managed tables are 'ACID' tables. Sharing a location between two 'ACID' tables will cause compaction issues and data issues. These need to be resolved before the upgrade.
- Database / Table and Partition Counts
Use this to understand the scope of what is in the metastore.
The reports created by this application will provide you with a prescriptive set of actions that will prepare and accelerate the Hive 1/2 to Hive 3 upgrade.
Hive consists of two parts: Metadata and Data. Since these are not strictly linked, they do get out of sync. The process 'Location Scans' above finds tables and partitions that don't have a matching file system location. Use the details of this report to 'clean up' the metastore and filesystem.
Use the reports and scripts provided to build a plan for your upgrade. The latest versions of u3
will create a file called hsmm_includelist.yaml
. This contains a list of Databases and Tables that need attention. Use the contents of this list to choose the actions you'll take for the upgrade. The choices are:
- Run the HiveStrictManagedMigration process against these targeted database and tables to avoid its default iteration over ALL DB's and tables.
- Run the scripts suggested by the reports to migrate your tables. NOTE: Scripts aren’t included to address legacy Kudu storage handler classes (CDH Only). Run the HSMM process on those DB’s and Tables identified in the report.
Note that the HiveStrictManagedMigration process DOES make adjustments to database definitions as a part of the upgrade. The CDP upgrade docs will include examples on how to ensure that process is run against the DB's and not any tables, to minimize the time in that process.