Skip to content

Latest commit

 

History

History
209 lines (157 loc) · 17.6 KB

u3.md

File metadata and controls

209 lines (157 loc) · 17.6 KB

Hive Upgrade Check (u3)

Review Hive Metastore Databases and Tables for upgrade or simply to evaluate potential issues. The intent was to make that process much more prescriptive and consumable by Cloudera customers. The application is 'Apache Hive' based, so it should work against both 'HDP', 'CDH', and 'CDP' clusters. See additional details about upgrading HDP2 and CDP 5/6 to CDP.

Download, Requirements, and Configuration Details

Quick Start (What to do!!!)

This process can be a long running process. I recommend using screen or tmux sessions to run it. That way the session won't get terminated when you disconnect from the host and you can easily reattach later to see the progress.

The process should start registering counts pretty quickly for each of the sub-checks. If you do not see these counts moving, or they show 0, check your connection to the Metastore database (IE: URL, RDBMS Driver(needs to be in $HOME/.hive-sre/aux_libs), Username, password, and/or permissions). hive-sre versions 2.4.0.21.0 and above will catch some of these misconfigurations and exit quickly.

Logs for the application are at $HOME/.hive-sre/logs. Check those for details on progress and/or problems.

Use the Platform flag to run ONLY the important reports.

Use one of -cdh, -hdp2, or -hdp3 depending on your source platform to run only the reports that make sense. If you don't use one of these options, that's fine. Just know that you'll be running MORE reports than you need and the process will run much longer. Just saying....

EACH report contains a description within the report on what actions should be taken.

If you are running the default upgrade and NOT doing the CDP Expedited Hive Upgrade Process for CDH5, CDH6, HDP2 there is a HSMM (Hive Strict Managed Migration Process) that runs AFTER the binaries and components have been upgraded to CDP's Hive 3. This process (HSMM) will fail if the issues/errors reported in reports 1 - Table / Partition Location Scan - Missing Directories, 5 - Questionable Serde's Check are not fixed BEFORE doing the upgrade.

Metadata Updates, like 3 - Hive 3 Upgrade Checks - Managed Non-ACID to ACID Table Migrations are done after the components and configurations have been updated to CDP. Therefore, the report actions for 3 - Hive 3 Upgrade Checks - Managed Non-ACID to ACID Table Migrations and 6 - Provide Script for ACID tables missing the 'bucketing_version' property will be done by HSMM.

If you are doing the Expedited Hive Upgrade process there are two options:

  1. Skip the HSMM process altogether. This means you will need to run the actions from the following, manually. These actions can be done BEFORE the upgrade, BUT may change behaviors expected in the legacy platform. They can be done AFTER the upgrade where there isn't a behavior impact. These will take time and dependent on the number of actions you need to perform. The tables must be updated BEFORE they can be consumed by users. But this doesn't mean you need to hold the upgrade or increase the downtime of the cluster. Organized high priority DBs first and work on less accessed DBs later.
    1. 3 - Hive 3 Upgrade Checks - Managed Non-ACID to ACID Table Migrations
    2. 6 - Provide Script for ACID tables missing the 'bucketing_version' property
  2. Use the hsmm_includelist.yaml produced by u3 to direct HSMM towards targeted actions. See the Expedite Hive Upgrade docs for details.
HDP2

Runs reports:

  • 1 - Table / Partition Location Scan - Missing Directories
  • 3 - Hive 3 Upgrade Checks - Managed Non-ACID to ACID Table Migrations
  • 4 - Hive 3 Upgrade Checks - Compaction Check
  • 5 - Hive Metastore Check
    - Questionable Serde's Check
    - Questionable Serde's Check
    - Leagcy Kudu Serde Report
    - Legacy Decimal Scale/Precision Check
    - List Databases with Table/Partition Counts
  • 6 - Provide Script for ACID tables missing the 'bucketing_version' property
  • 7 - List tables using known Storage Handlers
  • 8 - Hive 3 Upgrade Checks - List Managed Non-ACID tables that need to be converted. (Run u3e to process)
  • 9 - List ACID tables missing the 'bucketing_version' property (Run u3e to process)
HDP3

Runs reports:

  • 1 - Table / Partition Location Scan - Missing Directories
  • 5 - Hive Metastore Check
    - Questionable Serde's Check
    - Questionable Serde's Check
    - Leagcy Kudu Serde Report
    - Legacy Decimal Scale/Precision Check
    - List Databases with Table/Partition Counts
  • 6 - Provide Script for ACID tables missing the 'bucketing_version' property
CDH

Runs reports:

  • 1 - Table / Partition Location Scan - Missing Directories
  • 3 - Hive 3 Upgrade Checks - Managed Non-ACID to ACID Table Migrations
  • 5 - Hive Metastore Check
    - Questionable Serde's Check
    - Questionable Serde's Check
    - Leagcy Kudu Serde Report
    - Legacy Decimal Scale/Precision Check
    - List Databases with Table/Partition Counts
  • 6 - Provide Script for ACID tables missing the 'bucketing_version' property
  • 7 - List tables using known Storage Handlers
  • 8 - Hive 3 Upgrade Checks - List Managed Non-ACID tables that need to be converted. (Run u3e to process)
  • 9 - List ACID tables missing the 'bucketing_version' property (Run u3e to process)

Testing

I have tested this against MariaDB 10.2, Postgres, and Oracle. I have seen reports of SQL issues against MySql 5.6, so this process will certainly have issues there. If you run this in other DB configs, I would like to hear from you about it.

Known Issues

For a while during the evolution of Hive 3, there was a separate 'catalog' for Spark. The queries in this process do NOT consider this alternate catalog and may yield cross products in some areas if the 'spark' catalog was used at any point. Even though this tool was designed for Hive 1 and 2 in mind, it can be run against Hive 3. We do NOT include criteria regarding the 'catalog'.

Assumptions

  • Ran from a node on the cluster
    • That includes clients and configuration files for HDFS
  • Ran by a user that has READ privileges to all HDFS directories.
  • If cluster is 'kerberized', the user has a valid Kerberos Ticket 'before' starting the application.
  • Drivers for the HMS database are available.
  • The configuration file has been defined
  • The HMS Metastore DB is on a supported RDBMS for the platform (version matters!)

Application Help

usage: hive-sre u3|sre|perf -cdh|-hdp2|-hdp3|-all|-i <proc[,proc...]> [options]
                version:2.4.0.24.0-SNAPSHOT
Hive SRE Utility
 -all,--all-reports                            Run ALL available processes.
 -cdh,--cloudera-data-hub                      Run processes that make sense for CDH.
 -cfg,--config <arg>                           Config with details for the Sre Job.  Must match the
                                               either sre or u3 selection. Default:
                                               $HOME/.hive-sre/cfg/default.yaml
 -db,--database <arg>                          Comma separated list of Databases.  Will override
                                               config. (upto 100)
 -dbRegEx,--database-regex <arg>               A RegEx of databases to process
 -dp,--decrypt-password <encrypted-password>   Used this in conjunction with '-pkey' to decrypt the
                                               generated passcode from `-p`.
 -edbRegEx,--exclude-database-regex <arg>      A RegEx that will filter OUT matching databases from
                                               processing
 -h,--help                                     Help
 -hdp2,--hortonworks-data-platfrom-v2          Run processes that make sense for HDP2.
 -hdp3,--hortonworks-data-platfrom-v3          Run processes that make sense for HDP3.
 -hfw,--hive-framework <arg>                   The custom HiveFramework check configuration. Needs
                                               to be in the 'Classpath'.
 -i,--include <arg>                            Comma separated list of process id's to run.  When
                                               not specified, ALL processes are run.
 -o,--output-dir <arg>                         Output Directory to save results from Sre.
 -p,--password <password>                      Used this in conjunction with '-pkey' to generate the
                                               encrypted password that you'll add to the configs for
                                               the JDBC connections.
 -pkey,--password-key <password-key>           The key used to encrypt / decrypt the cluster jdbc
                                               passwords.  If not present, the passwords will be
                                               processed as is (clear text) from the config file.
 -scc,--skip-command-checks                    Don't process the command checks for the process.
 -tsql,--test-sql                              Check SQL against target Metastore RDBMS

Visit https://github.com/cloudera-labs/hive-sre for detailed docs

To limit which process runs, use the -i (include) option at the command line with a comma separated list of ids (below) of desired processes.

u3 Processes

id (link to sample report) process
1 Table / Partition Location Scan - Missing Directories
2 Table / Partition Location Scan - Bad ACID ORC Filenames
3 Hive 3 Upgrade Checks - Managed Non-ACID to ACID Table Migrations
4 Hive 3 Upgrade Checks - Compaction Check
5 Hive Metastore Check
- Questionable Serde's Check
- Questionable Serde's Check
- Leagcy Kudu Serde Report
- Legacy Decimal Scale/Precision Check
- List Databases with Table/Partition Counts
6 Provide Script for ACID tables missing the 'bucketing_version' property

Running

Run the Hive Metastore Checks Report first (-i 5) to get a list of databases with table and partition counts. With this information, you can develop a strategy to run the tool in parts using the -db option. Multi-tasking can be controlled in the configuration files parallelism setting.

To ease the launch of the application below, configure these core environment variables.

Output

The output is a set of files with actions and error (when encountered). The files maybe txt files or markdown. You may want to use a markdown viewer for easier viewing of those reports. The markdown viewer needs to support github markdown tables .

Examples

All command assume the config file is here: $HOME/.hive-sre/cfg/default.yaml

All Hive Databases - Will run processes relevant to cdh.

hive-sre u3 -cdh -o ./sre-out

Targeted Hive Database - Will run ALL processes on the 'priv_dstreev' hive database and run only reports relative to hdp2.

hive-sre u3 -db priv_dstreev -hdp2 -o ./sre-out

Run ONLY compaction check on All Hive Database - Using the -i option to run only the 'compaction check' sub routine.

hive-sre u3 -o ./sre-out -i 4

Check and Validations Performed

NO action is taken by this process. The output of each section will contain 'actions' for you to take when a scenario materializes. It is up to you to carry out those actions after reviewing them.

  1. Hive 3 Upgrade Checks - Locations Scan

    • Missing Directories

    Missing Directories cause the upgrade conversion process to fail. To prevent that failure, there are two choices for a 'missing directory'. Either create it of drop the table/partition.

    • You have two choices based on the output of this process.
      • RECOMMENDED: Drop the table/partition OR
      • Create the missing directory referenced by the table/partition (empty directories have been known to cause hive table CBO issues, leading to performance slowdowns).
    • The output file from this process will provide commands to accomplish which ever direction you choose. Use 'hive' to run the sql statements. Use hadoopcli to run the 'hdfs' commands in bulk.
  2. Hive 3 Upgrade Checks - Bad ORC Filenames

    • Bad Filename Format

    Tables that would be convert from a Managed Non-Acid table to an ACID transactional table require the files to match a certain pattern. This process will scan the potential directories of these tables for bad filename patterns. When located, it will indicate which tables/partitions have been file naming conventions that would prevent a successful conversion to ACID. The best and easiest way to correct these files names is to use HiveSQL to rewrite the contents of the table/partition with a simple 'INSERT OVERWRITE TABLE xxx SELECT * FROM xxx'. This type of statement will replace the current bad filenames with valid file names by rewriting the contents in HiveSQL.

  3. Hive 3 Upgrade Checks - Managed Table Migrations

    • Ownership Check
    • Conversion to ACID tables

    This process will list tables that will and 'could' be migrated to "Managed ACID" tables during the upgrade process. If these tables are used by Spark OR data is managed by a separate process that interacts with the FileSystem, DO NOT LET THESE conversion happen. The output of this process will supply Hive DDL commands to convert these tables to "EXTERNAL / PURGE" tables in Hive 3, which is the same as the 'classic' Hive 1/2 Managed Non-Acid table.

  4. Hive 3 Upgrade Checks - Compaction Check

    • Compaction Check

    Review ACID tables for 'delta' directories. Where 'delta' directories are found, we'll

  5. Metastore Report

    • Questionable Serde's Check

    Will list tables using SERDE's that are not standard to the platform.

    • Action here either:
      • Remove the table using the SERDE, if the SERDE isn't available
      • Ensure the SERDE is available during the upgrade so table can be evaluated.
    • Legacy Kudu Serde Report

      Early versions of Hive/Impala tables using Kudu were built before Kudu became an Apache Project. Once it became an Apache Project, the base Kudu Storage Handler classname changed. This report locates and reports on tables using the legacy storage handler class.

    • Legacy Decimal Scale and Precision Check

      When the DECIMAL data type was first introduced in Hive 1, it did NOT include a Scale or Precision element. This causes issues in later integration with Hive and Spark. We'll identify and suggest corrective action for tables where this condition exists.

    • Managed Table Shadows

    In Hive 3, Managed tables are 'ACID' tables. Sharing a location between two 'ACID' tables will cause compaction issues and data issues. These need to be resolved before the upgrade.

    • Database / Table and Partition Counts

    Use this to understand the scope of what is in the metastore.

Getting Ready for a Hive 1/2 to Hive 3 Upgrade

The reports created by this application will provide you with a prescriptive set of actions that will prepare and accelerate the Hive 1/2 to Hive 3 upgrade.

Hive consists of two parts: Metadata and Data. Since these are not strictly linked, they do get out of sync. The process 'Location Scans' above finds tables and partitions that don't have a matching file system location. Use the details of this report to 'clean up' the metastore and filesystem.

Use the reports and scripts provided to build a plan for your upgrade. The latest versions of u3 will create a file called hsmm_includelist.yaml. This contains a list of Databases and Tables that need attention. Use the contents of this list to choose the actions you'll take for the upgrade. The choices are:

  • Run the HiveStrictManagedMigration process against these targeted database and tables to avoid its default iteration over ALL DB's and tables.
  • Run the scripts suggested by the reports to migrate your tables. NOTE: Scripts aren’t included to address legacy Kudu storage handler classes (CDH Only). Run the HSMM process on those DB’s and Tables identified in the report.

Note that the HiveStrictManagedMigration process DOES make adjustments to database definitions as a part of the upgrade. The CDP upgrade docs will include examples on how to ensure that process is run against the DB's and not any tables, to minimize the time in that process.