forked from amplab/shark
-
Notifications
You must be signed in to change notification settings - Fork 0
Compatibility with Apache Hive
pwendell edited this page Oct 13, 2012
·
5 revisions
Shark is designed to be "out of the box" compatible with existing Hive installations. When you deploy Shark in a cluster, you do not need to modify your existing Hive Metastore or change the data placement or partitioning of your tables.
Shark supports the vast majority of Hive features, such as:
- Hive query statements, including:
- SELECT
- GROUP_BY
- ORDER_BY
- CLUSTER_BY
- SORT_BY
- All Hive operators, including:
- Relational operators (=, ⇔, ==, <>, <, >, >=, <=, etc)
- Arthimatic operators (+, -, *, /, %, etc)
- Logical operators (AND, &&, OR, ||, etc)
- Complex type constructors
- Mathemtatical functions (sign, ln, cos, etc)
- String functions (instr, length, prinf, etc)
- User defined functions (UDF)
- User defined aggregation functions (UDAF)
- User defined serialization formats (SerDe's)
- Joins
- JOIN
- {LEFT|RIGHT|FULL} OUTER JOIN
- LEFT SEMI JOIN
- CROSS JOIN
- Unions
- Sub queries
- SELECT col FROM ( SELECT a + b AS col from t1) t2
- Sampling
- Explain
- Partitioned tables
- All Hive DDL Functions, including:
- CREATE TABLE
- CREATE TABLE AS SELECT
- ALTER TABLE
- All Hive Data types, including:
- TINYINT
- SMALLINT
- INT
- BIGINT
- BOOLEAN
- FLOAT
- DOUBLE
- STRING
- BINARY
- TIMESTAMP
- ARRAY<>
- MAP<>
- STRUCT<>
- UNIONTYPE<>
Below is a list of Hive features that we don't support yet. The vast majority of these features are rarely used in Hive deployments.
Major Hive Features
- Tables with buckets: bucket is the hash partitioning within a Hive table partition. Shark doesn't support buckets.
Esoteric Hive Features
- Tables with partitions using different input formats: In Shark, all table partitions need to have the same input format.
- Non-equi outer join: For the uncommon use case of using outer joins with non-equi join conditions (e.g. condition "key < 10"), Shark will output wrong result for the NULL tuple.
- Unique join
- Single query multi insert
- Column statistics collecting: Shark does not piggyback scans to collect column statistics at the moment.
Hive Input/Output Formats
- File format for CLI: For results showing back to the CLI, Shark only supports TextOutputFormat.
- Hadoop archive
Hive Optimizations A handful of Hive optimizations are not yet included in Spark. Some of these (such as indexes) are not necessary due to Shark's in-memory computational model. Others are slotted for future releases of Spark.
- Block level bitmap indexes and virtual columns (used to build indexes)
- Automatically convert a join to map join: For joining a large table with multiple small tables, Hive automatically converts the join into a map join. We are adding this auto conversion in the next release.
- Automatically determine the number of reducers for joins and groupbys: Currently in Shark, you need to control the degree of parallelism post-shuffle using "set mapred.reduce.tasks=[num_tasks];". We are going to add auto-setting of parallelism in the next release.
- Meta-data only query: For queries that can be answered by using only meta data, Shark still launches tasks to compute the result.
- Skew data flag: Shark does not follow the skew data flags in Hive.
- STREAMTABLE hint in join: Shark does not follow the STREAMTABLE hint.
- Merge multiple small files for query results: If the result output contains multiple small files, Hive can optionally merge the small files into fewer large files to avoid overflowing the HDFS metadata. Shark does not support that.