-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-2523] [SQL] Hadoop table scan bug fixing #1439
Conversation
QA tests have started for PR 1439. This patch merges cleanly. |
|
QA results for PR 1439: |
Could you elaborate on when we will see an exception? Can you provide a case? |
iter.map { value => | ||
val raw = deserializer.deserialize(value) | ||
var idx = 0; | ||
while(idx < fieldRefs.length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: space after while
@yhuai & @concretevitamin thanks for the comments, I've updated the description in Jira, can you please jump there and take a look? |
Sorry, forgot to paste the link. https://issues.apache.org/jira/browse/SPARK-2523 |
QA tests have started for PR 1439. This patch merges cleanly. |
QA results for PR 1439: |
|
||
private[this] def castFromString(value: String, dataType: DataType) = { | ||
Cast(Literal(value), dataType).eval(null) | ||
} | ||
|
||
private def addColumnMetadataToConf(hiveConf: HiveConf) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would keep it. It is important to set needed columns in conf. So, RCFile and ORC can know what columns should be skipped. Also, seems hiveConf.set(serdeConstants.LIST_COLUMN_TYPES, columnTypeNames)
and hiveConf.set(serdeConstants.LIST_COLUMNS, columnInternalNames)
will be used to push down filters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yes, I didn't realize that. I will revert it.
I think we are not clear on the boundary between a I think it makes sense to remove the trait of @marmbrus, @liancheng, @rxin, @chenghao-intel thoughts? |
@chenghao-intel explained the root cause in https://issues.apache.org/jira/browse/SPARK-2523. Basically, we should use partition-specific |
@yhuai I agree with you we should make a clear boundary between |
@chenghao-intel I did not meant to introduce multiple |
@yhuai I got your mean eventually, I think you're right, some of the logic could be shared among TableScan operators. |
I'll just add the the If we are going to do major refactoring I'd want to see benchmarks showing that we aren't introducing any performance regressions. It would also be nice to see a test case that would be currently failing but passes after this PR is added. |
QA tests have started for PR 1439. This patch DID NOT merge cleanly! |
QA tests have started for PR 1439. This patch merges cleanly. |
Thank you guys, I've updated the code as suggested, and the also provide the micro-benchmark result in the PR description. |
QA results for PR 1439: |
QA results for PR 1439: |
@yhuai @concretevitamin @marmbrus @liancheng Can you take a look of this? I think the test result gave us more confidence for the improvement. |
@@ -114,6 +77,7 @@ case class HiveTableScan( | |||
val columnInternalNames = neededColumnIDs.map(HiveConf.getColumnInternalName(_)).mkString(",") | |||
|
|||
if (attributes.size == relation.output.size) { | |||
// TODO what if duplicated attributes queried? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A HiveTableScan
will be only created by SQLContext#pruneFilterProject
and pruneFilterProject
will make sure there is no duplicated attributes. Let's add a comment to explain it.
|
||
import org.apache.spark.{SparkConf, SparkContext} | ||
import org.apache.spark.sql.hive.test.TestHive | ||
import org.scalatest.{BeforeAndAfterAll, FunSuite} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chenghao-intel Sorry for my late reply. Other than those minor comment and format issues, it looks good to me. |
Also, can you delete |
""".stripMargin) | ||
TestHive.hql("""from src insert into table part_scan_test PARTITION (ds='2010-01-02') | ||
| select 200,200 limit 1 | ||
""".stripMargin) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: let's make all SQL keywords capital here.
Thank you @yhuai , @liancheng ,I've updated the code according to your comments, let's see if SparkQA happy with the changes. |
QA tests have started for PR 1439. This patch merges cleanly. |
QA results for PR 1439: |
LGTM |
Thanks! I've merged this into master. |
@marmbrus seems this PR is still open. Can you double check if it's merged? |
Apache mirroring is broken. We've filled a bug with infra.
|
I am worried this PR might have broken testing in Maven: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/228/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/testReport/junit/org.apache.spark.sql.hive.execution/HiveTableScanSuite/partition_based_table_scan_with_different_serde/ Can you please investigate? |
I am looking at it. |
…maven test) This PR tries to resolve the broken Jenkins maven test issue introduced by #1439. Now, we create a single query test to run both the setup work and the test query. Author: Yin Huai <huai@cse.ohio-state.edu> Closes #1669 from yhuai/SPARK-2523-fixTest and squashes the following commits: 358af1a [Yin Huai] Make partition_based_table_scan_with_different_serde run atomically.
In HiveTableScan.scala, ObjectInspector was created for all of the partition based records, which probably causes ClassCastException if the object inspector is not identical among table & partitions. This is the follow up with: apache#1408 apache#1390 I've run a micro benchmark in my local with 15000000 records totally, and got the result as below: With This Patch | Partition-Based Table | Non-Partition-Based Table ------------ | ------------- | ------------- No | 1927 ms | 1885 ms Yes | 1541 ms | 1524 ms It showed this patch will also improve the performance. PS: the benchmark code is also attached. (thanks liancheng ) ``` package org.apache.spark.sql.hive import org.apache.spark.SparkContext import org.apache.spark.SparkConf import org.apache.spark.sql._ object HiveTableScanPrepare extends App { case class Record(key: String, value: String) val sparkContext = new SparkContext( new SparkConf() .setMaster("local") .setAppName(getClass.getSimpleName.stripSuffix("$"))) val hiveContext = new LocalHiveContext(sparkContext) val rdd = sparkContext.parallelize((1 to 3000000).map(i => Record(s"$i", s"val_$i"))) import hiveContext._ hql("SHOW TABLES") hql("DROP TABLE if exists part_scan_test") hql("DROP TABLE if exists scan_test") hql("DROP TABLE if exists records") rdd.registerAsTable("records") hql("""CREATE TABLE part_scan_test (key STRING, value STRING) PARTITIONED BY (part1 string, part2 STRING) | ROW FORMAT SERDE | 'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe' | STORED AS RCFILE """.stripMargin) hql("""CREATE TABLE scan_test (key STRING, value STRING) | ROW FORMAT SERDE | 'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe' | STORED AS RCFILE """.stripMargin) for (part1 <- 2000 until 2001) { for (part2 <- 1 to 5) { hql(s"""from records | insert into table part_scan_test PARTITION (part1='$part1', part2='2010-01-$part2') | select key, value """.stripMargin) hql(s"""from records | insert into table scan_test select key, value """.stripMargin) } } } object HiveTableScanTest extends App { val sparkContext = new SparkContext( new SparkConf() .setMaster("local") .setAppName(getClass.getSimpleName.stripSuffix("$"))) val hiveContext = new LocalHiveContext(sparkContext) import hiveContext._ hql("SHOW TABLES") val part_scan_test = hql("select key, value from part_scan_test") val scan_test = hql("select key, value from scan_test") val r_part_scan_test = (0 to 5).map(i => benchmark(part_scan_test)) val r_scan_test = (0 to 5).map(i => benchmark(scan_test)) println("Scanning Partition-Based Table") r_part_scan_test.foreach(printResult) println("Scanning Non-Partition-Based Table") r_scan_test.foreach(printResult) def printResult(result: (Long, Long)) { println(s"Duration: ${result._1} ms Result: ${result._2}") } def benchmark(srdd: SchemaRDD) = { val begin = System.currentTimeMillis() val result = srdd.count() val end = System.currentTimeMillis() ((end - begin), result) } } ``` Author: Cheng Hao <hao.cheng@intel.com> Closes apache#1439 from chenghao-intel/hadoop_table_scan and squashes the following commits: 888968f [Cheng Hao] Fix issues in code style 27540ba [Cheng Hao] Fix the TableScan Bug while partition serde differs 40a24a7 [Cheng Hao] Add Unit Test
…maven test) This PR tries to resolve the broken Jenkins maven test issue introduced by apache#1439. Now, we create a single query test to run both the setup work and the test query. Author: Yin Huai <huai@cse.ohio-state.edu> Closes apache#1669 from yhuai/SPARK-2523-fixTest and squashes the following commits: 358af1a [Yin Huai] Make partition_based_table_scan_with_different_serde run atomically.
In HiveTableScan.scala, ObjectInspector was created for all of the partition based records, which probably causes ClassCastException if the object inspector is not identical among table & partitions.
This is the follow up with:
#1408
#1390
I've run a micro benchmark in my local with 15000000 records totally, and got the result as below:
It showed this patch will also improve the performance.
PS: the benchmark code is also attached. (thanks @liancheng )