unnecessary serialize/deserialize for cases where no conversion was performed #329

chiragaggarwal · 2014-05-06T11:06:12Z

If partition schema does not match table schema, the row (formed by deserializing through partition serde) is converted to match the table schema. If conversion was performed, convertedRow will be a standard Object, but if conversion wasn't necessary, it will still be lazy.

We can't have both (standard and lazy objects) across partitions, so we serialize and deserialize again to make it lazy.

This extra serialize/deserialize is being performed irrespective of the fact that whether conversion was done or not.

There are two effects of this serialization / deserialization:

Extra serialization / deserilization cost for cases, where no conversion happened.
If a table is created using ThriftDeserializer, the non-availability of serialize function in it makes it unusable in this context.

The fix done is that in case conversion was not done (when partition and table serde match), then this serialization and deserialization step should be skipped, since it is not required as the object would still be lazy. This shall also allow users to be able to use ThriftDeserializer in such a case.

…erformed

AmplabJenkins · 2014-05-06T11:07:55Z

Can one of the admins verify this patch?

rxin · 2014-05-12T18:00:52Z

Jenkins, test this please.

AmplabJenkins · 2014-05-12T18:02:56Z

Merged build triggered.

AmplabJenkins · 2014-05-12T18:03:05Z

Merged build started.

AmplabJenkins · 2014-05-12T18:52:03Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-05-12T18:52:03Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/Shark-Pull-Request-Builder/12193/

chenghao-intel · 2014-05-23T03:01:18Z

src/main/scala/shark/execution/HadoopTableReader.scala

-                    convertedRow, tblConvertedOI))
-              }
+                case _ => 
+		  if (partTblObjectInspectorConverter.isInstanceOf[IdentityConverter]) {


Maybe it's better to move the branch out of the iter.map.

for the case, where partTblObjectInspectorConverter.isInstanceOf[IdentityConverter] is true,
the deserialized value needs to be calculated anyways, which is done as part of iter.map, so would it make sense to pull it out of this iter.map, and push it in another one.

Sorry, I mean how about like this?

rowWithPartArr.update(1, partValues) if (!partTblObjectInspectorConverter.isInstanceOf[IdentityConverter]) { iter.map {...// do as it previously } } else { iter.map { rowWithPartArr.update(0, partSerDe.deserialize(value)) rowWithPartArr.asInstanceOf[Object] } }

ok, I will edit it.

…erformed Incorporated the review feedback.

chiragaggarwal · 2014-05-23T08:30:07Z

Incorporated the review feedback.

chenghao-intel · 2014-05-29T01:03:02Z

src/main/scala/shark/execution/HadoopTableReader.scala

+          iter.map { value =>
+            val deserializedRow = partTblObjectInspectorConverter.convert(partSerDe.deserialize(value))
+            rowWithPartArr.update(0, deserializedRow)
+            rowWithPartArr.update(1, partValues)


The partValues doesn't change per partition, we can move out of the iter.map, the same for the later one.

chenghao-intel · 2014-05-29T01:19:09Z

@chiragaggarwal thank you for the revising, I've tested that in my cluster, it does improve the performance for partitioned based table scan.
And sorry for the misleading in my previous comment, I was thinking we'd better keep the previous implementation (don't change it), and just add the new logic base on that.

I've also updated my code example which had a critical typo.

chiragaggarwal · 2014-05-29T03:17:36Z

@chenghao-intel I am little confused. Does your last comment imply that the code after revision is fine, or do you think it needs to be changed further, or do you think that the commit before the revision was better? Could you please clarify so that I can act accordingly.

chenghao-intel · 2014-05-29T04:47:39Z

@chiragaggarwal sorry for the confusing.
Your code improves the performance and I've verified it in real cluster, however, from the readability point of view, I think it's better to add the new logic without touching the original implementation (I am not sure if the original logic is 100% right, and I will check that with Hive, it will be cool if you can do it. :) ). So, how about the whole implementation looks like

// this is done per partition, and no necessary put it in the iterations (in iter.map).
rowWithPartArr.update(1, partValues) 
if (partTblObjectInspectorConverter.isInstanceOf[IdentityConverter]) {
  iter.map {
    rowWithPartArr.update(0, partSerDe.deserialize(value))
    rowWithPartArr.asInstanceOf[Object]
 }
} else {
  iter.map {...// do as it orginally
  }
}

Let me know if you are still confused.

chenghao-intel · 2014-06-04T14:31:21Z

I've tested the latest code in my cluster, it improves the performance up to 30% for a partition tables based query.
The code looks good to me, @rxin can you also review that and start the unit test?

rxin · 2014-07-11T18:32:17Z

@chenghao-intel do you mind submit a PR against Spark SQL to fix the same problem in Spark SQL? (assuming it also exists)

unnecessary serialize/deserialize for cases where no conversion was p…

0b2a7a2

…erformed

chenghao-intel reviewed May 23, 2014
View reviewed changes

unnecessary serialize/deserialize for cases where no conversion was p…

7bfc9e6

…erformed Incorporated the review feedback.

chenghao-intel reviewed May 29, 2014
View reviewed changes

chiragaggarwal added 2 commits May 29, 2014 10:38

Incorporated the feedback

eea72cd

Since partValues does not change, so moving it out of iter.map

6efb4d0

rxin closed this Jul 11, 2014

rxin reopened this Jul 11, 2014

concretevitamin mentioned this pull request Jul 12, 2014

[SPARK-2443][SQL] Fix slow read from partitioned tables apache/spark#1390

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unnecessary serialize/deserialize for cases where no conversion was performed #329

unnecessary serialize/deserialize for cases where no conversion was performed #329

chiragaggarwal commented May 6, 2014

AmplabJenkins commented May 6, 2014

rxin commented May 12, 2014

AmplabJenkins commented May 12, 2014

AmplabJenkins commented May 12, 2014

AmplabJenkins commented May 12, 2014

AmplabJenkins commented May 12, 2014

chenghao-intel May 23, 2014

chiragaggarwal May 23, 2014

chenghao-intel May 23, 2014

chiragaggarwal May 23, 2014

chiragaggarwal commented May 23, 2014

chenghao-intel May 29, 2014

chenghao-intel commented May 29, 2014

chiragaggarwal commented May 29, 2014

chenghao-intel commented May 29, 2014

chenghao-intel commented Jun 4, 2014

rxin commented Jul 11, 2014

unnecessary serialize/deserialize for cases where no conversion was performed #329

Are you sure you want to change the base?

unnecessary serialize/deserialize for cases where no conversion was performed #329

Conversation

chiragaggarwal commented May 6, 2014

AmplabJenkins commented May 6, 2014

rxin commented May 12, 2014

AmplabJenkins commented May 12, 2014

AmplabJenkins commented May 12, 2014

AmplabJenkins commented May 12, 2014

AmplabJenkins commented May 12, 2014

chenghao-intel May 23, 2014

Choose a reason for hiding this comment

chiragaggarwal May 23, 2014

Choose a reason for hiding this comment

chenghao-intel May 23, 2014

Choose a reason for hiding this comment

chiragaggarwal May 23, 2014

Choose a reason for hiding this comment

chiragaggarwal commented May 23, 2014

chenghao-intel May 29, 2014

Choose a reason for hiding this comment

chenghao-intel commented May 29, 2014

chiragaggarwal commented May 29, 2014

chenghao-intel commented May 29, 2014

chenghao-intel commented Jun 4, 2014

rxin commented Jul 11, 2014