Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unnecessary serialize/deserialize for cases where no conversion was performed #329

Open
wants to merge 4 commits into
base: branch-0.9
Choose a base branch
from

Conversation

chiragaggarwal
Copy link

If partition schema does not match table schema, the row (formed by deserializing through partition serde) is converted to match the table schema. If conversion was performed, convertedRow will be a standard Object, but if conversion wasn't necessary, it will still be lazy.

We can't have both (standard and lazy objects) across partitions, so we serialize and deserialize again to make it lazy.

This extra serialize/deserialize is being performed irrespective of the fact that whether conversion was done or not.

There are two effects of this serialization / deserialization:

  1. Extra serialization / deserilization cost for cases, where no conversion happened.

  2. If a table is created using ThriftDeserializer, the non-availability of serialize function in it makes it unusable in this context.

The fix done is that in case conversion was not done (when partition and table serde match), then this serialization and deserialization step should be skipped, since it is not required as the object would still be lazy. This shall also allow users to be able to use ThriftDeserializer in such a case.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@rxin
Copy link
Member

rxin commented May 12, 2014

Jenkins, test this please.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/Shark-Pull-Request-Builder/12193/

convertedRow, tblConvertedOI))
}
case _ =>
if (partTblObjectInspectorConverter.isInstanceOf[IdentityConverter]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's better to move the branch out of the iter.map.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the case, where partTblObjectInspectorConverter.isInstanceOf[IdentityConverter] is true,
the deserialized value needs to be calculated anyways, which is done as part of iter.map, so would it make sense to pull it out of this iter.map, and push it in another one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I mean how about like this?

rowWithPartArr.update(1, partValues) 
if (!partTblObjectInspectorConverter.isInstanceOf[IdentityConverter]) {
  iter.map {...// do as it previously
  }
} else {
  iter.map {
    rowWithPartArr.update(0, partSerDe.deserialize(value))
    rowWithPartArr.asInstanceOf[Object]
 }
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I will edit it.

@chiragaggarwal
Copy link
Author

Incorporated the review feedback.

iter.map { value =>
val deserializedRow = partTblObjectInspectorConverter.convert(partSerDe.deserialize(value))
rowWithPartArr.update(0, deserializedRow)
rowWithPartArr.update(1, partValues)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The partValues doesn't change per partition, we can move out of the iter.map, the same for the later one.

@chenghao-intel
Copy link
Contributor

@chiragaggarwal thank you for the revising, I've tested that in my cluster, it does improve the performance for partitioned based table scan.
And sorry for the misleading in my previous comment, I was thinking we'd better keep the previous implementation (don't change it), and just add the new logic base on that.

I've also updated my code example which had a critical typo.

@chiragaggarwal
Copy link
Author

@chenghao-intel I am little confused. Does your last comment imply that the code after revision is fine, or do you think it needs to be changed further, or do you think that the commit before the revision was better? Could you please clarify so that I can act accordingly.

@chenghao-intel
Copy link
Contributor

@chiragaggarwal sorry for the confusing.
Your code improves the performance and I've verified it in real cluster, however, from the readability point of view, I think it's better to add the new logic without touching the original implementation (I am not sure if the original logic is 100% right, and I will check that with Hive, it will be cool if you can do it. :) ). So, how about the whole implementation looks like

// this is done per partition, and no necessary put it in the iterations (in iter.map).
rowWithPartArr.update(1, partValues) 
if (partTblObjectInspectorConverter.isInstanceOf[IdentityConverter]) {
  iter.map {
    rowWithPartArr.update(0, partSerDe.deserialize(value))
    rowWithPartArr.asInstanceOf[Object]
 }
} else {
  iter.map {...// do as it orginally
  }
}

Let me know if you are still confused.

@chenghao-intel
Copy link
Contributor

I've tested the latest code in my cluster, it improves the performance up to 30% for a partition tables based query.
The code looks good to me, @rxin can you also review that and start the unit test?

@rxin
Copy link
Member

rxin commented Jul 11, 2014

@chenghao-intel do you mind submit a PR against Spark SQL to fix the same problem in Spark SQL? (assuming it also exists)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants