Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compression codec change #754

Merged
merged 11 commits into from
Jul 31, 2013
Merged

Compression codec change #754

merged 11 commits into from
Jul 31, 2013

Conversation

rxin
Copy link
Member

@rxin rxin commented Jul 31, 2013

This is based on and subsumes @lyogavin's pull request #685.

  • Cleaned up the code; moved the compression codec's into a new package spark.io
  • Added unit tests
  • Switched the default compression codec to Snappy, since it has lower memory overhead
  • Added configuration options to the documentation file
  • Properly resolved dependency conflicts in SBT for Snappy
  • Added the Snappy dependency in Maven

@lyogavin
Copy link
Contributor

Reynold, thanks very much for this. Sorry, I've been working on deploying our first release on production, haven't got time to handle this. So thanks for the help, appreciate it!

@AmplabJenkins
Copy link

Thank you for submitting this pull request.

All automated tests for this request have passed.

Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/374/

@mateiz
Copy link
Member

mateiz commented Jul 31, 2013

Hey Reynold, one question: is there a native Snappy library this depends on? If so, where will it look for it?

@rxin
Copy link
Member Author

rxin commented Jul 31, 2013

Yes.

See http://xerial.org/snappy-java/

Portable across various operating systems; Snappy-java contains native libraries built for Window/Mac/Linux (32/64-bit). At runtime, snappy-java loads one of these libraries according to your machine environment (It looks system properties, os.name and os.arch).

@lyogavin
Copy link
Contributor

Yes, the jar published in Maven repository already contains native libraries for Win/Mac/Linux(32/64). While for users of other OS/CPU architecture, they may need to build the native libraries from source code as introduced in http://xerial.org/snappy-java/.

@mateiz
Copy link
Member

mateiz commented Jul 31, 2013

Ah okay, that's great.

*/
trait CompressionCodec {

def compressionOutputStream(s: OutputStream): OutputStream
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't mind, rename these to compressedOutputStream and compressedInputStream -- just seems like a better naming convention for streams.

@mateiz
Copy link
Member

mateiz commented Jul 31, 2013

Hey Reynold, FYI, there are two other places where we should use the codec instead of LZF:

  • HttpBroadcast.scala
  • Checkpoint.scala

Basically do a search through the codebase for LZF and make sure we're replacing it.

@mateiz
Copy link
Member

mateiz commented Jul 31, 2013

BTW apart from this the patch looks good to me.

@AmplabJenkins
Copy link

Thank you for submitting this pull request.

All automated tests for this request have passed.

Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/378/

@rxin
Copy link
Member Author

rxin commented Jul 31, 2013

changed according to feedback. ptal.

@AmplabJenkins
Copy link

Thank you for submitting this pull request.

All automated tests for this request have passed.

Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/384/

@@ -49,6 +50,11 @@ class Checkpoint(@transient ssc: StreamingContext, val checkpointTime: Time)
}
}

private[streaming]
object Checkpoint {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using singleton objects is kind of ugly for testing purposes; could we just create a new codec on each checkpoint? It doesn't seem like that big a deal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then we introduce the possibility that the writer could be using one codec, and the reader uses another one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I don't really get it. The reader is going to be a different instance of the program anyway, after a failure. It will have to be configured with the same codec class either way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a header to the output that specifies the codec used? Seems
brittle to require the configuration to be correct for simply reading the
checkpoint. (Not that it's an unreasonable assumption in >90% of cases, but
robustness is nice.)

On Wed, Jul 31, 2013 at 8:52 AM, Matei Zaharia notifications@github.comwrote:

In streaming/src/main/scala/spark/streaming/Checkpoint.scala:

@@ -49,6 +50,11 @@ class Checkpoint(@transient ssc: StreamingContext, val checkpointTime: Time)
}
}

+private[streaming]
+object Checkpoint {

Hmm, I don't really get it. The reader is going to be a different instance
of the program anyway, after a failure. It will have to be configured with
the same codec class either way.


Reply to this email directly or view it on GitHubhttps://github.com//pull/754/files#r5507181
.

@rxin
Copy link
Member Author

rxin commented Jul 31, 2013

Ok I removed the singleton and squashed the last two commits.

Jenkins, test this please.

@AmplabJenkins
Copy link

Thank you for submitting this pull request.

All automated tests for this request have passed.

Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/400/

mateiz added a commit that referenced this pull request Jul 31, 2013
@mateiz mateiz merged commit a386ced into mesos:master Jul 31, 2013
@mateiz
Copy link
Member

mateiz commented Jul 31, 2013

Alright, merged this in. Thanks @rxin and @lyogavin .

xiajunluan pushed a commit to xiajunluan/spark that referenced this pull request May 30, 2014
…addendum

Author: witgo <witgo@qq.com>

Closes mesos#754 from witgo/commons-lang and squashes the following commits:

3ebab31 [witgo] merge master
f3b8fa2 [witgo] merge master
2083fae [witgo] repeat definition
5599cdb [witgo] multiple version of sbt  dependency
c1b66a1 [witgo] fix different versions of commons-lang dependency
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants