Compression codec change #754

rxin · 2013-07-31T00:32:35Z

This is based on and subsumes @lyogavin's pull request #685.

Cleaned up the code; moved the compression codec's into a new package spark.io
Added unit tests
Switched the default compression codec to Snappy, since it has lower memory overhead
Added configuration options to the documentation file
Properly resolved dependency conflicts in SBT for Snappy
Added the Snappy dependency in Maven

…ession Conflicts: project/SparkBuild.scala

lyogavin · 2013-07-31T00:40:42Z

Reynold, thanks very much for this. Sorry, I've been working on deploying our first release on production, haven't got time to handle this. So thanks for the help, appreciate it!

…nt with the documentation.

AmplabJenkins · 2013-07-31T00:52:43Z

Thank you for submitting this pull request.

All automated tests for this request have passed.

Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/374/

mateiz · 2013-07-31T01:09:45Z

Hey Reynold, one question: is there a native Snappy library this depends on? If so, where will it look for it?

rxin · 2013-07-31T01:11:54Z

Yes.

See http://xerial.org/snappy-java/

Portable across various operating systems; Snappy-java contains native libraries built for Window/Mac/Linux (32/64-bit). At runtime, snappy-java loads one of these libraries according to your machine environment (It looks system properties, os.name and os.arch).

lyogavin · 2013-07-31T01:20:36Z

Yes, the jar published in Maven repository already contains native libraries for Win/Mac/Linux(32/64). While for users of other OS/CPU architecture, they may need to build the native libraries from source code as introduced in http://xerial.org/snappy-java/.

mateiz · 2013-07-31T01:25:11Z

Ah okay, that's great.

mateiz · 2013-07-31T01:26:54Z

core/src/main/scala/spark/io/CompressionCodec.scala

+ */
+trait CompressionCodec {
+
+  def compressionOutputStream(s: OutputStream): OutputStream


If you don't mind, rename these to compressedOutputStream and compressedInputStream -- just seems like a better naming convention for streams.

mateiz · 2013-07-31T01:28:37Z

Hey Reynold, FYI, there are two other places where we should use the codec instead of LZF:

HttpBroadcast.scala
Checkpoint.scala

Basically do a search through the codebase for LZF and make sure we're replacing it.

…sedOutputStream and compressedInputStream.

mateiz · 2013-07-31T01:31:53Z

BTW apart from this the patch looks good to me.

AmplabJenkins · 2013-07-31T01:53:00Z

Thank you for submitting this pull request.

All automated tests for this request have passed.

Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/378/

rxin · 2013-07-31T01:55:29Z

changed according to feedback. ptal.

AmplabJenkins · 2013-07-31T03:45:02Z

Thank you for submitting this pull request.

All automated tests for this request have passed.

Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/384/

mateiz · 2013-07-31T04:04:26Z

streaming/src/main/scala/spark/streaming/Checkpoint.scala

@@ -49,6 +50,11 @@ class Checkpoint(@transient ssc: StreamingContext, val checkpointTime: Time)
  }
 }

+private[streaming]
+object Checkpoint {


Using singleton objects is kind of ugly for testing purposes; could we just create a new codec on each checkpoint? It doesn't seem like that big a deal.

But then we introduce the possibility that the writer could be using one codec, and the reader uses another one.

Hmm, I don't really get it. The reader is going to be a different instance of the program anyway, after a failure. It will have to be configured with the same codec class either way.

Can we add a header to the output that specifies the codec used? Seems
brittle to require the configuration to be correct for simply reading the
checkpoint. (Not that it's an unreasonable assumption in >90% of cases, but
robustness is nice.)

On Wed, Jul 31, 2013 at 8:52 AM, Matei Zaharia notifications@github.comwrote:

In streaming/src/main/scala/spark/streaming/Checkpoint.scala:

@@ -49,6 +50,11 @@ class Checkpoint(@transient ssc: StreamingContext, val checkpointTime: Time)
}
}

+private[streaming]
+object Checkpoint {

Hmm, I don't really get it. The reader is going to be a different instance
of the program anyway, after a failure. It will have to be configured with
the same codec class either way.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/754/files#r5507181
.

rxin · 2013-07-31T17:38:56Z

Ok I removed the singleton and squashed the last two commits.

Jenkins, test this please.

AmplabJenkins · 2013-07-31T18:21:11Z

Thank you for submitting this pull request.

All automated tests for this request have passed.

Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/400/

Compression codec change

mateiz · 2013-07-31T18:23:03Z

Alright, merged this in. Thanks @rxin and @lyogavin .

…addendum Author: witgo <witgo@qq.com> Closes mesos#754 from witgo/commons-lang and squashes the following commits: 3ebab31 [witgo] merge master f3b8fa2 [witgo] merge master 2083fae [witgo] repeat definition 5599cdb [witgo] multiple version of sbt dependency c1b66a1 [witgo] fix different versions of commons-lang dependency

lyogavin and others added 8 commits July 3, 2013 05:49

add compression codec trait and snappy compression

96130c3

fix dependencies

94238aa

Merge branch 'lazy_file_open' of github.com:lyogavin/spark into compr…

368c58e

…ession Conflicts: project/SparkBuild.scala

CompressionCodec cleanup. Moved it to spark.io package.

ad7e9d0

Documentation update for compression codec.

5227043

Added unit test for compression codecs.

56774b1

Exclude older version of Snappy in streaming and examples.

3b1ced8

Added Snappy dependency to Maven build files.

311aae7

rxin mentioned this pull request Jul 31, 2013

compression codec interface and Snappy support to reduce buffer size to improve scalability #685

Closed

Updated the configuration option for Snappy block size to be consiste…

dae12fe

…nt with the documentation.

mateiz reviewed Jul 31, 2013
View reviewed changes

Renamed compressionOutputStream and compressionInputStream to compres…

98024ea

…sedOutputStream and compressedInputStream.

mateiz reviewed Jul 31, 2013
View reviewed changes

Changed other LZF uses to use the compression codec interface.

c61843a

mateiz added a commit that referenced this pull request Jul 31, 2013

Merge pull request #754 from rxin/compression

a386ced

Compression codec change

mateiz merged commit a386ced into mesos:master Jul 31, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compression codec change #754

Compression codec change #754

rxin commented Jul 31, 2013

lyogavin commented Jul 31, 2013

AmplabJenkins commented Jul 31, 2013

mateiz commented Jul 31, 2013

rxin commented Jul 31, 2013

lyogavin commented Jul 31, 2013

mateiz commented Jul 31, 2013

mateiz Jul 31, 2013

mateiz commented Jul 31, 2013

mateiz commented Jul 31, 2013

AmplabJenkins commented Jul 31, 2013

rxin commented Jul 31, 2013

AmplabJenkins commented Jul 31, 2013

mateiz Jul 31, 2013

rxin Jul 31, 2013

mateiz Jul 31, 2013

jey Jul 31, 2013

rxin commented Jul 31, 2013

AmplabJenkins commented Jul 31, 2013

mateiz commented Jul 31, 2013

Compression codec change #754

Compression codec change #754

Conversation

rxin commented Jul 31, 2013

lyogavin commented Jul 31, 2013

AmplabJenkins commented Jul 31, 2013

mateiz commented Jul 31, 2013

rxin commented Jul 31, 2013

lyogavin commented Jul 31, 2013

mateiz commented Jul 31, 2013

mateiz Jul 31, 2013

Choose a reason for hiding this comment

mateiz commented Jul 31, 2013

mateiz commented Jul 31, 2013

AmplabJenkins commented Jul 31, 2013

rxin commented Jul 31, 2013

AmplabJenkins commented Jul 31, 2013

mateiz Jul 31, 2013

Choose a reason for hiding this comment

rxin Jul 31, 2013

Choose a reason for hiding this comment

mateiz Jul 31, 2013

Choose a reason for hiding this comment

jey Jul 31, 2013

Choose a reason for hiding this comment

rxin commented Jul 31, 2013

AmplabJenkins commented Jul 31, 2013

mateiz commented Jul 31, 2013