Frequently asked questions

Feel free to add new questions and to ping @Scalding for an answer.

Running Scalding

Who actually uses Scalding?

Twitter uses it in production all over the place!

Check out our Powered By page for more examples.

I'm having trouble with scald.rb, and I just want to run jars in my own system:

Can Scalding be run on Amazon's Elastic MapReduce?

Yes! See the cascading-user group discussion. We would like to see someone prepare a patch for scald.rb to handle submission of scalding jobs to EMR.

Scalding complains when I use a TimePathedSource and some of the data is missing. How can I ignore that error?

Pass the option --tool.partialok to your job and it will ignore any missing data. It's safer to work around by either filling with place-holder empty files, or writing sources that will skip known-missing dates. Using that option by default is very dangerous.

I receive this error when running `sbt update`: `Error occurred during initialization of VM. Incompatible minimum and maximum heap sizes specified`

In your sbt script, set local min=$(( $mem / 2 ))

Writing Jobs

How do I perform windowed calculations (for example, moving average) in Scalding?

You want to use GroupBuilder.scanLeft. A scanLeft is like a foldLeft except that you output each intermediate value. Both of these functions are part of the standard Scala library as well. See StackOverflow for scanLeft examples. For the specific example of moving averages in Scalding, see the cascading-user group discussion.

How do I read a single reduced value from a pipe?

You can't do that. Instead you should use RichPipe.crossWithTiny to efficiently do a cartesian product of a small set of values to a larger set. The small set might be a single output, from say pipe.groupAll { _.size }. Alternatively, you might kick off a subsequent job in Job.next, and use Source.readAtSubmitter to read the value before you get going (or even in Job.next to see if you need to kick off the next job).

How do I make simple records for use in my scalding job?

We recommend cases classes defined outside of your Job. Case classes defined inside your job capture an $outer member variable that references the job that is wasteful for serialization. If you have a use case this doesn't cover, email the cascading-user list or mention @scalding. Dealing with serialization issues well in systems like Hadoop is tricky, and we're still improving our approaches.

See the discussion on cascading-user.

How do I access the jobConf?

If you want to update the jobConf in your job, the way to do it is to override the config method in Job:

https://github.com/twitter/scalding/blob/develop/src/main/scala/com/twitter/scalding/Job.scala#L95

If you really want to just read from the jobConf, you can do it with code like:

implicitly[Mode] match {
  case Hdfs(_, configuration) => {
    // use the configuration which is an instance of Configuration
  }
  case _ => error("Not running on Hadoop! (maybe cascading local mode?)")
}

See this discussion: https://groups.google.com/forum/?fromgroups=#!topic/cascading-user/YppTLebWds8

What if I have more than 22 fields in my data-set?

Many of the examples (e.g. in the tutorial/ directory) show that the fields argument is specified as a Scala Tuple when reading a delimited file. However Scala Tuples are currently limited to a maximum of 22 elements. To read-in a data-set with more than 22 fields, you can use a List of Symbols as fields specifier. E.g.

val mySchema = List('first, 'last, 'phone, 'age, 'country)

val input = Csv("/path/to/file.txt", separator = ",", fields = mySchema)
val output = TextLine("/path/to/out.txt")
input.read
     .project('age, 'country)
     .write(Tsv(output))

Issues with Testing

How do I get my tests working with Spec2?

from Alex Dean, @alexatkeplar

The problem was in how I was defining my tests. For Scalding, your Specs2 tests must look like this:

"A job which trys to do blah" should {
  <<RUN JOB>>
  "successfully do blah" in {
    expected.blah must_== actual.blah
  }
}

My problem was that my tests looked like this:

"A job which trys to do blah" should {
  "successfully do blah" in {
    <<RUN JOB>>
    expected.blah must_== actual.blah
  }
}

In other words, running the job was inside the in {}. For some reason, this was leading to multiple jobs running at the same time and conflicting with each others' output.

If anyone is interested, the diff which fixed my tests is here:

https://github.com/snowplow/snowplow/commit/792ed2f9082b871ecedcf36956427a2f0935588c

Contributing code

Do you accept pull requests?

Yes! By requesting a pull, you are agreeing to license your code under the same license as Scalding.

To which branch do I make my pull request?

develop

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Frequently asked questions

Running Scalding

Who actually uses Scalding?

I'm having trouble with scald.rb, and I just want to run jars in my own system:

Can Scalding be run on Amazon's Elastic MapReduce?

Scalding complains when I use a TimePathedSource and some of the data is missing. How can I ignore that error?

I receive this error when running `sbt update`: `Error occurred during initialization of VM. Incompatible minimum and maximum heap sizes specified`

Writing Jobs

How do I perform windowed calculations (for example, moving average) in Scalding?

How do I read a single reduced value from a pipe?

How do I make simple records for use in my scalding job?

How do I access the jobConf?

What if I have more than 22 fields in my data-set?

Issues with Testing

How do I get my tests working with Spec2?

Contributing code

Do you accept pull requests?

To which branch do I make my pull request?

Contents

Getting help

Documentation

Matrix API

Third Party Modules

Videos

How-tos

Tutorials

Articles

Other

Clone this wiki locally

Frequently asked questions

Running Scalding

Who actually uses Scalding?

I'm having trouble with scald.rb, and I just want to run jars in my own system:

Can Scalding be run on Amazon's Elastic MapReduce?

Scalding complains when I use a TimePathedSource and some of the data is missing. How can I ignore that error?

I receive this error when running sbt update: Error occurred during initialization of VM. Incompatible minimum and maximum heap sizes specified

Writing Jobs

How do I perform windowed calculations (for example, moving average) in Scalding?

How do I read a single reduced value from a pipe?

How do I make simple records for use in my scalding job?

How do I access the jobConf?

What if I have more than 22 fields in my data-set?

Issues with Testing

How do I get my tests working with Spec2?

Contributing code

Do you accept pull requests?

To which branch do I make my pull request?

Contents

Getting help

Documentation

Matrix API

Third Party Modules

Videos

How-tos

Tutorials

Articles

Other

Clone this wiki locally

I receive this error when running `sbt update`: `Error occurred during initialization of VM. Incompatible minimum and maximum heap sizes specified`