Skip to content

Frequently asked questions

johnynek edited this page Jul 17, 2012 · 34 revisions

Feel free to add new questions and to ping @Scalding for an answer.

Running Scalding

Can Scalding be run on Amazon's Elastic MapReduce?

Yes! See the cascading-user group discussion. We would like to see someone prepare a patch for scald.rb to handle submission of scalding jobs to EMR.

Scalding complains when I use a TimePathedSource and some of the data is missing. How can I ignore that error?

Pass the option --tool.partialok to your job and it will ignore any missing data. It's safer to work around by either filling with place-holder empty files, or writing sources that will skip known-missing dates. Using that option by default is very dangerous.

Writing Jobs

How do I perform windowed calculations (for example, moving average) in Scalding?

You want to use GroupBuilder.scanLeft. A scanLeft is like a foldLeft except that you output each intermediate value. Both of these functions are part of the standard Scala library as well. See StackOverflow for scanLeft examples. For the specific example of moving averages in Scalding, see the cascading-user group discussion.

How do I read a single reduced value from a pipe?

You can't do that. Instead you should use RichPipe.crossWithTiny to efficiently do a cartesian product of a small set of values to a larger set. The small set might be a single output, from say pipe.groupAll { _.size }. Alternatively, you might kick off a subsequent job in Job.next, and use Source.readAtSubmitter to read the value before you get going (or even in Job.next to see if you need to kick off the next job).

See the discussion on cascading-user.

Contributing code

Do you accept pull requests?

Yes! By requesting a pull, you are agreeing to license your code under the same license as Scalding.

Contents

Getting help

Documentation

Matrix API

Third Party Modules

Videos

How-tos

Tutorials

Articles

Other

Clone this wiki locally