-
Notifications
You must be signed in to change notification settings - Fork 707
Frequently asked questions
Feel free to add new questions and to ping @Scalding for an answer.
Twitter uses it in production all over the place!
Check out our Powered By page for more examples.
See this conversation on Twitter.
Yes! See the cascading-user group discussion. We would like to see someone prepare a patch for scald.rb to handle submission of scalding jobs to EMR.
Scalding complains when I use a TimePathedSource and some of the data is missing. How can I ignore that error?
Pass the option --tool.partialok
to your job and it will ignore any missing data. It's safer to work around by either filling with place-holder empty files, or writing sources that will skip known-missing dates. Using that option by default is very dangerous.
I receive this error when running sbt update
: Error occurred during initialization of VM. Incompatible minimum and maximum heap sizes specified
In your sbt script, set local min=$(( $mem / 2 ))
You want to use GroupBuilder.scanLeft
. A scanLeft
is like a foldLeft
except that you output each intermediate value. Both of these functions are part of the standard Scala library as well. See StackOverflow for scanLeft
examples. For the specific example of moving averages in Scalding, see the cascading-user group discussion.
You can't do that. Instead you should use RichPipe.crossWithTiny to efficiently do a cartesian product of a small set of values to a larger set. The small set might be a single output, from say pipe.groupAll { _.size }
. Alternatively, you might kick off a subsequent job in Job.next
, and use Source.readAtSubmitter
to read the value before you get going (or even in Job.next
to see if you need to kick off the next job).
We recommend cases classes defined outside of your Job. Case classes defined inside your job capture an $outer member variable that references the job that is wasteful for serialization. If you have a use case this doesn't cover, email the cascading-user list or mention @scalding. Dealing with serialization issues well in systems like Hadoop is tricky, and we're still improving our approaches.
See the discussion on cascading-user.
If you want to update the jobConf in your job, the way to do it is to override the config method in Job:
https://github.com/twitter/scalding/blob/develop/src/main/scala/com/twitter/scalding/Job.scala#L95
If you really want to just read from the jobConf, you can do it with code like:
implicitly[Mode] match {
case Hdfs(_, configuration) => {
// use the configuration which is an instance of Configuration
}
case _ => error("Not running on Hadoop! (maybe cascading local mode?)")
}
See this discussion: https://groups.google.com/forum/?fromgroups=#!topic/cascading-user/YppTLebWds8
Yes! By requesting a pull, you are agreeing to license your code under the same license as Scalding.
- Scaladocs
- Getting Started
- Type-safe API Reference
- SQL to Scalding
- Building Bigger Platforms With Scalding
- Scalding Sources
- Scalding-Commons
- Rosetta Code
- Fields-based API Reference (deprecated)
- Scalding: Powerful & Concise MapReduce Programming
- Scalding lecture for UC Berkeley's Analyzing Big Data with Twitter class
- Scalding REPL with Eclipse Scala Worksheets
- Scalding with CDH3U2 in a Maven project
- Running your Scalding jobs in Eclipse
- Running your Scalding jobs in IDEA intellij
- Running Scalding jobs on EMR
- Running Scalding with HBase support: Scalding HBase wiki
- Using the distributed cache
- Unit Testing Scalding Jobs
- TDD for Scalding
- Using counters
- Scalding for the impatient
- Movie Recommendations and more in MapReduce and Scalding
- Generating Recommendations with MapReduce and Scalding
- Poker collusion detection with Mahout and Scalding
- Portfolio Management in Scalding
- Find the Fastest Growing County in US, 1969-2011, using Scalding
- Mod-4 matrix arithmetic with Scalding and Algebird
- Dean Wampler's Scalding Workshop
- Typesafe's Activator for Scalding