-
Notifications
You must be signed in to change notification settings - Fork 706
Calling Scalding from inside your application
johnynek edited this page Dec 3, 2014
·
13 revisions
Starting in scalding 0.12, there is a clear API for doing this. See Execution[T]
, which describes a set of map/reduce operations that when executed return a Future[T]
. Below is an example.
val job: Execution[Unit] =
TypedPipe.from(TextLine("input"))
.flatMap(_.split("\\s+"))
.map { word => (word, 1L) }
.sumByKey
.writeExecution(TypedTsv("output"))
// Now we run it in Local mode
val u: Unit = job.waitFor(Config.default, Local(true))
// Or for Hadoop:
val jobConf = new JobConf
val u: Unit = job.waitFor(Config.hadoopWithDefaults(jobConf), Hdfs(jobConf, true))
// If you want to be asynchronous, use run instead of waitFor and get a Future in return
For testing or cases where you aggregate data down to a manageable level, .toIterableExecution
on TypedPipe is very useful:
val job: Execution[Iterable[(String, Long)]] =
TypedPipe.from(TextLine("input"))
.flatMap(_.split("\\s+"))
.map { word => (word, 1L) }
.sumByKey
.toIterableExecution
// Now we run it in Local mode
val counts: Map[String, Long] = job.waitFor(Config.default, Local(true)).toMap
We recommend the above approach to build composable jobs with Executions. But if you have an existing Job, you can also run that:
WordCountJob.scala
class WordCountJob(args: Args) extends Job(args) {
TextLine(args("input"))
.read
.flatMap('line -> 'word) { line: String => line.split("\\s+") }
.groupBy('word) { _.size }
.write(Tsv(args("output")))
}
Runner.scala
object Runner extends App {
val hadoopConfiguration: Configuration = new Configuration
hadoopConfiguration.set("mapred.job.tracker","hadoop-master:8021")
hadoopConfiguration.set("fs.defaultFS","hdfs://hadoop-master:8020")
val hdfsMode = Hdfs(strict = true, hadoopConfiguration)
val arguments = Mode.putMode(hdfsMode, Args("--input in.txt --output counts.tsv"))
// Now create the job after the mode is set up properly.
val job: WordCountJob = new WordCountJob(arguments)
val flow = job.buildFlow
flow.complete()
}
And then you can run your App on any server, that have access to Hadoop cluster
- Scaladocs
- Getting Started
- Type-safe API Reference
- SQL to Scalding
- Building Bigger Platforms With Scalding
- Scalding Sources
- Scalding-Commons
- Rosetta Code
- Fields-based API Reference (deprecated)
- Scalding: Powerful & Concise MapReduce Programming
- Scalding lecture for UC Berkeley's Analyzing Big Data with Twitter class
- Scalding REPL with Eclipse Scala Worksheets
- Scalding with CDH3U2 in a Maven project
- Running your Scalding jobs in Eclipse
- Running your Scalding jobs in IDEA intellij
- Running Scalding jobs on EMR
- Running Scalding with HBase support: Scalding HBase wiki
- Using the distributed cache
- Unit Testing Scalding Jobs
- TDD for Scalding
- Using counters
- Scalding for the impatient
- Movie Recommendations and more in MapReduce and Scalding
- Generating Recommendations with MapReduce and Scalding
- Poker collusion detection with Mahout and Scalding
- Portfolio Management in Scalding
- Find the Fastest Growing County in US, 1969-2011, using Scalding
- Mod-4 matrix arithmetic with Scalding and Algebird
- Dean Wampler's Scalding Workshop
- Typesafe's Activator for Scalding