-
Notifications
You must be signed in to change notification settings - Fork 706
Calling Scalding from inside your application
P. Oscar Boykin edited this page Jul 7, 2014
·
13 revisions
Starting in scalding 0.11, there is a clear API for doing this:
// Running in local mode:
import ExecutionContext._ // needed to get FlowDef and Mode implicitly from ExecutionContext
val (result, tryJobStats) = Execution.waitFor(Config.default, Local(true)) { implicit ec: ExecutionContext =>
TypedPipe.from(TextLine("input"))
.flatMap(_.split("\\s+"))
.map { word => (word, 1L) }
.sumByKey
.write(TypedTsv("output"))
}
// Or for Hadoop:
val jobConf = new JobConf
val (result, tryJobStats) = Execution.run(Config.hadoopWithDefaults(jobConf), Hdfs(jobConf, true)) { implicit ec: ExecutionContext =>
TypedPipe.from(TextLine("input"))
.flatMap(_.split("\\s+"))
.map { word => (word, 1L) }
.sumByKey
.write(TypedTsv("output"))
}
// If you want to be asynchronous, use run instead of waitFor and get a Future in return
You can also just run existing Jobs in your own code. For instance:
WordCountJob.scala
class WordCountJob(args: Args) extends Job(args) {
TextLine(args("input"))
.read
.flatMap('line -> 'word) { line: String => line.split("\\s+") }
.groupBy('word) { _.size }
.write(Tsv(args("output")))
}
Runner.scala
object Runner extends App {
val hadoopConfiguration: Configuration = new Configuration
hadoopConfiguration.set("mapred.job.tracker","hadoop-master:8021")
hadoopConfiguration.set("fs.defaultFS","hdfs://hadoop-master:8020")
val hdfsMode = Hdfs(strict = true, hadoopConfiguration)
val arguments = Mode.putMode(hdfsMode, Args("--input in.txt --output counts.tsv"))
// Now create the job after the mode is set up properly.
val job: WordCountJob = new WordCountJob(arguments)
val flow = job.buildFlow
flow.complete()
}
And then you can run your App on any server, that have access to Hadoop cluster
- Scaladocs
- Getting Started
- Type-safe API Reference
- SQL to Scalding
- Building Bigger Platforms With Scalding
- Scalding Sources
- Scalding-Commons
- Rosetta Code
- Fields-based API Reference (deprecated)
- Scalding: Powerful & Concise MapReduce Programming
- Scalding lecture for UC Berkeley's Analyzing Big Data with Twitter class
- Scalding REPL with Eclipse Scala Worksheets
- Scalding with CDH3U2 in a Maven project
- Running your Scalding jobs in Eclipse
- Running your Scalding jobs in IDEA intellij
- Running Scalding jobs on EMR
- Running Scalding with HBase support: Scalding HBase wiki
- Using the distributed cache
- Unit Testing Scalding Jobs
- TDD for Scalding
- Using counters
- Scalding for the impatient
- Movie Recommendations and more in MapReduce and Scalding
- Generating Recommendations with MapReduce and Scalding
- Poker collusion detection with Mahout and Scalding
- Portfolio Management in Scalding
- Find the Fastest Growing County in US, 1969-2011, using Scalding
- Mod-4 matrix arithmetic with Scalding and Algebird
- Dean Wampler's Scalding Workshop
- Typesafe's Activator for Scalding