Releases: twitter/scalding
Configure to require SUCCESS
This release is binary compatible with 0.17.3 so it should be safe to use. One behavior change is that skipping null counters is now opt in (which was a default we regretted when shipping 0.17.3). See: #1716
Workaround null counters
This is a minor bugfix release that works around hadoop giving us a null counter reporter. We work around by ignoring counters. This may not be the best solution, but it unblocks some users. We don't yet know why hadoop sometimes gives this users a null counter reporter.
See #1726
Scalding 0.17.2
Scalding 0.17.1 with 2.11 and 2.12 support
Changes of this release:
- Request for Scalding release 0.17.0 (#1641) … b1bf1be
- make ordered serialization stable across compilations (#1664) … 0f95484
- Remove unnecessary semicolon (#1668) 14d3f76
- Add tailrec annotation (#1671) 693e6b7
- Be more paranoid about Kryo registration order (#1673) 0b291f2
- Update sbt version to 0.13.15 (#1677) bf0e724
- Register all Boxed classes in Kryo (#1678) … c930bcd
- Fix serialization of
KryoHadoop
(#1685) … a72fd72 - Merge pull request #1686 from ttim/cherry_pick_0.17.x_changes … 9da38a1
- Fix stack overflow in typedPipeMonoid.zero (#1688) … 6bcb169
- A couple of fixes into the 0.17.x branch (#1695) … a6d3775
- Memory estimator changes to 0.17.x branch (#1700) … 9b8ea00
Scalding 0.17.0 with 2.12 support!
This is the first Scalding release that publishes artifacts for Scala 2.12! Here are some of the changes that are part of this release:
- 2.12 releated updates: #1663, #1646
- Use reflection over Jobs to find serialized classes: #1654, #1662
- Simplify match statement and use collection.breakOut: #1661
- Add explicit types to implicit methods and values: #1660
- Reducer estimation size fixes: #1652, #1650, #1645, #1644
- Use Combined*SequenceFile for VKVS, WritableSequenceFileScheme, SequenceFileScheme: #1647
- Improve Vertica support in scalding-db: #1655
- Add andThen to Mappable: #1656
- Expand libjars globs in ScaldingShell to match the behavior of Tool: #1651
- Use Batched in Sketch production: #1648
- Pick up Algebird 0.13.0: #1640
- Added API for Execution/Config to work with DistributedCache: #1635
- Bump chill version to 0.8.3: #1634
- Fixes a bug in how we use this stack: #1632
- Upgrade build to sbt 0.13.13: #1629
- Generate Scalding microsite via sbt-microsites: #1623
- FileSource support for empty directories: #1622, #1618, #1613, #1611, #1591
- Clean up temporary files created by forceToDiskExecution: #1621
- Moving the repl in wonderland to a dedicated md file: #1614
- Update Scala and sbt version: #1610
- REFACTOR: Fixed some compilation warnings: #1604
- REFACTOR: Rename parameter to reflect expectation: #1601
- Add partitioned sources for Parquet thrift / scrooge: #1590
- Add a test for sortBy: #1594
- Create COMMITTERS.md: #1589
- Use ExecutionContext in Execution.from/fromTry: #1587
- Support custom parquet field name strategies: #1580
- Deprecate reflection-based JobTest apply method: #1578
- Use Caching for FlowDefExecution: #1581
[parquet tuple macros] listType was deprecated in favor of listOfElements: #1579 - Use Batched to speed up CMS summing on mappers: #1575
- Remove a TypedPipeFactory wrapper which seems unneeded: #1576
- Make Writeable sources Mappable to get toIterator: #1573
- case class implicit children: #1569
Scalding 0.16.0 Released!
28 Contributors to this release:
@Gabriel439, @JiJiTang, @MansurAshraf, @QuantumBear, @afsalthaj, @benpence, @danosipov, @epishkin, @gerashegalov, @ianoc, @isnotinvain, @jnievelt, @johnynek, @joshualande, @megaserg, @nevillelyh, @oeddyo, @piyushnarang, @reconditesea, @richwhitjr, @rubanm, @sid-kap, @sriramkrishnan, @stuhood, @tdyas, @tglstory, @vikasgorur, @zaneli
Release Notes
This release is a performance and correctness improvement release. The biggest improvements are to the Execution API and to OrderedSerialization.
Execution allows a reliable way to compose jobs and use scalding as a library, rather than running subclasses of Job
in a framework style. In this release we have improved the performance and added some methods for more control of Executions (such as .withNewCache for cases where caching in the whole flow is not desired).
OrderedSerialization is a way to easily leverage binary comparators, comparators that act directly on serialized data so they don’t need to allocate nearly as much when the data is partitioned by key. These were discussed in presentation at the Hadoop summit [slides]. These are generated by macros so most simple types (case classes, scala collections, primitives, and recursion of these) are easy to use with a single import (see this note).
Here’s a list of some of the features we’ve added to Scalding in this release.
New Features
- OrderedSerialization (fast binary comparators for grouping and joining + macros to create them) are production ready. To use them, and other macros, [see this note](https://github.com/twitter/scalding/wiki/Automatic-Orderings,-Monoids-and-Arbitraries). Updates related to OrderedSerialization - #1307, #1316, #1320, #1321, #1329, #1338, #1457 - Add TypedParquet macros for parquet read / write support (Note: this might not be ready for production use as it doesn’t support schema evolution) - #1303 - Naming of Executions is supported - #1334 - Add line numbers at .group and .toPipe boundaries - #1335 - Make some repl components extensible to allow setting custom implicits and config to load at boot time - #1342 - Implement flatMapValues method - #1348 - Add NullSink, can be used with .onComplete to drive a side-effecting (but idempotent) job - #1378 - Add monoid and semigroup for Execution - #1379 - Support nesting Options in TypeDescriptor - #1387 - Add .groupWith method to TypedPipe - #1406 - Add counter verification logic - #1409 - Scalding viz options - #1426 - Add TypedPipeChecker to assist in testing a TypedPipe - #1478 - Add withConfig api to allow running an execution with a transformed config to override hadoop or source level options in subsections - #1489 - Add a liftToTry function to Execution - #1499 - Utility methods for running Executions in parallel - #1507 - Add's support for OrderedSerialization on sealed abstract classes - #1518 - Support for more formats to work with RichDate - #1522Important Bug Fixes
- Add InvalidSourceTap to catch all cases for no good path - #1458 - SuccessFileSource: correctness for multi-dir globs - #1470 - A serialization error we were seeing in repl usage : #1376 - Fix lack of Externalizer in joins. : #1421 - Requires a DateRange's "end" to be after its "start" : #1425 - Fixes map-only jobs to accommodate both an lzo source and sink binary converter : #1431 - Fix bug with sketch joins and single keys : #1451 - Fix FileSystem.get issue : #1487 - Fix scrooge + OrderedSerialization for field names starting with `_`: #1534 - Add before() and after() to RichDate : #1538Performance Improvements
- Change defaults for Scalding reducer estimator - #1333 - Add runtime-based reducer estimators - #1358 - When using WriteExecution and forceToDisk we can share the same flowdef closer in construction - #1414 - Cache the zipped up write executions - #1415 - Cache counters for stat updates rather than doing a lookup for every increment - #1495 - Cache boxed classes - #1501 - Typed Mapside Reduce - #1508 - Add auto forceToDisk support to hashJoin in TypedPipe - #1529 - Fix performance bug in TypedPipeDiff : #1300 - Make sure Execution.zip fails fast : #1412 - Fix Rounding Bug in RatioBasedEstimator : #1542Full change list is here
v0.16.0-RC6
This is the candidate that we are considering for the 0.16.0 release. We will be testing this RC out internally at Twitter and if it looks good and other folks are on board, this can be promoted to 0.16.0
LzoGenericScheme/Source, Typed Parquet Tuple and Better Performance with a new Elephant-Bird API
- Typed Parquet Tuple #1198
- LzoGenericScheme and Source #1268
- Move OrderedSerialization into zero-dep scalding-serialization module #1289
- bump elephantbird to 4.8 #1292
- Fix OrderedSerialization for some forked graphs #1293
- Add serialization modules to aggregate list #1298
OrderedSerialization is work-in-progress and is not ready to be used.
Scalding 0.14.0 Released!
ExecutionApp tutorial
A new tutorial for ExecutionApp
is added in #1196. You can check out ExecutionTutorial.scala for the source.
Simple HDFS local mode REPL
#1244 adds an easy to use useHdfsLocalMode
method to the REPL for running hadoop locally. useHdfsMode
reverts the behavior.
TypedPipe conditional execution via #make
TypedPipe
now exposes the make
method for fallback computation/production of an optional store in an Execution
. If the store already exists, the computation is skipped. Otherwise, the computation is performed and the store is created before proceeding with execution.
TypedPipeDiff
#1266 adds TypedPipeDiff
and helper enrichments for comparing the contents of two pipes.
RichPipe#skewJoinWithSmaller now works
A data bug with the fields API method skewJoinWithSmaller
was discovered and fixed. The API should be functionally equivalent to joinWithSmaller
now.
See CHANGES.md for the full list of changes.
Scalding 0.13.1, the most convenient scalding we’ve ever released!
Scala 2.11 Support is here!
We’re now publishing scalding for scala 2.11! Get it while it’s hot!
Easier aggregation via the latest Algebird
Algebird now comes with some very powerful aggregators that make it easy to compose aggregations and apply them in a single pass.
For example, to find each customer's order with the max quantity, as well as the order with the min price, in a single pass:
val maxOp = maxBy[Order, Long](_.orderQuantity).andThenPresent(_.orderQuantity)
val minOp = minBy[Order, Long](_.orderPrice).andThenPresent(_.orderPrice)
TypedPipe.from(orders)
.groupBy(_.customerName)
.aggregate(maxOp.join(minOp))
For more examples and documentation see: Aggregation using Algebird Aggregators
And for a hands on walkthrough in the REPL, see Alice In Aggregator Land
Read-Eval-Print-Love
We’ve made some improvements that make day to day use of the REPL more convenient:
Easily switch between local and hdfs mode
#1113 Makes it easy to switch between local and hdfs mode in the REPL, without losing your session.
So you can iterate locally on some small data, and once that’s working, run a hadoop job on your real data, all from within the same REPL session. You can also sample some data down to fit into memory, then switch to local mode where you can really quickly get the answers you’re looking for.
For example:
$ ./sbt assembly
$ ./scripts/scald.rb --repl --hdfs --host <host to ssh to and launch jobs from>
scalding> useLocalMode()
scalding> def helper(x: Int) = (x * x) / 2
helper: (x: Int)Int
scalding> val dummyData = TypedPipe.from(Seq(10, 11, 12))
scalding> dummyData.map(helper).dump
50
60
72
scalding> useHdfsMode()
scalding> val realData = TypedPipe.from(MySource(“/logs/some/real/data”)
scalding> realData.map(helper).dump
Easily save TypedPipes of case classes to disk
#1129 Lets you save any TypedPipe to disk from the REPL, regardless of format, so you can load it back up again later from another session. This is useful for saving an intermediate TypedPipe[MyCaseClass] without figuring out how to map it to a TSV or some other format. This works by serializing the objects to json behind the scenes.
For example:
$ ./scripts/scald.rb --json --repl --local
scalding> import com.twitter.scalding.TypedJson
import com.twitter.scalding.TypedJson
scalding> case class Bio(text: String, language: String)
defined class Bio
scalding> case class User(id: Long, bio: Bio)
defined class User
// in a real use case, getUsers might load a few sources, do some projections + joins, and then return
// a TypedPipe[User]
scalding> def getUsers() = TypedPipe.from(Seq( User(7, Bio("hello", "en")), User(8, Bio("hola", "es")) ))
getUsers: ()com.twitter.scalding.typed.TypedPipe[User]
scalding> getUsers().filter(_.bio.language == "en").save(TypedJson("/tmp/en-users"))
res0: com.twitter.scalding.TypedPipe[User] = com.twitter.scalding.typed.TypedPipeFactory@7cccf31c
scalding> exit
$ cat /tmp/en-users
{"id":7,"bio":{"text":"hello","language":"en"}}
$ ./scripts/scald.rb --json --repl --local
scalding> import com.twitter.scalding.TypedJson
import com.twitter.scalding.TypedJson
scalding> case class Bio(text: String, language: String)
defined class Bio
scalding> case class User(id: Long, bio: Bio)
defined class User
scalding> val filteredUsers = TypedPipe.from(TypedJson[User]("/tmp/en-users"))
filteredUsers: com.twitter.scalding.typed.TypedPipe[User] = com.twitter.scalding.typed.TypedPipeFactory@44bb1922
scalding> filteredUsers.dump
User(7,Bio(hello,en))
ValuePipe.dump
#1157 Adds dump to ValuePipe, so now you can not only print the contents of TypedPipes but on ValuePipes as well (see above for examples of using dump in the REPL).
Execution Improvements
The scaladoc for Execution is complete, but some additional exposition was added to the wiki: Calling Scalding from inside your application. We added two helper methods to object Execution: Execution.failed
creates an Execution
from a Throwable
(like Future.failed
), and Execution.unit
which creates a successful Execution[Unit]
, which is handy in some branching loops.
Bugfixes
The final bugs were finally removed from scalding*. Including #1190, a bug that effected the hashCode for Args instances and issue #1184 that made Stats unreliable for some users.
*some humor is used in scalding notes.
See CHANGES.md for a full change log.
Thanks to @avibryant, @danielhfrank, @DanielleSucher, @miguno, and the rest of the algebird contributors for the new aggregations, as well as all the scalding contributors