DList#groupBy and DList#groupWith cannot co-exist #221

tonymorris · 2013-03-19T02:20:07Z

Both DList#groupBy and DList#groupWith differ only in that one of them abuses implicit resolution. One of them must be deleted.

A quick-fix is to delete DList#groupWith where the abuse is manifested, but there may be a more significant improvement available by resolving the underlying tension that compels such an abuse.

The text was updated successfully, but these errors were encountered:

espringe · 2013-03-19T16:24:11Z

Imho Grouping has been, and is, really useful, but perhaps it should be retired completely. It seems like it tries to do too much (e.g. (secondary) sorting), while not being able to do enough (it's can't use any information from the values, unlike hadoop's Partitioner) ..and its a little awkward to use (you just hope that keys/values in the subsequent parallel do obey your constraints.

So my proposal would be to just jack hadoops interface verbatim. (And use implicits to provide one for all types)

trait Partioner[K, V] {
     def partition(key: K, value: V, totalPartitions: Int): Int = mod(key.hashCode(), totalPartitions)
}

And DList[(K,V)].groupByKey require a Ordering[K] along with a Partioner[K, V]. We could provide an implicit that provides a Partioner[K,V] for all K's and V's. Which would quite elegantly cover the basic uses.

Then to support the crazy-ass stuff, I think we could just have a single function like:

DList[(K,V)] { 
     def crazyDo[B](  // none of these implicit, since you really want to manually specify them
             partioner: Partioner[K,V],
             keyOrder:  Ordering[K], // order the keys are presented to the reducer
             grouping: KeyCollapser[K], // more on this below
     ): GroupedDList[K, V]
}

Where KeyCollapser might be defined as:
trait KeyCollapser[K]{ def shouldCollapse[K](a: K, b: K): Boolean }

which conceptually is the function that slides over the list of Keys, after they have gone to a partition, and after hadoop has sorted them -- decided if they should be compressed into one or not.

(Although, it might be safer to not go with a KeyCollapser, and just require another Ordering).

And GroupedList[K, V] would be a special kind of a DList that only accepts a doFn (or something) where it guarantees that it'll be in the sorted/grouped fasion

Thoughts?

espringe · 2013-03-19T17:24:10Z

And I'd assume the way I propose could result in performance advantages -- as we would only have to tell hadoop to sort the keys if it is actually needed

blever · 2013-04-08T05:22:51Z

@tonymorris - can you add a comment discussing your similar experience with scalaz, e.g. 'add' vs 'multiply' monoids.

ghost assigned tonymorris Apr 8, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DList#groupBy and DList#groupWith cannot co-exist #221

DList#groupBy and DList#groupWith cannot co-exist #221

tonymorris commented Mar 19, 2013

espringe commented Mar 19, 2013

espringe commented Mar 19, 2013

blever commented Apr 8, 2013

DList#groupBy and DList#groupWith cannot co-exist #221

DList#groupBy and DList#groupWith cannot co-exist #221

Comments

tonymorris commented Mar 19, 2013

espringe commented Mar 19, 2013

espringe commented Mar 19, 2013

blever commented Apr 8, 2013