Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DList#groupBy and DList#groupWith cannot co-exist #221

Open
tonymorris opened this issue Mar 19, 2013 · 3 comments
Open

DList#groupBy and DList#groupWith cannot co-exist #221

tonymorris opened this issue Mar 19, 2013 · 3 comments
Labels
Milestone

Comments

@tonymorris
Copy link
Contributor

Both DList#groupBy and DList#groupWith differ only in that one of them abuses implicit resolution. One of them must be deleted.

A quick-fix is to delete DList#groupWith where the abuse is manifested, but there may be a more significant improvement available by resolving the underlying tension that compels such an abuse.

@espringe
Copy link
Contributor

Imho Grouping has been, and is, really useful, but perhaps it should be retired completely. It seems like it tries to do too much (e.g. (secondary) sorting), while not being able to do enough (it's can't use any information from the values, unlike hadoop's Partitioner) ..and its a little awkward to use (you just hope that keys/values in the subsequent parallel do obey your constraints.

So my proposal would be to just jack hadoops interface verbatim. (And use implicits to provide one for all types)

trait Partioner[K, V] {
     def partition(key: K, value: V, totalPartitions: Int): Int = mod(key.hashCode(), totalPartitions)
} 

And DList[(K,V)].groupByKey require a Ordering[K] along with a Partioner[K, V]. We could provide an implicit that provides a Partioner[K,V] for all K's and V's. Which would quite elegantly cover the basic uses.

Then to support the crazy-ass stuff, I think we could just have a single function like:

DList[(K,V)] { 
     def crazyDo[B](  // none of these implicit, since you really want to manually specify them
             partioner: Partioner[K,V],
             keyOrder:  Ordering[K], // order the keys are presented to the reducer
             grouping: KeyCollapser[K], // more on this below
     ): GroupedDList[K, V]
}

Where KeyCollapser might be defined as:
trait KeyCollapser[K]{ def shouldCollapse[K](a: K, b: K): Boolean }

which conceptually is the function that slides over the list of Keys, after they have gone to a partition, and after hadoop has sorted them -- decided if they should be compressed into one or not.

(Although, it might be safer to not go with a KeyCollapser, and just require another Ordering).

And GroupedList[K, V] would be a special kind of a DList that only accepts a doFn (or something) where it guarantees that it'll be in the sorted/grouped fasion

Thoughts?

@espringe
Copy link
Contributor

And I'd assume the way I propose could result in performance advantages -- as we would only have to tell hadoop to sort the keys if it is actually needed

@blever
Copy link
Contributor

blever commented Apr 8, 2013

@tonymorris - can you add a comment discussing your similar experience with scalaz, e.g. 'add' vs 'multiply' monoids.

@ghost ghost assigned tonymorris Apr 8, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants