Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Traverse reduce-side "Iterables" more than once #91

Open
blever opened this issue Jun 6, 2012 · 3 comments
Open

Traverse reduce-side "Iterables" more than once #91

blever opened this issue Jun 6, 2012 · 3 comments
Labels
Milestone

Comments

@blever
Copy link
Contributor

blever commented Jun 6, 2012

When operating over an Iterable value in Scoobi, it could refer to the "values" Iterable of a Hadoop reduce method. There are cases that demonstrate that this Iterable can only be traversed once from Scoobi user code (the second traversal would result in an empty Iterable).

Need to investigate whether the Iterable provided by Hadoop's reduce can only be traversed once. If not, fix whatever Scoobi is doing to ensure user code can also traverse it more than once.

@ghost ghost assigned espringe Jun 6, 2012
@etorreborre
Copy link
Collaborator

This case would show the issue:

val xs: DList[(Int, Int) = ...
xs.groupByKey.map { case (_, vs) = vs.sum / vs.size }

While a fix might not be trivial we can try to at least throw an exception if we detect this situation (when the iterable is used twice).

@ghost ghost assigned tonymorris Dec 8, 2012
@tonymorris
Copy link
Contributor

I have spent considerable time on this issue. I cannot find any meaningful improvement in the short-term. The only possible improvement (that I can imagine) would require a significant alteration to the existing API and considerable code refactoring.

Some example improvements would be:

  1. Instead of requesting the passing of an Iterable to reduce, pass a left-fold interface. This means that users accept responsibility for the consequences of the side-effects on iterable, for whatever their implementation is. However, it is not completely general and so would eschew some valid use-cases that currently exist. It would also require a small, backward-incompatible alteration to the reduce API.
  2. Use iteratees whereby the user passing a function of "what to do" as each element is visited. This gives rise to more valid use-cases, however, it would require significant API and code refactor changes. It also comes with a small performance penalty, and requiring trampolining[1].
  3. Use scala-machines[2]. This is the ideal solution, however, it would require significant effort to implement, along with alterations to the existing API and major code refactoring.

All other apparent improvements result in either meaninglessness (they do not provide any safety benefit) or bugs (improper operation of the scoobi library). This may be a limit of my imagination, but I believe I have exhausted this pursuit to the extent of my ability.

[1] Stackless Scala With Free Monads, Rúnar Óli Bjarnason, The Third Scala Workshop, London, Apr 17th 2012.
[2] https://github.com/runarorama/scala-machines/

@blever
Copy link
Contributor Author

blever commented Jan 25, 2013

Moving out to 0.8 for now - a solution may sneak back into 0.7.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants