-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IndexedRDD for accelerated joins #848
base: master
Are you sure you want to change the base?
Conversation
There appears to be an issue with subtract by key tests that needs to be investigated further.
Thank you for your pull request. An admin will review this request soon. |
This is still a prototype but I wanted to open the discussion on the design so I can incorporate feedback early. |
index.cache | ||
valuesRDD.cache | ||
return this | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be overriding persist() instead of cache() here to work with all storage levels
I have not looked into this in great detail, but shouldn't cogroup, etc Regards
|
I have made some organizational changes based on suggestions from @rxin and @mateiz. To RDD I added: def pairRDDFunctions[K, V](
implicit t: T <:< (K, V), k: ClassManifest[K], v: ClassManifest[V]):
PairRDDFunctions[K, V] = {
new PairRDDFunctions(this.asInstanceOf[RDD[(K,V)]])
} This interesting piece of code returns a PairRDDFunctions wrapper for this RDD. By delegating the construction of the pairRDDFunctions object to the RDD class, specializations of RDD can return different implementations of the PairRDDFunctions. For example, the new IndexedRDD class overrides this function to return: override def pairRDDFunctions[K1, V1](
implicit t: (K, V) <:< (K1,V1), k: ClassManifest[K1], v: ClassManifest[V1]):
PairRDDFunctions[K1, V1] = {
new IndexedRDDFunctions[K1,V1](this.asInstanceOf[IndexedRDD[K1,V1]])
} I updated the implicit construction of PairRDDFunctions to call I modified PairRDDFunctions to provide a ClassManifest for intermediate types. This is needed for some PairRDD implementations that are array backed (i.e., IndexedRDD). |
In response to @mridulm, the IndexedRDD exploits the partitioning of the index. If both RDDs are indexed (IndexedRDDs) and have the same index then both must be partitioned identically and even arranged identically within each partition enabling cogroup to be achieve using zip. If both RDDs are indexed with a different index then the standard shuffle logic is applied though the index is used to help pre-construct hash maps. |
To give an example of what I was referring to - take a look at cogroup in IndexedRDDFunctions, self and other IndexedRDDFunctions need not be partitioned by the same partitioner - and so direct index (value) comparisons in the case match can result in incorrectness ? Unfortunately, I do not have time to go over this PR - so will derfer to someone else to comment on this further. |
Hey Joey, this looks nice. Do you happen to have some microbenchmarks for the performance gains of indexed versus non-indexed cogroup? Also you you mentioned the applications to graph algorithms. What other operations on RDD's do you think could benefit from having an index from this context? |
Unfortunately, I don't have any good benchmarks to evaluate joins so I am in the process of creating some. Does anyone have suggestions? |
Sent secondary jars to distributed cache of all containers and add the cached jars to classpath before executors start. Tested on a YARN cluster (CDH-5.0). `spark-submit --jars` also works in standalone server and `yarn-client`. Thanks for @andrewor14 for testing! I removed "Doesn't work for drivers in standalone mode with "cluster" deploy mode." from `spark-submit`'s help message, though we haven't tested mesos yet. CC: @dbtsai @sryza Author: Xiangrui Meng <meng@databricks.com> Closes mesos#848 from mengxr/yarn-classpath and squashes the following commits: 23e7df4 [Xiangrui Meng] rename spark.jar to __spark__.jar and app.jar to __app__.jar to avoid confliction apped $CWD/ and $CWD/* to the classpath remove unused methods a40f6ed [Xiangrui Meng] standalone -> cluster 65e04ad [Xiangrui Meng] update spark-submit help message and add a comment for yarn-client 11e5354 [Xiangrui Meng] minor changes 3e7e1c4 [Xiangrui Meng] use sparkConf instead of hadoop conf dc3c825 [Xiangrui Meng] add secondary jars to classpath in yarn
In many applications (especially graph computation and machine learning) we are iteratively joining model parameters (vertices) with data (edges). In these cases it can be beneficial to pre-organize the records within each partition to share a similar structure.
Logically the
IndexedRDD[K,V]
extends theRDD[(K,V)]
and provides the same functionality. An IndexedRDD is constructed by using thePairRDDFunctions.indexed
method:The
tbl.index
can be then applied to other RDDs that share the same key set:By sharing the same index across multiple RDDs, we guarantee that the values within each RDD is organized identically enabling fast join operations. Furthermore, because the index is prematerialized (and cached) we can use it to accelerate join operations with
RDD[(K,V)]
that have not been indexed.