-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overhead when exporting to PSOCK cluster #98
Comments
A very related issue: mschubert/clustermq/#47 |
Thanks for this, and thanks spending all the time digging through the details and thinking about this problem. And, yes, as you discovered in PR #99, this is a much harder issue than it first appears to be. We are aware of it, and it's on the to-do list to see if it can be improved further. It might be that static-code analysis of the I'm sure the documentation can be improved to raise awareness, e.g. a vignette with examples of the problem, and best-practice suggestions on how to minimize the problem. For example, removing ( FWIW, here's some related work that will help move this forward: I've started to work on a profiling framework for futures (part of the roadmap) that will allow us to study what is going on under the hood, e.g. study timing profiles for when futures are created, globals are exported, future expressions are evaluated, finished, results are collected, and so on. This will help the user to better understand, troubleshoot, and workaround certain problems like this. It will also help me identify bottlenecks and benchmark improvements made. Another related milestone on the roadmap is the future.mapreduce package, which is meant to provide future.apply, furrr, and doFuture (and other similar projects) with a common, core map-reduce framework. This means, that any improvements made, will benefit all those packages. Similarly, user feedback from one of them, is likely to be benefitial for the other packages. |
@HenrikBengtsson @odelmarcelle I think I just experienced the same situation except I did not compare the |
@odelmarcelle I tried your following line of code before putting in the
|
@odelmarcelle, your troubleshooting looks spot on. Only one thing to comment on:
This is relevant to all operating systems (e.g. see below example executed on Linux). It's less of a problem when using forked parallel processing, e.g. This is a known problem that is general to all parallel frameworks in R, not just the future ecosystem. For example, my_fcn <- function(cl, cargo = 0L) {
huge_object_also_exported <- rnorm(cargo)
parallel::parLapply(cl, X = 1L, fun = function(x) x)
}
cl <- parallel::makeCluster(1L)
trace(parallel:::postNode, tracer = quote(message(sprintf("Size of exported data: %d bytes", lobstr::obj_size(value)))), print = FALSE)
y <- my_fcn(cl = cl, cargo = 0L)
Size of exported data: 9704 bytes
y <- my_fcn(cl = cl, cargo = 1e6L)
Size of exported data: 8010880 bytes It's rather complicated to automatically fix this problem. Basically, we need to figure out how to prune the environment of the function so that it does not carry the extra cargo holding objects in the local environment that are not necessary for evaluating the function in another, external process. It's easy to tweak it for a few simple toy examples, but for it to work in general is much more complicated. Also, solving it is unfortunately in a bit of conflict with other objectives (e.g. the ones outlined in futureverse/future#608). I am working towards something that improves on the current state, but it's not easy and if not done extremely carefully, it's very easy to break something else. Thankfully, I've got tons of tests in place that somewhat protect against that. |
Describe the bug
I believe that a significant (and avoidable) overhead is present when using future_lapply inside another function. I think this might be related to the exportation of
...future.FUN
, which is serialized with its enclosing environment.I experience this issue on Windows and had not the chance to give it a try on macOS. I suspect this is only relevant to Windows.
Expected behavior
The overhead in deploying a function on multiple cores should remain limited.
Reproduce example
Consider the following (lengthy) example, where I apply
identity()
to a large list of characters.The second function is nesting 'future_lapply' and this appears to have important implications regarding the parallel processing overhead. I quickly benchmark the functions for a sequential backend.
All apply functions are more or less similar, and foreach is slower, apparently due to some memory allocation. The results change drastically once a
multisession
is registered:Here the nested call of
future_lapply()
insidesome_function2()
appears to suffer compared to the first two alternatives. The foreach option is relatively unaffected by the changed plan.The gap between the nested and non-nested
future_lapply()
increases as the number of cores rises:Now
future_lapply()
is even slower than foreach! Foreach is, in contrast with the other options, barely affected by the increase in the number of processes.To investigate the matter further, I implemented a very dirty replacement of the internal function
sendData.SOCK0node()
of theparallel
package. Doing so, I added some lines to write on a logging file the time elapsed for each exportation to a cluster node.I know that
future
already implements many monitoring tools when settingoptions(future.debug = TRUE)
, but I felt something off when observing the log of exported values. Instead, I implemented this logging system usingpryr::pryr::object_size()
, supposed to be more reliable thanutils::object.size()
. The results suggest that the enclosing environment offuture_lapply()
is unnecessarily exported to each node.Here is the dirty replacement to the
parallel
function.Although this throw an error, it was enough (for me at least) to replace the existing function in the
parallel
package. I know this seems barbaric, but my limited knowledge offuture
's internals led me to this.Hence, I was able to observe the data transferred to each node by monitoring the size with
pryr::object_size()
for the nestedfuture_lapply()
call.The interesting part here are the lines 20 and 27, corresponding to the export of the function of
future_lapply
to each of the nodes. According topryr::object_size()
, a large amount of data is sent to the worker at that point. The elapsed time is, in consequence, significant for the function's exportation.In contrast, the same analysis on the non-nested call to
future_lapply()
does not show these large exports.For the non-nested case, the size of the function's exportation is much smaller.
After investigating the
...future.FUN
object more in detail, it seems that it is related to the functions' environment. When serializing the function, the environment is serialized as well, which basically enforces the exportation of the entire input forfuture_lapply()
to each node.To prove this point, I attempted to replace the environment of the exported function by the global environment:
Despite calling the nested call to
future_lapply()
, it now seems that the enclosing environment of the function is not exported anymore. I'm not sure of the implication on other functions (since now this is only aboutidentity()
), but the improvement brought by this little change indicates that some changes in that direction could significantly speed up the use offuture_lapply()
in other packages.Finally, I re-benchmarked the functions with 10 cores with this adjusted
...future.FUN
environment:The nested loop is drastically improved and is now comparable with the first two options! Of course, I'm sure that naively setting the global environment to the function is not the best way to handle this issue. But since a good use-case of
future_lapply
is to essentially replacelapply()
in packages, letting the user free to set the parallelism withplan()
, is it a shame that the usage offuture_lapply()
inside a function suffers from this large overhead.My session infos:
The text was updated successfully, but these errors were encountered: