Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for set.seed in parallel #15

Open
gabrielodom opened this issue Aug 10, 2018 · 5 comments
Open

Add support for set.seed in parallel #15

gabrielodom opened this issue Aug 10, 2018 · 5 comments
Assignees

Comments

@gabrielodom
Copy link
Owner

gabrielodom commented Aug 10, 2018

See Section 6 of the parallel package vignette:
https://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf

We need to add the ability to set seeds over multiple computing cores so that our AES-PCA and Supervised PCA function results are reproducible.

@gabrielodom
Copy link
Owner Author

gabrielodom commented Aug 20, 2018

Escalate to @lxw391 at the next meeting. @jamesban2015 reminded me that the supervised PCA would need the same fix (because of the randomly-generated parametric bootstrap samples).

@gabrielodom gabrielodom changed the title Add support for set.seed in AES-PCA Add support for set.seed in parallel Aug 20, 2018
@lxw391
Copy link
Collaborator

lxw391 commented Nov 30, 2018

Could you try 10,000 permutations and see how much the results vary? hypothesis is that variation would decrease as number of permutations increase

@gabrielodom
Copy link
Owner Author

The results have nothing to do with the number of permutations, but rather the random number generation algorithm within R itself. This is not a trivial problem. It depends on the random number generator used. This article discusses parallelization with the Mersenne-Twister, Marsaglia Multicarry, and L’Ecuyer-CMRG random number generators:
https://rpubs.com/Jouni_Helske/225931

@gabrielodom
Copy link
Owner Author

gabrielodom commented Jan 30, 2019

@lxw391 forwarded a Bioc-devel conversation to me:

On Mon, Jan 7, 2019 at 3:26 PM Henrik Bengtsson henrik.bengtsson@gmail.com wrote:

  1. To achieve fully numerically reproducible RNGs in way that is invariant to the number of workers (amount of chunking), I think the only solution is to pregenerated RNG seeds (using
    parallel::nextRNGStream()) for each individual iteration (element). In other words, if a worker will process K elements, then the main R process needs to generate K RNG seeds and pass those along to the work. I use this approach for future.apply::future_lapply(..., future.seed = TRUE/<initial_seed>), which then produce identical RNG results regardless of backend and amount of chunking. In the past, I
    think I've seen Martin suggesting something similar as a manual approach to some users.
  2. The above approach is obviously expensive, especially when there are a large number of elements to iterate over. Because of this I'm thinking providing an option to use only one RNG seed per worker
    which is the common approach used elsewhere. This won't be invariant to the number of workers, but it "should" still be statistically sound. This approach will give reproducible RNG results given the same initial seed and the same amount of chunking.
  3. For algorithms which do not rely on RNG, we can ignore both of the above. The problem is that it's not always known to the user/developer which methods depend on RNG or not. The above 'RNG tracker' helps to identify some, but things might also change over time. I believe there's room for automating this in one way or the other. For instance, having a way to declare a function being
    dependent on RNG or not could help. Static code inspection could also do it, e.g. when an R package is built and it could be part of the R CMD checks to validate.
  4. Are there other approaches?

I don't suppose it's possible to quickly determine via static analysis whether a piece of code uses the RNG?

I still see this as a non-critical enhancement for the upcoming version 1 release.

@lxw391
Copy link
Collaborator

lxw391 commented Mar 13, 2019

another email threads about set.see in parallel setting
https://stat.ethz.ch/pipermail/bioc-devel/2019-March/014757.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants