-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow variable mcmc length #365
base: master
Are you sure you want to change the base?
Conversation
Pull Request Test Coverage Report for Build 2105900692
💛 - Coveralls |
Thanks for the PR @ColmTalbot I think for this to be merged I'd like to see.
I can imagine the new behaviour can be better
but at least we need one example so we can assess. |
I've been thinking about a good method to quantify this. In the end, I've been looking at the Euclidean distance travelled by the MCMC at each stage (in the unit hypercube space). For a given likelihood threshold, there is a "correct" distribution of these distances. For simple cases, this can be probed by direct sampling from the constrained prior. I think this distribution is p(\Delta ||u||) = \int_{{\cal L} > \lambda} du_{\rm init} du_{\rm final} \delta(\Delta ||u|| - ||u_{\rm init} - u_{\rm final}||) This distribution will likely change slowly and smoothly as the threshold increases. If we average over many iterations we can therefore attempt to measure the average expected distance. If the MCMC chains are too short, increasing the number of steps will increase the average distance. After a certain point, increasing the chain length will no longer increase the average distance. Note that the distribution of distances won't necessarily converge to the "correct" distribution for some proposals. Here are a few examples with the current MCMC rwalk method (volumetric proposal). Here is a figure for a case where changing the MCMC length seems to outperform varying the scale. The legend entries are:
With the scale not changing there are much more large distances. These correspond to iterations where the MCMC jumps between the modes. The MCMC length increases a lot here in order to keep accepting points, but it manages to mix the different modes much more efficiently. This is the model def bimodal_normal(p):
"""
Bimodal normal distribution
sigma = 1
mu = +/-5
"""
return np.logaddexp(-np.sum((p - 5)**2) / 2, -np.sum((p + 5)**2) / 2) Here's the same plot for the Rosenbrock model. The variable MCMC length isn't doing better here than the longest fixed MCMC length method, but it is much more efficient (fewer calls per iteration.) In this case, none of these versions are getting close to the expected distribution, but we have other indications that this isn't a great proposal for this model. |
Thank you for your examples, but i don't think I agree with your interpretations.
In fact thinking more about it, I am not sure that your proposal is a good idea in the current form (or maybe I don't understand it) Right now we scale the proposal ellipsoid so that typically (averaged across the whole posterior) the proposal ellipsoid would overlap with 50% of the L>L* area. This takes care of the situation when say the L>L* area is very curved and a single ellipsoid is a bad approximation to it. Imagine a thin spherical gaussian shel radius 1 with one single ellipsoidal bound with radius 1. In this case the proposal based on that ellipsoid will have a horrid acceptance ratio, so the existing code will scale it down till the ellipsoid is about as thick as a shell and then the sampling will proceed (yes, because the proposal will be small, you'll need a lot of walks to have proper samples in this case). But in your proposal when you only adjust the walks parameter, without adjusting the scale, that's going to be very inefficient because you'll be samplign from the large ellipsoid and therefore the number of 'walks' will grow to a very large number. I think what one would want to do in practice is to keep adjusting the scale in the same way we do it now, but adjust the walks parameter if the samples are too correlated. I.e. if the typical correlation between proposed points and existing points is too large then we slowly adjust the walks parameter. But that needs to be decided based on correlation I think, not acceptance fraction that your current code is doing. Maybe the first step should be is to actually compute the auto-correlation time like here #163 and then we can actually characterize how correlated our samples are, and potentially introduce the adjustments of the walks. |
To clarify, the first example above is a 15-dimensional bimodal Gaussian (I forgot to say that above.) |
I see, that makes more sense. Thanks. Was it with a single ellipsoidal bound ? |
To address the first point. If we don't match the correct distribution, we don't have an independent sample. That's the definition of the "correct" distribution. I think the question we normally ask is "how correlated can the new sample be with the old sample and not be visibly biased", which can be answered by running a bunch of simulations for the problem at hand, but I don't know of a general rule. I agree that calculating the ACL would be ideal, but in practice, calculating the ACL requires walking for many ACLs, which in turn makes the exploration very inefficient if we only want to advance by a single ACL^. In the absence of that, we can use this method to assess whether the number of steps is producing an independent sample. Here's a figure demonstrating the complementarity of the two approaches. Here I'm just MCMC sampling from a flat disc with a Gaussian proposal. I then plot the distribution of distances for a range of lags (a proxy for running many iterations of the rwalk with I also calculated the ACL of the chain using the method you mentioned. The linestyle indicates whether the lag is more or less than the ACL. Pretty much perfectly, for lags > ACL the distribution matches the expected one. For lags < ACL the distances are too short, but on average they are further when the lag increases. ^ One alternative is to run the MCMC for many autocorrelation times to get a stable estimate and then return many points from the chain to get multiple updates (as is currently done when using multi-processing). This leads to pretty wildly varying autocorrelation lengths, which is less than ideal, but re-running on the Rosenbrock example it looks much better. The version where the scale and the MCMC length are both varying is super inefficient, here's the same thing without that configuration. The purple trace (fixed scale, variable MCMC) length uses a comparable number of likelihood calls per sample (and similar total run time) as the green (fixed chain length and variable scale) that does a worse job. Personally, I'd like to at least be able to disable the scale adaptation as the non-varying scale performs better for the Rosenbrock function and volumetric proposal. Implementing the on-the-fly act estimation is only really viable if we are happy to accept multiple points from each chain. I still think that the number of accepted points is a decent proxy for the number of autocorrelation lengths traversed, but I don't have a great argument for that at the moment. |
Just a quick answer on that. |
This PR adds a new option that allows the length of the MCMC exploration for the
rwalk
method to vary during the analysis such that a target number of proposed steps are accepted. There is also an option to disable the scale adaptation of the proposal.cc @joshspeagle @segasai