You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hyper-parameter Tuning as Global Optimization Problem
Objective: define an metric to optimize. E.g., Validation error.
We usually do random search or grid search.
They are easy to parallelize but computationally expensive, since the search space grows exponentially with the number of hyper-parameter.
Can we do better than random search and grid search for searching hyper-parameters?
Bayesian Optimization
A probabilistic surrogate model of the objective.
This surrogate model captures the uncertainty or confidence of the estimates.
Performs an efficient grid search by balancing exploration and exploitation. (Sounds like RL!)
More illustration: Assume we have an initial fixed budget, we randomly assign to different 4 slot machines. The returns of this 4 slot machines are illustrated in the above figure. If we only look at the mean return, we might exploit the 3rd machine. However if we also look at the uncertainty σ, we might wanna explore the 4th machine.
Can be formularized in different ways, such as Guassian Processes (as the surrogate model).
Function f is the objective (e.g., validation metric)
x is our hyper-parameter estimates.
Bayesian Optimization with Gaussian Processes (GP) Surrogate Model
Ingredient 1: GP
Primer: Multivariate Gaussian(mean_vector m, covariance_matrix K) can be used for sampling finite random variables. GP can be think of the generalization of Multivariate Gaussian.
The (prior) GP(mean_function m(x), covariance_function k(x, .)) can be used for sampling random functions (or infinite random variables).
The posterior GP(μ(x), σ(x)^2) can be computed in closed form after receiving new observation (x_.
Once the posterior GP is computed, we can draw functions from it.
Ingredient 2: Acquisition Function
Acquisition function aims to select hyper-parameter candidate x_new for us: What is the value (like value function in RL) of the given candidate.
The value of x -- a(x|D) (D=observations) is, the expectation of the utility function of evaluations A(f, x) (Remind that the f is sampled from the posterior GP, so it is stochastic).
We can use off-the-shelf solvers to compute the expectation.
The acquisition function makes the exploration and exploitation trade-off.
Two families of acquisition function: Improvement-based (e.g., PI, EI, UCB) and entropy-based (e.g, ES, PES, MES).
Another Surrogate Models instead of GP?
SMAC: Random Forests.
TPE: Tree of Parzen estimators.
Multi-fidelity Optimization
Bayesian optimization is (surrogate) model-based and sequential, where multi-fidelity optimization is more random-based and parallel.
Can we use cheap approximation of the objective f(x)?
Successive halving: Fixed schedule for early stopping (the bottom-50% performing models, and let the top-50% performing models continue).
Hyperband (HB): A variant of successive halving, where we have different schedules of early stopping (each group shares the same computational resource level, or fidelity r).
Model-based Multi-fidelity Optimization
Hyperband is still uniformly sampling the hyper-parameter candidates, can we combine model-based (Bayesian optimization? Yes.
BO-HB: TPE-based Hyperband for each fidelity r.
Asynchronous Extensions
The above methods have their asynchronous extensions, e.g., A-BO, A-RS, A-BOHB, A-HB,....
Metadata
Hyper-parameter Tuning as Global Optimization Problem
Bayesian Optimization
(Canonical) Algorithm of Bayesian Optimization
Bayesian Optimization with Gaussian Processes (GP) Surrogate Model
Ingredient 1: GP
Ingredient 2: Acquisition Function
Another Surrogate Models instead of GP?
Multi-fidelity Optimization
Model-based Multi-fidelity Optimization
Asynchronous Extensions
NAS
More on GPs and Bayesian Optimization on Distill Pub
The text was updated successfully, but these errors were encountered: