Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CVPR 2020 Tutorial Talk: Automated Hyperparameter and Architecture Tuning #71

Open
howardyclo opened this issue Jul 31, 2020 · 0 comments
Labels
Bayesian CVPR NAS Neural Architecture Search

Comments

@howardyclo
Copy link
Owner

howardyclo commented Jul 31, 2020

Metadata

Hyper-parameter Tuning as Global Optimization Problem

  • Objective: define an metric to optimize. E.g., Validation error.
  • We usually do random search or grid search.
  • They are easy to parallelize but computationally expensive, since the search space grows exponentially with the number of hyper-parameter.
  • Can we do better than random search and grid search for searching hyper-parameters?

Bayesian Optimization

  • A probabilistic surrogate model of the objective.
  • This surrogate model captures the uncertainty or confidence of the estimates.
  • Performs an efficient grid search by balancing exploration and exploitation. (Sounds like RL!)
  • More illustration: Assume we have an initial fixed budget, we randomly assign to different 4 slot machines. The returns of this 4 slot machines are illustrated in the above figure. If we only look at the mean return, we might exploit the 3rd machine. However if we also look at the uncertainty σ, we might wanna explore the 4th machine.
  • Can be formularized in different ways, such as Guassian Processes (as the surrogate model).
  • See more on A Tutorial on Bayesian Optimization.

The idea of surrogate model is also used in Efficient Progressive Neural Architecture Search.

(Canonical) Algorithm of Bayesian Optimization

  • Function f is the objective (e.g., validation metric)
  • x is our hyper-parameter estimates.

Bayesian Optimization with Gaussian Processes (GP) Surrogate Model

Ingredient 1: GP

  • Primer: Multivariate Gaussian(mean_vector m, covariance_matrix K) can be used for sampling finite random variables. GP can be think of the generalization of Multivariate Gaussian.
  • The (prior) GP(mean_function m(x), covariance_function k(x, .)) can be used for sampling random functions (or infinite random variables).
  • The posterior GP(μ(x), σ(x)^2) can be computed in closed form after receiving new observation (x_.
  • Once the posterior GP is computed, we can draw functions from it.

Ingredient 2: Acquisition Function

  • Acquisition function aims to select hyper-parameter candidate x_new for us: What is the value (like value function in RL) of the given candidate.
  • The value of x -- a(x|D) (D=observations) is, the expectation of the utility function of evaluations A(f, x) (Remind that the f is sampled from the posterior GP, so it is stochastic).
  • We can use off-the-shelf solvers to compute the expectation.
  • The acquisition function makes the exploration and exploitation trade-off.
  • Two families of acquisition function: Improvement-based (e.g., PI, EI, UCB) and entropy-based (e.g, ES, PES, MES).

Another Surrogate Models instead of GP?

  • SMAC: Random Forests.
  • TPE: Tree of Parzen estimators.

Multi-fidelity Optimization

  • Bayesian optimization is (surrogate) model-based and sequential, where multi-fidelity optimization is more random-based and parallel.
  • Can we use cheap approximation of the objective f(x)?
  • Successive halving: Fixed schedule for early stopping (the bottom-50% performing models, and let the top-50% performing models continue).
  • Hyperband (HB): A variant of successive halving, where we have different schedules of early stopping (each group shares the same computational resource level, or fidelity r).

Model-based Multi-fidelity Optimization

  • Hyperband is still uniformly sampling the hyper-parameter candidates, can we combine model-based (Bayesian optimization? Yes.
  • BO-HB: TPE-based Hyperband for each fidelity r.

Asynchronous Extensions

  • The above methods have their asynchronous extensions, e.g., A-BO, A-RS, A-BOHB, A-HB,....

NAS

More on GPs and Bayesian Optimization on Distill Pub

@howardyclo howardyclo added CVPR NAS Neural Architecture Search Bayesian labels Aug 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bayesian CVPR NAS Neural Architecture Search
Projects
None yet
Development

No branches or pull requests

1 participant