CVPR 2020 Tutorial Talk: Automated Hyperparameter and Architecture Tuning #71

howardyclo · 2020-07-31T08:53:12Z

Metadata

Speaker: Cedric Archambeau
Video: https://www.youtube.com/watch?v=pB1LmZWK_N8
Website (more about HPO and NAS): https://hangzhang.org/CVPR2020/
See more on my NAS survey: Neural Architecture Search #55

Hyper-parameter Tuning as Global Optimization Problem

Objective: define an metric to optimize. E.g., Validation error.
We usually do random search or grid search.
They are easy to parallelize but computationally expensive, since the search space grows exponentially with the number of hyper-parameter.
Can we do better than random search and grid search for searching hyper-parameters?

Bayesian Optimization

A probabilistic surrogate model of the objective.
This surrogate model captures the uncertainty or confidence of the estimates.
Performs an efficient grid search by balancing exploration and exploitation. (Sounds like RL!)
More illustration: Assume we have an initial fixed budget, we randomly assign to different 4 slot machines. The returns of this 4 slot machines are illustrated in the above figure. If we only look at the mean return, we might exploit the 3rd machine. However if we also look at the uncertainty σ, we might wanna explore the 4th machine.
Can be formularized in different ways, such as Guassian Processes (as the surrogate model).
See more on A Tutorial on Bayesian Optimization.

The idea of surrogate model is also used in Efficient Progressive Neural Architecture Search.

(Canonical) Algorithm of Bayesian Optimization

Function f is the objective (e.g., validation metric)
x is our hyper-parameter estimates.

Bayesian Optimization with Gaussian Processes (GP) Surrogate Model

Ingredient 1: GP

Primer: Multivariate Gaussian(mean_vector m, covariance_matrix K) can be used for sampling finite random variables. GP can be think of the generalization of Multivariate Gaussian.
The (prior) GP(mean_function m(x), covariance_function k(x, .)) can be used for sampling random functions (or infinite random variables).
The posterior GP(μ(x), σ(x)^2) can be computed in closed form after receiving new observation (x_.
Once the posterior GP is computed, we can draw functions from it.

Ingredient 2: Acquisition Function

Acquisition function aims to select hyper-parameter candidate x_new for us: What is the value (like value function in RL) of the given candidate.
The value of x -- a(x|D) (D=observations) is, the expectation of the utility function of evaluations A(f, x) (Remind that the f is sampled from the posterior GP, so it is stochastic).
We can use off-the-shelf solvers to compute the expectation.
The acquisition function makes the exploration and exploitation trade-off.
Two families of acquisition function: Improvement-based (e.g., PI, EI, UCB) and entropy-based (e.g, ES, PES, MES).

Another Surrogate Models instead of GP?

SMAC: Random Forests.
TPE: Tree of Parzen estimators.

Multi-fidelity Optimization

Bayesian optimization is (surrogate) model-based and sequential, where multi-fidelity optimization is more random-based and parallel.
Can we use cheap approximation of the objective f(x)?
Successive halving: Fixed schedule for early stopping (the bottom-50% performing models, and let the top-50% performing models continue).
Hyperband (HB): A variant of successive halving, where we have different schedules of early stopping (each group shares the same computational resource level, or fidelity r).

Model-based Multi-fidelity Optimization

Hyperband is still uniformly sampling the hyper-parameter candidates, can we combine model-based (Bayesian optimization? Yes.
BO-HB: TPE-based Hyperband for each fidelity r.

Asynchronous Extensions

The above methods have their asynchronous extensions, e.g., A-BO, A-RS, A-BOHB, A-HB,....

NAS

NAS Benchmark (ICLR'20 Spotlight): Contains 15625 architectures on 3 different image classification datasets.

More on GPs and Bayesian Optimization on Distill Pub

howardyclo added CVPR NAS Neural Architecture Search Bayesian labels Aug 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CVPR 2020 Tutorial Talk: Automated Hyperparameter and Architecture Tuning #71

CVPR 2020 Tutorial Talk: Automated Hyperparameter and Architecture Tuning #71

howardyclo commented Jul 31, 2020 •

edited

Loading

CVPR 2020 Tutorial Talk: Automated Hyperparameter and Architecture Tuning #71

CVPR 2020 Tutorial Talk: Automated Hyperparameter and Architecture Tuning #71

Comments

howardyclo commented Jul 31, 2020 • edited Loading

Metadata

Hyper-parameter Tuning as Global Optimization Problem

Bayesian Optimization

(Canonical) Algorithm of Bayesian Optimization

Bayesian Optimization with Gaussian Processes (GP) Surrogate Model

Ingredient 1: GP

Ingredient 2: Acquisition Function

Another Surrogate Models instead of GP?

Multi-fidelity Optimization

Model-based Multi-fidelity Optimization

Asynchronous Extensions

NAS

More on GPs and Bayesian Optimization on Distill Pub

howardyclo commented Jul 31, 2020 •

edited

Loading