Zhang, Zhi-Quan; 2022
@misc{https://doi.org/10.48550/arxiv.2208.09632,
doi = {10.48550/ARXIV.2208.09632},
url = {https://arxiv.org/abs/2208.09632},
author = {Zhang,
Yushun and Chen,
Congliang and Shi,
Naichen and Sun,
Ruoyu and Luo,
Zhi-Quan},
title = {Adam Can Converge Without Any Modification On Update Rules},
publisher = {arXiv},
year = {2022},
copyright = {arXiv.org perpetual, non-exclusive license}
}
Reddi et al. (2018) states that :
For any
$β_1, β_2$ s.t.$0 ≤ β_1 < √β_2 < 1$ , there exists a problem such that Adam diverges.
- Even though many varients of Adam came out after this, still Adam remains exceptionally popular.
- Without any modification on its update rules, Adam works well in practice.
- To construct the divergence example, they change sample size,
$n$ , for different$(β_1, β_2)$ , whereas in practical settings,$(β_1, β_2)$ is tuned after the sample size$n$ is fixed.
This paper suggests that:
- For
$β_2$ is large enough and$β_1 < √β_2,$ Adam converges to the neighborhood of critical points. - For any fixed
$n$ , there exists a function such that, Adam diverges to infinity when$(β_1, β_2)$ is picked in the red region.
- There is a phase transition from divergence to convergence when changing
$β_2$ . - Convergence and divergence regions of
$(β_1, β_2)$ are problem-dependent. - “Divergence region” of $(β_1, β_2) $ expands as
$n$ increases and converges to the whole region$[0,1)^2$ as$n$ goes to infinity.