flat-sophia

Sophia optimizer further projected towards flat areas of loss landscape

Ideas come mainly from this paper by Wang et al.

They projected adam towards a flatter area using Hvp. Here, since sophia is already using the Hvp, we keep a cheap int8 mask used to further project sophia's update towards flatter areas.

A small experiment

run_experiment.py is a sort of worst case scenerio experiment where a ViT is too wide and shallow and is prone to overfitting.

Baseline is orange line, flat-sophia is green line. Projecting updates towards flatter areas helped prevent overfitting and the rise in loss.

How it works

There are two pertinent values, sharp_fraction and dampening_factor. sharp_fraction is the fraction of sharpest updates that will be dampened, and dampening_factor is the factor by which they'll be scaled down. The example uses sharp_fraction=0.2 and dampening_factor=10.

Whenever the preconditioner is updated, we also update the sharpness mask with the largest sharp_fraction of Hvp values equal to dampening_factor, and the rest equal to 1. The final update is then divided by this mask.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
flat_sophia.py		flat_sophia.py
requirements.txt		requirements.txt
run_experiment.py		run_experiment.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

flat-sophia

A small experiment

How it works

About

Releases

Packages

Languages

License

evanatyourservice/flat-sophia

Folders and files

Latest commit

History

Repository files navigation

flat-sophia

A small experiment

How it works

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages