Hierarchical hyper-parameter configurations #194

cwbeitel · 2020-04-09T23:50:09Z

cwbeitel
Apr 9, 2020

Specifying hyper-parameters in flat/non-hierarchical form is the most natural first approach but as models grow in complexity such flat schemes can become unwieldy - both from the human perspectives of (1) understanding / intuiting how to improve and (2) maintaining them as well as from (3) the optimization perspective of a software tuner seeking to infer what a previous success indicates about what to try next.

Gin is an example of how to implement this (as is in practice in the Trax library). This seems like a great approach with only minor exceptions from perspective e.g. that it feels a little unnatural when developing in the notebook to specify hparams as a block of strings instead of as python objects.

Another requirement beyond hierarchy to consider is the means to specify allowable hyper-parameter ranges as was done in the tensor2tensor library and which could be used to configure tuning on CloudML.

A differential for this feature would be the perspective that flax is meant to be lower-level than the concern that would require such a means of configuration to be included in the core library.

levskaya · 2020-04-10T15:13:04Z

levskaya
Apr 10, 2020
Maintainer

So, we have a lot of experience with Gin and tensor2tensor (and I personally worked on trax as well). Though Gin can be a lifesaver when trying to bring sanity to an existing codebase needing configuration management, we find that Gin tends to infect the entire codebase with its dependency injection approach and adds a number of issues, e.g. too much noise to stacktraces, making debugging much less pleasant. If our users wish to use Gin that's great: it can be a great way of organizing hparam settings and we really don't want to dictate how users control their training loops. We just want to avoid adding that additional complexity to our examples for now.

That said, we would like to settle on a better 'canonical' strategy for hparam management in Flax. The one demonstrated in the examples is a kick-the-can-down-the-road non-answer.

A subset of design questions on the table:

do we just use a global config-dictionary scope? perhaps we scope some configs to a flax model? or to particular layers or functions? (The latter being closer to Gin's function wrapping.)
whether to use a 'smart' config system in python itself as in t2t, or a 'dumb' config format like YAML, etc. that only allows plain-old-data as configuration variables. This is of course an old design debate in the software/devops world generally.
do we include hyperparameter domain-annotations as you mention for helping random/guided search libraries?
how do we support hparam "hierarchy"? e.g. in t2t subclassing/templating is short-term convenient but leads in my experience to confusing "nested-doll" configurations that can take many hours to unwind and debug after months of research accretion (So we lean hard no on that kind of inheritance based on past pain.)
how tightly do we integrate it w. Flax layers? It could turn out that we only offer some minimal tooling to support the diverse opinions in the ML community. We tend to lean strongly in the "anti-framework" direction for flax and want to keep things decoupled, but also recognize how core hparam management is for ML work.

So: we're still trying to find the right spot in design space that will maximize utility for our users without getting in their way or precluding their own solutions. Further comments on the issue are most welcome!

0 replies

cwbeitel · 2020-04-10T22:22:42Z

cwbeitel
Apr 10, 2020
Author

Ack on timing, importance of clear docs, clear debugging, imposition on training strategy, and built-in vs. canonical recommendations / anti-framework. Partial metaphor in another domain is OpenWC that offers an opinionated docs site + a generator.

To your point above about unconstrained the functional sub-classing know exactly what you mean. That ref to t2t was just noting the feature of range or set hparams.

About hierarchy it's worth acknowledging there are two kinds to consider i.e. hierarchical configuration of a single model vs. a hierarchical lineage of configurations (perhaps the latter was source of complexity in t2t configs).

Your reference to yaml vs. python config in the devops world is great food for thought (linked examples). In our case local + same language vs. remote + potentially different.

Free thinking... perhaps useful to shift perspective from management of a library of independent hparam configs to individual experiment specification - that being the conjunction of settings and model and training etc. code (perhaps plus annotations for what is tunable) e.g. with an individual experiment specified by an individual python module.

In this latter pattern, arbitrary hierarchy, composition, etc. is supported in the natural way of specifying a program instead of looking for a non-program DSL that can be used to configure arbitrary programs. The in-program specification of hparams won't break these out as cleanly as e.g. the Gin approach but you could easily write an editor plugin that extracts and displays these.

... good to better understand the problem and clearly the design context is complex. Thank you for the interesting discussion.

0 replies

lucasb-eyer · 2020-07-22T12:56:02Z

lucasb-eyer
Jul 22, 2020

I don't see why flax has anything to do with hyper-parameter configurations. Flax is a library, and hparam configuration is something that should be chosen at the application level, not imposed by a library. One might wish to use flax in applications that already have their way of handling hyper-parameters, for example.

Now if you mean "[...] in the flax examples" then yeah, but it shouldn't "infect" the flax library in any way imo.

0 replies

levskaya · 2020-07-23T14:47:11Z

levskaya
Jul 23, 2020
Maintainer

We certainly aren't going to infect the core system w. hparam specific concerns, we just want to make sure we support underlying mechanisms that allow people to do what they want. Since the above was written a pattern that seems to work well is to have a model-specific dataclass of all the hparams that is fed through the layers by the user to minimize pointless "kwarg plumbing". The one improvement the new api revision will make w.o. any special hparam logic is allowing fine-grained hparams (e.g. quantization hparams for -every- parameter) by the same mechanism that feeds parameters and stateful variables into the module tree.

1 reply

lucasb-eyer Jul 23, 2020

That sounds very cool, looking forward to it!

lukaszlew · 2020-07-23T23:38:29Z

lukaszlew
Jul 23, 2020

It's also important that these parameters could change during the training without recompilation. I.e. be treated as inputs to the graph.

1 reply

lukaszlew Aug 4, 2020

Perhaps it is wrong place to make this comment but perhaps it will be useful:
Having a hierarchical way to pass parameters to all the individual modules is important.
This is useful for hyper params but much more fundamental, it helps to configure your computation.
Some of these parameters have to be static some dynamic.
One often wants to change a single parameter from static to dynamic to change it during the training.
Sometimes one wants to go the other way, to prevent memory usage explosion or for other reasons.

Currently the best workaround is to have two (!) parallel hierarchical structures static and dynamic and move fields between them.

andsteing · 2020-09-16T14:13:17Z

andsteing
Sep 16, 2020
Maintainer

Note that in the meantime ml-collections have been open sourced (https://pypi.org/project/ml-collections/) and we started updating the examples to use ml_collections.ConfigDict - see e.g. #463 for adding this to an example.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hierarchical hyper-parameter configurations #194

{{title}}

Replies: 6 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Hierarchical hyper-parameter configurations #194

cwbeitel Apr 9, 2020

Replies: 6 comments · 2 replies

levskaya Apr 10, 2020 Maintainer

cwbeitel Apr 10, 2020 Author

lucasb-eyer Jul 22, 2020

levskaya Jul 23, 2020 Maintainer

lucasb-eyer Jul 23, 2020

lukaszlew Jul 23, 2020

lukaszlew Aug 4, 2020

andsteing Sep 16, 2020 Maintainer

cwbeitel
Apr 9, 2020

Replies: 6 comments 2 replies

levskaya
Apr 10, 2020
Maintainer

cwbeitel
Apr 10, 2020
Author

lucasb-eyer
Jul 22, 2020

levskaya
Jul 23, 2020
Maintainer

lukaszlew
Jul 23, 2020

andsteing
Sep 16, 2020
Maintainer