Write some rough notes on scale propagation directions #13

DouglasOrr · 2023-11-10T13:37:07Z

Reviewing #11 sucked me into a thought vortex.

Better viewed as rendered markdown (e.g. here).

No firm conclusions just a lot of words & thoughts; comments very welcome!

balancap · 2023-11-10T14:26:47Z

Thanks @DouglasOrr . Personally feel that if we can avoid the worse-case scenario, we should really try. That would add a lot of complexity in addition to the average one.

I think before we get there, there are a lot of directions to explore to keep things simple with the "average" case: trying different distribution modelling, different metric for output approximation to Gaussian, ... I believe that where is the real value of the project, to manage to get that working. If we need to introduce a lot of complexity (multiple scale tracking, ...), it starts feeling we are not improving much more the situation compared to e.g. TransformerEngine (and we are moving towards a complex blackbox issue)

DouglasOrr · 2023-11-10T14:55:33Z

Thanks, yes I agree w.r.t. going simple first. Certainly tracking both is more complex.

But I can't really see where only-worst-case is more complex than only-average-case. Do you have an example?

(In either case, div() etc cause problems that will require something special.)

balancap · 2023-11-10T15:13:36Z

Agree, only worse case is not more complex. I think we should implement the different variations, and test in practice what works best (it would be easy to add to @autoscale decorator a strategy parameter for choosing the rule).
I think it is just my preference for probabilistic modelling steering me towards average-case.

thecharlieblake · 2023-11-10T15:57:14Z

Thanks for this Doug! Proposal 1 seems like it could be a viable scheme. I think this is pretty close to the vision of autoscale I started with, but with some of my mental holes filled in. I share some of Paul's concerns about the complexity of 2.

I hadn't thought about the layernorm/div stuff and I must admit this all scares me a bit. I'm a bit worried for the poor user trying to debug their scales and scratching their heads as to why this strange stuff is happening with pow and div, or why A+A != 2*A.

Apologies if this is a little severe(!), but the approaches we've been considering feel a bit mk2-ipu to me - designed to work well more-or-less out-the-box for some mainstream cases, but with a warning not to go off-piste or things might not behave as expected. And if things do break, going under the hood is going to be hard for users who aren't familiar with the intricacies of our distributional assumptions and some of the gotchas you outline.

…110.md

Write some rough notes on scale propagation directions

c323ff1

DouglasOrr requested review from balancap and samhosegood November 10, 2023 13:37

DouglasOrr mentioned this pull request Nov 10, 2023

Gaussian scaled ops #11

Closed

thecharlieblake and others added 3 commits November 13, 2023 15:27

Create scaleprop_proposal_2_20231110.md

3aa81aa

Rename scaleprop_proposal_2_20231110.md to scaleprop_proposal_b_20231…

c1c3805

…110.md

Add a pytorch prototype of dtype-like scale propagation

913c238

balancap closed this Jun 12, 2024

balancap deleted the scaleprop-notes branch June 17, 2024 13:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write some rough notes on scale propagation directions #13

Write some rough notes on scale propagation directions #13

DouglasOrr commented Nov 10, 2023

balancap commented Nov 10, 2023

DouglasOrr commented Nov 10, 2023

balancap commented Nov 10, 2023

thecharlieblake commented Nov 10, 2023

Write some rough notes on scale propagation directions #13

Write some rough notes on scale propagation directions #13

Conversation

DouglasOrr commented Nov 10, 2023

balancap commented Nov 10, 2023

DouglasOrr commented Nov 10, 2023

balancap commented Nov 10, 2023

thecharlieblake commented Nov 10, 2023