Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write some rough notes on scale propagation directions #13

Closed
wants to merge 4 commits into from

Conversation

DouglasOrr
Copy link

Reviewing #11 sucked me into a thought vortex.

Better viewed as rendered markdown (e.g. here).

No firm conclusions just a lot of words & thoughts; comments very welcome!

@balancap
Copy link
Contributor

Thanks @DouglasOrr . Personally feel that if we can avoid the worse-case scenario, we should really try. That would add a lot of complexity in addition to the average one.

I think before we get there, there are a lot of directions to explore to keep things simple with the "average" case: trying different distribution modelling, different metric for output approximation to Gaussian, ... I believe that where is the real value of the project, to manage to get that working. If we need to introduce a lot of complexity (multiple scale tracking, ...), it starts feeling we are not improving much more the situation compared to e.g. TransformerEngine (and we are moving towards a complex blackbox issue)

@DouglasOrr
Copy link
Author

Thanks, yes I agree w.r.t. going simple first. Certainly tracking both is more complex.

But I can't really see where only-worst-case is more complex than only-average-case. Do you have an example?

(In either case, div() etc cause problems that will require something special.)

@balancap
Copy link
Contributor

Agree, only worse case is not more complex. I think we should implement the different variations, and test in practice what works best (it would be easy to add to @autoscale decorator a strategy parameter for choosing the rule).
I think it is just my preference for probabilistic modelling steering me towards average-case.

@thecharlieblake
Copy link

Thanks for this Doug! Proposal 1 seems like it could be a viable scheme. I think this is pretty close to the vision of autoscale I started with, but with some of my mental holes filled in. I share some of Paul's concerns about the complexity of 2.

I hadn't thought about the layernorm/div stuff and I must admit this all scares me a bit. I'm a bit worried for the poor user trying to debug their scales and scratching their heads as to why this strange stuff is happening with pow and div, or why A+A != 2*A.

Apologies if this is a little severe(!), but the approaches we've been considering feel a bit mk2-ipu to me - designed to work well more-or-less out-the-box for some mainstream cases, but with a warning not to go off-piste or things might not behave as expected. And if things do break, going under the hood is going to be hard for users who aren't familiar with the intricacies of our distributional assumptions and some of the gotchas you outline.

@balancap balancap closed this Jun 12, 2024
@balancap balancap deleted the scaleprop-notes branch June 17, 2024 13:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants