-
-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Elide stack generation outside of looping control flow #1195
base: master
Are you sure you want to change the base?
Conversation
Wow |
Awesome, really nice work @ToucheSir. If this is based on @Keno's original code it probably makes sense to add a co-author to the commit? (Alternatively you could treat this as an update to his branch, but that might be a hassle.) I may be able to help with review if I get some time (but please don't wait up if someone else gets there first). |
c00c2b1
to
cf7acfb
Compare
Thanks @MikeInnes! Treating this as a branch update is a little beyond my ability since the original PR was filed before the IRTools transition, but I've now tagged the commit with co-authorship info. |
Trying to track references in issues, the guess is that this is the solution to TuringLang/Turing.jl#1754 or am I missing something? If so, is this PR sufficiently solid that it can be checked (on julia 1.7) or should I wait until it is merged? |
Please do check this. It may not make too much difference in the compilation but it should help with control flow heavy code. Besides it's a good idea to test against Turing in general. We should add that to our downstream tests if we can get a subsection of the testset that sufficiently checks for Zygote correctness. |
Friendly bump on this :) |
I just came across this, and I'll that this is huge for anything that uses DIstributions.jl (which we do in Turing.jl) due to the amount of if-statements in StatsFuns.jl/Distributions.jl. I've literally shaved off days of runtime for certain large models with Zygote by spending a grueling amount of effort tracking down if-statements in StatsFuns.jl and removing them. I'm currently trying to do some benchmarks to see exactly what sort of effect it has on both runtime and compile time for our use-cases. |
@ToucheSir would you rebase? |
36453ca
to
143f929
Compare
Co-authored-by: Keno Fischer <keno@juliacomputing.com>
143f929
to
7bdfe94
Compare
So it unfortuantely seems to significantly increase compilation time (and memory usage) in the example in TuringLang/Turing.jl#1754. For 15 tilde-statements, it blows out my 32GB mem laptop using this PR while the memory overhead for the current release (I haven't tested against |
Regarding the increase in compile-time, you can also observe this for the currently running tets, e.g. |
2/4 failures on nightly and all failures on stable+LTS should be squashed now. The remaining 2 nightly ones are because of a missing rule and have been reported at JuliaDiff/ChainRules.jl#684.
This one has been mysteriously timing out before this PR as well. I'll have another look at TuringLang/Turing.jl#1754 though. Last I checked (around the time of #1195 (comment)) the changes here didn't make a difference to latency, so perhaps the compiler has become smarter since... |
This PR ports @Keno's work on #78 to 2022 Zygote.
Because IRTools and base Julia have slightly different IR representations, some tweaks were necessary for the core algorithm:
IRTools.dominators
I'm not sure.forward_stacks!
now theoretically runs inO(blocks * alphas)
instead ofO(max(blocks, alphas))
now, in practice the vast majority of alphas will be eliminated very quickly (if not in the first loop iteration).Performance Comparison
The
afoldl
example is particularly interesting because of how that function is defined. Despite the presence of a loop at the end, not requiring stacks for the block of conditionals is significantly faster. This could have immediate downstream impact for code like FluxML/Flux.jl#1809 (comment).Next Steps
The Zygote test suite passes locally for me, so if CI + downstream is green then I think this should be a drop-in replacement for the current compiler code path. Per the comments, more optimizations may be possible for aspects such as calculating self-reachability. After looking through a bunch of IRTools code, there's probably a lot of low hanging fruit to optimize there as well.