Elide stack generation outside of looping control flow #1195

ToucheSir · 2022-04-05T06:25:34Z

This PR ports @Keno's work on #78 to 2022 Zygote.

Because IRTools and base Julia have slightly different IR representations, some tweaks were necessary for the core algorithm:

Instead of inserting phi nodes, we need to add block arguments. This is a bit more tedious because it requires updating multiple blocks.
On the bright side, we don't need to calculate an iterated dominance frontier for each block. Whether any savings from that are wiped away from calling IRTools.dominators I'm not sure.
Blocks are iterated over in reverse order. This allows us to iteratively narrow down the number of unaccounted alpha vars. Although forward_stacks! now theoretically runs in O(blocks * alphas) instead of O(max(blocks, alphas)) now, in practice the vast majority of alphas will be eliminated very quickly (if not in the first loop iteration).

Performance Comparison

using Zygote, BenchmarkTools

function qux(a, b, x) # Simple control flow
   aa = a ? sin(x) : cos(x)
   bb = b ? sech(aa) : tanh(aa)
   return bb
end

foldminus(xs) = Base.afoldl(-, xs...) # afoldl is very branch-heavy

xs = ntuple(identity, 16)

julia> @time gradient(qux, true, false, 1.0);
  0.146199 seconds (60.84 k allocations: 3.519 MiB, 99.73% compilation time) # 0.6.37
  0.135723 seconds (52.94 k allocations: 3.086 MiB, 99.86% compilation time) # This PR

julia> @btime gradient(qux, true, false, 1.0);
  3.378 μs (46 allocations: 1.31 KiB)
  3.044 μs (35 allocations: 720 bytes)

julia> @time gradient(foldminus, xs);
  4.785566 seconds (11.53 M allocations: 616.818 MiB, 2.59% gc time, 99.97% compilation time)
  4.428252 seconds (11.97 M allocations: 660.290 MiB, 3.03% gc time, 99.97% compilation time)

julia> @btime gradient(foldminus, $xs);
  111.256 μs (506 allocations: 20.30 KiB)
  151.316 ns (8 allocations: 848 bytes)

The afoldl example is particularly interesting because of how that function is defined. Despite the presence of a loop at the end, not requiring stacks for the block of conditionals is significantly faster. This could have immediate downstream impact for code like FluxML/Flux.jl#1809 (comment).

Next Steps

The Zygote test suite passes locally for me, so if CI + downstream is green then I think this should be a drop-in replacement for the current compiler code path. Per the comments, more optimizations may be possible for aspects such as calculating self-reachability. After looking through a bunch of IRTools code, there's probably a lot of low hanging fruit to optimize there as well.

CarloLucibello · 2022-04-05T07:17:46Z

Wow

MikeInnes · 2022-04-22T13:40:53Z

Awesome, really nice work @ToucheSir. If this is based on @Keno's original code it probably makes sense to add a co-author to the commit? (Alternatively you could treat this as an update to his branch, but that might be a hassle.)

I may be able to help with review if I get some time (but please don't wait up if someone else gets there first).

ToucheSir · 2022-04-23T22:52:44Z

Thanks @MikeInnes! Treating this as a branch update is a little beyond my ability since the original PR was filed before the IRTools transition, but I've now tagged the commit with co-authorship info.

jlperla · 2022-05-03T20:17:42Z

Trying to track references in issues, the guess is that this is the solution to TuringLang/Turing.jl#1754 or am I missing something?

If so, is this PR sufficiently solid that it can be checked (on julia 1.7) or should I wait until it is merged?

DhairyaLGandhi · 2022-05-03T20:22:04Z

Please do check this. It may not make too much difference in the compilation but it should help with control flow heavy code. Besides it's a good idea to test against Turing in general. We should add that to our downstream tests if we can get a subsection of the testset that sufficiently checks for Zygote correctness.

ToucheSir · 2022-08-10T05:16:22Z

Friendly bump on this :)

torfjelde · 2022-11-10T18:48:05Z

I just came across this, and I'll that this is huge for anything that uses DIstributions.jl (which we do in Turing.jl) due to the amount of if-statements in StatsFuns.jl/Distributions.jl. I've literally shaved off days of runtime for certain large models with Zygote by spending a grueling amount of effort tracking down if-statements in StatsFuns.jl and removing them.

I'm currently trying to do some benchmarks to see exactly what sort of effect it has on both runtime and compile time for our use-cases.

CarloLucibello · 2022-11-10T18:58:52Z

@ToucheSir would you rebase?

Co-authored-by: Keno Fischer <keno@juliacomputing.com>

torfjelde · 2022-11-11T13:10:20Z

So it unfortuantely seems to significantly increase compilation time (and memory usage) in the example in TuringLang/Turing.jl#1754. For 15 tilde-statements, it blows out my 32GB mem laptop using this PR while the memory overhead for the current release (I haven't tested against master) has a minimal memory usage (it still takes ages to compile).

torfjelde · 2022-11-11T13:17:00Z

Regarding the increase in compile-time, you can also observe this for the currently running tets, e.g. DiffEqFlux.jl/Layers. Atm it has been running for ~6hrs, while in the previously merged PR it seems to have only taken ~20mins: https://github.com/FluxML/Zygote.jl/actions/runs/3260268471/jobs/5353708714

ToucheSir · 2022-11-11T14:38:31Z

2/4 failures on nightly and all failures on stable+LTS should be squashed now. The remaining 2 nightly ones are because of a missing rule and have been reported at JuliaDiff/ChainRules.jl#684.

e.g. DiffEqFlux.jl/Layers. Atm it has been running for ~6hrs, while in the previously merged PR it seems to have only taken ~20mins:

This one has been mysteriously timing out before this PR as well. I'll have another look at TuringLang/Turing.jl#1754 though. Last I checked (around the time of #1195 (comment)) the changes here didn't make a difference to latency, so perhaps the compiler has become smarter since...

ToucheSir changed the title ~~Elide stack generation outside of non-looping control flow~~ Elide stack generation outside of looping control flow Apr 5, 2022

DhairyaLGandhi requested a review from Keno April 5, 2022 07:36

ToucheSir mentioned this pull request Apr 22, 2022

[WIP] Make FastDEQs fast again SciML/DeepEquilibriumNetworks.jl#45

Merged

3 tasks

ToucheSir force-pushed the bc/stack-elision branch from c00c2b1 to cf7acfb Compare April 23, 2022 22:50

darsnack mentioned this pull request Apr 27, 2022

Improved time to first gradient FluxML/Metalhead.jl#151

Merged

mcabbott added the performance label Jul 4, 2022

ToucheSir mentioned this pull request Aug 19, 2022

Attach rule to mapfoldl_impl not foldl JuliaDiff/ChainRules.jl#569

Open

DhairyaLGandhi approved these changes Nov 10, 2022

View reviewed changes

ToucheSir force-pushed the bc/stack-elision branch 3 times, most recently from 36453ca to 143f929 Compare November 11, 2022 07:01

Elide stack generation outside of non-looping control flow

7bdfe94

Co-authored-by: Keno Fischer <keno@juliacomputing.com>

ToucheSir force-pushed the bc/stack-elision branch from 143f929 to 7bdfe94 Compare November 11, 2022 07:30

torfjelde mentioned this pull request Nov 11, 2022

Zygote's compilation scales badly with the number of ~ statements TuringLang/Turing.jl#1754

Closed

ToucheSir mentioned this pull request Feb 9, 2023

Type instability with conditional return #371

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elide stack generation outside of looping control flow #1195

Elide stack generation outside of looping control flow #1195

ToucheSir commented Apr 5, 2022

CarloLucibello commented Apr 5, 2022

MikeInnes commented Apr 22, 2022

ToucheSir commented Apr 23, 2022

jlperla commented May 3, 2022

DhairyaLGandhi commented May 3, 2022

ToucheSir commented Aug 10, 2022

torfjelde commented Nov 10, 2022

CarloLucibello commented Nov 10, 2022

torfjelde commented Nov 11, 2022

torfjelde commented Nov 11, 2022

ToucheSir commented Nov 11, 2022 •

edited

Loading

Elide stack generation outside of looping control flow #1195

Are you sure you want to change the base?

Elide stack generation outside of looping control flow #1195

Conversation

ToucheSir commented Apr 5, 2022

Performance Comparison

Next Steps

CarloLucibello commented Apr 5, 2022

MikeInnes commented Apr 22, 2022

ToucheSir commented Apr 23, 2022

jlperla commented May 3, 2022

DhairyaLGandhi commented May 3, 2022

ToucheSir commented Aug 10, 2022

torfjelde commented Nov 10, 2022

CarloLucibello commented Nov 10, 2022

torfjelde commented Nov 11, 2022

torfjelde commented Nov 11, 2022

ToucheSir commented Nov 11, 2022 • edited Loading

ToucheSir commented Nov 11, 2022 •

edited

Loading