-
Notifications
You must be signed in to change notification settings - Fork 219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simple benchmarking for Tracker, ReverseDiff, and Zygote #1140
Comments
cc: @phipsgabler |
There is a more comprehensive one avaiable here: https://github.com/TuringLang/TuringExamples/blob/kx/improve_functionality/benchmarks/auto_diff/loop-vs-unrolled.ipynb |
Numbers using PyTorch's via julia> using ThArrays
julia> @btime ThArrays.gradient(mysum, x);
477.310 ms (160017 allocations: 7.32 MiB) |
This example might be over-favourable to |
@KDr2 can you check whether julia> @code_warntype ThArrays.gradient(mysum, x);
Variables
#self#::Core.Compiler.Const(ThArrays.gradient, false)
f::Core.Compiler.Const(mysum, false)
data::Tuple{Array{Float64,1}}
Body::Tuple{Union{Int64, Tensor{_A,_B} where _B where _A},Tensor{_A,_B} where _B where _A}
1 ─ %1 = Core.tuple(ThArrays.C_NULL, #self#, f)::Core.Compiler.Const((Ptr{Nothing} @0x0000000000000000, ThArrays.gradient, mysum), false)
│ %2 = Core._apply(ThArrays.:(#gradient#30), %1, data)::Tuple{Union{Int64, Tensor{_A,_B} where _B where _A},Tensor{_A,_B} where _B where _A}
└── return %2
|
@yebai +tensor_from_ptr(p::Ptr{Tensor{T, N}}) where {T, N} = Tensor{T, N}(p, nothing) and -function grad(self::Tensor)
+function grad(self::Tensor{T, N}) where {T, N}
outputs__ = Int[0]
__cret = ccall((:atg_grad, :libtorch_capi),
Cvoid, (Ptr{Cvoid}, Ptr{Cvoid}),
outputs__, self.pointer)
- __o_1 = tensor_from_ptr(Ptr{Cvoid}(outputs__[1]))
+ __o_1 = tensor_from_ptr(Ptr{Tensor{T, N}}(outputs__[1]))
return __o_1
end can erase the type warnings, but I think (and from my experiment on this example) it has little influence on the benchmark results. Maybe do a comparison on ThArrays and PyTorch to see if the result is sensible? |
Can you code this example in C++, and report the run time? |
#include <torch/torch.h>
#include <torch/csrc/autograd/variable.h>
#include <torch/csrc/autograd/function.h>
#include <typeinfo>
#include <iostream>
int main() {
torch::Tensor t = torch::rand({10000}, torch::requires_grad(1));
torch::Tensor r(t[0].clone());
r.set_requires_grad(1);
for(int i=0; i< 10000; i++ ) {
r += t[i];
}
r.backward();
// std::cout << t.grad() << std::endl;
t.grad();
return 0;
} you can replace On my machine, it takes 0.5s while the Julia version takes 1.25s. I think in the Julia version most time is spent on indexing op. In C++, these ops are single function call (and maybe inlined), while in the Julia version, each op contains many function calls. If you use I don't if we can compile our function to torchscript, if we can, these time-consuming ops would disappear at runtime. |
Are you running this Windows? |
This result (0.5s) seems consistent with julia when using
|
No, on linux, it's a typo, it's an ELF, sorry for the confusion.
On my (old) machine, it takes 1.25s in Julia, takes 0.5s in C++. |
|
thanks, @KDr2
Based on the C++ runtime for PyTorch, maybe the superiority of |
This sounds plausible for a subset of Julia, given that Julia's syntax is so close with Python (torchscript is a subset of Python). |
I'm wondering if adding |
In real-world practice, we may not do indexing in such a frequency, e.g. we should use About torchscript, I didn't mean a source to source translation, I think we can construct computational graph using Julia, then save the graph as torchscript or some internal format, then call it in Turing. |
That works too, but at a cost of losing control flows: loops are unrolled, branches are recorded for specific branches only. |
The numbers are slightly better with julia> mysum_inbounds(x) = begin z = 0; @inbounds for i =1:length(x); z += x[i]; end; return z; end
mysum_inbounds (generic function with 1 method)
julia> @btime ReverseDiff.gradient(mysum_inbounds, x);
2.436 ms (70022 allocations: 2.92 MiB)
julia> @btime Tracker.gradient(mysum_inbounds, x);
439.518 ms (150016 allocations: 767.59 MiB)
julia> @btime Zygote.gradient(mysum_inbounds, x);
856.985 ms (280102 allocations: 1.50 GiB)
julia> @btime ThArrays.gradient(mysum_inbounds, x);
465.540 ms (160017 allocations: 7.32 MiB) |
I guess you should also use |
thanks @devmotion, I rerun the benchmarks with Interestingly, I did another test to separate the effects of julia> mysum2(x) = begin z = 0;_x=x[1]; for i =1:10000; z += _x; end; return z; end
mysum (generic function with 1 method)
julia> @btime ReverseDiff.gradient(mysum2, [1.0]);
2.396 ms (60021 allocations: 2.23 MiB)
julia> @btime Tracker.gradient(mysum2, [1.0]);
945.712 μs (60022 allocations: 1.83 MiB)
julia> @btime Zygote.gradient(mysum2, [1.0]);
7.260 ms (100108 allocations: 3.54 MiB)
julia> @btime ThArrays.gradient(mysum2, [1.0]);
69.599 ms (40026 allocations: 1.68 MiB)
This result together with the above ones suggest:
Ps
|
Tracker is actually the fastest in the loop benchmark, isn't it? |
You're right! I didn't notice the unit! |
Interesting |
On the flip side, ReverseDiff and Zygote are both slower than Tracker in code with broadcasting. |
But I am trying to make RD faster. |
I did some optimizations on ThArrays, compintell/THArrays.jl@c021717, indexing is a little faster(10%?) now. |
I just re-run the same benchmarks. Here is an update: julia> @btime ReverseDiff.gradient(mysum, x);
1.194 ms (70023 allocations: 3.07 MiB)
julia> @btime Tracker.gradient(mysum, x);
86.576 ms (190012 allocations: 769.27 MiB)
julia> @btime Zygote.gradient(mysum, x);
90.346 ms (180107 allocations: 769.40 MiB)
|
The following is a simple benchmarking for loops, on 3 different reverse-mode AD implementations in Julia. At the moment, in terms of efficiency:
ReverseDiff
>Tracker
>Zygote
.Benchmark setup:
The text was updated successfully, but these errors were encountered: