Monte Carlo simulation (Single Tape optimisation) + GPU applications #91

stebos100 · 2023-11-01T10:04:56Z

stebos100
Nov 1, 2023

Hello everyone !

I have a general question regarding the use of XAD in a Monte Carlo setting. In a simplified world and general MC setting, a Tape is recorded per path, and the adjoints of that path are then calculated to determine the first-order derivatives. This process is then repeated n times and averaged over the n paths to generate a linear combination of the row vector of the Jacobian.

In essence however, one is repeating the generation of a "Tape per path" approach, but the structure of the Tape generally does not change, but the values fed into the model does. Has this ever been considered in terms of MC optimisations ie a possible Code generation/JIT approach for a single tape which could be executed in parallel ?

And then secondly, has anybody been able to implement XAD in a GPU setting ? would be very curious to see if this is possible

Looking forward to hearing from everyone

auto-differentiation-dev · 2023-11-04T14:40:21Z

auto-differentiation-dev
Nov 4, 2023
Maintainer

Hi Stephan,

Regarding Monte-Carlo/JIT, please refer to the issue #70 for details. What is important to note is that the approach of recording the tape only once is only valid if you can guarantee that the code-path taken does not depend on the inputs in any possible way. If there are branches, data-dependent iterations, different polymorphic calls, etc. (the general case), it is unsafe and wrong derivatives would be calculated. With that in mind, at their own risk, users can use the JIT approach mentioned in the issue once it's implemented.

Regarding GPUs: Having a dynamically-growing tape for each GPU thread is not practical and will be very inefficient. And for most code, even a fixed-size tape would not fit into local or shared memory, while global memory access in every GPU thread is highly inefficient. We therefore recommend to include GPU functions into a CPU-managed tape using the external functions interface. That is, implementing the adjoint reverse path manually with another GPU kernel. This ensures highest performance, and since GPU kernels are typically small, manual implementation of the corresponding adjoint kernel is typically feasible.

5 replies

stebos100 Nov 7, 2023
Author

Hi All,

Thanks so much for getting back to me, the question regarding issue #70 is clear.

Regarding using an external function for GPU calculations. Am I correct in assuming that one would essentially need to generate the Tape object on the CPU, perform the Monte Carlo calculation on the GPU as an external function which in turn would calculate the adjoints using a checkpointing strategy and chunking the tape into checkpoints ( which calculates the adjoints on the fly by running forward and propagating backwards for the tape sections until the external functions adjoints have been calculated ?). Then transfer these results back to the CPU and perform the back propagation ?

And just a final question, do you know if this has been done using CUDA ie is XAD OO approach compatible with CUDA ?

Thanks again for the response, looking forward to hearing from you !

auto-differentiation-dev Nov 14, 2023
Maintainer

Hi @stebos100,

Yes, you are right. The tape would stay on CPU, and you would need 2 GPU kernels for each function accelerated on GPU: one for forward mode and one for back-propagating the adjoints (given the output adjoints). The external function interface documented here shows how to do that:

the CPU-based tape can be cut
inputs converted back to plain double and processed in the external function (GPU forward kernel)
the reverse kernel registered on tape as a callback
the output values converted back to the active type on CPU and the calculation continues

During tape.computeAdjoints(), XAD would trigger the callback for the reverse section at the right point, where you would need to pass the output adjoints to the GPU kernel in order to back-propagate them to the kernel inputs. The adjoints of these inputs will need to be incremented by these values (on CPU) before continuing.

This has been done with CUDA before with significant performance benefits.

stebos100 Nov 20, 2023
Author

Hi all,

Thanks so much for getting back to me. The only remaining question that I have is when you implement the manual computeAdjoint() for the reverse kernel, in the examples we have a simple sum, and therefore the function itself is easily hand coded for the algorithm. But if the Algorithm is more inherently complex, one would have to hand code the adjoint function. Making it far more strenuous. Am I mistaken with this ? Or is there possibly a way to do the compute adjoints callback without having to hand code the adjoint function ?

Thanks so much for all the help, its really appreciated

auto-differentiation-dev Nov 22, 2023
Maintainer

Hi @stebos100,

You're right - you would need to hand-code the adjoint function in this case. Since most CUDA kernels are not too complex, that is usually feasible enough - but that depends the code of course.

If external help is needed, some companies might be able to help, e.g. Xcelerit. You can reach out to them via their website: https://www.xcelerit.com/adjoint-algorithmic-differentiation/

stebos100 Nov 28, 2023
Author

Alright great that is clear thank you so much !

I just have one final question, if we are generating the tape for a lets say a function f(x1,x2,x3...xn), is there a way to keep the generated tape, and change the input variables, and roll back the tape for the new input variables ? (Given that the tape would be generic for the function and that it is branchless). Because essentially the structure/way the tape is generated would remain the same for the function (which we could keep), all that changes is the initial values that get fed into the tape itself ?

Once again thanks for all the help

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monte Carlo simulation (Single Tape optimisation) + GPU applications #91

{{title}}

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Monte Carlo simulation (Single Tape optimisation) + GPU applications #91

stebos100 Nov 1, 2023

Replies: 1 comment · 5 replies

auto-differentiation-dev Nov 4, 2023 Maintainer

stebos100 Nov 7, 2023 Author

auto-differentiation-dev Nov 14, 2023 Maintainer

stebos100 Nov 20, 2023 Author

auto-differentiation-dev Nov 22, 2023 Maintainer

stebos100 Nov 28, 2023 Author

stebos100
Nov 1, 2023

Replies: 1 comment 5 replies

auto-differentiation-dev
Nov 4, 2023
Maintainer

stebos100 Nov 7, 2023
Author

auto-differentiation-dev Nov 14, 2023
Maintainer

stebos100 Nov 20, 2023
Author

auto-differentiation-dev Nov 22, 2023
Maintainer

stebos100 Nov 28, 2023
Author