Slow jitting in module #5822

tetterl · 2021-02-19T18:38:18Z

tetterl
Feb 19, 2021

(Similar problem as in: #5693)

I'm trying to write some modules that do some transformations. These transformations depend on something config-like and take quite a while to compute. Thus, the initialization of these transformations is done in the initialization of the module. The following presents and dummy example.

import time
from functools import partial

from jax import jit, random

key = random.PRNGKey(0)
transformation = random.uniform(key, (10000000,))  # expensive computatation
x = random.uniform(key, (10000000,))


class Module:
    def __init__(self, transformation):
        # expensive computation that depends on some config passed to the module
        self.transformation = transformation

    @partial(jit, static_argnums=(0,))
    def f(self, x):
        return x * self.transformation


m = Module(transformation)
start = time.time()
m.f(x)
print('jit time: ' + str(time.time() - start))

resulting in:

WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
jit time: 0.3965442180633545

Using a pure function:

@jit
def f(x, transformation):
    return x * transformation


start = time.time()
f(x, transformation)
print('jit time: ' + str(time.time() - start))

results in jit time: 0.02122330665588379.

Does the capturing of the self.transformation variable simply take that long? What is the recommended way to write such modules? (note in my case there are multiple nested modules and passing around the data/transformation would be cumbersome/non-practical.

Answered by froystig

Feb 20, 2021

The difference between m.f and f in your example is that m.f embeds a 10000000-entry f32 vector as a constant in the compiled computation (due to the static_argnums flag), whereas f takes it as a parameter. Timing the f call doesn't account for transferring that large vector to device memory. I suspect that timing the m.f one does, hence the longer wait.

One option is to write something along the lines of:

@jit
def _f(x, t): return x * t

class Module:
  # ...
  def f(self, x): return _f(x, self.t)

where t takes the role of what you call transformation in your example.

View full answer

froystig · 2021-02-20T05:20:39Z

froystig
Feb 20, 2021
Maintainer

The difference between m.f and f in your example is that m.f embeds a 10000000-entry f32 vector as a constant in the compiled computation (due to the static_argnums flag), whereas f takes it as a parameter. Timing the f call doesn't account for transferring that large vector to device memory. I suspect that timing the m.f one does, hence the longer wait.

One option is to write something along the lines of:

@jit
def _f(x, t): return x * t

class Module:
  # ...
  def f(self, x): return _f(x, self.t)

where t takes the role of what you call transformation in your example.

0 replies

tetterl · 2021-02-21T21:16:29Z

tetterl
Feb 21, 2021
Author

Thanks for the explanation. That's quite surprising that the embedding takes such a long time. Unfortunately I have multiple nested modules and can only jit the most outer functions.
In my use case I only want to expose the Module.f(x) function to a solver and all constants should be hidden from the solver. Furthermore, the solver is in charge of the jitting. I guess I have to resort to letting the solver also take charge of the constants.

0 replies

froystig · 2021-02-22T16:39:47Z

froystig
Feb 22, 2021
Maintainer

That's quite surprising that the embedding takes such a long time.

Embedding isn't what takes time per se, so much as transferring a large array from/to the device.

Before it can be run, a program and its data must be sent to device memory. If an array is an embedded constant (via static_argnums), it is transferred along with the program. If an array is taken as a parameter, then it was either transferred directly to the device before the call, or it is the outcome of a previous on-device computation.

Your array is the latter. It is a large random array computed on device from a much smaller input (the RNG key). This line in your example only transfers the key, and then holds a pointer to the resulting array:

transformation = random.uniform(key, (10000000,))  # expensive computation

This avoids a large transfer. When later captured as an embedded constant, 10000000 floating point numbers are pulled from the device, embedded into the compiled program as a constant, and then sent back with the program. Using a parameter lets you accept the pointer directly instead.

2 replies

harsh306 Jun 13, 2021

@froystig Is there a best practice to jit functions that use random vectors in high dimensions? Or I should just generate all possible random vectors first and then provide them to jitted function?
As that might be slowing my compilation too.. (https://github.com/harsh306/continuation-jax/blob/main/cjax/continuation/methods/corrector/perturb_parc_evolve.py#L114)

harsh306 Jun 15, 2021

new post at #6972

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow jitting in module #5822

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Slow jitting in module #5822

tetterl Feb 19, 2021

Replies: 3 comments · 2 replies

froystig Feb 20, 2021 Maintainer

tetterl Feb 21, 2021 Author

froystig Feb 22, 2021 Maintainer

harsh306 Jun 13, 2021

harsh306 Jun 15, 2021

tetterl
Feb 19, 2021

Replies: 3 comments 2 replies

froystig
Feb 20, 2021
Maintainer

tetterl
Feb 21, 2021
Author

froystig
Feb 22, 2021
Maintainer