+ + + +
+ +
+
+ +
+
+ +
+ +
+ +
+ + +
+ +
+ +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+ +
+
+ + + + + + + + +
+ +
+

37) Differentiation#

+
+

Last time#

+
    +
  • Gradient descent

  • +
  • Nonlinear models

  • +
+
+
+

Today#

+
    +
  1. Computing derivatives
    +1.1 Numeric
    +1.2 Analytic by hand
    +1.3 Algorithmic (automatic) differentiation

  2. +
  3. Ill-conditioned optimization

  4. +
+
+
+

1. Computing derivatives#

+

We know the definition of the difference quotient from Calculus:

+
+\[\lim_{h\to 0} \frac{f(x+h) - f(x)}{h}\]
+
    +
  • How should we choose \(h\)?

  • +
+
+

Taylor series#

+

Classical accuracy analysis assumes that functions are sufficiently smooth, meaning that derivatives exist and Taylor expansions are valid within a neighborhood. In particular, +

+\[ f(x+h) = f(x) + f'(x) h + f''(x) \frac{h^2}{2!} + \underbrace{f'''(x) \frac{h^3}{3!} + \dotsb}_{O(h^3)} . \]
+

+

The big-\(O\) notation is meant in the limit \(h\to 0\). Specifically, a function \(g(h) \in O(h^p)\) (sometimes written \(g(h) = O(h^p)\)) when +there exists a constant \(C\) such that +

+\[ g(h) \le C h^p \]
+ +for all sufficiently small \(h\).

+
+
+

Rounding error#

+

We have an additional source of error, rounding error, which comes from not being able to compute \(f(x)\) or \(f(x+h)\) exactly, nor subtract them exactly. Suppose that we can, however, compute these functions with a relative error on the order of \(\epsilon_{\text{machine}}\). This leads to +

+\[\begin{split} \begin{split} +\tilde f(x) &= f(x)(1 + \epsilon_1) \\ +\tilde f(x \oplus h) &= \tilde f((x+h)(1 + \epsilon_2)) \\ +&= f((x + h)(1 + \epsilon_2))(1 + \epsilon_3) \\ +&= [f(x+h) + f'(x+h)(x+h)\epsilon_2 + O(\epsilon_2^2)](1 + \epsilon_3) \\ +&= f(x+h)(1 + \epsilon_3) + f'(x+h)x\epsilon_2 + O(\epsilon_{\text{machine}}^2 + \epsilon_{\text{machine}} h) +\end{split} +\end{split}\]
+

+
+
+

Tedious error propagation#

+
+\[\begin{split} \begin{split} +\left\lvert \frac{\tilde f(x+h) \ominus \tilde f(x)}{h} - \frac{f(x+h) - f(x)}{h} \right\rvert &= + \left\lvert \frac{f(x+h)(1 + \epsilon_3) + f'(x+h)x\epsilon_2 + O(\epsilon_{\text{machine}}^2 + \epsilon_{\text{machine}} h) - f(x)(1 + \epsilon_1) - f(x+h) + f(x)}{h} \right\rvert \\ + &\le \frac{|f(x+h)\epsilon_3| + |f'(x+h)x\epsilon_2| + |f(x)\epsilon_1| + O(\epsilon_{\text{machine}}^2 + \epsilon_{\text{machine}}h)}{h} \\ + &\le \frac{(2 \max_{[x,x+h]} |f| + \max_{[x,x+h]} |f' x| \epsilon_{\text{machine}} + O(\epsilon_{\text{machine}}^2 + \epsilon_{\text{machine}} h)}{h} \\ + &= (2\max|f| + \max|f'x|) \frac{\epsilon_{\text{machine}}}{h} + O(\epsilon_{\text{machine}}) \\ +\end{split} \end{split}\]
+

where we have assumed that \(h \ge \epsilon_{\text{machine}}\). +This error becomes large (relative to \(f'\) – we are concerned with relative error after all).

+

What can be problematic:

+
    +
  • \(f\) is large compared to \(f'\)

  • +
  • \(x\) is large

  • +
  • \(h\) is too small

  • +
+
+
+

Automatic step size selection#

+

Reference: Numerical Optimization.

+
    +
  • Walker and Pernice

  • +
  • Dennis and Schnabel

  • +
+
+
+
diff(f, x; h=1e-8) = (f(x+h) - f(x)) / h
+
+function diff_wp(f, x; h=1e-8)
+    """Diff using Walker and Pernice (1998) choice of step"""
+    h *= (1 + abs(x))
+    (f(x+h) - f(x)) / h
+end
+
+x = 1000
+diff_wp(sin, x) - cos(x)
+
+
+
+
+
-4.139506408429305e-6
+
+
+
+
+
+
+
+

1.1 Symbolic differentiation#

+
+
+
using Pkg
+Pkg.add("Symbolics")
+
+using Symbolics
+
+@variables x
+Dx = Differential(x)
+
+y = sin(x)
+Dx(y)
+
+
+
+
+
   Resolving package versions...
+
+
+
  No Changes to `~/.julia/environments/v1.10/Project.toml`
+  No Changes to `~/.julia/environments/v1.10/Manifest.toml`
+
+
+
+\[ \begin{equation} +\frac{\mathrm{d} \sin\left( x \right)}{\mathrm{d}x} +\end{equation} + \]
+
+
+
+
+
expand_derivatives(Dx(y))
+
+
+
+
+
+\[ \begin{equation} +\cos\left( x \right) +\end{equation} + \]
+
+
+

Awesome, what about products?

+
+
+
y = x
+for _ in 1:2
+    y = cos(y^pi) * log(y)
+end
+expand_derivatives(Dx(y))
+
+
+
+
+
+\[ \begin{equation} +\frac{\left( \frac{\cos\left( x^{\pi} \right)}{x} - 3.1416 x^{2.1416} \log\left( x \right) \sin\left( x^{\pi} \right) \right) \cos\left( \cos^{3.1416}\left( x^{\pi} \right) \left( \log\left( x \right) \right)^{3.1416} \right)}{\log\left( x \right) \cos\left( x^{\pi} \right)} - \left( \frac{3.1416 \cos^{3.1416}\left( x^{\pi} \right) \left( \log\left( x \right) \right)^{2.1416}}{x} - 9.8696 \cos^{2.1416}\left( x^{\pi} \right) \left( \log\left( x \right) \right)^{3.1416} x^{2.1416} \sin\left( x^{\pi} \right) \right) \log\left( \log\left( x \right) \cos\left( x^{\pi} \right) \right) \sin\left( \cos^{3.1416}\left( x^{\pi} \right) \left( \log\left( x \right) \right)^{3.1416} \right) +\end{equation} + \]
+
+
+
    +
  • The size of these expressions can grow exponentially

  • +
+
+
+

1.2 Hand-coding (analytic) derivatives#

+
+\[ df = f'(x) dx \]
+
+
+
function f(x)
+    y = x
+    for _ in 1:2
+        a = y^pi
+        b = cos(a)
+        c = log(y)
+        y = b * c
+    end
+    y
+end
+
+f(1.9), diff_wp(f, 1.9)
+
+
+
+
+
(-1.5346823414986814, -34.032439961925064)
+
+
+
+
+
+
+
function df(x, dx)
+    y = x
+    dy = dx
+    for _ in 1:2
+        a = y ^ pi
+        da = pi * y^(pi - 1) * dy
+        b = cos(a)
+        db = -sin(a) * da
+        c = log(y)
+        dc = dy / y
+        y = b * c
+        dy = db * c + b * dc
+    end
+    y, dy
+end
+
+df(1.9, 1)
+
+
+
+
+
(-1.5346823414986814, -34.032419599140475)
+
+
+
+
+
+

We can go the other way#

+

We can differentiate a composition \(h(g(f(x)))\) as

+
+(19)#\[\begin{align} + \operatorname{d} h &= h' \operatorname{d} g \\ + \operatorname{d} g &= g' \operatorname{d} f \\ + \operatorname{d} f &= f' \operatorname{d} x. +\end{align}\]
+

What we’ve done above is called “forward mode”, and amounts to placing the parentheses in the chain rule like

+
+\[ \operatorname d h = \frac{dh}{dg} \left(\frac{dg}{df} \left(\frac{df}{dx} \operatorname d x \right) \right) .\]
+

The expression means the same thing if we rearrange the parentheses,

+
+\[ \operatorname d h = \left( \left( \left( \frac{dh}{dg} \right) \frac{dg}{df} \right) \frac{df}{dx} \right) \operatorname d x \]
+

which we can compute in reverse order via

+
+\[ \underbrace{\bar x}_{\frac{dh}{dx}} = \underbrace{\bar g \frac{dg}{df}}_{\bar f} \frac{df}{dx} .\]
+
+

A reverse mode example#

+
+\[ \underbrace{\bar x}_{\frac{dh}{dx}} = \underbrace{\bar g \frac{dg}{df}}_{\bar f} \frac{df}{dx} .\]
+
+
+
function g(x)
+    a = x^pi
+    b = cos(a)
+    c = log(x)
+    y = b * c
+    y
+end
+(g(1.4), diff_wp(g, 1.4))
+
+
+
+
+
(-0.32484122107701546, -1.2559760384500684)
+
+
+
+
+
+
+
function gback(x, y_)
+    a = x^pi
+    b = cos(a)
+    c = log(x)
+    y = b * c
+    # backward pass
+    c_ = y_ * b
+    b_ = c * y_
+    a_ = -sin(a) * b_
+    x_ = 1/x * c_ + pi * x^(pi-1) * a_
+    x_
+end
+gback(1.4, 1)
+
+
+
+
+
-1.2559761698835525
+
+
+
+
+
+
+
+
+

1.3 Automatic differentiation: Zygote.jl#

+
+
+
using Pkg
+Pkg.add("Zygote")
+
+import Zygote
+
+
+
+
+
   Resolving package versions...
+
+
+
  No Changes to `~/.julia/environments/v1.10/Project.toml`
+  No Changes to `~/.julia/environments/v1.10/Manifest.toml`
+
+
+
+
+
+
+
Zygote.gradient(f, 1.9)
+
+
+
+
+
(-34.03241959914049,)
+
+
+
+
+
+
+
g(x) = exp(x) + x^2
+@code_llvm Zygote.gradient(g, 1.9)
+
+
+
+
+
;  @ /home/valeria/.julia/packages/Zygote/nyzjS/src/compiler/interface.jl:146 within `gradient`
+define 
+
+
+
[1 x double] @julia_gradient_4658(double %0) #0 {
+top:
+;  @ /home/valeria/.julia/packages/Zygote/nyzjS/src/compiler/interface.jl:147 within `gradient`
+; ┌ @ /home/valeria/.julia/packages/Zygote/nyzjS/src/compiler/interface.jl:88 within `pullback` @ /home/valeria/.julia/packages/Zygote/nyzjS/src/compiler/interface.jl:90
+; │┌ @ In[11]:1 within `g`
+; ││┌ @ /home/valeria/.julia/packages/Zygote/nyzjS/src/compiler/interface2.jl:87 within `_pullback`
+; │││┌ @ /home/valeria/.julia/packages/Zygote/nyzjS/src/compiler/interface2.jl within `macro expansion`
+; ││││┌ @ /home/valeria/.julia/packages/Zygote/nyzjS/src/compiler/chainrules.jl:224 within `chain_rrule`
+; │││││┌ @ /home/valeria/.julia/packages/ChainRulesCore/6Pucz/src/rules.jl:138 within `rrule` @ /home/valeria/.julia/packages/ChainRules/DbWAz/src/rulesets/Base/fastmath_able.jl:56
+        %1 = call double @j_exp_4660(double %0)
+; └└└└└└
+;  @ /home/valeria/.julia/packages/Zygote/nyzjS/src/compiler/interface.jl:148 within `gradient`
+; ┌ @ /home/valeria/.julia/packages/Zygote/nyzjS/src/compiler/interface.jl:91 within `#78`
+; │┌ @ In[11]:1 within `g`
+; ││┌ @ /home/valeria/.julia/packages/Zygote/nyzjS/src/compiler/chainrules.jl:212 within `ZBack`
+; │││┌ @ /home/valeria/.julia/packages/Zygote/nyzjS/src/lib/number.jl:12 within `literal_pow_pullback`
+; ││││┌ @ promotion.jl:423 within `*` @ float.jl:411
+       %2 = fmul double %0, 2.000000e+00
+; │└└└└
+; │┌ @ /home/valeria/.julia/packages/Zygote/nyzjS/src/lib/lib.jl:17 within `accum`
+; ││┌ @ float.jl:409 within `+`
+     %3 = fadd double %2, %1
+; └└└
+;  @ /home/valeria/.julia/packages/Zygote/nyzjS/src/compiler/interface.jl:149 within `gradient`
+  %unbox3.fca.0.insert = insertvalue [1 x double] zeroinitializer, double %3, 0
+  ret [1 x double] %unbox3.fca.0.insert
+}
+
+
+
+
+
+

How does Zygote work?#

+
+
+
square(x) = x^2
+@code_llvm square(1.5)
+
+
+
+
+
;  @ In[12]:1 within `square`
+define double @julia_square_4682(double %0) #0 {
+top:
+; ┌ @ intfuncs.jl:332 within `literal_pow`
+; │┌ @ float.jl:411 within `*`
+    %1 = fmul double %0, %0
+    ret double %1
+; └└
+}
+
+
+
+
+
+
+
@code_llvm Zygote.gradient(square, 1.5)
+
+
+
+
+
;  @ /home/valeria/.julia/packages/Zygote/nyzjS/src/compiler/interface.jl:146 within `gradient`
+define [1 x double] @julia_gradient_4686(double %0) #0 {
+top:
+;  @ /home/valeria/.julia/packages/Zygote/nyzjS/src/compiler/interface.jl:148 within `gradient`
+; ┌ @ /home/valeria/.julia/packages/Zygote/nyzjS/src/compiler/interface.jl:91 within `#78`
+; │┌ @ In[12]:1 within `square`
+; ││┌ @ /home/valeria/.julia/packages/Zygote/nyzjS/src/compiler/chainrules.jl:212 within `ZBack`
+; │││┌ @ /home/valeria/.julia/packages/Zygote/nyzjS/src/lib/number.jl:12 within `literal_pow_pullback`
+; ││││┌ @ promotion.jl:423 within `*` @ float.jl:411
+       %1 = fmul double %0, 2.000000e+00
+; └└└└└
+;  @ /home/valeria/.julia/packages/Zygote/nyzjS/src/compiler/interface.jl:149 within `gradient`
+  %unbox2.fca.0.insert = insertvalue [1 x double] zeroinitializer, double %1, 0
+  ret [1 x double] %unbox2.fca.0.insert
+}
+
+
+
+
+
+
+

Kinds of algorithmic differentation#

+
    +
  • Source transformation: Fortran code in, Fortran code out

    +
      +
    • Duplicates compiler features, usually incomplete language coverage

    • +
    • Produces efficient code

    • +
    +
  • +
  • Operator overloading: C++ types

    +
      +
    • Hard to vectorize

    • +
    • Loops are effectively unrolled/inefficient

    • +
    +
  • +
  • Just-in-time compilation: tightly coupled with compiler

    +
      +
    • JIT lag

    • +
    • Needs dynamic language features (JAX) or tight integration with compiler (Zygote, Enzyme)

    • +
    • Some sharp bits in Python’s JAX

    • +
    +
  • +
+
+
+

Forward or reverse?#

+

It all depends on the shape.

+
    +
  • If you have one input, many outputs: use forward mode

    +
      +
    • “One input” can be looking in one direction

    • +
    +
  • +
  • If you have many inputs, one output: use reverse mode

    +
      +
    • Will need to traverse execution backwards (“tape”)

    • +
    • Hierarchical checkpointing

    • +
    +
  • +
  • What about square cases (same number of input/output)? Forward is usually a bit more efficient.

  • +
+
+
+
+

2. Ill-conditioned optimization#

+
+\[ L(c; x, y) = \frac 1 2 \lVert \underbrace{f(x, c) - y}_{r(c)} \rVert^2 \]
+

Recall that the computation of the gradient of the loss function \(L\) requires the Jacobian, denoted by \(J\), of the model \(f\) differentiated w. r. t. the constants \(c\).

+
+\[ g(c) = \nabla_c L = r^T \underbrace{\nabla_c f}_{J} \]
+

We can find the constants \(c\) for which \(g(c) = 0\) using a Newton method

+
+\[ g(c + \delta c) = g(c) + \underbrace{\nabla_c g}_H\delta c + O((\delta c)^2) \]
+

The Hessian requires the second derivative of \(f\), which can cause problems.

+
+\[ H = J^T J + r^T (\nabla_c J)\]
+
+

Newton-like methods for optimization#

+

We want to solve

+
+\[ H \delta c = -g(c) \]
+

Update +

+\[c \gets c + \gamma \delta c\]
+ +using a line search or trust region.

+
+

Outlook on optimization#

+
    +
  • The optimization problem can be solved using a Newton method. It can be onerous to implement the needed derivatives.

  • +
  • The Gauss-Newton method is often more practical than Newton while being faster than gradient descent, though it lacks robustness.

  • +
  • The Levenberg-Marquardt method provides a sort of middle-ground between Gauss-Newton and gradient descent.

  • +
  • Many globalization techniques are used for models that possess many local minima.

  • +
  • One pervasive approach is stochastic gradient descent, where small batches (e.g., 1, 10 or 20) are selected randomly from the corpus of observations (500 in the example we’ve seen with many realizations), and a step of gradient descent is applied to that reduced set of observations. This helps to escape saddle points and weak local minima.

  • +
  • Among expressive models \(f(x,c)\), some may converge much more easily than others. Having a good optimization algorithm is essential for nonlinear regression with complicated models, especially those with many parameters \(c\).

  • +
  • Classification is a very similar problem to regression, but the observations \(y\) are discrete, thus, in this case

    +
      +
    • models \(f(x,c)\) must have discrete output

    • +
    • the least squares loss function is not appropriate.

    • +
    +
  • +
  • Reading: Why momentum really works

  • +
+
+
+
+
+ + + + +
+ + + + + + + + +
+ + + + + + +
+
+ + +
+ + +