Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Termination at saddle point #705

Open
andreasnoack opened this issue Aug 10, 2023 · 6 comments
Open

Termination at saddle point #705

andreasnoack opened this issue Aug 10, 2023 · 6 comments

Comments

@andreasnoack
Copy link
Member

andreasnoack commented Aug 10, 2023

I'm facing an issue where a linear mixed model terminates at a saddle point for some datasets. I've tried to search for any prior discussion but was unable to find anything. The short version is that BOBYQA terminates at a saddle point where Optim's BFGS does not. I did a brief search for discussions of BOBYQA and saddle points but didn't find anything obvious. I'm wondering if it might be worthwhile to switch to a derivative based optimization algorithm.

Now to the details. I'm working on a bioequivalence testing problem and I don't have the luxury of choosing the model. The model is stated in SAS form in appendix E of https://www.fda.gov/media/70958/download and for most datasets, I'm matching the SAS results exactly with the MixedModels formulation below. However, for some datasets (looks like it's associated with missings), the MixedModels fits end up at saddle points. This dataset is used for the reproducer

saddlepointdata.csv

In the code below, I make tiny perturbations of the observations. Even though they are tiny, it's sufficient to make the difference between reaching an extremum and a saddle point (as I'll show further down)

using MixedModels, CSV, DataFrames, StatsBase, Random
using SparseArrays: nzrange

saddlepointdata = CSV.read("saddlepointdata.csv", DataFrame)
transform!(saddlepointdata, "AUCT" => eachindex => "row")

N = 500
_rng = Random.seed!(Random.default_rng(), 124)
fts = [
    fit(
        LinearMixedModel,
        @formula(log(AUCT) ~
            trt +
            seq +
            per +
            (trt + 0 | sub) +
            zerocorr(trt + 0 | row)
        ),
        transform(
            saddlepointdata,
            "AUCT" => ByRow(t -> t + randn(_rng)*1e-12) => "AUCT"
        );
        contrasts = Dict(
            [
                :trt,
                :per,
                :sub,
                :row,
            ] .=> Ref(DummyCoding())
        ),
        REML = true
    ) for _ in 1:N
];

countmap(round.(exp.(getindex.(coef.(fts), 2)), digits=3))

I'm focusing on the second coefficient because it's the one of interest in the equivalence testing and it makes the output less verbose when restricting focus to a single coefficient. The result of the code above is

Dict{Float64, Int64} with 2 entries:
  0.9329 => 483
  0.9301 => 17

The first of these solutions is an extremum and the second is a saddle point. To see this, we can run

using CairoMakie, SixelTerm

saddlepointfit = fts[findfirst(t -> exp(coef(t)[2]) < 0.931, fts)];

plotx = range(-12, stop=12, length=101)
_i = 5= copy(saddlepointfit.θ)
ploty = [
    MixedModels.deviance(
        MixedModels.updateL!(
            MixedModels.setθ!(
                deepcopy(saddlepointfit), # to avoid mutating the original version+ [zeros(_i - 1); _x; zeros(5 - _i)]
            )
        )
    )
    for _x in plotx
];
lineplot(plotx, ploty, xlim = extrema(plotx), width = 80)

which gives
Skærmbillede 2023-08-10 kl  09 18 13
so we are clearly not at a minimum.

To dig a little deeper, I'm using a version of the objective function that allows for ForwardDiff. Find the code in the folded section below

Forward compatible MixedModels code
using Optim, LinearAlgebra
mybound = x -> [exp.(x[1:3]); x[4]; exp.(x[5:6])]

function my_fit(
  ft::MixedModels.LinearMixedModel;
  show_trace = false,
  extended_trace = false
  )
  opt_res = optimize(
      [0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
      BFGS(initial_stepnorm = 1.0),
      Optim.Options(;show_trace, extended_trace),
      autodiff=:forward
  ) do θ
      my_deviance(ft, mybound(θ))
  end
  new_ft = deepcopy(ft)
  MixedModels.setθ!(new_ft, mybound(opt_res.minimizer)[1:end - 1])
  MixedModels.updateL!(new_ft)
  return new_ft
end

function my_deviance(model::LinearMixedModel, θ::AbstractVector{T}) where {T}

  σ² = θ[end]^2= θ[1:(end-1)]
  dof = MixedModels.ssqdenom(model)

  # Extract and promote
  A, L, reterms = model.A, model.L, model.reterms
  AA = [LinearAlgebra.copy_oftype(Ai, T) for Ai in A]
  LL = [LinearAlgebra.copy_oftype(Li, T) for Li in L]
  RR = [LinearAlgebra.copy_oftype(Ri, T) for Ri in reterms]

  # Update state with new θ
  _setθ!(RR, model.parmap, _θ)
  myupdateL!(AA, LL, RR)

  r² = _pwrss(LL)
  ld = _logdet(LL, RR, model.optsum.REML)

  return dof * log(2 * π * σ²) + ld +/ σ²
end

function _setθ!(
  reterms::Vector{<:MixedModels.AbstractReMat},
  parmap::Vector{<:NTuple},
  θ::AbstractVector,
)
  length(θ) == length(parmap) || throw(DimensionMismatch())
  reind = 1
  λ = first(reterms).λ
  for (tv, tr) in zip(θ, parmap)
      tr1 = first(tr)
      if reind  tr1
          reind = tr1
          λ = reterms[tr1].λ
      end
      λ[tr[2], tr[3]] = tv
  end
  return reterms
end

function myupdateL!(A::Vector, L::Vector, reterms::Vector)

  k = length(reterms)
  copyto!(last(L), last(A))  # ensure the fixed-effects:response block is copied
  for j in eachindex(reterms) # pre- and post-multiply by Λ, add I to diagonal
      cj = reterms[j]
      diagind = MixedModels.kp1choose2(j)
      MixedModels.copyscaleinflate!(L[diagind], A[diagind], cj)
      for i = (j+1):(k+1)     # postmultiply column by Λ
          bij = MixedModels.block(i, j)
          MixedModels.rmulΛ!(copyto!(L[bij], A[bij]), cj)
      end
      for jj = 1:(j-1)        # premultiply row by Λ'
          MixedModels.lmulΛ!(cj', L[MixedModels.block(j, jj)])
      end
  end
  for j = 1:(k+1)             # blocked Cholesky
      Ljj = L[MixedModels.kp1choose2(j)]
      for jj = 1:(j-1)
          _rankUpdate!(
              Hermitian(Ljj, :L),
              L[MixedModels.block(j, jj)],
              -one(eltype(Ljj)),
              one(eltype(Ljj)),
          )
      end
      _cholUnblocked!(Ljj, Val{:L})
      LjjT = isa(Ljj, Diagonal) ? Ljj : LowerTriangular(Ljj)
      for i = (j+1):(k+1)
          Lij = L[MixedModels.block(i, j)]
          for jj = 1:(j-1)
              mul!(
                  Lij,
                  L[MixedModels.block(i, jj)],
                  L[MixedModels.block(j, jj)]',
                  -one(eltype(Lij)),
                  one(eltype(Lij)),
              )
          end
          rdiv!(Lij, LjjT')
      end
  end
  return nothing
end

_pwrss(L::Vector) = abs2(last(last(L)))

function _logdet(L::Vector, reterms::Vector{<:MixedModels.AbstractReMat}, REML::Bool)
  @inbounds s = sum(j -> MixedModels.LD(L[MixedModels.kp1choose2(j)]), axes(reterms, 1))
  if REML
      lastL = last(L)
      s += MixedModels.LD(lastL)        # this includes the log of sqrtpwrss
      s -= log(last(lastL)) # so we need to subtract it from the sum
  end
  return s + s  # multiply by 2 b/c the desired det is of the symmetric mat, not the factor
end

LinearAlgebra.copy_oftype(A::MixedModels.UniformBlockDiagonal, ::Type{T}) where {T} =
  MixedModels.UniformBlockDiagonal(LinearAlgebra.copy_oftype(A.data, T))

function LinearAlgebra.copy_oftype(
  A::MixedModels.BlockedSparse{<:Any,S,P},
  ::Type{T},
) where {T,S,P}
  cscmat = LinearAlgebra.copy_oftype(A.cscmat, T)
  nzsasmat = LinearAlgebra.copy_oftype(A.nzsasmat, T)
  MixedModels.BlockedSparse{T,S,P}(cscmat, nzsasmat, A.colblkptr)
end

function LinearAlgebra.copy_oftype(A::MixedModels.ReMat{<:Any,S}, ::Type{T}) where {T,S}
  MixedModels.ReMat{T,S}(
      A.trm,
      A.refs,
      A.levels,
      A.cnames,
      LinearAlgebra.copy_oftype(A.z, T),
      LinearAlgebra.copy_oftype(A.wtz, T),
      LinearAlgebra.copy_oftype(A.λ, T),
      A.inds,
      LinearAlgebra.copy_oftype(A.adjA, T),
      LinearAlgebra.copy_oftype(A.scratch, T),
  )
end

function _cholUnblocked!(A::StridedMatrix, ::Type{Val{:L}})
  cholesky!(Hermitian(A, :L))
  return A
end
function _cholUnblocked!(D::MixedModels.UniformBlockDiagonal, ::Type{T}) where {T}
  Ddat = D.data
  for k in axes(Ddat, 3)
      _cholUnblocked!(view(Ddat, :, :, k), T)
  end
  return D
end
function _cholUnblocked!(A::Diagonal, ::Type{T}) where {T}
  A.diag .= sqrt.(A.diag)
  return A
end
_cholUnblocked!(A::AbstractMatrix, ::Type{T}) where {T} = MixedModels.cholUnblocked!(A, T)

function _rankUpdate!(
  C::LinearAlgebra.HermOrSym{T,UniformBlockDiagonal{T}},
  A::StridedMatrix{T},
  α,
  β,
) where {T}
  Cdat = C.data.data
  LinearAlgebra.require_one_based_indexing(Cdat, A)
  isone(β) || rmul!(Cdat, β)
  blksize = size(Cdat, 1)

  for k in axes(Cdat, 3)
      ioffset = (k - 1) * blksize
      joffset = (k - 1) * blksize
      for i in axes(Cdat, 1), j = 1:i
          iind = ioffset + i
          jind = joffset + j
          AtAij = 0
          for idx in axes(A, 2)
              # because the second multiplicant is from A', swap index order
              AtAij += A[iind, idx] * A[jind, idx]
          end
          Cdat[i, j, k] += α * AtAij
      end
  end

  return C
end

function _rankUpdate!(
  C::LinearAlgebra.HermOrSym{T,UniformBlockDiagonal{T}},
  A::MixedModels.BlockedSparse{T,S},
  α,
  β,
) where {T,S}
  Ac = A.cscmat
  cp = Ac.colptr
  all(==(S), diff(cp)) ||
      throw(ArgumentError("Columns of A must have exactly $S nonzeros"))
  Cdat = C.data.data
  LinearAlgebra.require_one_based_indexing(Ac, Cdat)

  j, k, l = size(Cdat)
  S == j == k && div(Ac.m, S) == l ||
      throw(DimensionMismatch("div(A.cscmat.m, S) ≠ size(C.data.data, 3)"))
  nz = Ac.nzval
  rv = Ac.rowval

  @inbounds for j in axes(Ac, 2)
      nzr = nzrange(Ac, j)
      # BLAS.syr!('L', α, view(nz, nzr), view(Cdat, :, :, div(rv[last(nzr)], S)))
      _x = view(nz, nzr)
      view(Cdat, :, :, div(rv[last(nzr)], S)) .+= α .* _x .* _x'
  end

  return C
end

function _rankUpdate!(
  C::LinearAlgebra.HermOrSym{T,S},
  A::StridedMatrix{T},
  α,
  β,
) where {T,S}
  # BLAS.syrk!(C.uplo, 'N', T(α), A, T(β), C.data)
  C.data .*= β
  C.data .+= α .* A * A'
  return C
end

_rankUpdate!(C::AbstractMatrix, A::AbstractMatrix, α, β) =
  MixedModels.rankUpdate!(C, A, α, β)

Notice that this version of the objective function hasn't concentrated out σ. With this code, I can compute the Hessian which clearly shows the indefiniteness

julia> using ForwardDiff

julia> (objective(saddlepointfit), my_deviance(saddlepointfit, [_θ; saddlepointfit.sigma]))
(-284.3311306944413, -284.33113069939395)

julia>

julia> ForwardDiff.hessian(θσ -> my_deviance(saddlepointfit, θσ), [_θ; saddlepointfit.sigma])
6×6 Matrix{Float64}:
    1.04012         9.41825e-6      0.00227411   0.000198533   1.81058e-8   1469.2
    9.41825e-6      0.58946        -1.33622e-5  -0.0003711     0.000425082  1660.82
    0.00227411     -1.33622e-5      0.860294     0.0357195    -2.86937e-7   1510.92
    0.000198533    -0.0003711       0.0357195    0.288273     -2.35253e-6     -1.50027
    1.81058e-8      0.000425082    -2.86937e-7  -2.35253e-6   -0.051874        1.04004
 1469.2          1660.82         1510.92        -1.50027       1.04004         9.40305e6

Finally, we can use the ForwardDiff compatible code to fit the model with BFGS using ForwardDiff based derivatives

julia> my_fts = [my_fit(ft) for ft in fts];

julia> countmap(round.(exp.(getindex.(coef.(my_fts), 2)), digits=4))
Dict{Float64, Int64} with 1 entry:
  0.9329 => 500

so with BFGS, we never end up at the saddle point (for this dataset). Of course, we don't know if that is the case for other datasets but I wanted to share this observation to start the conversation. Maybe a gradient based optimization should be considered. It will of course require some work since the version I threw together here is rather allocating so can't be used right away but might still be useful as a starting point.

@palday
Copy link
Member

palday commented Aug 10, 2023

@andreasnoack Can you show us versioninfo()?

@dmbates has actually done a fair number of more involved experiments with using the gradient and it never really paid off. That said, we've recently discussed revisiting this and evaluating the Hessian efficiently has several uses.

In my own experience, BOBYQA rarely fails on well-posed problems, but we do support using other derivative-free algorithms implemented in NLopt. For example, if we change to COBYLA, things work:

N = 500
_rng = Random.seed!(Random.default_rng(), 124)
formula = @formula(log(AUCT) ~ trt + seq + per +
                   (trt + 0 | sub) +
                   zerocorr(trt + 0 | row))
contrasts = Dict(:trt => DummyCoding(), :per => DummyCoding(),
                 :sub => Grouping(), :row => Grouping())
fts = @showprogress map(1:N) do _
    data = transform(saddlepointdata,
                     "AUCT" => ByRow(t -> t + randn(_rng)*1e-12) => "AUCT")
    model = LinearMixedModel(formula, data; contrasts)
    model.optsum.optimizer = :LN_COBYLA
    return fit!(model; REML=true)
end

countmap(round.(exp.(getindex.(coef.(fts), 2)), digits=3))

yields

Dict{Float64, Int64} with 1 entry:
  0.933 => 500

@andreasnoack
Copy link
Member Author

We have five test dataset where we could reproduce the saddle point. When running the test script above with those datasets, COBYLA always avoids the saddle point solutions that BOBYQA sometimes terminate at.

It would be interesting to know if this an issue with the implementation of BOBYQA or an issue with the algorithm. cc @stevengj.

@dmbates has actually done a fair number of more involved experiments with using the gradient and it never really paid off.

I'd be curious to read more about this. My expectation would be that it would be beneficial to use the gradient.

That said, we've recently discussed revisiting this and evaluating the Hessian efficiently has several uses.

The reason I had the ForwardDiff compatible deviance code handy is actually that I'm using it for the Satterthwaite approximation for the degrees of freedom. I'd be happy to contribute that here if there is an interest.

@palday
Copy link
Member

palday commented Aug 18, 2023

The reason I had the ForwardDiff compatible deviance code handy is actually that I'm using it for the Satterthwaite approximation for the degrees of freedom. I'd be happy to contribute that here if there is an interest.

@andreasnoack definitely! That's been on my secret todo list for a very long time, but other things were always higher priority.

@palday
Copy link
Member

palday commented Aug 18, 2023

@andreasnoack it's stale relative to main, but I think this was the most recent effort: https://github.com/JuliaStats/MixedModels.jl/tree/forwarddiff

@palday
Copy link
Member

palday commented Aug 22, 2023

@andreasnoack see also #312

@dmbates
Copy link
Collaborator

dmbates commented Feb 7, 2024

As I mention in #742, I have been looking at switching MixedModels.jl to the PRIMA.jl version of bobyqa for the constrained, derivative-free optimization. @zaikunzhang, the developer of libprima, spent considerable amounts of time translating Powell's original f77 code to a modularized modern Fortran. Along the way he discovered and fixed a number of bugs.

I will let you know if I am able to apply PRIMA.bobyqa to this problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants