Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mlj table #262

Merged
merged 11 commits into from
Oct 27, 2023
Merged

Mlj table #262

merged 11 commits into from
Oct 27, 2023

Conversation

jeremiedb
Copy link
Member

@jeremiedb jeremiedb commented Oct 17, 2023

Fix #260
@ablaom I moved the MLJ wrapper from the orignal Matrix API to the new Tables one, which provides native support for richer inputs, incl. Ordered and Unordered (MultiClass) CategoricalValues.

@codecov
Copy link

codecov bot commented Oct 17, 2023

Codecov Report

Attention: 8 lines in your changes are missing coverage. Please review.

Comparison is base (535b0c4) 51.05% compared to head (a4c9dcb) 51.90%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #262      +/-   ##
==========================================
+ Coverage   51.05%   51.90%   +0.84%     
==========================================
  Files          22       22              
  Lines        1986     1944      -42     
==========================================
- Hits         1014     1009       -5     
+ Misses        972      935      -37     
Files Coverage Δ
src/fit-utils.jl 90.36% <100.00%> (-1.36%) ⬇️
src/fit.jl 96.39% <100.00%> (+0.03%) ⬆️
src/predict.jl 92.24% <100.00%> (+0.06%) ⬆️
src/MLJ.jl 88.88% <92.30%> (+2.88%) ⬆️
ext/EvoTreesCUDAExt/predict.jl 0.00% <0.00%> (ø)
ext/EvoTreesCUDAExt/fit-utils.jl 0.00% <0.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ablaom
Copy link
Contributor

ablaom commented Oct 17, 2023

Happy to review this shortly.

MMI.selectrows(::EvoTypes, I, A, y) =
((matrix=view(A.matrix, I, :), names=A.names), view(y, I))
MMI.selectrows(::EvoTypes, I, A) = ((matrix=view(A.matrix, I, :), names=A.names),)

Copy link
Contributor

@ablaom ablaom Oct 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is this: previously table -> matrix conversion happened in reformat and not in EvoTree's internal training, which means MLJ's IteratedModel wrapper avoided repeating these conversions over and over in controlled iteration.

The change here suggests that table -> matrix conversion is now "internalised", which means the performance gain is lost, right? What am I missing here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid this brings back at the forefront the issue with regard to MLJ's handling of iterated process for models relying on caching / initialization mechanism.

Once an EvoTree model is initialized with a training dataset, this dataset should be considered as fixed, no subsampling should be performed to it throughout the iterations. The algo is responsible for performing the subsampling operations throughout the iterations using tricks that are algo specific. This is true AFAIK for all "high performance" GBT models (incl. XGBoost, LightGBM, CatBoost).

Looking at the error resulting when trying an IteratedModel uder this PR, why MLJ's fit_only! needs to perform a data resampling: fit(model, verbosity, _resampled_data(mach, model, rows)...)?

The logic that has been used with EvoTrees so far to handle the MMI's update method has been to completely ignore the data args A and y:

grow_evotree!(fitresult, cache, model)

I'm still unclear why an issue happened with IteratedModel at first since update behavior where only the model and cache are actually used when iterating the model. And the EvoTrees' MLJ tests pass, so the basic operations seem to work fine, including iterative calls to fit!.

This is still much WIP, but I'm trying to lay down what I have experiencced as shortcomings from ML toolkits, incl Scikitlearn, which assumes that a model can be instantiated with a MyModelRegressor rather than having distinct management of a model conguration, its initialization and the fitting of that instantiated model: https://github.com/Evovest/EvoTrees.jl/blob/learnAPI/experiments/learnAPI.jl. This is the kind of approach that we need to rely on internally to handle our ML pipelines, but I'd definitely like to see such support in an open framework.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ablaom The issue I encountered testing on IteratedModel appeared to be a red herring: the failure was caused by the log_loss_ measure not accepting Deterministic output of the EvoTreeRegressor (isn't it unnecessarily restrictive? I actually don't see a logistic regression as less deterministic than "mse", but that's another topic!).
It seems like the MLJ's fit! call to to reformat effectively results in a no-op as expected (each iteration following the 1st one directly mutate the instantiate fitted_model and the cache). Training speed is aligned with internal's fit_evotree if using very large evaluation Step (but quickly blows up, going from 12 sec reference time to ~68 sec with Step(5)).

So back the this specific topic of moving the EvoTrees's table API, AFAICT the MLJ integration with MLJ works fine. Or at least as well as the existing one, but with added benefits in terms of richer input feature types.

@ablaom
Copy link
Contributor

ablaom commented Oct 24, 2023

Okay, I think I was mistaken to think dumping the model-specific data pre-processing (reformat) and model-specific resampling (selectrows) would slow down IteratedModel. As you point out, we are only calling update after the first train, so the conversion table -> matrix only happens once. I have also confirmed this with a benchmark, where this PR actually runs slightly faster for IteratedModel, for Step sizes of 1, 5, 30 and a total of 300 iterations (details at the end).

On the other hand, I do expect some impact on TunedModel wrapping, for here changing the parameter to be tuned means we call fit and not update for each new parameter value. A benchmark suggest this is indeed the case, but the slowdown is small (assuming it's even statistically significant):

# on EvoTrees master branch:

@btime fit!($mach, verbosity=0)
#  15.437 μs (100 allocations: 4.55 KiB)

# on this PR:

@btime fit!($mach, verbosity=0)
#  17.439 μs (100 allocations: 4.55 KiB)

Benchmark details are attached below.

So, I won't push back against your reluctance to update the reformat/selectrows implementation, and assume you have good reasons for not doing so.

I'm not sure I fully understand all the objections to the current API, but I suggest we have this discussion in a different forum. (Still happy to take a call to discuss this.)

API changes to MLJModelInterface, if they are breaking, will not happen in any hurry.

I will proceed with the rest of my review shortly.

using MLJModels
using MLJIteration
using MLJBase
using BenchmarkTools
using StatisticalMeasures
using MLJTuning

EvoTreeBooster = @load EvoTreeRegressor
booster = EvoTreeBooster()

X, y = make_regression(1000, 5)


# # CONTROLLED ITERATION

# number of iterations:
N = 300

steps = [1, 5, 30]

iterated_boosters = [
    IteratedModel(
        model=booster,
        resampling=Holdout(fraction_train=0.8),
        controls=[Step(s), NumberLimit(round(Int, N/s))],
        measure=l1,
    )
    for s in steps
        ]

for ib in iterated_boosters
    mach = machine(ib, X, y)
    print("step=", ib.controls[1].n)
    @btime fit!($mach, verbosity=0)
    @assert report(mach).n_iterations == N
end

# on EvoTrees master branch:

# step=1  14.351 μs (101 allocations: 4.61 KiB)
# step=5  14.301 μs (101 allocations: 4.61 KiB)
# step=30  14.342 μs (101 allocations: 4.61 KiB)

# on this PR:

# step=1  13.701 μs (101 allocations: 4.61 KiB)
# step=5  13.726 μs (101 allocations: 4.61 KiB)
# step=30  13.776 μs (101 allocations: 4.61 KiB)


# # TUNING

r = range(booster, :lambda, lower = 0, upper = 0.5)
tuned_booster = TunedModel(
    booster,
    tuning=RandomSearch(),
    range=r,
    measure = l2,
    n = 500,
)

mach = machine(tuned_booster, X, y)

# on EvoTrees master branch:

@btime fit!($mach, verbosity=0)
#  15.437 μs (100 allocations: 4.55 KiB)

# on this PR:

@btime fit!($mach, verbosity=0)
#  17.439 μs (100 allocations: 4.55 KiB)

@ablaom
Copy link
Contributor

ablaom commented Oct 24, 2023

@jeremiedb There seems to be a problem with handling tables that are row-based. Perhaps you are calling Tables.columnnames(X) directly on a table X (a common gotcha)?? You can only call this on tables implementing the AbstractColumns interface, or on rows of a table implementing the AbstractRow interface.

Pkg.activate("explore002", shared=true)

using MLJBase
using MLJModels
using Tables

EvoTreeBooster = @load EvoTreeRegressor
booster = EvoTreeBooster()

X, y = make_regression(1000, 5)

# this works:
mach = machine(booster, X, y) |> fit!

# this doesn't
X = Tables.rowtable(X)
mach = machine(booster, X, y) |> fit!

# ERROR: BoundsError: attempt to access 1000-element Vector{NamedTuple{(:x1, :x2, :x3, :x4, :x5), NTuple{5, Float64}}} at index [1] 
# Stacktrace:
#  [1] getcolumn(x::Vector{NamedTuple{(:x1, :x2, :x3, :x4, :x5), NTuple{5, Float64}}}, i::Int64)                                                                                      
#    @ Tables ~/.julia/packages/Tables/NSGZI/src/Tables.jl:101
#  [2] fit(model::EvoTrees.EvoTreeRegressor{EvoTrees.MSE}, verbosity::Int64, A::Vector{NamedTuple{(:x1, :x2, :x3, :x4, :x5), NTuple{5, Float64}}}, y::Vector{Float64}, w::Nothing)
#    @ EvoTrees ~/.julia/packages/EvoTrees/jzXEo/src/MLJ.jl:3
#  [3] fit(model::EvoTrees.EvoTreeRegressor{EvoTrees.MSE}, verbosity::Int64, A::Vector{NamedTuple{(:x1, :x2, :x3, :x4, :x5), NTuple{5, Float64}}}, y::Vector{Float64})            
#    @ EvoTrees ~/.julia/packages/EvoTrees/jzXEo/src/MLJ.jl:3
#  [4] fit_only!(mach::Machine{EvoTrees.EvoTreeRegressor{EvoTrees.MSE}, true}; rows::Nothing, 
# verbosity::Int64, force::Bool, composite::Nothing)                                         
#    @ MLJBase ~/.julia/packages/MLJBase/fEiP2/src/machines.jl:680
#  [5] fit_only!
#    @ ~/.julia/packages/MLJBase/fEiP2/src/machines.jl:606 [inlined]
#  [6] #fit!#63
#    @ ~/.julia/packages/MLJBase/fEiP2/src/machines.jl:777 [inlined]
#  [7] fit!
#    @ ~/.julia/packages/MLJBase/fEiP2/src/machines.jl:774 [inlined]
#  [8] |>(x::Machine{EvoTrees.EvoTreeRegressor{EvoTrees.MSE}, true}, f::typeof(fit!))
#    @ Base ./operators.jl:907
#  [9] top-level scope
#    @ REPL[12]:1

@jeremiedb
Copy link
Member Author

jeremiedb commented Oct 24, 2023

Perhaps you are calling Tables.columnnames(X) directly on a table X

Guilty :) The Tables implementation effectively assumed working with columntables. Since there were several dependencies to getcolumn internally, I've added a conversion to Tables.columntable. It doesn't add any overhead when Tables are already in such format which is the most common case I suppose (such as DataFrame).

Regarding the impact on TunedModel I'd expect it to be very small, as when a new hyper-prams configuration is tested, a full model initialisation must be performed. Such initialization time is essentially the same for both the original Matrix and the new Tables based internal APIs. The bulk of computing time during an EvoTree training is normally spent iterating over n trees; steps which are identical in EvoTrees regardless of if the input was originally a Matrix of a Tables.

Thanks a lot for taking the time to look at the PR. And sorry about the digression about the iterated API. I'm trying to gather some coherent proposal so as to have a constructive take, will ping back once done!

@ablaom
Copy link
Contributor

ablaom commented Oct 25, 2023

I think that row-table problem persists with predict:

EvoTreeBooster = @load EvoTreeRegressor
booster = EvoTreeBooster()

X, y = make_regression(1000, 5)
X = Tables.rowtable(X);

# smoke tests:
mach = machine(booster, X, y) |> fit!    # this now works, thanks
predict(mach, X) # fails

# ERROR: BoundsError: attempt to access 1000-element Vector{NamedTuple{(:x1, :x2, :x3, :x4, :x5), NTuple{5, Float64}}} at index [1]    
Stack trace
Stacktrace:
 [1] getcolumn(x::Vector{NamedTuple{(:x1, :x2, :x3, :x4, :x5), NTuple{5, Float64}}}, i::Int64)                                                                                      
   @ Tables ~/.julia/packages/Tables/NSGZI/src/Tables.jl:101
 [2] binarize(df::Vector{NamedTuple{(:x1, :x2, :x3, :x4, :x5), NTuple{5, Float64}}}; fnames::NTuple{5, Symbol}, edges::Vector{Any})                                                    
   @ EvoTrees ~/.julia/packages/EvoTrees/3Ocxy/src/fit-utils.jl:74
 [3] predict(m::EvoTrees.EvoTree{EvoTrees.MSE, 1}, data::Vector{NamedTuple{(:x1, :x2, :x3, :x4, :x5), NTuple{5, Float64}}}, ::Type{EvoTrees.CPU}; ntree_limit::Int64)
   @ EvoTrees ~/.julia/packages/EvoTrees/3Ocxy/src/predict.jl:91
 [4] predict(m::EvoTrees.EvoTree{EvoTrees.MSE, 1}, data::Vector{NamedTuple{(:x1, :x2, :x3, :x4, :x5), NTuple{5, Float64}}}, ::Type{EvoTrees.CPU})          
   @ EvoTrees ~/.julia/packages/EvoTrees/3Ocxy/src/predict.jl:83
 [5] predict(m::EvoTrees.EvoTree{EvoTrees.MSE, 1}, data::Vector{NamedTuple{(:x1, :x2, :x3, :x4, :x5), NTuple{5, Float64}}})                                
   @ EvoTrees ~/.julia/packages/EvoTrees/3Ocxy/src/predict.jl:83
 [6] predict(#unused#::EvoTrees.EvoTreeRegressor{EvoTrees.MSE}, fitresult::EvoTrees.EvoTree{EvoTrees.MSE, 1}, A::Vector{NamedTuple{(:x1, :x2, :x3, :x4, :x5), NTuple{5, Float64}}})    
   @ EvoTrees ~/.julia/packages/EvoTrees/3Ocxy/src/MLJ.jl:45
 [7] predict(mach::Machine{EvoTrees.EvoTreeRegressor{EvoTrees.MSE}, true}, Xraw::Vector{NamedTuple{(:x1, :x2, :x3, :x4, :x5), NTuple{5, Float64}}})                                
   @ MLJBase ~/.julia/packages/MLJBase/fEiP2/src/operations.jl:133
 [8] top-level scope
   @ REPL[42]:1

@jeremiedb
Copy link
Member Author

jeremiedb commented Oct 26, 2023

I think that row-table problem persists with predict:

Thanks for the catch. I had neglected adding a test for rowtables, which has now been added within the MLJ test suite (both fit! and predict). Tests are now passing followin the patch to convert to Tables.columntable in EvoTrees.predict.

Regarding the @spawn vs @thread, I've followed the discussions around it. The post you referred to caused the migration to :static I did in v0.15, which turned out ill advised as it resulted in issues for a user running EvoTrees under an already threaded routine. My understanding of the issue is that problems can be caused when one relies on the threadid() as a reference to the thread making the work. This is to best of my knowledge not a pattern that has been used in the codebase. I simply wanted to move all threaded operation to a common appraoch, as the histogram builing routines are already using @threads and provide some performance improvements over @spawn.

src/MLJ.jl Outdated Show resolved Hide resolved
src/MLJ.jl Show resolved Hide resolved
src/MLJ.jl Outdated Show resolved Hide resolved
Copy link
Contributor

@ablaom ablaom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have finished reviewing those parts of the PR that seemed relevant to the MLJ interface. It looks good to go.

@jeremiedb jeremiedb merged commit dbee39d into main Oct 27, 2023
8 checks passed
@jeremiedb jeremiedb deleted the mlj-table branch October 27, 2023 04:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Categorical Variables Not Working
2 participants