Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document how missing values should be handled by user #247

Closed
bkamins opened this issue Aug 11, 2023 · 7 comments
Closed

Document how missing values should be handled by user #247

bkamins opened this issue Aug 11, 2023 · 7 comments

Comments

@bkamins
Copy link

bkamins commented Aug 11, 2023

I checked this part of your tutorial:

https://github.com/Evovest/EvoTrees.jl/blob/main/docs/src/tutorials/logistic-regression-titanic.md?plain=1#L34

and

https://github.com/Evovest/EvoTrees.jl/blob/main/docs/src/tutorials/logistic-regression-titanic.md?plain=1#L35

And it was not fully clear for me what is the recommended practice for both cases from the package maintainers.
I.e. what should be the canonical way to preprocess string variables and the canonical way to handle missing.
(for example in case of missing probably, if such a replacement as suggested in the docs is done another 0-1 feature indicating where a missing value was would be added to avoid loosing information).

Also, thank you for using DataFrames.jl :). From this perspective you could write (this is a mild suggestion):

transform!(df, :Sex => ByRow(!=("male")) => :Sex) # gives Bool column, but I guess it is OK for your package
transform!(df, :Age => (x -> coalesce.(x, mean(skipmissing(df.Age)))) => :Age) # this will be more efficient as it avoids computing the mean repeatedly

or maybe just simply:

df.Sex = df.Sex .!== "male"
df.Age = coalesce.(df.Age, mean(skipmissing(df.Age)))

See for the second performance point:

julia> x = rand([1.0, 2.0, missing], 10^4);

julia> f1(v) = ismissing(v) ? mean(skipmissing(x)) : v
f1 (generic function with 1 method)

julia> f2(x) = coalesce.(x, mean(skipmissing(x)))
f2 (generic function with 1 method)

julia> @time f1.(x);
  0.107405 seconds (20.01 k allocations: 390.969 KiB)

julia> @time f1.(x);
  0.110699 seconds (20.01 k allocations: 390.969 KiB)

julia> @time f2(x);
  0.000142 seconds (2 allocations: 78.172 KiB)

julia> @time f2(x);
  0.000124 seconds (2 allocations: 78.172 KiB)
@jeremiedb
Copy link
Member

Should be adressed by improved preprocessing documentation in: https://evovest.github.io/EvoTrees.jl/dev/tutorials/logistic-regression-titanic/#Preprocessing

@bkamins
Copy link
Author

bkamins commented Aug 19, 2023

Thank you.

I would consider adding the following details to the documentation (or changing the behavior of the package):

  • for features: currently if eltype of feature allows Missing then the feature is dropped; I recommend that you either document this (i.e. that the user needs to call e.g. disallowmissing function on the input) or that you actually do the check if there are any missing; this is a more general issue, e.g. if column eltype is Any (which sometimes happens in practice) users should be informed what to do (simplest is to do identity.(col) to force establishment of a more precise eltype)
  • you say in the docs "Tables API supports input that are either Real, Bool or Categorical" but Bool <: Real so it would be good to explain why Bool is explicitly mentioned (I assume it is handled differently internally?)
  • having said that if we have missing value for unordered categorical variable then I think you could allow missing in such a variable (and just treat it as a separate level); currently you drop it - this is a soft recommendation; also I would make a separate recommendation for case of missing in Categorical column (as then probably it is better to change missing to a separate level given your current implementation of the package) - the recommendation with a dummy variable is good for Real variables. (I hope it is clear what I want to say here 😄)
  • the requirement you specified applies to FEATURES. For target actually eltype is not checked, but values are checked (still missing is disallowed). I would make it more precise what input is assumed for target input.

@bkamins
Copy link
Author

bkamins commented Oct 1, 2023

@jeremiedb - any thoughts on this? I would make a blogpost about updates of EvoTrees after you decide what to do with this issue and tag a release. Hopefully it could help promote this excellent package.

@jeremiedb
Copy link
Member

For now my take would be to add a "Missing data" section in the docs (along the Reproducibility one) that clarifies the behavior of the algo. I'm for now reluctant to perform further transformations to the input data or make any assumption of what the intent of the user would have been. My perspetive is for ML algos to be limited to the algo part, while the handling of missings and the likes to be handled by the preprocessing part, which I conceive as a topic of its own within a modeling pipeline. So I'd prefer to direct users to MLJ, TableTransforms or of self-defined preprocessing.
I agree though on importance add clarification on the supported feature and column eltypes. I'll look to have those docs updated within 2-3 days.

@bkamins
Copy link
Author

bkamins commented Oct 2, 2023

Sure - if docs are precise what is done in algo and what has to be done in pre-processing this is also OK.

@jeremiedb jeremiedb mentioned this issue Oct 5, 2023
@jeremiedb
Copy link
Member

jeremiedb commented Oct 5, 2023

Let me know if you think the above PR provides satisfying clarificationson the handling of missings: https://evovest.github.io/EvoTrees.jl/dev/#Missing-values

@bkamins
Copy link
Author

bkamins commented Oct 5, 2023

Looks good. Thank you!

@bkamins bkamins closed this as completed Oct 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants