-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document how missing values should be handled by user #247
Comments
Should be adressed by improved preprocessing documentation in: https://evovest.github.io/EvoTrees.jl/dev/tutorials/logistic-regression-titanic/#Preprocessing |
Thank you. I would consider adding the following details to the documentation (or changing the behavior of the package):
|
@jeremiedb - any thoughts on this? I would make a blogpost about updates of EvoTrees after you decide what to do with this issue and tag a release. Hopefully it could help promote this excellent package. |
For now my take would be to add a "Missing data" section in the docs (along the Reproducibility one) that clarifies the behavior of the algo. I'm for now reluctant to perform further transformations to the input data or make any assumption of what the intent of the user would have been. My perspetive is for ML algos to be limited to the algo part, while the handling of missings and the likes to be handled by the preprocessing part, which I conceive as a topic of its own within a modeling pipeline. So I'd prefer to direct users to MLJ, TableTransforms or of self-defined preprocessing. |
Sure - if docs are precise what is done in algo and what has to be done in pre-processing this is also OK. |
Let me know if you think the above PR provides satisfying clarificationson the handling of missings: https://evovest.github.io/EvoTrees.jl/dev/#Missing-values |
Looks good. Thank you! |
I checked this part of your tutorial:
https://github.com/Evovest/EvoTrees.jl/blob/main/docs/src/tutorials/logistic-regression-titanic.md?plain=1#L34
and
https://github.com/Evovest/EvoTrees.jl/blob/main/docs/src/tutorials/logistic-regression-titanic.md?plain=1#L35
And it was not fully clear for me what is the recommended practice for both cases from the package maintainers.
I.e. what should be the canonical way to preprocess string variables and the canonical way to handle
missing
.(for example in case of missing probably, if such a replacement as suggested in the docs is done another 0-1 feature indicating where a missing value was would be added to avoid loosing information).
Also, thank you for using DataFrames.jl :). From this perspective you could write (this is a mild suggestion):
or maybe just simply:
See for the second performance point:
The text was updated successfully, but these errors were encountered: