cross-learn is an ensemble of sklearn wrappers aiming to simplify the validation of statistical learning models.
Particularly, these libraries address how the groups parameter is handled by scikit-learn, which has been bugging me for a while.
The main features I focused on are:
- Cleanliness of code.
- Flexibility.
- Automation and completeness of models scoring.
- Simplification of nested crossvalidation procedures.
The code is functionally split in 3 separate modules: crossvalidators, evaluation and transformers.
Contains the crossvalidate_classification
and crossvalidate_regression
methods, all-in-one wrappers to obtain crossvalidation and nested crossvalidation scores with any sklearn-like model or pipeline, but most importantly allows for intra-fold dependencies during crossvalidation (ie nested crossvalidation with GroupKFold or similar).
Functionally, these methods act as simple scoring tracers to ease readability of evaluation metrics.
Revisions of some vanilla sklearn transformers with some new functionality:
DropColin
: Unsupervised filtering of linearly correlated features.DropColinCV
: Crossvalidated extension ofDropColin
.DropByMissingRate
: Filters out features missing more than a predefined thershold.DropByMissingRateCV
: Crossvalidated extension ofDropByMissingRate
.
Run:
pip install "git+https://github.com/jhn-nt/cross-learn.git"
These are libraires I have been developing during the years on personal projects.
After noticing I was re-writing time after time the same routines for the same problems I have decided to write them one last time for good.
Hopefully they will be of good use for others as well.
The code is fully scikit-learn compatbile and likely will see major revisions as I come up with new ideas. I have been moslty focusing on polishness and ease-of-use with a great focus on typing.
Most of all, writing these libraries has been a fantastic exercise to learn to build a cleaner and more re-usable code.
Very open to any feedback
Cheers!