Title | Text |
---|---|
Intro | In 2021 during three months, Nigerian car insurance company held a competition in African data science competition platform called Zindi . In this competition the organizer wanted to know wheter or not a client will submit a vehicle insurance claim in the next 3 months. In this competition 600+ competitors participated. |
Data | The dataset consisted of Train == 12000, Test == 1200, Sample_Submition, Nigerian_State_LGA_Name. |
Metrics | F1_score for evaluating our algorithm. |
ML Task | Binary Classification task. |
- The dataset was unbalanced.
- It had missing values in some columns.
Age
column had outliers.- Despite distinct IDs duplicated rows existed.
- State and LGA column names were incorrect.
- Some duplicated rows had different target.
- Used RandomOverSampler algorithm to oversample the minority class.
- I tried to impute NaNs with Iterative-Imputer and KNN-Imputer.
- I used absolute value of Age to fix negative values.
- When I deleted duplicated values I got lower F1_score in public LB so I did not fix it. But in private LB I found out I should have deleted it.
- Interestingly I used Nigerian_State_LGA_Name dataset to correct Names in LGA and State.
- I again did not fix duplicated rows with different targets.
- Did not pay attention to scaling, transforming, feature selection, which led to overfitting.
- rather than following ML rules I followed what public LB told me about duplicated rows.
- I did not use Stacking or boosting from ensembles efficiently.
- CatBoost for binary Classification.
- Iterative-Imputer with ExtraTrees for Imputing Missing Values by Label-Encoding the categorical dtype.
- RandomOverSampler for Over-Sampling minority class.
- Others.