Question: Regarding time complexity of Oversamplers and "Noise Filters" #63

BradKML · 2022-09-16T05:11:19Z

BradKML
Sep 16, 2022

For Scikit Learn some have created tools for demoing latency (model fitting) against error.

The Scitime estimator is useful for some of the algorithms in Scikit-learn but not all

It would be useful to benchmark and measure the time complexity of oversamplers and see which ones are fast (or not) based on size of dataset and log-odds of majority proportions.

gykovacs · 2022-09-16T07:27:38Z

gykovacs
Sep 16, 2022
Maintainer

I agree, the time complexity of oversampling techniques is somewhat unexplored. Some runtime measurements are incorporated though. There was an extensive evaluation (shared in corresponding papers), and based on the average runtimes on 104 datasets a ranking of oversampling techniques is available. For example, if one is interested in the 10 quickest techniques overall, then can query them as

import smote_variants as sv

# get 10 quickest oversamplers
oversamplers = sv.get_all_oversamplers(n_quickest=10)

Although this is not a true time complexity analysis, it can still be used to query computationally efficient techniques for further research or application purposes.

Nevertheless, a proper time complexity analysis by varying number of majority and minority samples, features, imbalance ratios class overlap, etc. would be very useful.

Regarding the noise filters, they are not intended to be primary or necessary steps for oversampling pipelines, but similar analysis on them could still be useful.

4 replies

BradKML Sep 16, 2022
Author

I would kindly request a testing tool function to examine speed through timeit usage and custom datasets, to see which ones has higher complexity based on log-time grpahs of different binary imbalance ratios and row counts. https://github.com/analyticalmindsltd/smote_variants/blob/master/smote_variants/queries/_runtimes.py

Idea for oversampler: given dataset X and y pairs, with minority class rows constant, based on a subset of the majority class, determine the speed of oversampling.

For some extra reference for classifiers:

gykovacs Sep 16, 2022
Maintainer

I dont really see what kind of tool is needed for that. You can use timeit with the existing features. Given a bunch of datasets with differing characteristics (you can use sklearn.datasets.make_classification to generate classification datasets with any number of features/complexity/imbalance ratio/etc.), and given an oversampler you can simply do

X, y # this is some dataset
oversampler = sv.SMOTE()

%timeit oversampler.sample(X, y)

However, one thing to consider is that oversampling performance highly depends on the actual parameters used. Each of the oversamplers have a bunch of parameters which can heavily affect the runtimes.

BradKML Sep 18, 2022
Author

Let's say I write it like this for looped testing: (probably not correct)

for oversampler in [polynom-fit-SMOTE(proportions=n), ProWSyn(proportions=n), SMOTE-IPF(proportions=n), Lee(proportions=n), SMOBD(proportions=n), Assembled-SMOTE(proportions=n)]:
    %timeit oversampler.sample(X, y)

Might there be a way of doing this without using time module for start and stop?

gykovacs Sep 18, 2022
Maintainer

Yes, something like this might work, although not entirely correct. Why do you stick to using %timeit? It can be replicated with the time module, and the time module gives more control. Also, be aware that most of these techniques rely on the inherent vector parallelization of numpy. In order to achieve reliable runtimes which are not affected by numpy parallelization, you need to set environment variables and the n_jobs parameters of the oversamplers properly. For the ease of these kinds of bulk evaluations, let me recommend you taking a look on the evaluate_oversamplers function in the package, which does this kind of tasks, and also measures runtimes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Regarding time complexity of Oversamplers and "Noise Filters" #63

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Question: Regarding time complexity of Oversamplers and "Noise Filters" #63

BradKML Sep 16, 2022

Replies: 1 comment · 4 replies

gykovacs Sep 16, 2022 Maintainer

BradKML Sep 16, 2022 Author

gykovacs Sep 16, 2022 Maintainer

BradKML Sep 18, 2022 Author

gykovacs Sep 18, 2022 Maintainer

BradKML
Sep 16, 2022

Replies: 1 comment 4 replies

gykovacs
Sep 16, 2022
Maintainer

BradKML Sep 16, 2022
Author

gykovacs Sep 16, 2022
Maintainer

BradKML Sep 18, 2022
Author

gykovacs Sep 18, 2022
Maintainer