Skip to content

Benchmark for optimizations to scikit-learn in the Intel Distribution for Python*

License

Notifications You must be signed in to change notification settings

PivovarA/scikit-learn_bench

 
 

Repository files navigation

scikit-learn_bench

scikit-learn_bench benchmarks various implementations of machine learning algorithms across data analytics frameworks. Scikit-learn_bench can be extended to add new frameworks and algorithms. It currently support the scikit-learn, DAAL4PY, cuML, and XGBoost frameworks for commonly used machine learning algorithms.

See benchmark results here.

Table of content

Prerequisites

  • python and scikit-learn to run python versions
  • pandas when using its DataFrame as input data format
  • icc, ifort, mkl, daal to compile and run native benchmarks
  • machine learning frameworks, that you want to test. Check this item to get additional information how to set environment.

How to create conda environment for benchmarking

Create a suitable conda environment for each framework to test. Each item in the list below links to instructions to create an appropriate conda environment for the framework.

Running Python benchmarks with runner script

Run python runner.py --config configs/config_example.json [--output-format json --verbose] to launch benchmarks.

runner options:

  • config : the path to configuration file
  • dummy-run : run configuration parser and datasets generation without benchmarks running
  • verbose : print additional information during benchmarks running
  • output-format: json or csv. Output type of benchmarks to use with their runner

Benchmarks currently support the following frameworks:

  • scikit-learn
  • daal4py
  • cuml
  • xgboost

The configuration of benchmarks allows you to select the frameworks to run, select datasets for measurements and configure the parameters of the algorithms.

You can configure benchmarks by editing a config file. Check config.json schema for more details.

Benchmark supported algorithms

algorithm benchmark name sklearn daal4py cuml xgboost
DBSCAN dbscan
RandomForestClassifier df_clfs
RandomForestRegressor df_regr
pairwise_distances distances
KMeans kmeans
KNeighborsClassifier knn_clsf
LinearRegression linear
LogisticRegression log_reg
PCA pca
Ridge ridge
SVM svm
train_test_split train_test_split
GradientBoostingClassifier gbt
GradientBoostingRegressor gbt

Algorithms parameters

You can launch benchmarks for each algorithm separately. To do this, go to the directory with the benchmark:

cd <framework>

Run the following command:

python <benchmark_file> --dataset-name <path to the dataset> <other algorithm parameters>

The list of supported parameters for each algorithm you can find here:

Legacy automatic building and running

  • Run make. This will generate data, compile benchmarks, and run them.
    • To run only scikit-learn benchmarks, use make sklearn.
    • To run only native benchmarks, use make native.
    • To run only daal4py benchmarks, use make daal4py.
    • To run a specific implementation of a specific benchmark, directly request the corresponding file: make output/<impl>/<bench>.out.
    • If you have activated a conda environment, the build will use daal from the conda environment, if available.

About

Benchmark for optimizations to scikit-learn in the Intel Distribution for Python*

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 48.9%
  • Python 29.2%
  • Fortran 18.8%
  • Makefile 1.6%
  • C 1.5%