Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Logistic Regression algorithm #2530

Merged
merged 71 commits into from
Nov 8, 2023

Conversation

avolkov-intel
Copy link
Contributor

@avolkov-intel avolkov-intel commented Sep 27, 2023

Add LogisticRegression algorithm to oneDAL

Changes proposed in this pull request:

  • Implement LogisticRegression algorithm interface
  • Implement dense_batch method for solving binary classification problem
  • Add new convergence tests for newton_cg primitive
  • Add convergence tests for LogisticRegression algorithm
  • Add dpc example for LogisticRegression algorithm

- Add control over the number of iterations in while loops
- Use l2-norm for convergence checks in cg-solver
- Move QuadraticFunction to primitives section
@avolkov-intel avolkov-intel added enhancement dpc++ Issue/PR related to DPC++ functionality new algorithm New algorithm or method in oneDAL testing labels Sep 27, 2023
@avolkov-intel
Copy link
Contributor Author

/intelci: run

@avolkov-intel
Copy link
Contributor Author

/intelci: run

}

template <typename Float, typename Task>
static infer_result<Task> call_dal_kernel(const context_gpu& ctx,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to avoid using call_dal_kernel name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have it in many algorithms (in linear_regression, for example) so I do not understand what is the problem with naming

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this name is not representative in terms of what this kernel does

namespace pr = be::primitives;

template <typename Float, typename Task>
static train_result<Task> call_dal_kernel(const context_gpu& ctx,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same comment

// TODO: add check if the dataset can be moved to gpu
// Move data to gpu
pr::ndarray<Float, 2> data_nd = pr::table2ndarray<Float>(queue, data, sycl::usm::alloc::device);
table data_gpu = homogen_table::wrap(data_nd.flatten(queue, {}), sample_count, feature_count);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whats the reason to convert gpu ndarray to table? Is it possible to provide data_nd?

cpp/oneapi/dal/algo/logistic_regression/test/fixture.hpp Outdated Show resolved Hide resolved
cpp/oneapi/dal/algo/logistic_regression/train_types.hpp Outdated Show resolved Hide resolved
#include "oneapi/dal/exceptions.hpp"
#include "example_util/utils.hpp"
#include <chrono>
#include <iostream>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it included in utils.hpp?

@avolkov-intel
Copy link
Contributor Author

/intelci: run

Comment on lines +46 to +50
template <typename Float>
std::int64_t propose_block_size(const sycl::queue& q, std::int64_t f, std::int64_t r) {
constexpr std::int64_t fsize = sizeof(Float);
return 0x10000l * (8 / fsize);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change this in the future

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be changed in the PR.

Comment on lines +72 to +73
pr::ndarray<Float, 1> params = pr::table2ndarray_1d<Float>(queue, betas, alloc);
pr::ndview<Float, 1> params_suf = fit_intercept ? params : params.slice(1, feature_count);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend to just use auto. You already explicitly define type just on the right side

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it is excplicitly defined in the second line as params_suf = params may result in having type ndarray. Maybe it is ok, but in order to avoid confusions I would leave it as is

cpp/oneapi/dal/algo/logistic_regression/common.hpp Outdated Show resolved Hide resolved
cpp/oneapi/dal/algo/logistic_regression/test/fixture.hpp Outdated Show resolved Hide resolved
cpp/oneapi/dal/algo/logistic_regression/test/fixture.hpp Outdated Show resolved Hide resolved
cpp/oneapi/dal/algo/logistic_regression/test/fixture.hpp Outdated Show resolved Hide resolved
Comment on lines +145 to +146
table X_train = homogen_table::wrap<float_t>(X_host_.get_mutable_data(), train_size, p_);
table X_test = homogen_table::wrap<float_t>(X_host_.get_mutable_data() + train_size * p_,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please just use 2 separate tables/ndarrays all the way instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would lead to code duplication during generation and predictions, I see no problem in having data stored in one array

cpp/oneapi/dal/algo/logistic_regression/test/fixture.hpp Outdated Show resolved Hide resolved
Comment on lines +194 to +195
float_t min_train_acc = 0.95;
float_t min_test_acc = n_ < 500 ? 0.7 : 0.85;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should it mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For small datasets it is harder to fit model so accuracy threshold should be lower

@avolkov-intel
Copy link
Contributor Author

/intelci: run

// TODO: add check if the dataset can be moved to gpu
// Move data to gpu
pr::ndarray<Float, 2> data_nd = pr::table2ndarray<Float>(queue, data, sycl::usm::alloc::device);
table data_gpu = homogen_table::wrap(data_nd.flatten(queue, {}), sample_count, feature_count);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

define alldeps at once

@avolkov-intel
Copy link
Contributor Author

/intelci: run

@avolkov-intel
Copy link
Contributor Author

/intelci: rerun

@ethanglaser
Copy link
Contributor

Copy link
Contributor

@ethanglaser ethanglaser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address bazel cpu test fail before merge

@ethanglaser
Copy link
Contributor

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make more sense to put this in algo or primitives?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally this is probably the only BUILD file needed within logistic_regression if there are no tests outside of logistic_regression/tests

Copy link
Contributor

@ethanglaser ethanglaser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, left a few comments. CI in good shape.

@avolkov-intel avolkov-intel merged commit 286f2f7 into oneapi-src:master Nov 8, 2023
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dpc++ Issue/PR related to DPC++ functionality enhancement new algorithm New algorithm or method in oneDAL testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants