-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add spectral embedding algorithm #2875
base: main
Are you sure you want to change the base?
Add spectral embedding algorithm #2875
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not bad for the initial version.
Please decompose the compute function into smaller kernels. I'd move binary search part and Laplassian part into separate functions.
My other comments are below.
cpp/daal/src/algorithms/spectral_embedding/spectral_embedding_default_dense_fpt_cpu.cpp
Outdated
Show resolved
Hide resolved
template <typename algorithmFPType, CpuType cpu> | ||
services::Status computeEigenvectorsInplace(size_t nFeatures, algorithmFPType * eigenvectors, algorithmFPType * eigenvalues) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add arguments description.
cpp/daal/src/algorithms/spectral_embedding/spectral_embedding_default_dense_impl.i
Outdated
Show resolved
Hide resolved
cpp/daal/src/algorithms/spectral_embedding/spectral_embedding_default_dense_impl.i
Outdated
Show resolved
Hide resolved
cpp/daal/src/algorithms/spectral_embedding/spectral_embedding_default_dense_impl.i
Outdated
Show resolved
Hide resolved
size_t numEmb = 1; | ||
size_t numNeighbors = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In DAAL we try to use full names where possible.
size_t numEmb = 1; | |
size_t numNeighbors = 1; | |
size_t numberOfEmbeddings = 1; | |
size_t numberOfNeighbors = 1; |
cpp/oneapi/dal/algo/spectral_embedding/backend/cpu/compute_kernel.cpp
Outdated
Show resolved
Hide resolved
std::int64_t embedding_dim = 0; | ||
std::int64_t num_neighbors = -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's align with scikit and use components
for the dimension of projected space. Also, in oneDAL we usually use count
, not dim
or num
in naming.
In DAAL it would be numberOfComponents
, etc.
std::int64_t embedding_dim = 0; | |
std::int64_t num_neighbors = -1; | |
std::int64_t component_count = 0; | |
std::int64_t neighbor_count = -1; |
void check_compute_result(const spectral_embedding::compute_result<>& result) { | ||
array<Float> data_arr = row_accessor<const Float>(data_).pull({ 0, -1 }); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add comparison with 'golden' data computed with sklearn.
auto desc = | ||
get_descriptor(sp_emb::result_options::embedding); | ||
//desc.set_embedding_dim(5); | ||
//desc.set_num_neighbors(4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's compute the number of neighbors by the formula provided in the original request.
/intelci: run |
2 similar comments
/intelci: run |
/intelci: run |
b0c3428
to
c3318e9
Compare
/intelci: run |
/intelci: run |
/intelci: run |
1 similar comment
/intelci: run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of the blue questions about the algo implementation. Nothing to be forced, I just want to start a conversation with you and the other reviewers.
// Use binary search to find such d that the number of verticies having distance <= d is filtNum | ||
const size_t binarySearchIterNum = 20; | ||
// TODO: add parallel_for | ||
for (size_t i = 0; i < n; ++i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would split off the BinarySearch into a separate function which could be inlined. (i.e. everything in this for loop). Then applying the daal::threader_for would be easier (something like this as an example https://github.com/oneapi-src/oneDAL/blob/main/cpp/daal/src/algorithms/dtrees/forest/df_train_dense_default_impl.i#L434)
x[i * n + i] = 0; | ||
} | ||
|
||
// Create Laplassian matrix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Create Laplassian matrix | |
// Create Laplacian matrix |
// std::cout << std::endl; | ||
// } | ||
|
||
// Fill the output matrix with eigen vectors corresponding to the smallest eigen values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Fill the output matrix with eigen vectors corresponding to the smallest eigen values | |
// Fill the output matrix with eigenvectors corresponding to the smallest eigenvalues |
DAAL_CHECK_BLOCK_STATUS(embedMatrix); | ||
algorithmFPType * embed = embedMatrix.get(); | ||
|
||
for (int i = 0; i < k; ++i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if this a dumb request, could you leave a comment in the code above this double for loop as to the matrix operation is doing? I know its related to the eigenvectors out of X, but why into the columns of embed? May save some time in the future for someone unfamiliar when they try to get up to speed. I can see its the transpose copy of part of a row of x into a column of embed, is that right?
{ | ||
M = (L + R) / 2; | ||
cnt = 0; | ||
// Calculate the number of elements in the row with value <= M |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just from a high level perspective, it looks like this code will run O(binarySearchIterNum*n) to try and find the proper M which yields the right number of neighbors. binarySearchIterNum will be log(1/eps) (from here https://github.com/oneapi-src/oneDAL/pull/2875/files#r1720389061) Would a binary search tree be faster? At some point also wouldn't just sorting the array may also be faster (when log(n) < log(1/eps))?
} | ||
|
||
// Create Laplassian matrix | ||
for (size_t i = 0; i < n; ++i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dumb question, I sort assumed the nearest cosine neighbors from above created almost an adjacency matrix which could be used here to create the laplacian via L=D-A. I guess I am trying to understand the creation math here, could you send me a link what definition you used here?
DAAL_CHECK_BLOCK_STATUS(embedMatrix); | ||
algorithmFPType * embed = embedMatrix.get(); | ||
|
||
for (int i = 0; i < k; ++i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also a todo for a parallel for here?
|
||
size_t lcnt, rcnt, cnt; | ||
algorithmFPType L, R, M; | ||
// Use binary search to find such d that the number of verticies having distance <= d is filtNum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Use binary search to find such d that the number of verticies having distance <= d is filtNum | |
// Use binary search to find such d that the number of vertices having distance <= d is filtNum |
break; | ||
} | ||
} | ||
// create edges for the closest neighbors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I guess now that I think about it this is getting in the direction of this sklearn function: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.kneighbors_graph.html Thoughts on using kneighbors from daal for this, or am I missing something big?
status |= computeEigenvectorsInplace<algorithmFPType, cpu>(n, x, eigenValuesPtr); | ||
DAAL_CHECK_STATUS_VAR(status); | ||
|
||
// std::cout << "Eigen vectors: " << std::endl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a reminder to delete unused code
{ | ||
algorithmFPType val = (x[i * n + j] + x[j * n + i]) / 2; | ||
x[i * n + j] = -val; | ||
x[j * n + i] = -val; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little bitconfused with inconsistency - in this case you are assigning values to the elements of matrix, but for the other part you are incrementing the symmetric ones. Does this matrix represent two triangular ones or should operations be actually symmetric?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am/was too. Hopefully a bit of clarity to help in a code comment.
DAAL_CHECK_BLOCK_STATUS(embedMatrix); | ||
algorithmFPType * embed = embedMatrix.get(); | ||
|
||
for (int i = 0; i < k; ++i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be size_t
instead?
constexpr std::int64_t neighbor_count = 5; | ||
constexpr std::int64_t component_count = 4; | ||
|
||
constexpr Float data[n * p] = { 0.49671415, -0.1382643, 0.64768854, 1.52302986, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this data is auto-generated. Please consider using rng to generate it.
{ | ||
if (x[i * n + j] <= R) | ||
{ | ||
x[i * n + j] = 1.0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
x[i * n + j] = 1.0; | |
x[i * n + j] = algorithmFPType(1); |
} | ||
else | ||
{ | ||
x[i * n + j] = 0.0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
x[i * n + j] = 0.0; | |
x[i * n + j] = algorithmFPType(0); |
WriteRows<algorithmFPType, cpu> xMatrix(covOutput, 0, n); | ||
DAAL_CHECK_BLOCK_STATUS(xMatrix); | ||
algorithmFPType * x = xMatrix.get(); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that according to SDL somewhere around here should be several DAAL_ASSERT_...
statements to check if your access to the memory is contained.
size_t lcnt, rcnt, cnt; | ||
algorithmFPType L, R, M; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend to initialize these variables. It's not a C code =D
L = 0; // min possible cos distance | ||
R = 2; // max possible cos distance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I remember at least some of these variables are actually float
s. Should we initialize them accordingly?
|
||
struct KernelParameter : daal::algorithms::Parameter | ||
{ | ||
size_t numberOfEmbeddings = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be DAAL_INT
? Personally I have nothing against using size_t
just caring for the consistency.
#include "oneapi/dal/algo/spectral_embedding.hpp" | ||
#include "oneapi/dal/io/csv.hpp" | ||
#include <algorithm> | ||
#include <math.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#include <math.h> | |
#include <cmath> |
Please use a regular C++ headers
Description
Add implementation for spectral embedding algorithm in DAAL and oneDAL interfaces for this algorithm