-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[enh] Refactor finiteness_checker #2669
Conversation
/intelci: run |
/intelci: run |
/intelci: run |
test fail related to rbf kernel, which doesn't use this code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general it looks good to me, but, it's better to change the order of the new functions. Like firstly avx2 then avx512. Also, please use onedal-benchmarks with the kernel profiler as well as intelex benchmarks to check what kernels have been improved. I would like to say that you will get similar performance benefits in both.
This will likely pass CI, but performance benchmarks are necessary due to the underlying changes in the CPU function dispatching. |
/intelci: run |
/intelci: run |
Things required before re-review: a privateCI run for checking avx512, and oneDAL performance benchmarks of changes to function dispatching. |
Run with an avx512 build: http://intel-ci.intel.com/ef007c41-cb1f-f115-9514-a4bf010d0e2e failures due to un-related GPU issues. |
private CI failures due to unrelated GPU/dpc timeouts |
private CI run with last sklearnex master (includes _assert_all_finite tests coming from intel/scikit-learn-intelex#1759) http://intel-ci.intel.com/ef012d7e-a408-f166-adc8-a4bf010d0e2e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the restructuring effort.
The code became much more understandable.
My final comment is that 'AVX' needs to be replaced with something else like 'SIMD'. Otherwise the code is good to go.
constexpr size_t numberOfBitsInByte = 8; | ||
constexpr size_t nPerInstr = avx2RegisterLength / (numberOfBitsInByte * sizeof(float)); | ||
services::internal::TArray<bool, avx2> notFiniteArr(nTotalBlocks); | ||
bool * notFinitePtr = notFiniteArr.get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a FYI. No action is needed.
It might be better (in terms of performance) to have those kind of arrays as a 'thread local storage' (see daal::tls).
Having them as a regular arrays might lead to cache coherency issues when i-th thread updates i-th element of notFinitePtr array, and (i+1)-th thread updates (i+1)-th element of the same array. In this case both threads might need to synchronize the information in their caches as both i-th and (i+1)-th elements of the array might reside in both caches of i-th and (i+1)-th threads and can become unsynchronized after the update.
bool * localNotFinite = tlsNotFinite.local(); | ||
DAAL_CHECK_MALLOC_THR(localNotFinite); | ||
|
||
switch ((*tableFeaturesDict)[i].getIndexType()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we need something that checks for integer overflow for integer features here? How is it done in array API?
ReadRows<DataType, cpu> dataBlock(table, 0, nRows); | ||
DAAL_CHECK_BLOCK_STATUS(dataBlock); | ||
const DataType * dataPtr = dataBlock.get(); | ||
|
||
sum = computeSum<DataType, cpu>(1, nElements, &dataPtr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another FYI. No action is required.
It is better in terms of both performance and memory consumption to move ReadRows inside the parallel for that happens in computeSum.
Because ReadRows in some cases might copy the whole table and convert it -> +1 pass through the memory.
When ReadRows is made by blocks and is combined with the computational part it is usually more efficient.
/intelci: run |
Rerun due to CI timeouts: http://intel-ci.intel.com/ef01f546-5586-f1d1-863c-a4bf010d0e2e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description
Introduce avx2 ISA intrinsics for finiteness_checker. Duplicate AVX512 functions for AVX2 by switching certain numbers to half size and changing instructions from 512 to 256 bit width. Due to the hardcoded nature of the functions, it is not easily templated out without performance loss. This implementation should improve sklearnex performance on standard benchmarks.
Changes proposed in this pull request:
__DAAL_KERNEL_MIN
to daal_kernel_defines.hDAAL_DISPATCH_FUNCTION_BY_CPU
andDAAL_DISPATCH_FUNCTION_BY_CPU_SAFE
to properly select other ISAsTasks