Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[enh] Refactor finiteness_checker #2669

Merged
merged 58 commits into from
Apr 24, 2024
Merged

Conversation

icfaust
Copy link
Contributor

@icfaust icfaust commented Feb 20, 2024

Description

Introduce avx2 ISA intrinsics for finiteness_checker. Duplicate AVX512 functions for AVX2 by switching certain numbers to half size and changing instructions from 512 to 256 bit width. Due to the hardcoded nature of the functions, it is not easily templated out without performance loss. This implementation should improve sklearnex performance on standard benchmarks.

Changes proposed in this pull request:

  • Add avx2 finite sum check
  • Add avx2 finiteness per element check
  • Add avx2 SOA supports
  • Move final comparison of finalMask out of for loop to reduce branching in AVX512 inf/nan check.
  • AVX2 to finiteness_checker_avx2_impl.i
  • AVX512 to finiteness_chcker_avx512_impli.
  • Common functions to finitness_checker_cpu.cpp, swap to templating
  • Expand definitions in finiteness_checker.h
  • Move __DAAL_KERNEL_MIN to daal_kernel_defines.h
  • fix bug in DAAL_DISPATCH_FUNCTION_BY_CPU and DAAL_DISPATCH_FUNCTION_BY_CPU_SAFE to properly select other ISAs

Tasks

  • Implement AVX2
  • Get it to compile
  • Green CI
  • Run sklearnex Benchmarks

@icfaust
Copy link
Contributor Author

icfaust commented Feb 20, 2024

/intelci: run

@icfaust
Copy link
Contributor Author

icfaust commented Feb 20, 2024

/intelci: run

@icfaust
Copy link
Contributor Author

icfaust commented Feb 20, 2024

/intelci: run

@icfaust icfaust marked this pull request as ready for review February 20, 2024 15:56
@icfaust
Copy link
Contributor Author

icfaust commented Feb 21, 2024

test fail related to rbf kernel, which doesn't use this code

Copy link
Contributor

@Alexandr-Solovev Alexandr-Solovev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general it looks good to me, but, it's better to change the order of the new functions. Like firstly avx2 then avx512. Also, please use onedal-benchmarks with the kernel profiler as well as intelex benchmarks to check what kernels have been improved. I would like to say that you will get similar performance benefits in both.

@icfaust
Copy link
Contributor Author

icfaust commented Apr 19, 2024

This will likely pass CI, but performance benchmarks are necessary due to the underlying changes in the CPU function dispatching.

@icfaust
Copy link
Contributor Author

icfaust commented Apr 19, 2024

/intelci: run

@icfaust
Copy link
Contributor Author

icfaust commented Apr 22, 2024

/intelci: run

@icfaust
Copy link
Contributor Author

icfaust commented Apr 22, 2024

Things required before re-review: a privateCI run for checking avx512, and oneDAL performance benchmarks of changes to function dispatching.

@icfaust icfaust changed the title [enh] Add avx2 support in finiteness_checker [enh] Refactor finiteness_checker Apr 22, 2024
@icfaust
Copy link
Contributor Author

icfaust commented Apr 22, 2024

Run with an avx512 build: http://intel-ci.intel.com/ef007c41-cb1f-f115-9514-a4bf010d0e2e failures due to un-related GPU issues.

@icfaust icfaust requested a review from Vika-F April 22, 2024 08:52
@icfaust
Copy link
Contributor Author

icfaust commented Apr 23, 2024

private CI failures due to unrelated GPU/dpc timeouts

@icfaust
Copy link
Contributor Author

icfaust commented Apr 23, 2024

private CI run with last sklearnex master (includes _assert_all_finite tests coming from intel/scikit-learn-intelex#1759) http://intel-ci.intel.com/ef012d7e-a408-f166-adc8-a4bf010d0e2e

Copy link
Contributor

@Vika-F Vika-F left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the restructuring effort.
The code became much more understandable.

My final comment is that 'AVX' needs to be replaced with something else like 'SIMD'. Otherwise the code is good to go.

constexpr size_t numberOfBitsInByte = 8;
constexpr size_t nPerInstr = avx2RegisterLength / (numberOfBitsInByte * sizeof(float));
services::internal::TArray<bool, avx2> notFiniteArr(nTotalBlocks);
bool * notFinitePtr = notFiniteArr.get();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a FYI. No action is needed.

It might be better (in terms of performance) to have those kind of arrays as a 'thread local storage' (see daal::tls).

Having them as a regular arrays might lead to cache coherency issues when i-th thread updates i-th element of notFinitePtr array, and (i+1)-th thread updates (i+1)-th element of the same array. In this case both threads might need to synchronize the information in their caches as both i-th and (i+1)-th elements of the array might reside in both caches of i-th and (i+1)-th threads and can become unsynchronized after the update.

bool * localNotFinite = tlsNotFinite.local();
DAAL_CHECK_MALLOC_THR(localNotFinite);

switch ((*tableFeaturesDict)[i].getIndexType())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need something that checks for integer overflow for integer features here? How is it done in array API?

Comment on lines +279 to +283
ReadRows<DataType, cpu> dataBlock(table, 0, nRows);
DAAL_CHECK_BLOCK_STATUS(dataBlock);
const DataType * dataPtr = dataBlock.get();

sum = computeSum<DataType, cpu>(1, nElements, &dataPtr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another FYI. No action is required.

It is better in terms of both performance and memory consumption to move ReadRows inside the parallel for that happens in computeSum.
Because ReadRows in some cases might copy the whole table and convert it -> +1 pass through the memory.
When ReadRows is made by blocks and is combined with the computational part it is usually more efficient.

@icfaust icfaust requested a review from Vika-F April 23, 2024 14:14
@icfaust
Copy link
Contributor Author

icfaust commented Apr 23, 2024

/intelci: run

@icfaust
Copy link
Contributor Author

icfaust commented Apr 24, 2024

Rerun due to CI timeouts: http://intel-ci.intel.com/ef01f546-5586-f1d1-863c-a4bf010d0e2e

Copy link
Contributor

@Vika-F Vika-F left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@icfaust icfaust merged commit 34c4b78 into oneapi-src:main Apr 24, 2024
15 of 16 checks passed
@icfaust icfaust deleted the dev/avx2_finite branch April 24, 2024 12:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants