Releases: morousg/cvGPUSpeedup
Alpha-0.0.16 (Parallel Christmas)
This release brings significant sugar code enhancements:
- Operation Builders:
- All Operations now have an static build() function that returns an instance of the proper InstantiableOperation type.
- Some default build() functions have been defined, and it is possible to add overloads of the function to ease the usage of the Operation.
- All Read and ReadBack Operations now have a default build_batch function, to help with the construction of Batch InstantiableOperations.
- DeviceFunctions renamed to InstantiableOperations
- Since we wanted to shift the attention of the user thowards the Operation structs, we think having InstantiableOperation naming is more adequate, to avoid introducing more "concept" overhead. At the end, DF's are just the way to make Operations instantiable, using static polimorphism, hence the name of InstantiableOperation.
- InstantiableOperation then() function:
- Continuing with the sugar code features, now you can create a transform kernel, by just using Operation types and their new build and then static functions. Here is an example:
const auto myOperationChain = PerThreadRead<_2D,uchar3>::build(params...).then(Cast<uchar3,float3>::build()).then(Mul<float>::build(3.3f));
- All Operations are now compatible with both CPU and GPU:
- Saturate has been refactored to use GPU or CPU code, using come macros to give the compiler the proper code.
- Also, many of the implementations are now also constexpr.
In future releases we plan to work on:
- Adding a CPU Transform Data Parallel Pattern, that uses both CPU threads and AVX instructions.
- Making all operations constexpr, using constexpr OpenSource math libraries. We think this can be much better for performance than having assembler code snippets, which might be the fastest for that single operation, but breaks constexpr wich in some cases can be way worse for performance.
NOTES:
Starting with this release, some tests can only be compiled with CUDA 12.1 to 12.3. This is due to the 4KB limitation in earlier CUDA versions in the amount of Bytes passed as Kernel parameters. CUDA 12 increased that number to 32KB. CUDA 12.4 to 12.6 versions have a bug that affects some unit tests, in the nvcc compiler. This bug will be solved in a future nvcc release.
Alpha-0.0.15
Micro release to include make_tuple functionality
Alpha-0.0.14
The main contribution of this release, is the formal definition of an API in order to perform Backwards Generic Vertical Fusion.
The result is the possiblity to create more complex secuences of DeviceFunctions, that start to look more like compile time graphs, though not yet. There is more to come in this regard.
Alpha-0.0.13
Merge pull request #99 from morousg/95-change-sum-operation-name-to-a…
Alpha-0.0.12
Added support for CUDA 12.3
Tested with OpenCV 4.9
OpenCV is not mandatory anymore when using cmake files
Alpha-0.0.11
Merge pull request #94 from morousg/93-adapt-code-for-easier-integrat…
Alpha-0.0.10
Fix: Fixed a bug in resize and a compilation issue affecting only VS2017
Alpha-0.0.9
Fising a bug
Alpha-0.0.8
Added configurable CirculaTensorOrder
Alpha-0.0.7
New TensortT (Transposed) available also in CircularTensor