Skip to content

Releases: morousg/cvGPUSpeedup

Alpha-0.0.16 (Parallel Christmas)

26 Dec 11:36
253ccc5
Compare
Choose a tag to compare

This release brings significant sugar code enhancements:

  1. Operation Builders:
    • All Operations now have an static build() function that returns an instance of the proper InstantiableOperation type.
    • Some default build() functions have been defined, and it is possible to add overloads of the function to ease the usage of the Operation.
    • All Read and ReadBack Operations now have a default build_batch function, to help with the construction of Batch InstantiableOperations.
  2. DeviceFunctions renamed to InstantiableOperations
    • Since we wanted to shift the attention of the user thowards the Operation structs, we think having InstantiableOperation naming is more adequate, to avoid introducing more "concept" overhead. At the end, DF's are just the way to make Operations instantiable, using static polimorphism, hence the name of InstantiableOperation.
  3. InstantiableOperation then() function:
    • Continuing with the sugar code features, now you can create a transform kernel, by just using Operation types and their new build and then static functions. Here is an example:
const auto myOperationChain = PerThreadRead<_2D,uchar3>::build(params...).then(Cast<uchar3,float3>::build()).then(Mul<float>::build(3.3f));
  1. All Operations are now compatible with both CPU and GPU:
    • Saturate has been refactored to use GPU or CPU code, using come macros to give the compiler the proper code.
    • Also, many of the implementations are now also constexpr.

In future releases we plan to work on:

  1. Adding a CPU Transform Data Parallel Pattern, that uses both CPU threads and AVX instructions.
  2. Making all operations constexpr, using constexpr OpenSource math libraries. We think this can be much better for performance than having assembler code snippets, which might be the fastest for that single operation, but breaks constexpr wich in some cases can be way worse for performance.

NOTES:

Starting with this release, some tests can only be compiled with CUDA 12.1 to 12.3. This is due to the 4KB limitation in earlier CUDA versions in the amount of Bytes passed as Kernel parameters. CUDA 12 increased that number to 32KB. CUDA 12.4 to 12.6 versions have a bug that affects some unit tests, in the nvcc compiler. This bug will be solved in a future nvcc release.

Alpha-0.0.15

17 Jul 10:42
Compare
Choose a tag to compare

Micro release to include make_tuple functionality

Alpha-0.0.14

20 Jun 09:33
5690cfa
Compare
Choose a tag to compare

The main contribution of this release, is the formal definition of an API in order to perform Backwards Generic Vertical Fusion.

The result is the possiblity to create more complex secuences of DeviceFunctions, that start to look more like compile time graphs, though not yet. There is more to come in this regard.

Alpha-0.0.13

15 Apr 10:23
53cf00d
Compare
Choose a tag to compare
Merge pull request #99 from morousg/95-change-sum-operation-name-to-a…

Alpha-0.0.12

24 Mar 15:01
Compare
Choose a tag to compare

Added support for CUDA 12.3
Tested with OpenCV 4.9
OpenCV is not mandatory anymore when using cmake files

Alpha-0.0.11

23 Feb 12:35
9a73a1d
Compare
Choose a tag to compare
Merge pull request #94 from morousg/93-adapt-code-for-easier-integrat…

Alpha-0.0.10

23 Oct 16:49
Compare
Choose a tag to compare
Fix: Fixed a bug in resize and a compilation issue affecting only VS2017

Alpha-0.0.9

18 Sep 10:23
Compare
Choose a tag to compare

Fising a bug

Alpha-0.0.8

14 Sep 16:28
Compare
Choose a tag to compare

Added configurable CirculaTensorOrder

Alpha-0.0.7

13 Sep 18:07
b985cb1
Compare
Choose a tag to compare

New TensortT (Transposed) available also in CircularTensor