All notable changes to this project will be documented in this file. This project adheres to Semantic Versioning.
- HIP backend to support tuning HIP kernels on AMD GPUs
- Experimental features for mixed-precision and accuracy tuning
- Experimental features for OpenACC tuning
- Major speedup due to new parser and using revamped python-constraint for searchspace building
- Implemented ability to use
PySMT
andATF
for searchspace building - Added Poetry for dependency and build management
- Switched from
setup.py
andsetup.cfg
topyproject.toml
for centralized metadata, added relevant tests - Updated GitHub Action workflows to use Poetry
- Updated dependencies, most notably NumPy is no longer version-locked as scikit-opt is no longer a dependency
- Documentation now uses
pyproject.toml
metadata, minor fixes and changes to be compatible with updated dependencies - Set up Nox for testing on all supported Python versions in isolated environments
- Added linting information, VS Code settings and recommendations
- Discontinued use of
OrderedDict
, as all dictionaries in the Python versions used are already ordered - Dropped Python 3.7 support
- PMTObserver to measure power and energy on various platforms
- Improved functionality for storing output and metadata files
- Updated PowerSensorObserver to support PowerSensor3
- Refactored interal interfaces of runners and backends
- Bugfix in interface to set objective and optimization direction
- Support for using time_limit in simulation mode
- Helper functions for energy tuning
- Example to show ridge frequency and power-frequency model
- Functions to store tuning output and metadata
- Changed what timings are stored in cache files
- No longer inserting partial loop unrolling factor of 0 in CUDA
- A new backend that uses Nvidia cuda-python
- Support for locked clocks in NVMLObserver
- Support for measuring core voltages using NVML
- Support for custom preprocessor definitions
- Support for boolean scalar arguments in PyCUDA backend
- Migrated from github.com/benvanwerkhoven to github.com/KernelTuner
- Significant update to the documentation pages
- Unified benchmarking loops across backends
- Backends are no longer context managers
- Replaced the method for measuring power consumption using NVML
- Improved NVML measurements of temperature and clock frequencies
- bugfix in parse_restrictions when using and/or in expressions
- bugfix in GreedyILS when using neighbor method "adjacent"
- bugfix in Bayesian Optimization for small problems
- new optimization strategies: dual annealing, greedly ILS, ordered greedy MLS, greedy MLS
- support for constant memory in cupy backend
- constraint solver to cut down time spent in creating search spaces
- support for custom tuning objectives
- support for max_fevals and time_limit in strategy_options of all strategies
- alternative Bayesian Optimization strategies that could not be used directly
- C++ wrapper module that was too specific and hardly used
- string-based restrictions are compiled into functions for improved performance
- genetic algorithm, MLS, ILS, random, and simulated annealing use new search space object
- diff evo, firefly, PSO are initialized using population of all valid configurations
- all strategies except brute_force strictly adhere to max_fevals and time_limit
- simulated annealing adapts annealing schedule to max_fevals if supplied
- minimize, basinhopping, and dual annealing start from a random valid config
- support for PyTorch Tensors as input data type for kernels
- support for smem_args in run_kernel
- support for (lambda) function and string for dynamic shared memory size
- a new Bayesian Optimization strategy
- optionally store the kernel_string with store_results
- improved reporting of skipped configurations
- support for (lambda) function instead of list of strings for restrictions
- support for (lambda) function instead of list for specifying grid divisors
- support for (lambda) function instead of tuple for specifying problem_size
- function to store the top tuning results
- function to create header file with device targets from stored results
- support for using tuning results in PythonKernel
- option to control measurements using observers
- support for NVML tunable parameters
- option to simulate auto-tuning searches from existing cache files
- Cupy backend to support C++ templated CUDA kernels
- support for templated CUDA kernels using PyCUDA backend
- documentation on tunable parameter vocabulary
- support loop unrolling using params that start with loop_unroll_factor
- always insert "define kernel_tuner 1" to allow preprocessor ifdef kernel_tuner
- support for user-defined metrics
- support for choosing the optimization starting point x0 for most strategies
- more compact output is printed to the terminal
- sequential runner runs first kernel in the parameter space to warm up device
- updated tutorials to demonstrate use of user-defined metrics
- kernelbuilder functionality for including kernels in Python applications
- smem_args option for dynamically allocated shared memory in CUDA kernels
- bugfix for Nvidia devices without internal current sensor
- fix for output checking, custom verify functions are called just once
- benchmarking now returns multiple results not only time
- more sophisticated implementation of genetic algorithm strategy
- how the "method" option is passed, now use strategy_options
- Bayesian Optimizaton strategy, use strategy="bayes_opt"
- support for kernels that use texture memory in CUDA
- support for measuring energy consumption of CUDA kernels
- option to set strategy_options to pass strategy specific options
- option to cache and restart from tuned kernel configurations cachefile
- Python 2 support, it may still work but we no longer test for Python 2
- Noodles parallel runner
- no longer replacing kernel names with instance strings during tuning
- bugfix in tempfile creation that lead to too many open files error
- A minimal Fortran example and basic Fortran support
- Particle Swarm Optimization strategy, use strategy="pso"
- Simulated Annealing strategy, use strategy="simulated_annealing"
- Firefly Algorithm strategy, use strategy="firefly_algorithm"
- Genetic Algorithm strategy, use strategy="genetic_algorithm"
- bugfix for C backend for byte array arguments
- argument type mismatches throw warning instead of exception
- wrapper functionality to wrap C++ functions
- citation file and zenodo doi generation for releases
- bugfix for when using iterations smaller than 3
- the install procedure now uses extras, e.g. [cuda,opencl]
- option quiet makes tune_kernel completely quiet
- extensive updates to documentation
- type checking for kernel arguments and answers lists
- checks for reserved keywords in tunable paramters
- checks for whether thread block dimensions are specified
- printing units for measured time with CUDA and OpenCL
- option to print all measured execution times
- bugfix install when scipy not present
- bugfix for GPU cleanup when using Noodles runner
- reworked the way strings are handled internally
- option to set compiler name, when using C backend
- actively freeing GPU memory after tuning
- bugfix for 3D grids when using OpenCL
- support for dynamic parallelism when using PyCUDA
- option to use differential evolution optimization
- global optimization strategies basinhopping, minimize
- option to pass a fraction to the sample runner
- fixed a bug in memset for OpenCL backend
- parallel tuning on single node using Noodles runner
- option to pass new defaults for block dimensions
- option to pass a Python function as code generator
- option to pass custom function for output verification
- device and kernel name are printed by runner
- tune_kernel also returns a dict with environment info
- using different timer in C vector add example
- changed how scalar arguments are handled internally
- separate install and contribution guides
- allow non-tuple problem_size for 1D grids
- changed default for grid_div_y from None to block_size_y
- converted the tutorial to a Jupyter Notebook
- CUDA backend prints device in use, similar to OpenCL backend
- migrating from nosetests to pytest
- rewrote many of the examples to save results to json files
- full support for 3D grids, including option for grid_div_z
- separable convolution example
- changed the output format to list of dictionaries
- option to set compiler options
- verbose now also prints debug output when correctness check fails
- restructured the utility functions into util and core
- restructured the code to prepare for different strategies
- shortened the output printed by the tune_kernel
- allowing numpy integers for specifying problem size
- a public roadmap
- requirements.txt
- example showing GPU code unit testing with the Kernel Tuner
- support for passing a (list of) filenames instead of kernel string
- runner that takes a random sample of 10 percent
- support for OpenCL platform selection
- support for using tuning parameter names in the problem size
- A function to type check the arguments to the kernel
- Example (convolution) that tunes the number of streams
- Device interface to C functions, for tuning host code
- Correctness checks for kernels during tuning
- Function for running a single kernel instance
- CHANGELOG file
- Compute Cartesian product and process restrictions before main loop
- Python 3.5 compatible code, thanks to Berend
- Support for constant memory arguments to CUDA kernels
- Use of mocking in unittests
- Reporting coverage to codacy
- OpenCL support
- Documentation pages with Convolution and Matrix Multiply examples
- Inspecting device properties at runtime
- Basic Kernel Tuning functionality