TODO.txt

* Implement the contribution of a parallel implementation that speeds up 1.5 to 3x in pure python (but requires a rewriting of the module to be self-contained, no globals, to be compatible with the multiprocessing module)
	* Maybe test and/or make a version compatible with the Mojo JIT compiler (a superset of Python), but it's only partially opensource so this version cannot replace the main one due to the constraint of being future-proof. https://www.macg.co/ailleurs/2024/04/le-code-source-de-mojo-bien-ete-rendu-public-comme-promis-143289 -- the huge speed-up is obtainable through parallelization, so it would be a step after the parallel implementation in pure python https://www.modular.com/blog/mojo-a-journey-to-68-000x-speedup-over-python-part-3 and https://github.com/modularml/mojo/discussions/843
	* Test Photon JIT compiler, but not a drop-in replacement for CPython but tries to be as close as possible. Also requires to be parallelism-ready. https://github.com/exaloop/codon
        * For reference, we can compare llama.mojo which is a port of llama.cpp by the same author as llama.py which is compatible with codon https://www.modular.com/blog/community-spotlight-how-i-built-llama2-by-aydyn-tairov and https://github.com/tairov/llama2.py and https://github.com/tairov/llama2.mojo
* Test the new [galois](https://github.com/mhostetter/galois) jit-optimized extension module for numpy, which implements a reed-solomon codec and NTT transforms!
* Implement my own code for probabilistic prime polynomials generator, because for higher fields it is necessary: https://en.wikiversity.org/wiki/Reed%E2%80%93Solomon_codes_for_coders/Additional_information#Probabilistic_prime_polynomial_generator
* Cythonize more, read recent tutorials:
    * https://www.peterbaumgartner.com/blog/intro-to-just-enough-cython-to-be-useful/
    * https://members.loria.fr/LMendoza/link/Cython_speedup_notes.html
    * chunking should be moved to functions outside of RSCodec python object, so that we can make a cpdef and use memoryviews. Essentially, all looping functions should be done in cythonized functions, outside of Python objects (we can't cythonize Python objects methods).
    * remove all PyObject calls, especially in bytearrays
        * maybe replace all bytearrays by array 'i' and check if no more python calls
        * or maybe try to create an array with nogil, no PyObject call
        * or another way to make a pure C array? Or any kind of chained list? Something to store an array of items? We just need a basic array, nothing fancy, but with no Python call.
        * cpython array is much more efficient than cython view array! https://groups.google.com/g/cython-users/c/CwtU_jYADgM
            * then use nogil in cdef to ensure there is no python object (try with gf_poly_add() ).
        * use vector as a variable https://stackoverflow.com/questions/54759483/how-to-return-two-values-in-cython-cdef-without-gil-nogil
            * great to specify inside ctuples in cdef function definition return value
        * once return type is changed, try to change the exception type, maybe if pointer we set exception to a int instead
    * return types for each functions, especially those returning an array, maybe I can retry returning memoryviews now?
        * especially for rs_encode_msg() and decoding functions (may require a change in return values, to drop the tuples, or use vectors instead to return inside a ctuple)
    * replace all len() with array.shape[0]:
    regex search: len\(([^\)]+)\)
    replace: \1.shape[0]
    * add noexcept to cdef functions definitions once we finished unpythonizing all functions, because it will make debugging harder (exceptions will be printed but not propagated) https://cython.readthedocs.io/en/latest/src/userguide/migrating_to_cy30.html#exception-values-and-noexcept and https://cython.readthedocs.io/en/latest/src/userguide/language_basics.html#error-return-values
    * cpdef is slightly faster than cdef + def in classes https://stackoverflow.com/questions/48864631/what-are-the-differences-between-a-cpdef-and-a-cdef-wrapped-in-a-def
    * define return ctype value for each function, otherwise "You don’t need to (and shouldn’t) declare exception values for functions which return Python objects. Remember that a function with no declared return type implicitly returns a Python object. (Exceptions on such functions are implicitly propagated by returning NULL.)".
    * add nogil (and noexcept) to most cpdef functions https://stackoverflow.com/questions/54759483/how-to-return-two-values-in-cython-cdef-without-gil-nogil
    * https://stackoverflow.com/questions/41764244/obtaining-pointer-to-python-memoryview-on-bytes-object
    * Memoryviews:
        * Const memoryview to accept immutable bytes https://cython.readthedocs.io/en/latest/src/userguide/memoryviews.html#read-only-views
        * Memoryviews do not require the GIL except for copying: https://cython.readthedocs.io/en/latest/src/userguide/memoryviews.html#memoryviews-and-the-gil
        * Memoryviews and Cython Arrays https://cython.readthedocs.io/en/latest/src/userguide/memoryviews.html#memoryview-objects-and-cython-arrays
        * CPython Array module: https://cython.readthedocs.io/en/latest/src/userguide/memoryviews.html#cpython-array-module
        * to access parent array: a.base https://cython.readthedocs.io/en/latest/src/userguide/memoryviews.html#memoryview-objects-and-cython-arrays
        * can coerce memoryviews without copying content? Does it work for other types of arrays than numpy arrays? https://cython.readthedocs.io/en/latest/src/userguide/memoryviews.html#coercion-to-numpy
    * cdivision = False has a 35% speed penalty! But we are reliant on Python division style... We would have to rewrite a LOT of the maths! https://cython.readthedocs.io/en/latest/src/userguide/source_files_and_compilation.html#compiler-directives and https://github.com/cython/cython/wiki/enhancements-division
    * with cython.nogil, parallel(): https://cython.readthedocs.io/en/latest/src/userguide/parallelism.html#cython.parallel.parallel
    * Recode the Cython extension into Rust instead ? Like cryptography did? https://lwn.net/Articles/845535/ and https://pythonspeed.com/articles/rust-cython-python-extensions/
        * see also: https://pythonspeed.com/articles/python-extension-performance/
        * integrate with https://github.com/PyO3/setuptools-rust and meson supports rust too

* Best resources on memoryviews and arrays speed:
    * https://stackoverflow.com/questions/18462785/what-is-the-recommended-way-of-allocating-memory-for-a-typed-memory-view
    * https://groups.google.com/g/cython-users/c/CwtU_jYADgM
    * https://cython.readthedocs.io/en/latest/src/userguide/memoryviews.html#cpython-array-module

------

Other TODO apart from Cython:
* Optimize pure python implementation by using _bytearray() instead of [] whenever possible (compare with changes done on the cython implementation)
* Pre-allocate bytearrays whenever possible in the pure-python implementation (like the cython implementation)
* Test and fix issues with galois fields smaller than 2^8.
* Try to implement NTT, especially using https://github.com/raeudigerRaeffi/generalizedReedSolomon which uses fft/ifft:
	* BEST NTT tuto for RS with examples! https://tmo.jpl.nasa.gov/progress_report2/42-35/35K.PDF - The fast decoding of reed-solomon codes with number theoretic transform by Reed et al
	* Implement a fast algorithm for large data encoding (and decoding), see [https://github.com/catid/leopard leopard-RS] for a FFT approach and compare with https://github.com/raeudigerRaeffi/generalizedReedSolomon or [https://github.com/Bulat-Ziganshin/FastECC FastECC] for a NTT approach and try to translate https://github.com/raeudigerRaeffi/generalizedReedSolomon to use NTT instead.

* Try to reproduce encoding and decoding using an external library, potentially faster and supporting both polynomials and matrices modulo n, such as flint: https://fredrikj.net/python-flint/nmod_poly.html and https://pypi.org/project/flint-py/#description

* move creedsolo to its own repo so that cythonization and c build is not optional and we can use cibuildwheel to build several different wheels for various platforms

* [x] Fix optional cythonization post pep517 and with isolated builds:
    * Main issue is that although pep517 standardizes pure python packages building, extensions building is not standardized at all and mainly locked in setuptools, although a few other tools with their own custom process exist such as hatch, meson, enscons https://hackmd.io/@gaborbernat/py-packaging-summit-2022#Building-C-Extensions-without-Setuptools---Henry-Schreiner-III-and-Ofek-Lev
    * setup.py is not deprecated, only calling it as a cli tool is deprecated: https://discuss.python.org/t/custom-build-steps-moving-bokeh-off-setup-py/16128/11 + see also conditional hooks in hatch https://hatch.pypa.io/dev/config/build/#conditional-execution
    * can't access extras from setup.py, so that's not a solution: https://stackoverflow.com/questions/66083971/setup-py-how-to-find-user-specified-bracketed-extras
    * poetry supports optional cythonization, readily usable examples: https://github.com/python-poetry/poetry/issues/2330
    * the old setup.py and setuptools system is usually ancient on a lot of linux distribs, so we end up with an ancient python ecosystem, pyproject.toml fixes that https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html
    * update conda building process? "Conda 3.22 was just released, adding load_file_data function that can read toml files, so we can template the conda recipe dependencies directly from pyproject.toml . The conda recipe also now just builds from the wheel, which is simpler." https://discuss.python.org/t/custom-build-steps-moving-bokeh-off-setup-py/16128/17 - but beware, conda-build globally sets --no-isolation for pip https://discuss.python.org/t/how-to-use-pip-install-to-build-some-scientific-packages-from-sources-with-custom-build-arguments/24717/6
    * --install-option is deprecated: https://discuss.python.org/t/passing-command-line-arguments-to-pip-install-after-install-options-deprecation/22981

* [x] Rollback to v1.7.0 on PyPi and change all releases in v2.x branch as pre-releases.
* [x] Avoid packaging of tests/* folder inside the sdist
    * [x] maybe impossible with single-module layout, may need to switch to src-layout (use __init__ to make reedsolo and creedsolo importable directly with same API, and place creedsolo in its own package)
* [] Overhaul again the build process, maybe change backend to switch to Meson: https://discuss.python.org/t/explicit-optional-cythonization-via-config-setting-post-pep517-and-install-option-deprecation/25379
* [] cibuildwheel github actions only on release
    * [x] set pypi secret for auto upload
    * [x] configure unit testing of builds
    * [x] use test pypi
    * [x] fix RST issues
    * [x] upload new release to test
    * [x] add macos and windows oses
    * [x] switch to real pypi
* [] make tests against unireedsolomon and pyfilefixity to future proof (max errors, max erasures, max mix, non max scenarios) - but only for one single python version in the continuous integration, otherwise will be too time consuming
    * [] Repackage unireedsolomon to PEP517
    * [] Repackage pyFileFixity to PEP517 with a pyproject.toml and setup.cfg (for Py2 compatibility) so it can be installed in --editable mode with pip. Then just add to ci (not local test).
* [] Optimized maths build for Cython https://github.com/Technologicat/setup-template-cython/blob/master/setup.py
* [] finish array optimizations (see jupyter notebook)

To support bit arrays:
* Surprisingly, there is no bitarray in cython nor standard Python library. We need to use 3rd party modules, but they don't seem to have a cython/C interface, so they are likely slow. Still, it could be nice to support them for those who work on bit arrays.
* BEST: likely the fastest and most up-to-date implementation of bitarray, done in C but with a Python interface: https://github.com/ilanschnell/bitarray
* BEST: another efficient implementation: https://pypi.org/project/BitVector/
* https://pypi.org/project/bitarray/
* https://bitstring.readthedocs.io/en/stable/bitarray.html
* alternative using bitshift operators: https://stackoverflow.com/a/53419563/1121352
preallocation: x = 1<<10 # or x = 0 # int in Python 3 or long in Python 2 can be arbitrarily large https://stackoverflow.com/a/43458926/1121352
assigning 1 at bit 19: x = x | 1<<19 # or x |= 1<<19
accessing bit 19: x = x & 1<<19 # or x &= 1<<19
More operators: https://stackoverflow.com/a/43458926/1121352
x &= ~(1<<19) # clear bit 19
x ^= 1<<19 # toggle bit 19
x = ~x # invert *all* bits, all the way to infinity
mask = ((1<<20)-1) # define a 20 bit wide mask
x &= mask # ensure bits 20 and higher are 0
x ^= mask # invert only bits 0 through 19

(x >> 19) & 1 # test bit 19
(x >> 16) & 0xf # get bits 16 through 20.
* bitshifting: https://stackoverflow.com/questions/12461361/bits-list-to-integer-in-python?rq=3
def shifting(bitlist):
     out = 0
     for bit in bitlist:
         out = (out << 1) | bit
     return out
* ctypes https://stackoverflow.com/a/40364970/1121352
* to convert user input (strings) into binary data: https://pymotw.com/3/struct/

Linux packaging:
* Checklist: https://bugzilla.redhat.com/show_bug.cgi?id=1925761