Optimize Vulkan REPEAT performance #2

SkutteOleg · 2024-08-27T17:56:25Z

See leejet/stable-diffusion.cpp#291 (comment)

This commit moves the comment for the c parameter from ggml_rope to ggml_rope_ext. The comment is currently incorrect as ggml_rope does not have a c parameter (freq_factors tensor). Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* Fix Vulkan repeat op * Implement Vulkan concat op * Delete old Vulkan shader generator * Implement Vulkan im2col op * Implement Vulkan unary gelu_quick op * Implement Vulkan group_norm op * Implement Vulkan timestep_embedding op * Implement Vulkan upscale op * Fix Vulkan vk_context tensor extra index issue * Fix Vulkan matmul shader parameter bug * Properly fix Vulkan matmul shader parameter bug * Add Vulkan ADD f16 + f32 -> f16 operator support * Implement Vulkan tanh op * Fix Vulkan group count too large Validation error on non-Nvidia GPUs * Throw error when too much memory is requested * Fix another Vulkan group count too large Validation error on non-Nvidia GPUs * Fix matmul MMQ condition * Implement Vulkan pad op * Fix Vulkan crash when tensor is used multiple times in a compute graph * Add Vulkan CONCAT f16 + f16 -> f16 op * Add Vulkan LEAKY_RELU op

ggml-ci

* Update doc for MUSA Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Add GGML_MUSA in Makefile Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Add GGML_MUSA in CMake Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * CUDA => MUSA Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * MUSA adds support for __vsubss4 Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Fix CI build failure Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

…/8746) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

… (llama/8748) In these codes, we want to retain the value that they previously held when mask[i] is false. So we should use undisturbed. With the default agnostic policy of rvv intrinsic, these values can be held or be written with 1s. Co-authored-by: carter.li <carter.li@starfivetech.com>

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

…751) * added android implementation of ggml_print_backtrace_symbols * Update ggml/src/ggml.c Co-authored-by: slaren <slarengh@gmail.com> * Update ggml/src/ggml.c Co-authored-by: slaren <slarengh@gmail.com> * Update ggml/src/ggml.c Co-authored-by: slaren <slarengh@gmail.com> * Update ggml/src/ggml.c Co-authored-by: slaren <slarengh@gmail.com> * Update ggml/src/ggml.c Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>

* cuda : fix dmmv cols requirement to 2*GGML_CUDA_DMMV_X * update asserts * only use dmmv for supported types * add test

…a/8783) * Only enable backtrace on GLIBC linux systems * fix missing file from copy * use glibc macro instead of defining a custom one

* Adding support for unified memory * adding again the documentation about unified memory * refactoring: Moved the unified memory code in the correct location. * Fixed compilation error when using hipblas * cleaning up the documentation * Updating the documentation Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * adding one more case where the PR should not be enabled --------- Co-authored-by: matteo serva <matteo.serva@gmail.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* fix ggml_cann_im2col for 1D im2col * fix build warning

* add truncate_bf16 * truncate intermediate fp32 if converting bf16 to bf16 * fix masking in __compute_fp32_to_bf16 * np.int16 no longer used * missing cast and additional numpy 2.x fix * ggml-impl : do not flush bf16 subnormals to zero * ggml : add reference fp32 to bf16 conversion The fast version is no longer equivalent for all platforms because of the handling of subnormal values. * gguf-py : remove flush to zero for bf16 subnormals * gguf-py : remove float32 truncation to bf16 Rounding achieves the same thing in the cases where this was used. * missed prototype update in merge * merge cleanup --------- Co-authored-by: Francis Couture-Harpin <git@compilade.net>

* ggml : reading the runtime sve config of the cpu * change to one time init to prevent performance drop * prefix variable to avoid possible conflicts * revert xxhash fix and add brackets --------- Co-authored-by: domke <673751-domke@users.noreply.gitlab.com>

…855) * Fix Vulkan mul mat vec invalid results when ncols < warp size * Only run backend ops mul mat vec block size test if block size not already covered

It's helpful to use expm1f(x), because expf(x)-1 will result in overflow for 25% of single-precision floating point numbers.

* cann: fix ggml_backend_cann_buffer_get_tensor 1. fix data ptr offset 2. enable the acquisition of incomplete tensors * fix backend cann set_tensor

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Updated device filter to depend on default_selector (fixes non-intel device issues) * Small related update to example/sycl Readme

* ggml-backend : fix async copy from CPU * cuda : more reliable async copy, fix stream used when the devices are the same

ggml-ci

* ggml: use vulkan as gpu backend when available Signed-off-by: Matt Stephenson <mstephenson6@users.noreply.github.com> * whisper: enable using vk as default buffer type Signed-off-by: Matt Stephenson <mstephenson6@users.noreply.github.com> --------- Signed-off-by: Matt Stephenson <mstephenson6@users.noreply.github.com>

Co-authored-by: slaren <slarengh@gmail.com>

* ggml: support forward pass broadcasting in ggml_sub Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com> * Use assert instead of GGML_ASSERT in ggml_compute_forward_sub_f32 The check is already performed in ggml_sub_impl Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com> --------- Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>

* ggml : add sin/cos operators * ggml-cuda : add sin/cos operators * ggml : add corresponding tests for sin/cos * ggml : add backward computation for sin/cos operators * ggml-vulkan : add sin/cos operators * ggml-vulkan : add sin/cos shader source * metal : add sin, cos --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

Co-Authored-By: 0cc4m <11707594+0cc4m@users.noreply.github.com>

SkutteOleg · 2024-08-27T18:02:11Z

Nevermind, I should just open an issue in ggml repo upstream.

JohannesGaessler and others added 30 commits July 29, 2024 15:03

examples: add TensorFlow to requirements.txt (ggerganov#902)

49164e6

metal : add abort callback (ggerganov#905)

1f2b80a

metal : fix struct name (ggerganov#912)

444e896

ggml-ci

ggml : ignore more msvc warnings (ggerganov#906)

6c71d5a

add conv support (llama/8688)

dcb2400

cuda : organize vendor-specific headers into vendors directory (llama…

116362c

…/8746) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

Add TIMESTEP_EMBEDDING OP (llama/8707)

5efe2dc

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

cann: update cmake (llama/8765)

607299d

cuda : fix dmmv cols requirement to 2*GGML_CUDA_DMMV_X (llama/8800)

58e50d2

* cuda : fix dmmv cols requirement to 2*GGML_CUDA_DMMV_X * update asserts * only use dmmv for supported types * add test

Build: Only include execinfo.h on linux systems that support it (llam…

77e79cd

…a/8783) * Only enable backtrace on GLIBC linux systems * fix missing file from copy * use glibc macro instead of defining a custom one

Fixing wrong VDR iq4nl value (llama/8812)

eaeff32

cann: Fix ggml_cann_im2col for 1D im2col (llama/8819)

0a19c02

* fix ggml_cann_im2col for 1D im2col * fix build warning

cann: support q4_0 model (llama/8822)

1fb1c9d

vulkan : fix Qantized Mat-Vec Mul on AMD GPUs for ncols < 64 (llama/8…

a3b2059

…855) * Fix Vulkan mul mat vec invalid results when ncols < warp size * Only run backend ops mul mat vec block size test if block size not already covered

ggml : fix overflows in elu function (llama/8866)

15eac32

It's helpful to use expm1f(x), because expf(x)-1 will result in overflow for 25% of single-precision floating point numbers.

cann: fix buffer_num and runtime speed slowly error (llama/8865)

70f29c7

Fix ggml_backend_cann_buffer_get_tensor (llama/8871)

dc3dba3

* cann: fix ggml_backend_cann_buffer_get_tensor 1. fix data ptr offset 2. enable the acquisition of incomplete tensors * fix backend cann set_tensor

ggml : add epsilon as a parameter for group_norm (llama/8818)

fc31d40

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

CUDA: fix padding logic for FP16/FP32 (llama/8884)

9510e3c

CUDA/HIP: fix tests/test-backend-ops (llama/8896)

63f2251

Updated SYCL device filtering (llama/8901)

02a0b27

* Updated device filter to depend on default_selector (fixes non-intel device issues) * Small related update to example/sycl Readme

ggml-backend : fix async copy from CPU (llama/8897)

67c3e78

* ggml-backend : fix async copy from CPU * cuda : more reliable async copy, fix stream used when the devices are the same

ggerganov and others added 15 commits August 8, 2024 13:45

sync : llama.cpp

9793ab7

ggml-ci

scripts : update sync scripts (#0)

3058ec3

scripts : remove obsolete header (#0)

3266c07

scripts : sync sycl (#0)

723445e

sync : vulkan (llama/0)

bc97237

ggml : add CANN backend (llama/0)

a06c683

ggml-ci

sync : whisper.cpp

797faa2

rpc : sanitize tensor data + warnings (llama/0)

483ccfb

Co-authored-by: slaren <slarengh@gmail.com>

sync : llama.cpp

4bf4a25

metal : fix uninitialized abort_callback (llama/8968)

9309817

sync : llama.cpp

681247d

Optimize Vulkan REPEAT performance

681d4f3

Co-Authored-By: 0cc4m <11707594+0cc4m@users.noreply.github.com>

SkutteOleg closed this Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Vulkan REPEAT performance #2

Optimize Vulkan REPEAT performance #2

SkutteOleg commented Aug 27, 2024

SkutteOleg commented Aug 27, 2024

Optimize Vulkan REPEAT performance #2

Optimize Vulkan REPEAT performance #2

Conversation

SkutteOleg commented Aug 27, 2024

SkutteOleg commented Aug 27, 2024