Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: [SYCL] Inference not working correctly on multiple GPUs #8294

Closed
ch1y0q opened this issue Jul 4, 2024 · 6 comments
Closed

Bug: [SYCL] Inference not working correctly on multiple GPUs #8294

ch1y0q opened this issue Jul 4, 2024 · 6 comments
Labels
bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow) stale SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language

Comments

@ch1y0q
Copy link

ch1y0q commented Jul 4, 2024

What happened?

I am using Llama.cpp + SYCL to perform inference on a multiple GPU server. However, I get a Segmentation Fault when using multiple GPUs. The same model can produce inference output correctly with single GPU mode.

git clone https://github.com/ggerganov/llama.cpp.git
source /opt/intel/oneapi/setvars.sh
cmake -B build -DGGML_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx
cmake --build build --config Release -j -v

cd ~/llama.cpp/
./build/bin/llama-ls-sycl-device

## single gpu, ok
./build/bin/llama-cli -m ~/mistral-7b-v0.1.Q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 33 -s 0 -sm none -mg 0

## multiple gpus, Segmentation Fault, core dumped
./build/bin/llama-cli -m ~/mistral-7b-v0.1.Q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 33 -s 0 -sm layer

image
image

Output of ./build/bin/llama-ls-sycl-device:

found 8 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A770 Graphics|    1.3|    512|    1024|   32| 16225M|            1.3.26241|
| 1| [level_zero:gpu:1]|                Intel Arc A770 Graphics|    1.3|    512|    1024|   32| 16225M|            1.3.26241|
| 2| [level_zero:gpu:2]|                 Intel UHD Graphics 770|    1.3|     32|     512|   32| 53751M|            1.3.26241|
| 3|     [opencl:gpu:0]|                Intel Arc A770 Graphics|    3.0|    512|    1024|   32| 16225M|       23.17.26241.33|
| 4|     [opencl:gpu:1]|                Intel Arc A770 Graphics|    3.0|    512|    1024|   32| 16225M|       23.17.26241.33|
| 5|     [opencl:gpu:2]|                 Intel UHD Graphics 770|    3.0|     32|     512|   32| 53751M|       23.17.26241.33|
| 6|     [opencl:cpu:0]|                   Intel Core i9-14900K|    3.0|     32|    8192|   64| 67189M|2023.16.11.0.22_160000|
| 7|     [opencl:acc:0]|            Intel FPGA Emulation Device|    1.2|     32|67108864|   64| 67189M|2023.16.11.0.22_160000|

Name and Version

./llama-cli --version
version: 3292 (20fc3804)
built with Intel(R) oneAPI DPC++/C++ Compiler 2024.0.1 (2024.0.1.20231122) for x86_64-unknown-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

Log start
main: build = 3292 (20fc3804)
main: built with Intel(R) oneAPI DPC++/C++ Compiler 2024.0.1 (2024.0.1.20231122) for x86_64-unknown-linux-gnu
main: seed  = 0
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /home/arda/qiyue/mistral-7b-v0.1.Q4_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-v0.1
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 2
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 259
llm_load_vocab: token to piece cache size = 0.1637 MB
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 3.83 GiB (4.54 BPW) 
llm_load_print_meta: general.name     = mistralai_mistral-7b-v0.1
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: max token length = 48
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 8 SYCL devices:
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llm_load_tensors: ggml ctx size =    1.23 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      SYCL0 buffer size =   234.06 MiB
llm_load_tensors:      SYCL1 buffer size =   234.06 MiB
llm_load_tensors:      SYCL2 buffer size =   702.19 MiB
llm_load_tensors:      SYCL3 buffer size =   234.06 MiB
llm_load_tensors:      SYCL4 buffer size =   117.03 MiB
llm_load_tensors:      SYCL5 buffer size =   702.19 MiB
llm_load_tensors:      SYCL6 buffer size =   819.22 MiB
llm_load_tensors:      SYCL7 buffer size =   804.74 MiB
llm_load_tensors:        CPU buffer size =    70.31 MiB
.................................................................................................
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 8 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A770 Graphics|    1.3|    512|    1024|   32| 16225M|            1.3.26241|
| 1| [level_zero:gpu:1]|                Intel Arc A770 Graphics|    1.3|    512|    1024|   32| 16225M|            1.3.26241|
| 2| [level_zero:gpu:2]|                 Intel UHD Graphics 770|    1.3|     32|     512|   32| 53751M|            1.3.26241|
| 3|     [opencl:gpu:0]|                Intel Arc A770 Graphics|    3.0|    512|    1024|   32| 16225M|       23.17.26241.33|
| 4|     [opencl:gpu:1]|                Intel Arc A770 Graphics|    3.0|    512|    1024|   32| 16225M|       23.17.26241.33|
| 5|     [opencl:gpu:2]|                 Intel UHD Graphics 770|    3.0|     32|     512|   32| 53751M|       23.17.26241.33|
| 6|     [opencl:cpu:0]|                   Intel Core i9-14900K|    3.0|     32|    8192|   64| 67189M|2023.16.11.0.22_160000|
| 7|     [opencl:acc:0]|            Intel FPGA Emulation Device|    1.2|     32|67108864|   64| 67189M|2023.16.11.0.22_160000|
llama_kv_cache_init:      SYCL0 KV buffer size =   256.00 MiB
llama_kv_cache_init:      SYCL1 KV buffer size =   256.00 MiB
llama_kv_cache_init:      SYCL2 KV buffer size =   768.00 MiB
llama_kv_cache_init:      SYCL3 KV buffer size =   256.00 MiB
llama_kv_cache_init:      SYCL4 KV buffer size =   128.00 MiB
llama_kv_cache_init:      SYCL5 KV buffer size =   768.00 MiB
llama_kv_cache_init:      SYCL6 KV buffer size =   896.00 MiB
llama_kv_cache_init:      SYCL7 KV buffer size =   768.00 MiB
llama_new_context_with_model: KV self size  = 4096.00 MiB, K (f16): 2048.00 MiB, V (f16): 2048.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.12 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =  2144.00 MiB
llama_new_context_with_model:      SYCL1 compute buffer size =  2144.00 MiB
llama_new_context_with_model:      SYCL2 compute buffer size =  2144.00 MiB
llama_new_context_with_model:      SYCL3 compute buffer size =  2144.00 MiB
llama_new_context_with_model:      SYCL4 compute buffer size =  2144.00 MiB
llama_new_context_with_model:      SYCL5 compute buffer size =  2144.00 MiB
llama_new_context_with_model:      SYCL6 compute buffer size =  2144.00 MiB
llama_new_context_with_model:      SYCL7 compute buffer size =  2144.00 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    72.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 9
@ch1y0q ch1y0q added bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow) labels Jul 4, 2024
@NeoZhangJianyu
Copy link
Collaborator

@ch1y0q
This PR #8014 fix this issue. but it's not approved.
You could use old release: fb76ec3
or merge the PR to ggerganov/llama.cpp

@airMeng airMeng added the SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language label Jul 18, 2024
@airMeng
Copy link
Collaborator

airMeng commented Jul 18, 2024

@ClarkChin08 please take a look

@ClarkChin08
Copy link
Contributor

#8554 @ch1y0q please refer this pr for fix of multi-device crash.

@ch1y0q
Copy link
Author

ch1y0q commented Jul 24, 2024

#8554 @ch1y0q please refer this pr for fix of multi-device crash.

it seems that, in latest (de28008) commit of main branch, the bug is fixed, without need to integrate the patch

@ClarkChin08
Copy link
Contributor

#8554 @ch1y0q please refer this pr for fix of multi-device crash.

it seems that, in latest (de28008) commit of main branch, the bug is fixed, without need to integrate the patch

If you use different sycl backend devices at the same time, it still will crash. Did you test the split_layer mode?

@github-actions github-actions bot added the stale label Aug 24, 2024
Copy link
Contributor

github-actions bot commented Sep 7, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Sep 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow) stale SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language
Projects
None yet
Development

No branches or pull requests

4 participants