Bug: [SYCL] Inference not working correctly on multiple GPUs #8294

ch1y0q · 2024-07-04T11:46:57Z

What happened?

I am using Llama.cpp + SYCL to perform inference on a multiple GPU server. However, I get a Segmentation Fault when using multiple GPUs. The same model can produce inference output correctly with single GPU mode.

git clone https://github.com/ggerganov/llama.cpp.git
source /opt/intel/oneapi/setvars.sh
cmake -B build -DGGML_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx
cmake --build build --config Release -j -v

cd ~/llama.cpp/
./build/bin/llama-ls-sycl-device

## single gpu, ok
./build/bin/llama-cli -m ~/mistral-7b-v0.1.Q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 33 -s 0 -sm none -mg 0

## multiple gpus, Segmentation Fault, core dumped
./build/bin/llama-cli -m ~/mistral-7b-v0.1.Q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 33 -s 0 -sm layer

Output of ./build/bin/llama-ls-sycl-device:

found 8 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A770 Graphics|    1.3|    512|    1024|   32| 16225M|            1.3.26241|
| 1| [level_zero:gpu:1]|                Intel Arc A770 Graphics|    1.3|    512|    1024|   32| 16225M|            1.3.26241|
| 2| [level_zero:gpu:2]|                 Intel UHD Graphics 770|    1.3|     32|     512|   32| 53751M|            1.3.26241|
| 3|     [opencl:gpu:0]|                Intel Arc A770 Graphics|    3.0|    512|    1024|   32| 16225M|       23.17.26241.33|
| 4|     [opencl:gpu:1]|                Intel Arc A770 Graphics|    3.0|    512|    1024|   32| 16225M|       23.17.26241.33|
| 5|     [opencl:gpu:2]|                 Intel UHD Graphics 770|    3.0|     32|     512|   32| 53751M|       23.17.26241.33|
| 6|     [opencl:cpu:0]|                   Intel Core i9-14900K|    3.0|     32|    8192|   64| 67189M|2023.16.11.0.22_160000|
| 7|     [opencl:acc:0]|            Intel FPGA Emulation Device|    1.2|     32|67108864|   64| 67189M|2023.16.11.0.22_160000|

Name and Version

./llama-cli --version
version: 3292 (20fc3804)
built with Intel(R) oneAPI DPC++/C++ Compiler 2024.0.1 (2024.0.1.20231122) for x86_64-unknown-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

Log start
main: build = 3292 (20fc3804)
main: built with Intel(R) oneAPI DPC++/C++ Compiler 2024.0.1 (2024.0.1.20231122) for x86_64-unknown-linux-gnu
main: seed  = 0
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /home/arda/qiyue/mistral-7b-v0.1.Q4_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-v0.1
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 2
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 259
llm_load_vocab: token to piece cache size = 0.1637 MB
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 3.83 GiB (4.54 BPW) 
llm_load_print_meta: general.name     = mistralai_mistral-7b-v0.1
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: max token length = 48
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 8 SYCL devices:
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llm_load_tensors: ggml ctx size =    1.23 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      SYCL0 buffer size =   234.06 MiB
llm_load_tensors:      SYCL1 buffer size =   234.06 MiB
llm_load_tensors:      SYCL2 buffer size =   702.19 MiB
llm_load_tensors:      SYCL3 buffer size =   234.06 MiB
llm_load_tensors:      SYCL4 buffer size =   117.03 MiB
llm_load_tensors:      SYCL5 buffer size =   702.19 MiB
llm_load_tensors:      SYCL6 buffer size =   819.22 MiB
llm_load_tensors:      SYCL7 buffer size =   804.74 MiB
llm_load_tensors:        CPU buffer size =    70.31 MiB
.................................................................................................
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 8 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A770 Graphics|    1.3|    512|    1024|   32| 16225M|            1.3.26241|
| 1| [level_zero:gpu:1]|                Intel Arc A770 Graphics|    1.3|    512|    1024|   32| 16225M|            1.3.26241|
| 2| [level_zero:gpu:2]|                 Intel UHD Graphics 770|    1.3|     32|     512|   32| 53751M|            1.3.26241|
| 3|     [opencl:gpu:0]|                Intel Arc A770 Graphics|    3.0|    512|    1024|   32| 16225M|       23.17.26241.33|
| 4|     [opencl:gpu:1]|                Intel Arc A770 Graphics|    3.0|    512|    1024|   32| 16225M|       23.17.26241.33|
| 5|     [opencl:gpu:2]|                 Intel UHD Graphics 770|    3.0|     32|     512|   32| 53751M|       23.17.26241.33|
| 6|     [opencl:cpu:0]|                   Intel Core i9-14900K|    3.0|     32|    8192|   64| 67189M|2023.16.11.0.22_160000|
| 7|     [opencl:acc:0]|            Intel FPGA Emulation Device|    1.2|     32|67108864|   64| 67189M|2023.16.11.0.22_160000|
llama_kv_cache_init:      SYCL0 KV buffer size =   256.00 MiB
llama_kv_cache_init:      SYCL1 KV buffer size =   256.00 MiB
llama_kv_cache_init:      SYCL2 KV buffer size =   768.00 MiB
llama_kv_cache_init:      SYCL3 KV buffer size =   256.00 MiB
llama_kv_cache_init:      SYCL4 KV buffer size =   128.00 MiB
llama_kv_cache_init:      SYCL5 KV buffer size =   768.00 MiB
llama_kv_cache_init:      SYCL6 KV buffer size =   896.00 MiB
llama_kv_cache_init:      SYCL7 KV buffer size =   768.00 MiB
llama_new_context_with_model: KV self size  = 4096.00 MiB, K (f16): 2048.00 MiB, V (f16): 2048.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.12 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =  2144.00 MiB
llama_new_context_with_model:      SYCL1 compute buffer size =  2144.00 MiB
llama_new_context_with_model:      SYCL2 compute buffer size =  2144.00 MiB
llama_new_context_with_model:      SYCL3 compute buffer size =  2144.00 MiB
llama_new_context_with_model:      SYCL4 compute buffer size =  2144.00 MiB
llama_new_context_with_model:      SYCL5 compute buffer size =  2144.00 MiB
llama_new_context_with_model:      SYCL6 compute buffer size =  2144.00 MiB
llama_new_context_with_model:      SYCL7 compute buffer size =  2144.00 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    72.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 9

The text was updated successfully, but these errors were encountered:

NeoZhangJianyu · 2024-07-05T02:37:47Z

@ch1y0q
This PR #8014 fix this issue. but it's not approved.
You could use old release: fb76ec3
or merge the PR to ggerganov/llama.cpp

airMeng · 2024-07-18T06:54:19Z

@ClarkChin08 please take a look

ClarkChin08 · 2024-07-18T07:23:17Z

#8554 @ch1y0q please refer this pr for fix of multi-device crash.

ch1y0q · 2024-07-24T08:21:49Z

#8554 @ch1y0q please refer this pr for fix of multi-device crash.

it seems that, in latest (de28008) commit of main branch, the bug is fixed, without need to integrate the patch

ClarkChin08 · 2024-07-24T10:36:52Z

#8554 @ch1y0q please refer this pr for fix of multi-device crash.

it seems that, in latest (de28008) commit of main branch, the bug is fixed, without need to integrate the patch

If you use different sycl backend devices at the same time, it still will crash. Did you test the split_layer mode?

github-actions · 2024-09-07T01:07:06Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

ch1y0q added bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow) labels Jul 4, 2024

airMeng added the SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language label Jul 18, 2024

github-actions bot added the stale label Aug 24, 2024

github-actions bot closed this as completed Sep 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: [SYCL] Inference not working correctly on multiple GPUs #8294

Bug: [SYCL] Inference not working correctly on multiple GPUs #8294

ch1y0q commented Jul 4, 2024

NeoZhangJianyu commented Jul 5, 2024

airMeng commented Jul 18, 2024

ClarkChin08 commented Jul 18, 2024

ch1y0q commented Jul 24, 2024

ClarkChin08 commented Jul 24, 2024

github-actions bot commented Sep 7, 2024

Bug: [SYCL] Inference not working correctly on multiple GPUs #8294

Bug: [SYCL] Inference not working correctly on multiple GPUs #8294

Comments

ch1y0q commented Jul 4, 2024

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

NeoZhangJianyu commented Jul 5, 2024

airMeng commented Jul 18, 2024

ClarkChin08 commented Jul 18, 2024

ch1y0q commented Jul 24, 2024

ClarkChin08 commented Jul 24, 2024

github-actions bot commented Sep 7, 2024