Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] do not running Llama-3-8B-Instruct-q4f16_1-MLC on windows #2899

Open
BlindDeveloper opened this issue Sep 12, 2024 · 1 comment
Open
Labels
bug Confirmed bugs

Comments

@BlindDeveloper
Copy link
Contributor

🐛 Bug

Do not running Llama-3-8B-Instruct-q4f16_1-MLC

To Reproduce

Steps to reproduce the behavior:

  1. conda create --name mlc-prebuilt python=3.11
  2. conda activate mlc-prebuilt
  3. conda install -c conda-forge clang libvulkan-loader git-lfs git
    1. python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly mlc-ai-nightly
    1. mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC

[2024-09-12 15:20:08] INFO auto_device.py:88: Not found device: cuda:0
[2024-09-12 15:20:10] INFO auto_device.py:88: Not found device: rocm:0
[2024-09-12 15:20:12] INFO auto_device.py:88: Not found device: metal:0
[2024-09-12 15:20:13] INFO auto_device.py:79: Found device: vulkan:0
[2024-09-12 15:20:15] INFO auto_device.py:88: Not found device: opencl:0
[2024-09-12 15:20:15] INFO auto_device.py:35: Using device: vulkan:0
[2024-09-12 15:20:15] INFO download_cache.py:227: Downloading model from HuggingFace: HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC
[2024-09-12 15:20:15] INFO download_cache.py:29: MLC_DOWNLOAD_CACHE_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY
[2024-09-12 15:20:15] INFO download_cache.py:166: Weights already downloaded: C:\Users\username\AppData\Local\mlc_llm\model_weights\hf\mlc-ai\Llama-3-8B-Instruct-q4f16_1-MLC
[2024-09-12 15:20:15] INFO jit.py:43: MLC_JIT_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY
[2024-09-12 15:20:15] INFO jit.py:118: Compiling using commands below:
[2024-09-12 15:20:15] INFO jit.py:119: 'C:\ProgramData\miniconda3\envs\mlc-prebuilt\python.exe' -m mlc_llm compile 'C:\Users\username\AppData\Local\mlc_llm\model_weights\hf\mlc-ai\Llama-3-8B-Instruct-q4f16_1-MLC' --opt 'flashinfer=1;cublas_gemm=1;faster_transformer=0;cudagraph=1;cutlass=1;ipc_allreduce_strategy=NONE' --overrides '' --device vulkan:0 --output 'C:\Users\username\AppData\Local\Temp\tmpr7njn151\lib.dll'
[2024-09-12 15:20:18] INFO auto_config.py:70: Found model configuration: C:\Users\username\AppData\Local\mlc_llm\model_weights\hf\mlc-ai\Llama-3-8B-Instruct-q4f16_1-MLC\mlc-chat-config.json
[2024-09-12 15:20:18] INFO auto_target.py:91: Detecting target device: vulkan:0
[2024-09-12 15:20:18] INFO auto_target.py:93: Found target: {"thread_warp_size": runtime.BoxInt(1), "supports_float32": runtime.BoxBool(true), "supports_int16": runtime.BoxBool(true), "max_threads_per_block": runtime.BoxInt(1024), "supports_storage_buffer_storage_class": runtime.BoxBool(true), "supports_int8": runtime.BoxBool(true), "supports_8bit_buffer": runtime.BoxBool(true), "supports_int64": runtime.BoxBool(true), "max_num_threads": runtime.BoxInt(256), "kind": "vulkan", "tag": "", "max_shared_memory_per_block": runtime.BoxInt(32768), "supports_16bit_buffer": runtime.BoxBool(true), "supports_int32": runtime.BoxBool(true), "keys": ["vulkan", "gpu"], "supports_float16": runtime.BoxBool(true)}
[2024-09-12 15:20:18] INFO auto_target.py:110: Found host LLVM triple: x86_64-pc-windows-msvc
[2024-09-12 15:20:18] INFO auto_target.py:111: Found host LLVM CPU: alderlake
[2024-09-12 15:20:18] INFO auto_config.py:154: Found model type: llama. Use --model-type to override.
Compiling with arguments:
--config LlamaConfig(hidden_size=4096, intermediate_size=14336, num_attention_heads=32, num_hidden_layers=32, rms_norm_eps=1e-05, vocab_size=128256, tie_word_embeddings=False, position_embedding_base=500000.0, rope_scaling=None, context_window_size=8192, prefill_chunk_size=2048, num_key_value_heads=8, head_dim=128, tensor_parallel_shards=1, pipeline_parallel_stages=1, max_batch_size=80, kwargs={})
--quantization GroupQuantize(name='q4f16_1', kind='group-quant', group_size=32, quantize_dtype='int4', storage_dtype='uint32', model_dtype='float16', linear_weight_layout='NK', quantize_embedding=True, quantize_final_fc=True, num_elem_per_storage=8, num_storage_per_group=4, max_int_value=7, tensor_parallel_shards=0)
--model-type llama
--target {"thread_warp_size": runtime.BoxInt(1), "host": {"mtriple": "x86_64-pc-windows-msvc", "tag": "", "kind": "llvm", "mcpu": "alderlake", "keys": ["cpu"]}, "supports_float32": runtime.BoxBool(true), "supports_int16": runtime.BoxBool(true), "max_threads_per_block": runtime.BoxInt(1024), "supports_storage_buffer_storage_class": runtime.BoxBool(true), "supports_int8": runtime.BoxBool(true), "supports_8bit_buffer": runtime.BoxBool(true), "supports_int64": runtime.BoxBool(true), "max_num_threads": runtime.BoxInt(256), "kind": "vulkan", "tag": "", "max_shared_memory_per_block": runtime.BoxInt(32768), "supports_16bit_buffer": runtime.BoxBool(true), "supports_int32": runtime.BoxBool(true), "keys": ["vulkan", "gpu"], "supports_float16": runtime.BoxBool(true)}
--opt flashinfer=0;cublas_gemm=0;faster_transformer=0;cudagraph=0;cutlass=0;ipc_allreduce_strategy=NONE
--system-lib-prefix ""
--output C:\Users\username\AppData\Local\Temp\tmpr7njn151\lib.dll
--overrides context_window_size=None;sliding_window_size=None;prefill_chunk_size=None;attention_sink_size=None;max_batch_size=None;tensor_parallel_shards=None;pipeline_parallel_stages=None
[2024-09-12 15:20:18] INFO compile.py:140: Creating model from: LlamaConfig(hidden_size=4096, intermediate_size=14336, num_attention_heads=32, num_hidden_layers=32, rms_norm_eps=1e-05, vocab_size=128256, tie_word_embeddings=False, position_embedding_base=500000.0, rope_scaling=None, context_window_size=8192, prefill_chunk_size=2048, num_key_value_heads=8, head_dim=128, tensor_parallel_shards=1, pipeline_parallel_stages=1, max_batch_size=80, kwargs={})
[2024-09-12 15:20:18] INFO compile.py:158: Exporting the model to TVM Unity compiler
[2024-09-12 15:20:22] INFO compile.py:164: Running optimizations using TVM Unity
[2024-09-12 15:20:22] INFO compile.py:185: Registering metadata: {'model_type': 'llama', 'quantization': 'q4f16_1', 'context_window_size': 8192, 'sliding_window_size': -1, 'attention_sink_size': -1, 'prefill_chunk_size': 2048, 'tensor_parallel_shards': 1, 'pipeline_parallel_stages': 1, 'kv_state_kind': 'kv_cache', 'max_batch_size': 80}
[2024-09-12 15:20:24] INFO pipeline.py:54: Running TVM Relax graph-level optimizations
[2024-09-12 15:20:31] INFO pipeline.py:54: Lowering to TVM TIR kernels
[2024-09-12 15:20:40] INFO pipeline.py:54: Running TVM TIR-level optimizations
[2024-09-12 15:21:03] INFO pipeline.py:54: Running TVM Dlight low-level optimizations
[2024-09-12 15:21:05] INFO pipeline.py:54: Lowering to VM bytecode
[2024-09-12 15:21:10] INFO estimate_memory_usage.py:58: [Memory usage] Function alloc_embedding_tensor: 16.00 MB
[2024-09-12 15:21:10] INFO estimate_memory_usage.py:58: [Memory usage] Function argsort_probs: 0.00 MB
[2024-09-12 15:21:10] INFO estimate_memory_usage.py:58: [Memory usage] Function batch_decode: 11.56 MB
[2024-09-12 15:21:10] INFO estimate_memory_usage.py:58: [Memory usage] Function batch_decode_to_last_hidden_states: 12.19 MB
[2024-09-12 15:21:10] INFO estimate_memory_usage.py:58: [Memory usage] Function batch_prefill: 296.62 MB
[2024-09-12 15:21:10] INFO estimate_memory_usage.py:58: [Memory usage] Function batch_prefill_to_last_hidden_states: 312.00 MB
[2024-09-12 15:21:10] INFO estimate_memory_usage.py:58: [Memory usage] Function batch_select_last_hidden_states: 0.62 MB
[2024-09-12 15:21:10] INFO estimate_memory_usage.py:58: [Memory usage] Function batch_verify: 296.00 MB
[2024-09-12 15:21:10] INFO estimate_memory_usage.py:58: [Memory usage] Function batch_verify_to_last_hidden_states: 312.00 MB
[2024-09-12 15:21:10] INFO estimate_memory_usage.py:58: [Memory usage] Function create_tir_paged_kv_cache: 0.00 MB
[2024-09-12 15:21:11] INFO estimate_memory_usage.py:58: [Memory usage] Function decode: 0.14 MB
[2024-09-12 15:21:11] INFO estimate_memory_usage.py:58: [Memory usage] Function decode_to_last_hidden_states: 0.15 MB
[2024-09-12 15:21:11] INFO estimate_memory_usage.py:58: [Memory usage] Function embed: 16.00 MB
[2024-09-12 15:21:11] INFO estimate_memory_usage.py:58: [Memory usage] Function gather_hidden_states: 0.00 MB
[2024-09-12 15:21:11] INFO estimate_memory_usage.py:58: [Memory usage] Function get_logits: 0.00 MB
[2024-09-12 15:21:11] INFO estimate_memory_usage.py:58: [Memory usage] Function multinomial_from_uniform: 0.00 MB
[2024-09-12 15:21:11] INFO estimate_memory_usage.py:58: [Memory usage] Function prefill: 296.01 MB
[2024-09-12 15:21:11] INFO estimate_memory_usage.py:58: [Memory usage] Function prefill_to_last_hidden_states: 312.00 MB
[2024-09-12 15:21:11] INFO estimate_memory_usage.py:58: [Memory usage] Function renormalize_by_top_p: 0.00 MB
[2024-09-12 15:21:11] INFO estimate_memory_usage.py:58: [Memory usage] Function sample_with_top_p: 0.00 MB
[2024-09-12 15:21:11] INFO estimate_memory_usage.py:58: [Memory usage] Function sampler_take_probs: 0.00 MB
[2024-09-12 15:21:11] INFO estimate_memory_usage.py:58: [Memory usage] Function sampler_verify_draft_tokens: 0.00 MB
[2024-09-12 15:21:11] INFO estimate_memory_usage.py:58: [Memory usage] Function scatter_hidden_states: 0.00 MB
[2024-09-12 15:21:11] INFO estimate_memory_usage.py:58: [Memory usage] Function softmax_with_temperature: 0.00 MB
[2024-09-12 15:21:13] INFO pipeline.py:54: Compiling external modules
[2024-09-12 15:21:13] INFO pipeline.py:54: Compilation complete! Exporting to disk
Traceback (most recent call last):
File "", line 198, in run_module_as_main
File "", line 88, in run_code
File "C:\ProgramData\miniconda3\envs\mlc-prebuilt\Lib\site-packages\mlc_llm_main
.py", line 64, in
main()
File "C:\ProgramData\miniconda3\envs\mlc-prebuilt\Lib\site-packages\mlc_llm_main
.py", line 33, in main
cli.main(sys.argv[2:])
File "C:\ProgramData\miniconda3\envs\mlc-prebuilt\Lib\site-packages\mlc_llm\cli\compile.py", line 129, in main
compile(
File "C:\ProgramData\miniconda3\envs\mlc-prebuilt\Lib\site-packages\mlc_llm\interface\compile.py", line 243, in compile
_compile(args, model_config)
File "C:\ProgramData\miniconda3\envs\mlc-prebuilt\Lib\site-packages\mlc_llm\interface\compile.py", line 188, in _compile
args.build_func(
File "C:\ProgramData\miniconda3\envs\mlc-prebuilt\Lib\site-packages\mlc_llm\support\auto_target.py", line 316, in build
).export_library(
^^^^^^^^^^^^^^^
File "C:\ProgramData\miniconda3\envs\mlc-prebuilt\Lib\site-packages\tvm\relax\vm_build.py", line 146, in export_library
return self.mod.export_library(
^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ProgramData\miniconda3\envs\mlc-prebuilt\Lib\site-packages\tvm\runtime\module.py", line 624, in export_library
return fcompile(file_name, files, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ProgramData\miniconda3\envs\mlc-prebuilt\Lib\site-packages\tvm\contrib\cc.py", line 96, in create_shared
_windows_compile(output, objects, options, cwd, ccache_env)
File "C:\ProgramData\miniconda3\envs\mlc-prebuilt\Lib\site-packages\tvm\contrib\cc.py", line 418, in _windows_compile
raise RuntimeError(msg)
RuntimeError: Compilation error:
clang -O2 --target=x86_64 -shared -o C:\Users\username\AppData\Local\Temp\tmpr7njn151\lib.dll C:\Users\username\AppData\Local\Temp\tmp8zcmbzym\lib0.o C:\Users\username\AppData\Local\Temp\tmp8zcmbzym\devc.o
clang: error: unable to execute command: program not executable
clang: error: linker (via gcc) command failed with exit code 1 (use -v to see invocation)

Traceback (most recent call last):
File "", line 198, in run_module_as_main
File "", line 88, in run_code
File "C:\ProgramData\miniconda3\envs\mlc-prebuilt\Scripts\mlc_llm.exe_main
.py", line 7, in
File "C:\ProgramData\miniconda3\envs\mlc-prebuilt\Lib\site-packages\mlc_llm_main
.py", line 45, in main
cli.main(sys.argv[2:])
File "C:\ProgramData\miniconda3\envs\mlc-prebuilt\Lib\site-packages\mlc_llm\cli\chat.py", line 36, in main
chat(
File "C:\ProgramData\miniconda3\envs\mlc-prebuilt\Lib\site-packages\mlc_llm\interface\chat.py", line 285, in chat
JSONFFIEngine(
File "C:\ProgramData\miniconda3\envs\mlc-prebuilt\Lib\site-packages\mlc_llm\json_ffi\engine.py", line 232, in init
model_args = _process_model_args(models, device, engine_config)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ProgramData\miniconda3\envs\mlc-prebuilt\Lib\site-packages\mlc_llm\serve\engine_base.py", line 171, in _process_model_args
model_args: List[Tuple[str, str]] = [_convert_model_info(model) for model in models]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ProgramData\miniconda3\envs\mlc-prebuilt\Lib\site-packages\mlc_llm\serve\engine_base.py", line 171, in
model_args: List[Tuple[str, str]] = [_convert_model_info(model) for model in models]
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ProgramData\miniconda3\envs\mlc-prebuilt\Lib\site-packages\mlc_llm\serve\engine_base.py", line 164, in _convert_model_info
model_lib = jit.jit(
^^^^^^^^
File "C:\ProgramData\miniconda3\envs\mlc-prebuilt\Lib\site-packages\mlc_llm\interface\jit.py", line 164, in jit
_run_jit(
File "C:\ProgramData\miniconda3\envs\mlc-prebuilt\Lib\site-packages\mlc_llm\interface\jit.py", line 124, in _run_jit
raise RuntimeError("Cannot find compilation output, compilation failed")
RuntimeError: Cannot find compilation output, compilation failed

Expected behavior

Running chat.

Environment

  • Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): vulkan
  • Operating system (e.g. Ubuntu/Windows/MacOS/...): Windows
  • Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...) intel integrated video card
  • How you installed MLC-LLM (conda, source): conda
  • How you installed TVM-Unity (pip, source):
  • Python version (e.g. 3.10):
  • GPU driver version (if applicable):
  • CUDA/cuDNN version (if applicable):
  • TVM Unity Hash Tag (python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))", applicable if you compile models):
  • Any other relevant information:

Additional context

@BlindDeveloper BlindDeveloper added the bug Confirmed bugs label Sep 12, 2024
@0xcrypto
Copy link

Install gcc, clang is trying to use it for linking

conda install gcc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Confirmed bugs
Projects
None yet
Development

No branches or pull requests

2 participants