Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"ze_peak" freezes on DG1 with latest drm-tip kernel + drivers #20

Open
eero-t opened this issue Oct 7, 2022 · 3 comments
Open

"ze_peak" freezes on DG1 with latest drm-tip kernel + drivers #20

eero-t opened this issue Oct 7, 2022 · 3 comments

Comments

@eero-t
Copy link

eero-t commented Oct 7, 2022

Setup:

  • HW: CML-S / DG1 (0x4905)
  • OS: Ubuntu 22.04
  • Kernel: "drm-tip" head from yesterday
  • UMD: Latest releases of compute stack components, built with LLVM 12
  • App: "ze_peak" from level-zero-tests head

Bug:

./ze_peak freezes with 99% CPU usage after showing:
Single Precision Compute (GFLOPS)

(I.e. half precision and global BW tests before it worked fine.)

It can be quit with ^C, so it's not in 100% CPU loop.

Gdb shows:

warning: Target and debugger are in different PID namespaces; thread lists and other data are likely unreliable.  Connect to gdbserver inside the container.
0x00007f6fbca28cab in sched_yield () from target:/lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007f6fbca28cab in sched_yield () from target:/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f6fbc27cd63 in ?? () from target:/usr/local/lib/libze_intel_gpu.so.1
#2  0x00007f6fbc0572c2 in ?? () from target:/usr/local/lib/libze_intel_gpu.so.1
#3  0x0000564de2c87d3f in ?? ()
#4  0x0000564de2c88653 in ?? ()
#5  0x0000564de2c94ba1 in ?? ()
#6  0x0000564de2c86104 in ?? ()
#7  0x00007f6fbc949d90 in ?? () from target:/lib/x86_64-linux-gnu/libc.so.6
#8  0x00007f6fbc949e40 in __libc_start_main () from target:/lib/x86_64-linux-gnu/libc.so.6
#9  0x0000564de2c862e5 in ?? ()

perf showed most of the time being spent inside libze_intel_gpu.so.1. I.e. it could be driver issue, but I thought it better to start from the app.

ze_image_copy, ze_nano and ze_pingpong work fine. ze_bandwidth gets slower and slower, and I did not wait for it to complete.

@jandres742
Copy link

@eero-t : could you check if it is just that it is taking a long time? please execute with reduced number of iterations

-i 5

@eero-t
Copy link
Author

eero-t commented Oct 10, 2022

With -i 2, "Global memory bandwidth" numbers were output at 1s interval, "Half Precision Compute" numbers at 2s interval, "Single Precision Compute" numbers at 5-10s interval, "Integer Compute" numbers at 10- 20s interval.

In total it took 3.5 mins with -i 2, and 4.4 mins with -i 5.

What's the default iteration count? With that, I see this in dmesg:

[271298.886789] Fence expiration time out i915-0000:03:00.0:ze_peak[226719]:788!
[271298.887157] Fence expiration time out i915-0000:03:00.0:ze_peak[226719]:786!

Which may explain why it freezes.

With default iteration count, there are no numbers shown for "Single Precision Compute" even after 40 mins, so I think that test is really frozen. Especially as numbers for the two earlier categories came only with few second delays.

Benchmarks code may be missing some error checks and warnings for the errors (when to skip given thing).


As to ze_bandwidth, that finished in a bit over 4 mins with its default options, so it is fine.


PS. why both of these GPU benchmarking programs take constantly 100% CPU, and need to allocate 32TB of virtual memory?

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                    
 226471 root      20   0   32,0t  55040  40752 R 100,3   0,3   1:56.89 ze_bandwidth
...
     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                    
 226719 root      20   0   32,0t   2,0g  71788 R 100,0  12,7   0:13.69 ze_peak 

@eero-t
Copy link
Author

eero-t commented Nov 25, 2022

Latest ze_peak is still freezing in "Single Precision" test with following stack:

  • kernel: drm-tip 6.1.0-rc5
  • GuC FW: 70.5.1
  • GMMlib: intel-gmmlib-22.3.1
  • SPIRV-SDK: sdk-1.3.231.1/sdk-1.3.231.1 (headers/tools)
  • SPIRV-LLVM: libllvmspirvlib-12-dev:amd64:12.0.0-3 (Ubuntu package)
  • OpenCL-Clang: libopencl-clang-12-dev:amd64:12.0.0-3 (Ubuntu package)
  • VC-intrinsics: v0.9.0
  • Graphics Compiler: igc-1.0.12662.1 (IGC)
  • Level-Zero API: v1.8.8
  • compute-runtime: 22.43.24558

There are again these kernel driver warnings:

[859809.534534] Fence expiration time out i915-0000:03:00.0:ze_peak[438677]:788!
[859809.534952] Fence expiration time out i915-0000:03:00.0:ze_peak[438677]:786!

strace -f -p $(pidof ze_peak) shows it doing nothing but sched_yield() system calls.

perf shows its 100% CPU usage going to:

Overhead  Command          Shared Object                     Symbol
   7,68%  ze_peak          libze_intel_gpu.so.1.3.0          [.] 0x000000000023d714
   6,44%  swapper          [kernel.kallsyms]                 [k] mwait_idle_with_hints.constprop.0
   3,17%  ze_peak          [kernel.kallsyms]                 [k] check_preemption_disabled
   3,12%  ze_peak          [kernel.kallsyms]                 [k] preempt_count_add
   2,82%  ze_peak          [kernel.kallsyms]                 [k] __schedule
   2,66%  ze_peak          libc.so.6                         [.] _help
   2,56%  ze_peak          [vdso]                            [.] __vdso_clock_gettime
   2,17%  ze_peak          [kernel.kallsyms]                 [k] _raw_spin_lock
   2,04%  ze_peak          [kernel.kallsyms]                 [k] entry_SYSRETQ_unsafe_stack
   1,91%  ze_peak          [kernel.kallsyms]                 [k] update_curr
   1,83%  ze_peak          [kernel.kallsyms]                 [k] preempt_count_sub
   1,83%  ze_peak          [kernel.kallsyms]                 [k] pick_next_task_fair
   1,46%  ze_peak          [kernel.kallsyms]                 [k] sched_clock
   1,44%  ze_peak          [kernel.kallsyms]                 [k] rcu_note_context_switch
   1,39%  ze_peak          [kernel.kallsyms]                 [k] __entry_text_start

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants