Replies: 4 comments 10 replies
-
Thanks @giuseros for sharing this. @krzysz00 , purely looking at https://github.com/ROCmSoftwarePlatform/llvm-project-private/issues/678, what is left after #826 ? IIUC, using rocprof is covered in that PR. So the question remains, what more things should we need to make it worthwhile the investment into PAPI or even lower-level profiling.
I guess Im still trying to understand the problem statement. My 2 cents here : If I had to stretch my imagination, I can only see is that we lack a finer kernel breakdown based profiling i.e. how much address calculation /coordinate transforms cost vs the actual feature data crunching. Therefore, I think it would be great to identify what sort of data points that we are after before selecting on a solution. |
Beta Was this translation helpful? Give feedback.
-
It'd be also interesting if we can get counters per each rocmlir OP. Roughly, propagate profiling info through the lowering pass and profiling tools to collect counters per each group. |
Beta Was this translation helpful? Give feedback.
-
@giuseros , can I steal your librocprofiler code? I looked briefly into librocprofiler because I can load the .so into python and thus use it with torch-mlir, but it required a configuration setup that I wasn't prepared to mess with yet. The reason I considered librocprofiler is that I have a suspicion that rocprof doesn't necessarily follow down into subprocesses. I could be wrong, but it's one more variable. |
Beta Was this translation helpful? Give feedback.
-
Introduction
https://github.com/ROCmSoftwarePlatform/llvm-project-private/issues/678 suggests to integrate profiler information in our
rocmlir-gen
. I think there are different ways to approach this , and this page is to discuss which approach we want to followProfiling in
rocmlir-gen
In this section I will list three solutions I think are viable to add profiling information to
rocmlir-gen
Using
rocprof
(simple)rocprof
is the profiling tool distributed with ROCm. The high level idea is to enhanceMIOpenDriver.py
and add the counters we are interested in. Please note that we already userocprof
inMIOpenDriver.py
and we can simply specify an input file with all the metrics we are interested.Pros/Cons
Pros:
rocprof
is a customer application and the interface is quite stableCons:
Using PAPI (medium)
I previously ran an investigation about PAPI . PAPI is a high level API that relies on a low level API to gather performance metrics (like counters). In particular, for AMD GPUs, it relies on the
rocprofiler
low level library . The high level idea is to instrument the runtime with two functions:The only issue (that we can discuss later) is how to handle the EventSet (which needs to be shared between start/end functions).
Pros/Cons
Pros:
kernel-repeat
launch)Cons:
Using
rocprofiler
library (hard)This last approach makes use of the low level profiler library
rocprofiler
which is the core ofrocprof
andPAPI
.The approach would be similar to PAPI, i.e., creating two functions to start/stop the profiler and dumping the metrics at the end (and handling some profiling context between start/end).
The main difference is that
rocprofiler
is a much lower level API, and as such is more complicated to use. I evenually made it work, but I had to struggle a lot to understand how it could be used from within MLIR.Pros/Cons
Pros:
rocprof
Cons:
rocprofiler
a low level library, the interface tends to change more oftenBeta Was this translation helpful? Give feedback.
All reactions