Adding profile information to `rocmlir-gen` #840

giuseros · 2022-11-02T17:24:29Z

giuseros
Nov 2, 2022
Collaborator

Introduction

https://github.com/ROCmSoftwarePlatform/llvm-project-private/issues/678 suggests to integrate profiler information in our rocmlir-gen. I think there are different ways to approach this , and this page is to discuss which approach we want to follow

Profiling in `rocmlir-gen`

In this section I will list three solutions I think are viable to add profiling information to rocmlir-gen

Using `rocprof` (simple)

rocprof is the profiling tool distributed with ROCm. The high level idea is to enhance MIOpenDriver.py and add the counters we are interested in. Please note that we already use rocprof in MIOpenDriver.py and we can simply specify an input file with all the metrics we are interested.

Pros/Cons

Pros:

Very simple approach
Easy to maintain: rocprof is a customer application and the interface is quite stable

Cons:

Coarse grained : we can only profile the whole application.
Possibly more noisy : since we are profiling the whole application, counters might be affected by other factors other than the kernel we run

Using PAPI (medium)

I previously ran an investigation about PAPI . PAPI is a high level API that relies on a low level API to gather performance metrics (like counters). In particular, for AMD GPUs, it relies on the rocprofiler low level library . The high level idea is to instrument the runtime with two functions:

extern "C" void startProfiler(){  
   // Initialize the library  
   retval = PAPI_library_init(PAPI_VER_CURRENT);    
   // Add the events we want to monitor  
   retval = PAPI_create_eventset(&EventSet);  

   retval = PAPI_add_named_event(EventSet, "rocm:::SQ_INSTS_VALU:device=0"); 
   // add other events
   PAPI_reset(EventSet);  
   retval=PAPI_start(EventSet);  
   return;
}

extern "C" void stopProfiler(){
  long long values[N];
 
  // Stop monitoring
  retval=PAPI_stop(EventSet, values);
 
  // Save the values of the events
  printf("rocm:::SQ_INSTS_VALU:device=0 %lli\n", values[0]);
  
}

The only issue (that we can discuss later) is how to handle the EventSet (which needs to be shared between start/end functions).

Pros/Cons

Pros:

The approach is still quite simple to implement
More fine grained performance support (for instance, we can profile each loop of a kernel-repeat launch)
Reset metrics before starting the kernel means that results would be less noisy

Cons:

Additional dependency on an external library
Slightly harder to maintain (PAPI is a -high level- library, not an application, and the interface changes more often)

Using `rocprofiler` library (hard)

This last approach makes use of the low level profiler library rocprofiler which is the core of rocprof and PAPI.

The approach would be similar to PAPI, i.e., creating two functions to start/stop the profiler and dumping the metrics at the end (and handling some profiling context between start/end).

The main difference is that rocprofiler is a much lower level API, and as such is more complicated to use. I evenually made it work, but I had to struggle a lot to understand how it could be used from within MLIR.

Pros/Cons

Pros:

We don't depend on any external library
Like PAPI, we don't profile the entire application, just the kernel we dispatch
Like PAPI (probably even more?) results would be less noisy than using rocprof

Cons:

Hard to implement (at least for me)
Hard to maintain: being the rocprofiler a low level library, the interface tends to change more often

giuseros · 2022-11-02T17:27:32Z

giuseros
Nov 2, 2022
Collaborator Author

cc: @sjw36 @krzysz00

0 replies

manupak · 2022-11-03T09:32:45Z

manupak
Nov 3, 2022
Collaborator

Thanks @giuseros for sharing this.

@krzysz00 , purely looking at https://github.com/ROCmSoftwarePlatform/llvm-project-private/issues/678, what is left after #826 ?

IIUC, using rocprof is covered in that PR.

So the question remains, what more things should we need to make it worthwhile the investment into PAPI or even lower-level profiling.
Since we only profile GPU side of things and given we easily create carve-outs in the IR in MLIR, Are the following cons applicable :

Coarse grained : we can only profile the whole application.
Possibly more noisy : since we are profiling the whole application, counters might be affected by other factors other than the kernel we run

I guess Im still trying to understand the problem statement.

My 2 cents here :

If I had to stretch my imagination, I can only see is that we lack a finer kernel breakdown based profiling i.e. how much address calculation /coordinate transforms cost vs the actual feature data crunching. Therefore, I think it would be great to identify what sort of data points that we are after before selecting on a solution.

6 replies

giuseros Nov 3, 2022
Collaborator Author

P.S. another reason I assumed API profiling was preferred to rocprof is that in rocmlir-gen there is a TBD task about plumbing the profiler through the runner.

jungpark-mlir Nov 3, 2022
Collaborator

Some comments,

rocprof also gives per kernel execution time in the result file. For example, some miopen convolutions contain im2col kernel followed by gemm kernel separately.
rocprof is also expected to provide API profiling from host-side but I've never succeeded to get numbers. I tried several different build of rocprof but no luck yet. It'd be helpful if we can easily get that information from PAPI
Another feature PAPI might be helpful is its sampling mode. During the PAPI presentation, I heard it can gather counters more fine-grained fashion that we can inspect counters within a kernel. I suppose the low-level feature for this is provided by rocprofiler lib but haven't seen anything from rocprof tool. I suppose people rely fine grained investigation on the ThreadTrace, which gathers some information at cycle level but we can only get trace from limited number of CUs and painful to investigate because it contains too much information.

giuseros Nov 3, 2022
Collaborator Author

Hi Jungwook, thanks for the comment!

I am not sure I follow entirely though: I have been able to collect metrics from PAPI and from librocprofiler, so implementation-wise should be (hopefully) fine. My question (and Manupa's question, I think) is : what PAPI/librocprofiler gives us that rocprof doesn't?

Basically, what do you mean by:

it can gather counters more fine-grained fashion that we can inspect counters within a kernel

With rocprof I can extract already those counters (I think). Is there any counter I can access with PAPI/librocprofiler that I cannot access with rocprof?

jungpark-mlir Nov 3, 2022
Collaborator

Sorry, my comment was not clear.
PAPI's sampling mode collects counters per finer sampling period so user can observe the counter changes within a kernel.
For example, at the beginning of the kernel it starts initiating waves and waves are retiring at the end of the kernel, we could observe the change of the GPU utilization through the kernel execution.
I don't have a good idea to relate this feature with MLIR integration, just wish there is a gui tool easily visualize this information.

manupak Nov 4, 2022
Collaborator

OK,

I think we are revolving around intra-kernel measurements is what we could get from PAPI (or lower-level solutions).

I think it would be great to do a case study understand picking an example kernel configuration from here : http://rocmhead.amd.com:8080/job/MLIR/job/mlir-nightly-all/Performance_20report_20for_20gfx908/ that uses the same type of solvers (i.e. where MIOpen uses IGEMM) but result in difference performance.

My hunch is that we trade-off LDS for more address generation instructions (i.e. coordinate transforms). Though that could be already observable by rocprof counters, we could see whether this is true by seeing a timeline as @jungpark-mlir here alludes to and see if there are more things.

Then I think we can have a seperate ticket to add intra-kernel profiling, but that should not be mutually exclusive to rocprof based profiling, IMO.

What do others think ?

jungpark-mlir · 2022-11-03T12:51:53Z

jungpark-mlir
Nov 3, 2022
Collaborator

It'd be also interesting if we can get counters per each rocmlir OP. Roughly, propagate profiling info through the lowering pass and profiling tools to collect counters per each group.
Sorry, I don't have enough knowledge on this and not sure even it's possible.

2 replies

manupak Nov 3, 2022
Collaborator

For op-based profiling, this might be interesting look at : https://www.tensorflow.org/mlir/xla_gpu_codegen#step_2_optional_profiling_support

I think we can still have this functionality build into rocmlir-gen profiling flow slice on a op basis and wrap with profiling calls rather doing a monolithic program construction with profiling hooks on it if the above (XLA) reasoning is true.

jungpark-mlir Nov 10, 2022
Collaborator

Actually I was thinking of something lower level within a kernel. What xla is doing looks higher level or CPU only feature.
For example,
On CPU, I suppose we can get a profile information of per basic-block hit count which give us the extremely low level breakdown of execution time. Also compiler knows instruction count per each basic-block.
Something like that might be helpful for investigating the observed overhead in convolution kernel, other than calculating gemm.

I think this doesn't need to be included as a part of automation or tuning, just another profiling feature.
Such feature might be already supported in the AMD GPU compiler. Not sure but we can ask if there is any.

pcf000 · 2022-11-04T00:27:30Z

pcf000
Nov 4, 2022

@giuseros , can I steal your librocprofiler code? I looked briefly into librocprofiler because I can load the .so into python and thus use it with torch-mlir, but it required a configuration setup that I wasn't prepared to mess with yet.

The reason I considered librocprofiler is that I have a suspicion that rocprof doesn't necessarily follow down into subprocesses. I could be wrong, but it's one more variable.

2 replies

giuseros Nov 4, 2022
Collaborator Author

Hi Paul, I sent you a PM about this !

manupak Nov 4, 2022
Collaborator

Dont we already use rocprof with subprocesses here :

https://github.com/ROCmSoftwarePlatform/rocMLIR/blob/19a24cb047d5f8a3b3eb032694fe0a28bbc95b69/mlir/utils/performance/perfRunner.py#L527-L552

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding profile information to `rocmlir-gen` #840

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 10 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Adding profile information to rocmlir-gen #840

giuseros Nov 2, 2022 Collaborator

Introduction

Profiling in rocmlir-gen

Using rocprof (simple)

Pros/Cons

Using PAPI (medium)

Pros/Cons

Using rocprofiler library (hard)

Pros/Cons

Replies: 4 comments · 10 replies

giuseros Nov 2, 2022 Collaborator Author

manupak Nov 3, 2022 Collaborator

giuseros Nov 3, 2022 Collaborator Author

jungpark-mlir Nov 3, 2022 Collaborator

giuseros Nov 3, 2022 Collaborator Author

jungpark-mlir Nov 3, 2022 Collaborator

manupak Nov 4, 2022 Collaborator

jungpark-mlir Nov 3, 2022 Collaborator

manupak Nov 3, 2022 Collaborator

jungpark-mlir Nov 10, 2022 Collaborator

pcf000 Nov 4, 2022

giuseros Nov 4, 2022 Collaborator Author

manupak Nov 4, 2022 Collaborator

Adding profile information to `rocmlir-gen` #840

giuseros
Nov 2, 2022
Collaborator

Profiling in `rocmlir-gen`

Using `rocprof` (simple)

Using `rocprofiler` library (hard)

Replies: 4 comments 10 replies

giuseros
Nov 2, 2022
Collaborator Author

manupak
Nov 3, 2022
Collaborator

giuseros Nov 3, 2022
Collaborator Author

jungpark-mlir Nov 3, 2022
Collaborator

giuseros Nov 3, 2022
Collaborator Author

jungpark-mlir Nov 3, 2022
Collaborator

manupak Nov 4, 2022
Collaborator

jungpark-mlir
Nov 3, 2022
Collaborator

manupak Nov 3, 2022
Collaborator

jungpark-mlir Nov 10, 2022
Collaborator

pcf000
Nov 4, 2022

giuseros Nov 4, 2022
Collaborator Author

manupak Nov 4, 2022
Collaborator