Forked from https://github.com/microsoft/onnxruntime
This is the comparison showing just my changes
ONNX Runtime is a framework for running ML programs. It can run any ML program in the popular ONNX format.
perf is a utility on Linux to measure software and hardware event counters. It is especially useful for measuring CPU events.
onnxruntime/core/common/perf_profiler.h
andonnxruntime/core/common/perf_profiler.cc
work with theperf_event_open
API.onnxruntime/core/session/inference_session.cc
- Loads perf config json from filename in
config_options
- Initializes perf config object and saves it in Profiler
- Loads perf config json from filename in
onnxruntime/core/common/profiler.h
andonnxruntime/core/common/profiler.cc
- Store perf configuration object here.
- Modified
EndTimeAndRecordEvent
to take in a list ofstr, str
onnxruntime/core/framework/sequential_executor.cc
- This is where onnxruntime records profiler info per-layer
- perf profiler is therefore called here pairs to append to the json.
- You need
perf
installed on the Linux kernel - You also need to install
libpfm4
.
./build.sh --config RelWithDebInfo --build_wheel --parallel
can add --skip_tests
if you fail those (I did)
With the path to the .whl
file from your build:
pip3 install ../onnxruntime/build/Linux/RelWithDebInfo/dist/onnxruntime-1.12.0-cp310-cp310-linux_x86_64.whl
Adding --force
will force reinstall, which is good for testing if you have the official or previous version of onnxruntime installed. If you are in a python virtual environment, make sure to re-load it to get the new package.
sess_options = onnxruntime.SessionOptions()
# enable builtin profiler
sess_options.enable_profiling = True
# specify path to perf configuration json
sess_options.add_session_config_entry("session.profiler.perf_config_file_name", os.path.abspath("perf_config.json"))
sess = onnxruntime.InferenceSession(model_filename, sess_options=sess_options)
# then run sess.run() on your model...
Then run sess.run() with your model. A json should appear in your directory. You can open it directly, or open it in chrome://tracing
.
For a tutorial of running a simple ONNX model with profiling, see: https://onnxruntime.ai/docs/api/python/auto_examples/plot_profiling.html
However, that model is too simple to get to the Sequential Executor (which is what my profiler hooks onto). So, use a more complex model such as sigmoid.onnx
from here: https://onnxruntime.ai/docs/api/python/auto_examples/plot_load_and_predict.html#sphx-glr-auto-examples-plot-load-and-predict-py
For a full example, see onnx_profiling_example.py
An example of perf_config.json
could be:
{
"perf::PERF_COUNT_HW_CPU_CYCLES": "cycles",
"perf::PERF_COUNT_HW_INSTRUCTIONS": "instructions",
"perf::PERF_COUNT_HW_CACHE_DTLB:READ:ACCESS": "L1-dcache-loads"
}
Each key is the name of a perf event which libpfm4
can look up and translate to a perf_event_attr
for use by perf_event_open
.
To find valid perf events for your cpu, use check_events
and showevtinfo
in the examples
folder of your libpfm4
install.
The value is anything you want to name your event. Here, I am using the corresponding event names that my perf user program uses.
- "Bad file descriptor": this is a problem when calling the
perf_event_open
API. Could be one of two things:- 'perf' does not have permissions: set
/proc/sys/kernel/perf_event_paranoid
to 3 or lower - The perf configuration json has an invalid event string. Make sure the perf events in the json are valid
and available on your computer. You can use tools like the event checker in the
libpfm4
library (which this uses) to verify.
- 'perf' does not have permissions: set
- All counters are 0, and they shouldn't be
- This happens when you try to pass in too many perf hardware event counters. The CPU has a special Performance Monitoring Unit (PMU) which only has enough registers to record a few hardware counters at once. On my CPU this limit is 4 (3 for cache counters).
- Solution: remove some hardware events
- The perf user program (e.g.
perf stat
) performs multiplexing to support more event counters. Basically, it quickly cycles through which counters it records and reports a percentage of time that the counter was able to be measured. For more information, this is a good read.
- This happens when you try to pass in too many perf hardware event counters. The CPU has a special Performance Monitoring Unit (PMU) which only has enough registers to record a few hardware counters at once. On my CPU this limit is 4 (3 for cache counters).
- There are no perf counters at all
- Check the exact spelling of the configuration key/value
- Make sure your model is complex enough to get to the Sequential Executor, because that's where I put the per-layer profiling. In onnxruntime python examples,
sigmoid.onnx
was complex enough, butmul_1.onnx
wasn't.
-
ONNX Runtime already has a built-in profiler which records how much time it takes for each layer to execute. If you just want time info, use that.
- It also has a memory profiler for how much memory each layer uses, which can be optionally enabled with a compiler flag.
-
ONNX Runtime can be compiled to support NVTX, an Nvidia program to monitor GPU performance counters, including hardware performance counters on the GPU. It seems to work by adding events to ONNX Runtime that NVTX can listen for.
-
Linux
perf record
function can record with high granularity (samples 1000+ Hz), which is enough to capture the performance counter info for functions that run for more than a few milliseconds (the ones we care about)- perf can also be configured to start/stop recording at particular code breakpoints. This is extremely useful for profiling individual functions or segments of code in long-running programs*.
- However, in both of these approaches it can be difficult to differentiate between the different layers in the ML model. Convolutions and matrix multiplications from different layers can be fused together into the same function for efficiency reasons, so it can be tricky to figure out which layer belongs to which function
- I did try once lining up the
perf record
timestamps with timestamps from the ONNX Runtime builtin profiler. However, they used different system clocks and I found I had to modify the ONNX Runtime profiler anyway, so I might as well add a function to record the perf counters per-layer.
- I did try once lining up the
*In hindsight, this is the approach I should have used for this program: adding specific events to ONNX Runtime that the regular perf user program could simply listen for. Which is what I think ONNX Runtime does with NVTX integration...
perf_event_open
is called here on a the current process pid. I call it in onnxruntime's Sequential Executor. If it spawns new processes,perf
should track those (and even kernel processes ifperf_event_paranoid
is permissive enough). However, I don't think it can keeps track of the counters from existing processes e.g. services. I haven't checked to see if onnxruntime uses those.