From 0b60dcc78952c1ae5ccdfdc6368fbdf51c857d2a Mon Sep 17 00:00:00 2001 From: Jason Andrews Date: Fri, 13 Dec 2024 21:40:50 +0000 Subject: [PATCH] Complete review of WindowsPerf with SPE --- .../_index.md | 4 +- .../_review.md | 8 +- .../windowsperf_sampling_cpython_spe.md | 7 +- ...dowsperf_sampling_cpython_spe_example_1.md | 108 +++++++++++------- ...dowsperf_sampling_cpython_spe_example_2.md | 23 ++-- 5 files changed, 89 insertions(+), 61 deletions(-) diff --git a/content/learning-paths/cross-platform/windowsperf_sampling_cpython_spe/_index.md b/content/learning-paths/cross-platform/windowsperf_sampling_cpython_spe/_index.md index 15f237674..d3dfb5837 100644 --- a/content/learning-paths/cross-platform/windowsperf_sampling_cpython_spe/_index.md +++ b/content/learning-paths/cross-platform/windowsperf_sampling_cpython_spe/_index.md @@ -1,5 +1,5 @@ --- -title: Sampling CPython with Arm SPE with WindowsPerf +title: Sampling CPython WindowsPerf and Arm SPE draft: true cascade: draft: true @@ -16,7 +16,7 @@ learning_objectives: prerequisites: - Windows on Arm desktop or development machine with [WindowsPerf](/install-guides/wperf), [Visual Studio](/install-guides/vs-woa/), and [Git](/install-guides/git-woa/) installed. - - The system must also have an Arm CPU with SPE support. + - The Windows on Arm system must have an Arm CPU with SPE support. author_primary: Przemyslaw Wirkus diff --git a/content/learning-paths/cross-platform/windowsperf_sampling_cpython_spe/_review.md b/content/learning-paths/cross-platform/windowsperf_sampling_cpython_spe/_review.md index 29beb2948..23b93cc09 100644 --- a/content/learning-paths/cross-platform/windowsperf_sampling_cpython_spe/_review.md +++ b/content/learning-paths/cross-platform/windowsperf_sampling_cpython_spe/_review.md @@ -32,7 +32,7 @@ review: - questions: question: > - WindowsPerf can be used and executed only on native ARM64 WOA hardware, and not in a virtual environment. + WindowsPerf can be used and executed only on native Windows on Arm hardware, and not in a virtual environment. answers: - "True" - "False" @@ -62,7 +62,7 @@ review: - questions: question: > - Is load_filter is one of SPE filters supported by WindowsPerf? + load_filter is one of SPE filters supported by WindowsPerf? answers: - "True" - "False" @@ -72,7 +72,7 @@ review: - questions: question: > - Is store_filter is one of SPE filters supported by WindowsPerf? + store_filter is one of SPE filters supported by WindowsPerf? answers: - "True" - "False" @@ -82,7 +82,7 @@ review: - questions: question: > - Is branch_filter is one of SPE filters supported by WindowsPerf? + branch_filter is one of SPE filters supported by WindowsPerf? answers: - "True" - "False" diff --git a/content/learning-paths/cross-platform/windowsperf_sampling_cpython_spe/windowsperf_sampling_cpython_spe.md b/content/learning-paths/cross-platform/windowsperf_sampling_cpython_spe/windowsperf_sampling_cpython_spe.md index 29cb3bd64..025d2a325 100644 --- a/content/learning-paths/cross-platform/windowsperf_sampling_cpython_spe/windowsperf_sampling_cpython_spe.md +++ b/content/learning-paths/cross-platform/windowsperf_sampling_cpython_spe/windowsperf_sampling_cpython_spe.md @@ -4,7 +4,7 @@ title: An overview of CPython sampling with SPE weight: 2 --- -In this example, you will build a debug build of CPython from sources and execute simple instructions in the Python interactive mode to obtain WindowsPerf sampling results from the CPython runtime image. +In this example, you will build a debug build of CPython from source and execute simple instructions in the Python interactive mode to obtain WindowsPerf sampling results from the CPython runtime image. ## Introduction to the Arm Statistical Profiling Extension (SPE) @@ -21,7 +21,8 @@ WindowsPerf includes `record` support for the Arm Statistical Profiling Extensio SPE is an optional feature in ARMv8.2 hardware that allows CPU instructions to be sampled and associated with the source code location where that instruction occurred. {{% notice Note %}} -Currently SPE is available on Windows On Arm in Test Mode only! +SPE is only available on Windows on Arm in Test Mode. +Windows Test Mode is a feature that allows you to install and test drivers that have not been digitally signed by Microsoft. {{% /notice %}} ## Before you begin @@ -31,7 +32,7 @@ For this Learning Path you will need: * A Windows on Arm (ARM64) native machine with pre-installed WindowsPerf (both driver and `wperf` CLI tool). Refer to the [WindowsPerf Install Guide](/install-guides/wperf/) for more details. * Note: The [WindowsPerf release 3.8.0](https://github.com/arm-developer-tools/windowsperf/releases/tag/3.8.0) includes a separate build with Arm SPE (Statistical Profiling Extension) support enabled. To install this version download release asset and you will find WindowsPerf SPE build in the `SPE/` subdirectory. * [Visual Studio](/install-guides/vs-woa/) and [Git](/install-guides/git-woa/) installed. -* The CPU must support the Arm SPE extension, an optional feature in ARMv8.2 hardware. You can check your CPU compatibility using the WindowsPerf command-line tool (explained below). +* The CPU must support the Arm SPE extension, an optional feature in ARMv8.2 hardware. You can check your CPU compatibility using the WindowsPerf command-line tool as explained below. ### How do I check if my Arm CPU supports the Arm SPE extension? diff --git a/content/learning-paths/cross-platform/windowsperf_sampling_cpython_spe/windowsperf_sampling_cpython_spe_example_1.md b/content/learning-paths/cross-platform/windowsperf_sampling_cpython_spe/windowsperf_sampling_cpython_spe_example_1.md index 3a4762acd..390593d7b 100644 --- a/content/learning-paths/cross-platform/windowsperf_sampling_cpython_spe/windowsperf_sampling_cpython_spe_example_1.md +++ b/content/learning-paths/cross-platform/windowsperf_sampling_cpython_spe/windowsperf_sampling_cpython_spe_example_1.md @@ -4,33 +4,39 @@ title: WindowsPerf sample using SPE example weight: 3 --- -## Example 1: Sampling of CPython calculating Googolplex using SPE +## Example 1: Sample CPython using SPE -{{% notice Note %}} -All the steps in these following sections are done on a native ARM64 Windows on Arm machine. -{{% /notice %}} +You can use the [CPython](https://github.com/python/cpython) binary you built from source in debug mode to compute a large integer number called a [Googolplex](https://en.wikipedia.org/wiki/Googolplex). This is a good way to stress CPython to demonstrate profiling. + +The steps are: +- Pin the `python_d.exe` interactive console to an arbitrary CPU core and calculate `10^10^100`. +- Run counting and sampling to obtain event information. -You will use the pre-built [CPython](https://github.com/python/cpython) binaries targeting ARM64 from sources in the debug mode from the previous step and then complete the following: -- Pin `python_d.exe` interactive console to an arbitrary CPU core, calculate `10^10^100` expression, a large integer number [Googolplex](https://en.wikipedia.org/wiki/Googolplex) to stress the CPython application and get a simple workload. -- Run counting and sampling to obtain some simple event information. +### Pin CPython to CPU core 1 -### Pin the new CPython process to a CPU core 1 +You can use the Windows `start` command to execute and pin `python_d.exe` process to CPU core 1. -Use the Windows `start` command to execute and pin `python_d.exe` process to CPU core number 1. Below command is executing computation intensive calculations of `10^10^100`, a [Googolplex](https://en.wikipedia.org/wiki/Googolplex) number, with CPython. +Run the command below at a Windows Command Prompt to execute the computation intensive calculation: ```command start /affinity 2 cpython\PCbuild\arm64\python_d.exe -c 10**10**100 ``` {{% notice Note %}} -The [start](https://learn.microsoft.com/en-us/windows-server/administration/windows-commands/start) command line switch `/affinity ` applies the specified processor affinity mask (expressed as a hexadecimal number) to the new application. In our example decimal `2` is `0x02` or `0b0010`. This value denotes core no. `1` as `1` is a first bit in the mask, where the mask is indexed from `0` (zero). +The [start](https://learn.microsoft.com/en-us/windows-server/administration/windows-commands/start) command line `/affinity ` applies the specified processor affinity mask (expressed as a hexadecimal number). In this example decimal `2` is `0x02` or `0b0010`. This value denotes core number `1` as `1` is a first bit in the mask, where the mask is indexed from `0`. {{% /notice %}} -You can use the Windows Task Manager to confirm that `python_d.exe` is running on CPU core no. 1. +You can use the Windows Task Manager to confirm that `python_d.exe` is running on CPU core 1. + +### WindowsPerf introduces SPE filters + +You can specify SPE filters using the `-e` command line option with `arm_spe_0//`. + +The `arm_spe_0/*/` notation is available for the `sample` and `record` commands, where `*` represents a comma-separated list of supported filters. -### SPE introduces new option for command line switch -e arm_spe_0// +Currently, filters such as `store_filter=`, `load_filter=`, and `branch_filter=`, or their short equivalents like `st=`, `ld=`, and `b=`. Use `0` or `1` to disable or enable a given filter. -Users can specify SPE filters using the `-e` command line option with `arm_spe_0//`. We've introduced the `arm_spe_0/*/` notation for the `sample` and `record` command, where `*` represents a comma-separated list of supported filters. Currently, we support filters such as `store_filter=`, `load_filter=`, and `branch_filter=`, or their short equivalents like `st=`, `ld=`, and `b=`. Use `0` or `1` to disable or enable a given filter. For example: +Here are some filter examples: ```output arm_spe_0/branch_filter=1/ @@ -41,24 +47,31 @@ arm_spe_0/st=0,ld=0,b=1/ #### Filtering sample records -SPE register `PMSFCR_EL1.FT` enables filtering by operation type. When enabled `PMSFCR_EL1.{ST, LD, B}` define the collected types: +The SPE register `PMSFCR_EL1.FT` enables filtering by operation type. + +When enabled `PMSFCR_EL1.{ST, LD, B}` defines the collected types: + - `ST` enables collection of store sampled operations, including all atomic operations. - `LD` enables collection of load sampled operations, including atomic operations that return a value to a register. - `B` enables collection of branch sampled operations, including direct and indirect branches and exception returns. -### Sampling using SPE the CPython application running the Googolplex calculation on CPU core 1 +### Sample CPython using SPE + +The command below samples the running `python_d.exe` process. -Below command will sample already running process `python_d.exe` (denoted with `--image_name python_d.exe`) on CPU core no. 1. SPE filter `ld=1` enables collection of load sampled operations, including atomic operations that return a value to a register. +The SPE filter `ld=1` enables collection of load sampled operations, including atomic operations that return a value to a register. ```command wperf sample -e arm_spe_0/ld=1/ --pe_file cpython\PCbuild\arm64\python_d.exe --image_name python_d.exe -c 1 ``` {{% notice Note%}} -You can use the same sampling `--annotate` and `--disassemble` command line interface of WindowsPerf with SPE extension. See example outputs below. +You can use the same sampling `--annotate` and `--disassemble` command line interface of WindowsPerf with SPE extension. This is shown in the example output below. {{% /notice %}} -Please wait a few seconds for the samples to arrive from the Kernel driver and then press `Ctrl+C` to stop sampling. You should see: +Please wait a few seconds for the samples to arrive from the Kernel driver and then press `Ctrl+C` to stop sampling. + +You see output similar to: ```output base address of 'python_d.exe': 0x7ff765fe1288, runtime delta: 0x7ff625fe0000 @@ -84,36 +97,44 @@ note: 'e' - normal event, 'gN' - grouped event with group number N, metric name 9.853 seconds time elapsed ``` -{{% notice Note%}} -You can close the command line window with `python_d.exe` running when you have finished sampling. Sampling will also automatically end when the sample process has finished. -{{% /notice %}} +You can close the command line window running `python_d.exe` when you have finished sampling. +Sampling will also automatically end when the sampled process exits. #### SPE sampling output -- In the above example, you can see that the majority of "overhead" is generated by `python_d.exe` executable resides inside the `python312_d.dll` DLL, in `x_mul` symbol. -- SPE sampling output contains also PMU events for SPE registered during sampling: - - `sample_pop` - Statistical Profiling sample population. Counts statistical profiling sample population, the count of all operations that could be sampled but may or may not be chosen for sampling. - - `sample_feed` - Statistical Profiling sample taken. Counts statistical profiling samples taken for sampling. - - `sample_filtrate` - Statistical Profiling sample taken and not removed by filtering. Counts statistical profiling samples taken which are not removed by filtering. - - `sample_collision` - Statistical Profiling sample collided with previous sample. Counts statistical profiling samples that have collided with a previous sample and so therefore not taken. -- Note that in sampling `....eee....e` is a progressing printout where: - - character `.` represents a SPE sample payload received from the WindowsPerf Kernel driver and - - character `e` represents an unsuccessful attempt (empty SPE fill buffer) to fetch the whole sample payload. +In the output above, you see that the majority of "overhead" generated by `python_d.exe` resides in the `python312_d.dll` DLL, in the `x_mul` symbol. + +SPE sampling output also contains PMU events for the SPE registered events. + +Here are some helpful definitions: + + - `sample_pop` - Counts statistical profiling sample population, the count of all operations that could be sampled but may or may not be chosen for sampling. + - `sample_feed` - Counts statistical profiling samples taken. + - `sample_filtrate` - Counts statistical profiling samples taken which are not removed by filtering. + - `sample_collision` - Counts statistical profiling samples that have collided with a previous sample and therefore not taken. + +During sampling the `....eee....e` output is a progressing printout where: + - each `.` character represents an SPE sample payload received from the WindowsPerf Kernel driver + - each `e` character represents an unsuccessful attempt (empty SPE fill buffer) to fetch the whole sample payload {{% notice Note%}} -You can also output `wperf sample` command in JSON format. Use the `--json` command line option to enable the JSON output. +You can also generate `wperf sample` output in JSON format. Use the `--json` command line option to enable the JSON output. Use the `-v` command line option `verbose` to add more information about sampling. {{% /notice %}} #### Example output with annotate enabled -Command line option `--annotate` enables translating addresses taken from samples in sample/record mode into source code line numbers. +The `--annotate` command line option enables translating addresses taken from samples in sample/record mode into source code line numbers. + +For example: ```console wperf sample -e arm_spe_0/ld=1/ --annotate --pe_file cpython\PCbuild\arm64\python_d.exe --image_name python_d.exe -c 1 ``` +The output is similar to: + ```output base address of 'python_d.exe': 0x7ff765fe1288, runtime delta: 0x7ff625fe0000 sampling ....ee.Ctrl-C received, quit counting...e done! @@ -142,18 +163,21 @@ x_mul:python312_d.dll 5.199 seconds time elapsed ``` -Note: Above SPE sampling pass recorded: -- function `x_mul:python312_d.dll`: - - in source file `C:\path\to\cpython\Objects\longobject.c`, line `3590` as a hot-spot for `load_filter` enabled. +The above SPE sampling pass records that the function `x_mul:python312_d.dll` +n source file `C:\path\to\cpython\Objects\longobject.c`, line `3590` is a hot-spot for the `load_filter`. #### Example output with disassemble enabled -Command line option `--disassemble` enables disassemble output on sampling mode. Implies `--annotate`. +The `--disassemble` command line option enables disassembly output, and also implies `--annotate`. + +For example: ```console wperf sample -e arm_spe_0/ld=1/ --disassemble --pe_file cpython\PCbuild\arm64\python_d.exe --image_name python_d.exe -c 1 ``` +The output is similar to: + ```output base address of 'python_d.exe': 0x7ff765fe1288, runtime delta: 0x7ff625fe0000 sampling ......eCtrl-C received, quit counting... done! @@ -207,9 +231,9 @@ v_isub:python312_d.dll 4.422 seconds time elapsed ``` -Note: Above SPE sampling pass recorded: -- function `x_mul:python312_d.dll`: - - in source file `C:\path\to\cpython\Objects\longobject.c`, line `3591`, instruction `ldr x9, [sp, #0x20]` at address `0x4043b4` as potential hot-spot. - - in source file `C:\path\to\cpython\Objects\longobject.c`, line `3589`, instruction `ldr x8, [sp, #0x58]` at address `0x404360` as potential hot-spot. -- Function `v_isub:python312_d.dll`: - - in source file `C:\path\to\cpython\Objects\longobject.c`, line `1603`, instruction `ldr w8, [sp, #0x10]` at address `0x402a60` as potential hot-spot. +The output above shows that the function `x_mul:python312_d.dll` is a hot spot which comes from the following source code lines: + - File `C:\path\to\cpython\Objects\longobject.c`, line `3591`, instruction `ldr x9, [sp, #0x20]` at address `0x4043b4` as potential hot-spot. + - File `C:\path\to\cpython\Objects\longobject.c`, line `3589`, instruction `ldr x8, [sp, #0x58]` at address `0x404360` as potential hot-spot. + +Another potential hot spot is in the function `v_isub:python312_d.dll` in the source file `C:\path\to\cpython\Objects\longobject.c`, line `1603`, instruction `ldr w8, [sp, #0x10]` at address `0x402a60`. + \ No newline at end of file diff --git a/content/learning-paths/cross-platform/windowsperf_sampling_cpython_spe/windowsperf_sampling_cpython_spe_example_2.md b/content/learning-paths/cross-platform/windowsperf_sampling_cpython_spe/windowsperf_sampling_cpython_spe_example_2.md index 76c907585..2c27f10c5 100644 --- a/content/learning-paths/cross-platform/windowsperf_sampling_cpython_spe/windowsperf_sampling_cpython_spe_example_2.md +++ b/content/learning-paths/cross-platform/windowsperf_sampling_cpython_spe/windowsperf_sampling_cpython_spe_example_2.md @@ -4,22 +4,23 @@ title: WindowsPerf record using SPE example weight: 4 --- -## Example 2: Using the `record` command to simplify things +## Example 2: Record CPython using SPE -- The `record` command spawns the process and pins it to the core specified by the `-c` option. -- A double-dash (`--`) is a syntax used in shell commands to signify end of command options and beginning of positional arguments. In other words, it separates `wperf` CLI options from arguments that the command operates on. Use `--` to separate `wperf.exe` command line options from the process you want to spawn followed by its verbatim arguments. +You can use the `record` command to spawn the Python process and pin it to the core specified by the `-c` option. + +A double-dash (`--`) syntax in shell commands signifies the end of command options and beginning of positional arguments. In other words, it separates the `wperf` CLI options from the arguments passed to the profiled program, `python_d.exe`. + +Run the `record` command with SPE to collect load events from SPE: ```console wperf record -e arm_spe_0/ld=1/ -c 1 --timeout 5 -- cpython\PCbuild\arm64\python_d.exe -c 10**10**100 ``` -{{% notice Note%}} -You can use the same sampling `--annotate` and `--disassemble` command line interface of WindowsPerf with SPE extension. -{{% /notice %}} +You can use the same `--annotate` and `--disassemble` command line arguments the SPE extension. The WindowsPerf `record` command is versatile, allowing you to start and stop the sampling process easily. It also simplifies the command line syntax, making it user-friendly and efficient. -Example 2 can be replaced by these two commands: +The example above can be replaced by these two commands: ```console start /affinity 2 cpython\PCbuild\arm64\python_d.exe -c 10**10**100 @@ -28,8 +29,10 @@ wperf sample -e arm_spe_0/ld=1/ --pe_file cpython\PCbuild\arm64\python_d.exe --i ## Summary -WindowsPerf is a versatile performance analysis tool that can support both software (with CPU PMU events) and hardware sampling (with SPE extension). The type of sampling it can perform depends on the availability of the Arm Statistical Profiling Extension (SPE) in the ARM64 CPU. If the Arm SPE extension is present, WindowsPerf can leverage hardware sampling to provide detailed performance insights. Otherwise, it will rely on software sampling to gather performance data. This flexibility ensures that WindowsPerf can adapt to different hardware configurations and still deliver valuable performance metrics. +WindowsPerf is a versatile performance analysis tool supporting both software (with CPU PMU events) and hardware sampling (with the SPE extension). + +The type of sampling it can perform depends on the availability of the Arm Statistical Profiling Extension (SPE) in the CPU. If the Arm SPE extension is present, WindowsPerf can leverage hardware sampling to provide detailed performance insights. Otherwise, it will rely on software sampling to gather performance data. This flexibility ensures that WindowsPerf can adapt to different hardware configurations and still deliver valuable performance metrics. -Use `wperf sample`, a sampling mode, for determining the frequencies of event occurrences produced by program locations at the function, basic block, and/or instruction levels. +Use `wperf sample`, sampling mode, for determining the frequencies of event occurrences produced by program locations at the function, basic block, and/or instruction levels. -Use `wperf record`, same as sample but also automatically spawns the process and pins it to the core specified by `-c`. Process name is defined by COMMAND. User can pass verbatim arguments to the process. +Use `wperf record`, is the same as sample, but also automatically spawns the process and pins it to the core specified by `-c`. You can use `record` to pass verbatim arguments to the process.