Skip to content

Commit

Permalink
Merge pull request #1441 from jasonrandrews/review
Browse files Browse the repository at this point in the history
Complete review of WindowsPerf with SPE
  • Loading branch information
jasonrandrews authored Dec 13, 2024
2 parents f6f5330 + 0b60dcc commit db0c5a0
Show file tree
Hide file tree
Showing 5 changed files with 89 additions and 61 deletions.
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: Sampling CPython with Arm SPE with WindowsPerf
title: Sampling CPython WindowsPerf and Arm SPE
draft: true
cascade:
draft: true
Expand All @@ -16,7 +16,7 @@ learning_objectives:

prerequisites:
- Windows on Arm desktop or development machine with [WindowsPerf](/install-guides/wperf), [Visual Studio](/install-guides/vs-woa/), and [Git](/install-guides/git-woa/) installed.
- The system must also have an Arm CPU with SPE support.
- The Windows on Arm system must have an Arm CPU with SPE support.

author_primary: Przemyslaw Wirkus

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ review:
- questions:
question: >
WindowsPerf can be used and executed only on native ARM64 WOA hardware, and not in a virtual environment.
WindowsPerf can be used and executed only on native Windows on Arm hardware, and not in a virtual environment.
answers:
- "True"
- "False"
Expand Down Expand Up @@ -62,7 +62,7 @@ review:
- questions:
question: >
Is load_filter is one of SPE filters supported by WindowsPerf?
load_filter is one of SPE filters supported by WindowsPerf?
answers:
- "True"
- "False"
Expand All @@ -72,7 +72,7 @@ review:
- questions:
question: >
Is store_filter is one of SPE filters supported by WindowsPerf?
store_filter is one of SPE filters supported by WindowsPerf?
answers:
- "True"
- "False"
Expand All @@ -82,7 +82,7 @@ review:
- questions:
question: >
Is branch_filter is one of SPE filters supported by WindowsPerf?
branch_filter is one of SPE filters supported by WindowsPerf?
answers:
- "True"
- "False"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ title: An overview of CPython sampling with SPE
weight: 2
---

In this example, you will build a debug build of CPython from sources and execute simple instructions in the Python interactive mode to obtain WindowsPerf sampling results from the CPython runtime image.
In this example, you will build a debug build of CPython from source and execute simple instructions in the Python interactive mode to obtain WindowsPerf sampling results from the CPython runtime image.

## Introduction to the Arm Statistical Profiling Extension (SPE)

Expand All @@ -21,7 +21,8 @@ WindowsPerf includes `record` support for the Arm Statistical Profiling Extensio
SPE is an optional feature in ARMv8.2 hardware that allows CPU instructions to be sampled and associated with the source code location where that instruction occurred.

{{% notice Note %}}
Currently SPE is available on Windows On Arm in Test Mode only!
SPE is only available on Windows on Arm in Test Mode.
Windows Test Mode is a feature that allows you to install and test drivers that have not been digitally signed by Microsoft.
{{% /notice %}}

## Before you begin
Expand All @@ -31,7 +32,7 @@ For this Learning Path you will need:
* A Windows on Arm (ARM64) native machine with pre-installed WindowsPerf (both driver and `wperf` CLI tool). Refer to the [WindowsPerf Install Guide](/install-guides/wperf/) for more details.
* Note: The [WindowsPerf release 3.8.0](https://github.com/arm-developer-tools/windowsperf/releases/tag/3.8.0) includes a separate build with Arm SPE (Statistical Profiling Extension) support enabled. To install this version download release asset and you will find WindowsPerf SPE build in the `SPE/` subdirectory.
* [Visual Studio](/install-guides/vs-woa/) and [Git](/install-guides/git-woa/) installed.
* The CPU must support the Arm SPE extension, an optional feature in ARMv8.2 hardware. You can check your CPU compatibility using the WindowsPerf command-line tool (explained below).
* The CPU must support the Arm SPE extension, an optional feature in ARMv8.2 hardware. You can check your CPU compatibility using the WindowsPerf command-line tool as explained below.

### How do I check if my Arm CPU supports the Arm SPE extension?

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,33 +4,39 @@ title: WindowsPerf sample using SPE example
weight: 3
---

## Example 1: Sampling of CPython calculating Googolplex using SPE
## Example 1: Sample CPython using SPE

{{% notice Note %}}
All the steps in these following sections are done on a native ARM64 Windows on Arm machine.
{{% /notice %}}
You can use the [CPython](https://github.com/python/cpython) binary you built from source in debug mode to compute a large integer number called a [Googolplex](https://en.wikipedia.org/wiki/Googolplex). This is a good way to stress CPython to demonstrate profiling.

The steps are:
- Pin the `python_d.exe` interactive console to an arbitrary CPU core and calculate `10^10^100`.
- Run counting and sampling to obtain event information.

You will use the pre-built [CPython](https://github.com/python/cpython) binaries targeting ARM64 from sources in the debug mode from the previous step and then complete the following:
- Pin `python_d.exe` interactive console to an arbitrary CPU core, calculate `10^10^100` expression, a large integer number [Googolplex](https://en.wikipedia.org/wiki/Googolplex) to stress the CPython application and get a simple workload.
- Run counting and sampling to obtain some simple event information.
### Pin CPython to CPU core 1

### Pin the new CPython process to a CPU core 1
You can use the Windows `start` command to execute and pin `python_d.exe` process to CPU core 1.

Use the Windows `start` command to execute and pin `python_d.exe` process to CPU core number 1. Below command is executing computation intensive calculations of `10^10^100`, a [Googolplex](https://en.wikipedia.org/wiki/Googolplex) number, with CPython.
Run the command below at a Windows Command Prompt to execute the computation intensive calculation:

```command
start /affinity 2 cpython\PCbuild\arm64\python_d.exe -c 10**10**100
```

{{% notice Note %}}
The [start](https://learn.microsoft.com/en-us/windows-server/administration/windows-commands/start) command line switch `/affinity <hexaffinity>` applies the specified processor affinity mask (expressed as a hexadecimal number) to the new application. In our example decimal `2` is `0x02` or `0b0010`. This value denotes core no. `1` as `1` is a first bit in the mask, where the mask is indexed from `0` (zero).
The [start](https://learn.microsoft.com/en-us/windows-server/administration/windows-commands/start) command line `/affinity <hexaffinity>` applies the specified processor affinity mask (expressed as a hexadecimal number). In this example decimal `2` is `0x02` or `0b0010`. This value denotes core number `1` as `1` is a first bit in the mask, where the mask is indexed from `0`.
{{% /notice %}}

You can use the Windows Task Manager to confirm that `python_d.exe` is running on CPU core no. 1.
You can use the Windows Task Manager to confirm that `python_d.exe` is running on CPU core 1.

### WindowsPerf introduces SPE filters

You can specify SPE filters using the `-e` command line option with `arm_spe_0//`.

The `arm_spe_0/*/` notation is available for the `sample` and `record` commands, where `*` represents a comma-separated list of supported filters.

### SPE introduces new option for command line switch -e arm_spe_0//
Currently, filters such as `store_filter=`, `load_filter=`, and `branch_filter=`, or their short equivalents like `st=`, `ld=`, and `b=`. Use `0` or `1` to disable or enable a given filter.

Users can specify SPE filters using the `-e` command line option with `arm_spe_0//`. We've introduced the `arm_spe_0/*/` notation for the `sample` and `record` command, where `*` represents a comma-separated list of supported filters. Currently, we support filters such as `store_filter=`, `load_filter=`, and `branch_filter=`, or their short equivalents like `st=`, `ld=`, and `b=`. Use `0` or `1` to disable or enable a given filter. For example:
Here are some filter examples:

```output
arm_spe_0/branch_filter=1/
Expand All @@ -41,24 +47,31 @@ arm_spe_0/st=0,ld=0,b=1/

#### Filtering sample records

SPE register `PMSFCR_EL1.FT` enables filtering by operation type. When enabled `PMSFCR_EL1.{ST, LD, B}` define the collected types:
The SPE register `PMSFCR_EL1.FT` enables filtering by operation type.

When enabled `PMSFCR_EL1.{ST, LD, B}` defines the collected types:

- `ST` enables collection of store sampled operations, including all atomic operations.
- `LD` enables collection of load sampled operations, including atomic operations that return a value to a register.
- `B` enables collection of branch sampled operations, including direct and indirect branches and exception returns.

### Sampling using SPE the CPython application running the Googolplex calculation on CPU core 1
### Sample CPython using SPE

The command below samples the running `python_d.exe` process.

Below command will sample already running process `python_d.exe` (denoted with `--image_name python_d.exe`) on CPU core no. 1. SPE filter `ld=1` enables collection of load sampled operations, including atomic operations that return a value to a register.
The SPE filter `ld=1` enables collection of load sampled operations, including atomic operations that return a value to a register.

```command
wperf sample -e arm_spe_0/ld=1/ --pe_file cpython\PCbuild\arm64\python_d.exe --image_name python_d.exe -c 1
```

{{% notice Note%}}
You can use the same sampling `--annotate` and `--disassemble` command line interface of WindowsPerf with SPE extension. See example outputs below.
You can use the same sampling `--annotate` and `--disassemble` command line interface of WindowsPerf with SPE extension. This is shown in the example output below.
{{% /notice %}}

Please wait a few seconds for the samples to arrive from the Kernel driver and then press `Ctrl+C` to stop sampling. You should see:
Please wait a few seconds for the samples to arrive from the Kernel driver and then press `Ctrl+C` to stop sampling.

You see output similar to:

```output
base address of 'python_d.exe': 0x7ff765fe1288, runtime delta: 0x7ff625fe0000
Expand All @@ -84,36 +97,44 @@ note: 'e' - normal event, 'gN' - grouped event with group number N, metric name
9.853 seconds time elapsed
```

{{% notice Note%}}
You can close the command line window with `python_d.exe` running when you have finished sampling. Sampling will also automatically end when the sample process has finished.
{{% /notice %}}
You can close the command line window running `python_d.exe` when you have finished sampling.

Sampling will also automatically end when the sampled process exits.

#### SPE sampling output

- In the above example, you can see that the majority of "overhead" is generated by `python_d.exe` executable resides inside the `python312_d.dll` DLL, in `x_mul` symbol.
- SPE sampling output contains also PMU events for SPE registered during sampling:
- `sample_pop` - Statistical Profiling sample population. Counts statistical profiling sample population, the count of all operations that could be sampled but may or may not be chosen for sampling.
- `sample_feed` - Statistical Profiling sample taken. Counts statistical profiling samples taken for sampling.
- `sample_filtrate` - Statistical Profiling sample taken and not removed by filtering. Counts statistical profiling samples taken which are not removed by filtering.
- `sample_collision` - Statistical Profiling sample collided with previous sample. Counts statistical profiling samples that have collided with a previous sample and so therefore not taken.
- Note that in sampling `....eee....e` is a progressing printout where:
- character `.` represents a SPE sample payload received from the WindowsPerf Kernel driver and
- character `e` represents an unsuccessful attempt (empty SPE fill buffer) to fetch the whole sample payload.
In the output above, you see that the majority of "overhead" generated by `python_d.exe` resides in the `python312_d.dll` DLL, in the `x_mul` symbol.

SPE sampling output also contains PMU events for the SPE registered events.

Here are some helpful definitions:

- `sample_pop` - Counts statistical profiling sample population, the count of all operations that could be sampled but may or may not be chosen for sampling.
- `sample_feed` - Counts statistical profiling samples taken.
- `sample_filtrate` - Counts statistical profiling samples taken which are not removed by filtering.
- `sample_collision` - Counts statistical profiling samples that have collided with a previous sample and therefore not taken.

During sampling the `....eee....e` output is a progressing printout where:
- each `.` character represents an SPE sample payload received from the WindowsPerf Kernel driver
- each `e` character represents an unsuccessful attempt (empty SPE fill buffer) to fetch the whole sample payload

{{% notice Note%}}
You can also output `wperf sample` command in JSON format. Use the `--json` command line option to enable the JSON output.
You can also generate `wperf sample` output in JSON format. Use the `--json` command line option to enable the JSON output.
Use the `-v` command line option `verbose` to add more information about sampling.
{{% /notice %}}

#### Example output with annotate enabled

Command line option `--annotate` enables translating addresses taken from samples in sample/record mode into source code line numbers.
The `--annotate` command line option enables translating addresses taken from samples in sample/record mode into source code line numbers.

For example:

```console
wperf sample -e arm_spe_0/ld=1/ --annotate --pe_file cpython\PCbuild\arm64\python_d.exe --image_name python_d.exe -c 1
```

The output is similar to:

```output
base address of 'python_d.exe': 0x7ff765fe1288, runtime delta: 0x7ff625fe0000
sampling ....ee.Ctrl-C received, quit counting...e done!
Expand Down Expand Up @@ -142,18 +163,21 @@ x_mul:python312_d.dll
5.199 seconds time elapsed
```

Note: Above SPE sampling pass recorded:
- function `x_mul:python312_d.dll`:
- in source file `C:\path\to\cpython\Objects\longobject.c`, line `3590` as a hot-spot for `load_filter` enabled.
The above SPE sampling pass records that the function `x_mul:python312_d.dll`
n source file `C:\path\to\cpython\Objects\longobject.c`, line `3590` is a hot-spot for the `load_filter`.

#### Example output with disassemble enabled

Command line option `--disassemble` enables disassemble output on sampling mode. Implies `--annotate`.
The `--disassemble` command line option enables disassembly output, and also implies `--annotate`.

For example:

```console
wperf sample -e arm_spe_0/ld=1/ --disassemble --pe_file cpython\PCbuild\arm64\python_d.exe --image_name python_d.exe -c 1
```

The output is similar to:

```output
base address of 'python_d.exe': 0x7ff765fe1288, runtime delta: 0x7ff625fe0000
sampling ......eCtrl-C received, quit counting... done!
Expand Down Expand Up @@ -207,9 +231,9 @@ v_isub:python312_d.dll
4.422 seconds time elapsed
```

Note: Above SPE sampling pass recorded:
- function `x_mul:python312_d.dll`:
- in source file `C:\path\to\cpython\Objects\longobject.c`, line `3591`, instruction `ldr x9, [sp, #0x20]` at address `0x4043b4` as potential hot-spot.
- in source file `C:\path\to\cpython\Objects\longobject.c`, line `3589`, instruction `ldr x8, [sp, #0x58]` at address `0x404360` as potential hot-spot.
- Function `v_isub:python312_d.dll`:
- in source file `C:\path\to\cpython\Objects\longobject.c`, line `1603`, instruction `ldr w8, [sp, #0x10]` at address `0x402a60` as potential hot-spot.
The output above shows that the function `x_mul:python312_d.dll` is a hot spot which comes from the following source code lines:
- File `C:\path\to\cpython\Objects\longobject.c`, line `3591`, instruction `ldr x9, [sp, #0x20]` at address `0x4043b4` as potential hot-spot.
- File `C:\path\to\cpython\Objects\longobject.c`, line `3589`, instruction `ldr x8, [sp, #0x58]` at address `0x404360` as potential hot-spot.

Another potential hot spot is in the function `v_isub:python312_d.dll` in the source file `C:\path\to\cpython\Objects\longobject.c`, line `1603`, instruction `ldr w8, [sp, #0x10]` at address `0x402a60`.

Original file line number Diff line number Diff line change
Expand Up @@ -4,22 +4,23 @@ title: WindowsPerf record using SPE example
weight: 4
---

## Example 2: Using the `record` command to simplify things
## Example 2: Record CPython using SPE

- The `record` command spawns the process and pins it to the core specified by the `-c` option.
- A double-dash (`--`) is a syntax used in shell commands to signify end of command options and beginning of positional arguments. In other words, it separates `wperf` CLI options from arguments that the command operates on. Use `--` to separate `wperf.exe` command line options from the process you want to spawn followed by its verbatim arguments.
You can use the `record` command to spawn the Python process and pin it to the core specified by the `-c` option.

A double-dash (`--`) syntax in shell commands signifies the end of command options and beginning of positional arguments. In other words, it separates the `wperf` CLI options from the arguments passed to the profiled program, `python_d.exe`.

Run the `record` command with SPE to collect load events from SPE:

```console
wperf record -e arm_spe_0/ld=1/ -c 1 --timeout 5 -- cpython\PCbuild\arm64\python_d.exe -c 10**10**100
```

{{% notice Note%}}
You can use the same sampling `--annotate` and `--disassemble` command line interface of WindowsPerf with SPE extension.
{{% /notice %}}
You can use the same `--annotate` and `--disassemble` command line arguments the SPE extension.

The WindowsPerf `record` command is versatile, allowing you to start and stop the sampling process easily. It also simplifies the command line syntax, making it user-friendly and efficient.

Example 2 can be replaced by these two commands:
The example above can be replaced by these two commands:

```console
start /affinity 2 cpython\PCbuild\arm64\python_d.exe -c 10**10**100
Expand All @@ -28,8 +29,10 @@ wperf sample -e arm_spe_0/ld=1/ --pe_file cpython\PCbuild\arm64\python_d.exe --i

## Summary

WindowsPerf is a versatile performance analysis tool that can support both software (with CPU PMU events) and hardware sampling (with SPE extension). The type of sampling it can perform depends on the availability of the Arm Statistical Profiling Extension (SPE) in the ARM64 CPU. If the Arm SPE extension is present, WindowsPerf can leverage hardware sampling to provide detailed performance insights. Otherwise, it will rely on software sampling to gather performance data. This flexibility ensures that WindowsPerf can adapt to different hardware configurations and still deliver valuable performance metrics.
WindowsPerf is a versatile performance analysis tool supporting both software (with CPU PMU events) and hardware sampling (with the SPE extension).

The type of sampling it can perform depends on the availability of the Arm Statistical Profiling Extension (SPE) in the CPU. If the Arm SPE extension is present, WindowsPerf can leverage hardware sampling to provide detailed performance insights. Otherwise, it will rely on software sampling to gather performance data. This flexibility ensures that WindowsPerf can adapt to different hardware configurations and still deliver valuable performance metrics.

Use `wperf sample`, a sampling mode, for determining the frequencies of event occurrences produced by program locations at the function, basic block, and/or instruction levels.
Use `wperf sample`, sampling mode, for determining the frequencies of event occurrences produced by program locations at the function, basic block, and/or instruction levels.

Use `wperf record`, same as sample but also automatically spawns the process and pins it to the core specified by `-c`. Process name is defined by COMMAND. User can pass verbatim arguments to the process.
Use `wperf record`, is the same as sample, but also automatically spawns the process and pins it to the core specified by `-c`. You can use `record` to pass verbatim arguments to the process.

0 comments on commit db0c5a0

Please sign in to comment.