Skip to content

Commit

Permalink
Merge pull request #704 from ArmDeveloperEcosystem/main
Browse files Browse the repository at this point in the history
Merge to production
  • Loading branch information
pareenaverma authored Jan 29, 2024
2 parents ff136e1 + 4720657 commit 78462e9
Show file tree
Hide file tree
Showing 30 changed files with 815 additions and 83 deletions.
28 changes: 27 additions & 1 deletion .wordlist.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2006,4 +2006,30 @@ FFTs
fft
Linaro's
bytecode
MSC
MSC
vectorizable
autovectorizable
autovectorized
autovectorize
vectorizer
Reflowing
callee
inlining
Autovec
GGC
Schmitz
Vectorizers
accg
hpac
se
sem
ssa
umu
wchars
Kitwares
arduino
uv
uvmpw
libhugtlbfs
mcpu
NoLSE
17 changes: 9 additions & 8 deletions content/install-guides/browsers/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,14 +36,15 @@ The information below helps you:

Here is a quick summary to get you started:

| Browser | Windows on Arm | Arm Linux support |
| ----------- | -------------- | --------- |
| Firefox | native | yes |
| Chromium | native | yes |
| Brave | native | yes |
| Edge | native | no |
| Vivaldi | emulation | yes |
| Chrome | emulation | no |
| Browser | Windows on Arm | Arm Linux support |
| ----------- | -------------- | --------- |
| Firefox | native | yes |
| Chromium | native | yes |
| Brave | native | yes |
| Edge | native | no |
| Chrome Canary | native | no |
| Chrome Stable | emulation | no |
| Vivaldi | emulation | yes |

Windows on Arm runs native ARM64 applications, but can also emulate 32-bit x86 and 64-bit x64 applications. Emulation is slower than native and shortens battery life, but may provide functionality you need.

Expand Down
23 changes: 18 additions & 5 deletions content/install-guides/browsers/chrome.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,19 +25,33 @@ layout: installtoolsall # DO NOT MODIFY. Always true for tool install ar

## Installing Chrome

The Chrome browser runs on Windows on Arm using emulation. Chrome is not available for Arm Linux.
The Chrome browser runs on Windows on Arm natively on the Canary release channel, and using emulation on the Stable release channel. Chrome is not available for Arm Linux.

### Linux

Chrome is not available for Arm Linux.

### Windows

Chrome runs on Windows on Arm using emulation.
#### Native

Emulation is slower than native and shortens battery life, but may provide required functionality.
To install Chrome on Windows on Arm:

Chrome supports Google account sign-in, bookmark synchronization, and password manager so if you need these features emulated Chrome is useful.
1. Go to the [download page](https://www.google.com/chrome/canary/?platform=win_arm64) and click the Download Chrome Canary button.

2. Run the downloaded `ChromeSetup.exe` file

3. Find and start Chrome from the applications menu

{{% notice Note %}}
The native Windows on Arm version of Chrome is currently on the Canary channel. This is an experimental version which is updated daily, but is faster than emulation.
{{% /notice %}}

#### Emulation

If you prefer to use a Stable version, you can run using emulation.

Emulation is slower than native and shortens battery life.

1. Download the Windows installer from [Google Chrome](https://www.google.com/chrome/)

Expand All @@ -49,4 +63,3 @@ Chrome supports Google account sign-in, bookmark synchronization, and password m
The Chrome setup program installs the 32-bit x86 version of Chrome.
{{% /notice %}}


10 changes: 10 additions & 0 deletions content/install-guides/pdh/browser.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,16 @@ The portal uses [IBM Aspera](https://www.ibm.com/products/aspera) to enable high

It is also possible to download without the use of Aspera if you prefer, but this will be slower, especially for large download sizes.

## Firewall issues

If you are unable to download your products, it may be due to firewall issues.

If the EULA fails to appear, check with your internal IT teams that you can access:
```url
na3.docusign.net
```
It may also be necessary to delete cookies from your browser.

## Updates

You will automatically be notified by the system when updates become available for any products that you have downloaded.
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: Loop Reflowing/Autovectorization
title: Learn about Autovectorization

draft: true

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ further_reading:
link: https://community.arm.com/arm-community-blogs/b/tools-software-ides-blog/posts/update-on-gnu-performance
type: blog
- resource:
title: Auto-Vectorization in LLVM
title: Auto-Vectorization in LLVM
link: https://llvm.org/docs/Vectorizers.html
type: website
- resource:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -103,8 +103,8 @@ The reason for this is related to how each compiler decides whether to use autov

For each candidate loop the compiler will estimate the possible performance gains against a cost model, which is affected by many parameters and of course the optimization level in the compilation flags.

The cost model estimates whether the autovectorized code grows in size and if the performance gains are enough to outweigh the increase in code size. Based on this estimation, the compiler will decide to use vectorized code or fall back to a more 'safe' scalar implementation. This decision however is fluid and is constantly reevaluated during compiler development.
The cost model estimates whether the autovectorized code grows in size and if the performance gains are enough to outweigh the increase in code size. Based on this estimation, the compiler will decide to use vectorized code or fall back to a more 'safe' scalar implementation. This decision, however, is fluid and is constantly reevaluated during compiler development.

Compiler cost model analysis is beyond the scope of this Learning Path, but the example demonstrates how autovectorization can be triggered by a flag.
Compiler cost model analysis is beyond the scope of this Learning Path but the above example demonstrates how autovectorization can be triggered by a flag.

You will see some more advanced examples in the next sections.
You will see some more advanced examples in the next sections.
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ In the previous section, you learned that compilers cannot autovectorize loops w

In this section, you will see more examples of loops with branches.

You will learn when it is possible to enable the vectorizer in the compiler by adapting the loop, and when you are required to modify the algorithm or write manually optimized code.
You will learn when it is possible to enable the vectorizer in the compiler by adapting the loop and when you are required to modify the algorithm or write manually optimized code.

### Loops with if/else/switch statements

Expand Down Expand Up @@ -48,9 +48,9 @@ void addvecweight(float *restrict C, float *A, float *B, size_t N) {

These are two different loops that the compiler can vectorize.

Both GCC and Clang can autovectorize this loop, but the output is slightly different, performance may vary depending on the flags used and the exact nature of the loop.
Both GCC and Clang can autovectorize this loop but the output is slightly different, and performance may vary depending on the flags used and the exact nature of the loop.

However, the loop below is autovectorized by Clang but it is not autovectorized by GCC.
The loop below is autovectorized by Clang but it is not autovectorized by GCC.

```C
void addvecweight2(float *restrict C, float *A, float *B,
Expand Down Expand Up @@ -111,4 +111,4 @@ void addvecweight(float *restrict C, float *A, float *B,
The cases you have seen so far are generic, they work the same for any architecture.
In the next section, you will see Arm-specific cases for autovectorization.
In the next section, you will see Arm-specific cases for autovectorization.
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ for (size_t i=0; i < N; i++) {
}
```

This loop is not countable and cannot be vectorized:
But this loop is not countable and cannot be vectorized:

```C
i = 0;
Expand All @@ -46,7 +46,7 @@ while(1) {
}
```

This loop is not vectorizable:
But this loop is not vectorizable:

```C
i = 0;
Expand All @@ -59,17 +59,17 @@ while(1) {

#### No function calls inside the loop

If `f()` and `g()` are functions that take `float` arguments this loop cannot be autovectorized:
If `f()` and `g()` are functions that take `float` arguments, the loop cannot be autovectorized:

```C
for (size_t i=0; i < N; i++) {
C[i] = f(A[i]) + g(B[i]);
}
```

There is a special case of the math library trigonometry and transcendental functions (like `sin`, `cos`, `exp`, etc). There is work underway to enable these functions to be autovectorized, as the compiler will use their vectorized counterparts in the `mathvec` library (`libmvec`).
There is a special case with the math library trigonometry and transcendental functions (like `sin`, `cos`, `exp`, etc). There is work underway to enable these functions to be autovectorized, as the compiler will use their vectorized counterparts in the `mathvec` library (`libmvec`).

The loop below is *already autovectorized* in current gcc trunk for Arm (note you have to add `-Ofast` to the compilation flags to enable autovectorization):
The loop below is *already autovectorized* in current gcc trunk for Arm (note, you have to add `-Ofast` to the compilation flags to enable autovectorization):

```C
void addfunc(float *restrict C, float *A, float *B, size_t N) {
Expand All @@ -79,7 +79,7 @@ void addfunc(float *restrict C, float *A, float *B, size_t N) {
}
```
This feature will be in gcc 14 and require a new glibc version 2.39 as well. Until then, if you are using a released compiler as part of a Linux distribution (such as gcc 13.2), you will need to manually vectorize such code for performance.
This feature will be in gcc 14 and requires a new glibc version 2.39 as well. Until then, if you are using a released compiler as part of a Linux distribution (such as gcc 13.2), you will need to manually vectorize such code for performance.
There is more about autovectorization of conditionals in the next section.
Expand All @@ -105,11 +105,11 @@ for (size_t i=0; i < N; i++) {

In this case, only the inner loop will be vectorized, again provided all the other conditions also apply (no branches and the inner loop is countable).

There are some cases where outer loop types are autovectorized, but these are not covered in this Learning Path.
There are some cases where outer loop types are autovectorized but these are not covered in this Learning Path.

#### No data inter-dependency between iterations

This means that each iteration depends on the result of the previous iteration. This example is difficult, but not impossible to autovectorize.
This means that each iteration depends on the result of the previous iteration. This example is difficult but not impossible to autovectorize.

The loop below cannot be autovectorized as it is.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ dotprod:
ret
```

You can see that it's a pretty standard implementation, doing one element at a time. The option `-fno-inline` is necessary to avoid inlining any code from the function `dot-prod()` into `main()` for performance reasons. In general, this is a good thing, but demonstrating the autovectorization process is more difficult if there is no easy way to distinguish the caller from the callee.
You can see that it's a pretty standard implementation, doing one element at a time. The option `-fno-inline` is necessary to avoid inlining any code from the function `dot-prod()` into `main()` for performance reasons. In general, this is a good thing but demonstrating the autovectorization process is more difficult if there is no easy way to distinguish the caller from the callee.

Next, increase the optimization level to `-O3`, recompile, and observe the assembly output again:

Expand Down Expand Up @@ -135,7 +135,7 @@ dotprod:
b .L3
```

The code is larger, but you can see that some autovectorization has taken place.
The code is larger but you can see that some autovectorization has taken place.

The label `.L4` includes the main loop and you can see that the `mla` instruction is used to multiply and accumulate the dot products, 4 elements at a time.

Expand All @@ -145,7 +145,7 @@ With the new code, you can expect a performance gain of about 4x.

You might be wondering if there is a way to hint to the compiler that the sizes are always going to be multiples of 4 and avoid the last part of the code.

The answer is *yes*, but it depends on the compiler. In the case of gcc, it is enough to add an instruction that ensures the sizes are multiples of 4.
The answer is *yes* but it depends on the compiler. In the case of gcc, it is enough to add an instruction that ensures the sizes are multiples of 4.

Modify the `dotprod()` function to add the multiples of 4 hint as shown below:

Expand All @@ -160,7 +160,7 @@ int32_t dotprod(int32_t *A, int32_t *B, size_t N) {
}
```
Compile again ith `-O3`:
Compile again with `-O3`:
```bash
gcc -O3 -fno-inline dotprod.c -o dotprod
Expand Down Expand Up @@ -195,7 +195,7 @@ Is there anything else the compiler can do?

Modern compilers are very proficient at generating code that utilizes all available instructions, provided they have the right information.

For example, the `dotprod()` function operates on `int32_t` elements, what if you could limit the range to 8-bit?
For example, the `dotprod()` function operates on `int32_t` elements. What if you could limit the range to 8-bit?

There is an Armv8 ISA extension that [provides signed and unsigned dot product instructions](https://developer.arm.com/documentation/102651/a/What-are-dot-product-intructions-) to perform a dot product across 8-bit elements of 2 vectors and store the results in the 32-bit elements of the resulting vector.

Expand Down Expand Up @@ -237,7 +237,7 @@ gcc -O3 -fno-inline -march=armv8-a+dotprod dotprod.c -o dotprod

You need to compile with the architecture flag to use the dot product instructions.

The assembly output will be quite larger as the use of `SDOT` can only work in the main loop where the size is a multiple of 16. The compiler will unroll the loop to use Advanced SIMD instructions if the size is greater than 8, and byte-handling instructions if the size is smaller.
The assembly output will be quite large as the use of `SDOT` can only work in the main loop where the size is a multiple of 16. The compiler will unroll the loop to use Advanced SIMD instructions if the size is greater than 8 and byte-handling instructions if the size is smaller.

You can eliminate the extra tail instructions by converting `N -= N % 4` to 8 or even 16 as shown below:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ layout: learningpathall

The previous example using the `SDOT`/`UDOT` instructions is only one of the Arm-specific optimizations possible.

While it is not possible to demonstrate all of the specialized instructions offered by the Arm architecture, it's worth looking at another example:
While it is not possible to demonstrate all of the specialized instructions offered by the Arm architecture, it's worth looking at another example.

Below is a very simple loop, calculating what is known as a Sum of Absolute Differences (SAD). Such code is very common in video codecs and used in calculating differences between video frames.

Expand Down Expand Up @@ -37,9 +37,9 @@ int main() {
}
```
A hint to the compiler was added that the size is a multiple of 16 to avoid generating cases for smaller lengths. *This is only for demonstration purposes*.
A hint to the compiler was added that the size is a multiple of 16 to avoid generating cases for smaller lengths. *This is for demonstration purposes only*.
Save the code above to a file named `sadtest.c` and compile it:
Save the above code to a file named `sadtest.c` and compile it:
```bash
gcc -O3 -fno-inline sadtest.c -o sadtest
Expand Down Expand Up @@ -71,11 +71,11 @@ sad8:
ret
```

You can see that the compiler generates code that uses 3 specialized instructions that exist only on Arm: [`SABDL2`](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SABDL--SABDL2--Signed-Absolute-Difference-Long-?lang=en), [`SABAL`](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SABAL--SABAL2--Signed-Absolute-difference-and-Accumulate-Long-?lang=en) and [`SADALP`](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SADALP--Signed-Add-and-Accumulate-Long-Pairwise-?lang=en).
You can see that the compiler generates code that uses 3 specialized instructions that exist only on Arm: [`SABDL2`](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SABDL--SABDL2--Signed-Absolute-Difference-Long-?lang=en), [`SABAL`](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SABAL--SABAL2--Signed-Absolute-difference-and-Accumulate-Long-?lang=en) and [`SADALP`](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SADALP--Signed-Add-and-Accumulate-Long-Pairwise-?lang=en).

The accumulator variable is not 8-bit but 32-bit, so the typical SIMD implementation that would involve 16 x 8-bit subtractions, then 16 x absolute values and 16 x additions would not do, and a widening conversion to 32-bit would have to take place before the accumulation.

This would mean that 4x items at a time would be accumulated, but with the use of these instructions, the performance gain can be up to 16x faster than the original scalar code, or about 4x faster than the typical SIMD implementation.
This would mean that 4x items at a time would be accumulated but, with the use of these instructions, the performance gain can be up to 16x faster than the original scalar code, or about 4x faster than the typical SIMD implementation.

For completeness the SVE2 version will be provided, which does not depend on size being a multiple of 16.

Expand Down Expand Up @@ -126,7 +126,7 @@ You might ask why you should learn about autovectorization if you need to have s

Autovectorization is a tool. The goal is to minimize the effort required by developers and maximize the performance, while at the same time requiring low maintenance in terms of code size.

It is far easier to maintain hundreds or thousands of functions that are known to generate the fastest code using autovectorization, for all platforms, than it is to maintain the same number of functions in multiple versions for each supported architecture and SIMD engine.
It is far easier to maintain hundreds or thousands of functions that are known to generate the fastest code using autovectorization for all platforms, than it is to maintain the same number of functions in multiple versions for each supported architecture and SIMD engine.

As with most tools, the better you know how to use it, the better the results will be.

Loading

0 comments on commit 78462e9

Please sign in to comment.