Merge pull request #704 from ArmDeveloperEcosystem/main

Merge to production
ArmDeveloperEcosystem · Jan 29, 2024 · 78462e9 · 78462e9
2 parents ff136e1 + 4720657
commit 78462e9
Show file tree

Hide file tree

Showing 30 changed files with 815 additions and 83 deletions.
diff --git a/.wordlist.txt b/.wordlist.txt
@@ -2006,4 +2006,30 @@ FFTs
 fft
 Linaro's
 bytecode
-MSC
+MSC
+vectorizable
+autovectorizable
+autovectorized
+autovectorize
+vectorizer
+Reflowing
+callee
+inlining
+Autovec
+GGC
+Schmitz
+Vectorizers
+accg
+hpac
+se
+sem
+ssa
+umu
+wchars
+Kitwares
+arduino
+uv
+uvmpw
+libhugtlbfs
+mcpu
+NoLSE
diff --git a/content/install-guides/browsers/_index.md b/content/install-guides/browsers/_index.md
@@ -36,14 +36,15 @@ The information below helps you:
 
 Here is a quick summary to get you started:
 
-| Browser     | Windows on Arm | Arm Linux support |
-| ----------- | -------------- | --------- |
-| Firefox     | native         | yes       |
-| Chromium    | native         | yes       |
-| Brave       | native         | yes       |
-| Edge        | native         | no        |
-| Vivaldi     | emulation      | yes       |
-| Chrome      | emulation      | no        |
+| Browser       | Windows on Arm | Arm Linux support |
+| -----------   | -------------- | ---------         |
+| Firefox       | native         | yes               |
+| Chromium      | native         | yes               |
+| Brave         | native         | yes               |
+| Edge          | native         | no                |
+| Chrome Canary | native         | no                |
+| Chrome Stable | emulation      | no                |
+| Vivaldi       | emulation      | yes               |
 
 Windows on Arm runs native ARM64 applications, but can also emulate 32-bit x86 and 64-bit x64 applications. Emulation is slower than native and shortens battery life, but may provide functionality you need.
 

diff --git a/content/install-guides/browsers/chrome.md b/content/install-guides/browsers/chrome.md
@@ -25,19 +25,33 @@ layout: installtoolsall         # DO NOT MODIFY. Always true for tool install ar
 
 ## Installing Chrome
 
-The Chrome browser runs on Windows on Arm using emulation. Chrome is not available for Arm Linux. 
+The Chrome browser runs on Windows on Arm natively on the Canary release channel, and using emulation on the Stable release channel. Chrome is not available for Arm Linux. 
 
 ### Linux
 
 Chrome is not available for Arm Linux. 
 
 ### Windows 
 
-Chrome runs on Windows on Arm using emulation. 
+#### Native 
 
-Emulation is slower than native and shortens battery life, but may provide required functionality.
+To install Chrome on Windows on Arm:
 
-Chrome supports Google account sign-in, bookmark synchronization, and password manager so if you need these features emulated Chrome is useful. 
+1. Go to the [download page](https://www.google.com/chrome/canary/?platform=win_arm64) and click the Download Chrome Canary button.
+
+2. Run the downloaded `ChromeSetup.exe` file 
+
+3. Find and start Chrome from the applications menu
+
+{{% notice Note %}}
+The native Windows on Arm version of Chrome is currently on the Canary channel. This is an experimental version which is updated daily, but is faster than emulation.
+{{% /notice %}}
+
+#### Emulation
+
+If you prefer to use a Stable version, you can run using emulation. 
+
+Emulation is slower than native and shortens battery life.
 
 1. Download the Windows installer from [Google Chrome](https://www.google.com/chrome/)
 
@@ -49,4 +63,3 @@ Chrome supports Google account sign-in, bookmark synchronization, and password m
 The Chrome setup program installs the 32-bit x86 version of Chrome.
 {{% /notice %}}
 
-
diff --git a/content/install-guides/pdh/browser.md b/content/install-guides/pdh/browser.md
@@ -31,6 +31,16 @@ The portal uses [IBM Aspera](https://www.ibm.com/products/aspera) to enable high
 
 It is also possible to download without the use of Aspera if you prefer, but this will be slower, especially for large download sizes.
 
+## Firewall issues
+
+If you are unable to download your products, it may be due to firewall issues.
+
+If the EULA fails to appear, check with your internal IT teams that you can access:
+```url
+na3.docusign.net
+```
+It may also be necessary to delete cookies from your browser.
+
 ## Updates
 
 You will automatically be notified by the system when updates become available for any products that you have downloaded.
diff --git a/content/learning-paths/cross-platform/loop-reflowing/_index.md b/content/learning-paths/cross-platform/loop-reflowing/_index.md
@@ -1,5 +1,5 @@
 ---
-title: Loop Reflowing/Autovectorization
+title: Learn about Autovectorization
 
 draft: true
 

diff --git a/content/learning-paths/cross-platform/loop-reflowing/_next-steps.md b/content/learning-paths/cross-platform/loop-reflowing/_next-steps.md
@@ -9,7 +9,7 @@ further_reading:
         link: https://community.arm.com/arm-community-blogs/b/tools-software-ides-blog/posts/update-on-gnu-performance
         type: blog
     - resource:
-        title: Auto-Vectorization in LLVM¶
+        title: Auto-Vectorization in LLVM
         link: https://llvm.org/docs/Vectorizers.html
         type: website
     - resource:

diff --git a/.../learning-paths/cross-platform/loop-reflowing/autovectorization-and-restrict.md b/.../learning-paths/cross-platform/loop-reflowing/autovectorization-and-restrict.md
@@ -103,8 +103,8 @@ The reason for this is related to how each compiler decides whether to use autov
 
 For each candidate loop the compiler will estimate the possible performance gains against a cost model, which is affected by many parameters and of course the optimization level in the compilation flags. 
 
-The cost model estimates whether the autovectorized code grows in size and if the performance gains are enough to outweigh the increase in code size. Based on this estimation, the compiler will decide to use vectorized code or fall back to a more 'safe' scalar implementation. This decision however is fluid and is constantly reevaluated during compiler development.
+The cost model estimates whether the autovectorized code grows in size and if the performance gains are enough to outweigh the increase in code size. Based on this estimation, the compiler will decide to use vectorized code or fall back to a more 'safe' scalar implementation. This decision, however, is fluid and is constantly reevaluated during compiler development.
 
-Compiler cost model analysis is beyond the scope of this Learning Path, but the example demonstrates how autovectorization can be triggered by a flag.
+Compiler cost model analysis is beyond the scope of this Learning Path but the above example demonstrates how autovectorization can be triggered by a flag.
 
-You will see some more advanced examples in the next sections.
+You will see some more advanced examples in the next sections.
diff --git a/.../learning-paths/cross-platform/loop-reflowing/autovectorization-conditionals.md b/.../learning-paths/cross-platform/loop-reflowing/autovectorization-conditionals.md
@@ -10,7 +10,7 @@ In the previous section, you learned that compilers cannot autovectorize loops w
 
 In this section, you will see more examples of loops with branches.
 
-You will learn when it is possible to enable the vectorizer in the compiler by adapting the loop, and when you are required to modify the algorithm or write manually optimized code.
+You will learn when it is possible to enable the vectorizer in the compiler by adapting the loop and when you are required to modify the algorithm or write manually optimized code.
 
 ### Loops with if/else/switch statements
 
@@ -48,9 +48,9 @@ void addvecweight(float *restrict C, float *A, float *B, size_t N) {
 
 These are two different loops that the compiler can vectorize. 
 
-Both GCC and Clang can autovectorize this loop, but the output is slightly different, performance may vary depending on the flags used and the exact nature of the loop.
+Both GCC and Clang can autovectorize this loop but the output is slightly different, and performance may vary depending on the flags used and the exact nature of the loop.
 
-However, the loop below is autovectorized by Clang but it is not autovectorized by GCC. 
+The loop below is autovectorized by Clang but it is not autovectorized by GCC. 
 
 ```C
 void addvecweight2(float *restrict C, float *A, float *B,
@@ -111,4 +111,4 @@ void addvecweight(float *restrict C, float *A, float *B,
 
 The cases you have seen so far are generic, they work the same for any architecture. 
 
-In the next section, you will see Arm-specific cases for autovectorization.
+In the next section, you will see Arm-specific cases for autovectorization.
diff --git a/content/learning-paths/cross-platform/loop-reflowing/autovectorization-limits.md b/content/learning-paths/cross-platform/loop-reflowing/autovectorization-limits.md
@@ -22,7 +22,7 @@ for (size_t i=0; i < N; i++) {
 }
 ```
 
-This loop is not countable and cannot be vectorized:
+But this loop is not countable and cannot be vectorized:
 
 ```C
 i = 0;
@@ -46,7 +46,7 @@ while(1) {
 }
 ```
 
-This loop is not vectorizable:
+But this loop is not vectorizable:
 
 ```C
 i = 0;
@@ -59,17 +59,17 @@ while(1) {
 
 #### No function calls inside the loop
 
-If `f()` and `g()` are functions that take `float` arguments this loop cannot be autovectorized:
+If `f()` and `g()` are functions that take `float` arguments, the loop cannot be autovectorized:
 
 ```C
 for (size_t i=0; i < N; i++) {
     C[i] = f(A[i]) + g(B[i]);
 }
 ```
 
-There is a special case of the math library trigonometry and transcendental functions (like `sin`, `cos`, `exp`, etc). There is work underway to enable these functions to be autovectorized, as the compiler will use their vectorized counterparts in the `mathvec` library (`libmvec`).
+There is a special case with the math library trigonometry and transcendental functions (like `sin`, `cos`, `exp`, etc). There is work underway to enable these functions to be autovectorized, as the compiler will use their vectorized counterparts in the `mathvec` library (`libmvec`).
 
-The loop below is *already autovectorized* in current gcc trunk for Arm (note you have to add `-Ofast` to the compilation flags to enable autovectorization):
+The loop below is *already autovectorized* in current gcc trunk for Arm (note, you have to add `-Ofast` to the compilation flags to enable autovectorization):
 
 ```C
 void addfunc(float *restrict C, float *A, float *B, size_t N) {
@@ -79,7 +79,7 @@ void addfunc(float *restrict C, float *A, float *B, size_t N) {
 }
 ```
 
-This feature will be in gcc 14 and require a new glibc version 2.39 as well. Until then, if you are using a released compiler as part of a Linux distribution (such as gcc 13.2), you will need to manually vectorize such code for performance.
+This feature will be in gcc 14 and requires a new glibc version 2.39 as well. Until then, if you are using a released compiler as part of a Linux distribution (such as gcc 13.2), you will need to manually vectorize such code for performance.
 
 There is more about autovectorization of conditionals in the next section.
 
@@ -105,11 +105,11 @@ for (size_t i=0; i < N; i++) {
 
 In this case, only the inner loop will be vectorized, again provided all the other conditions also apply (no branches and the inner loop is countable).
 
-There are some cases where outer loop types are autovectorized, but these are not covered in this Learning Path.
+There are some cases where outer loop types are autovectorized but these are not covered in this Learning Path.
 
 #### No data inter-dependency between iterations
 
-This means that each iteration depends on the result of the previous iteration. This example is difficult, but not impossible to autovectorize. 
+This means that each iteration depends on the result of the previous iteration. This example is difficult but not impossible to autovectorize. 
 
 The loop below cannot be autovectorized as it is. 
 

diff --git a/content/learning-paths/cross-platform/loop-reflowing/autovectorization-on-arm-1.md b/content/learning-paths/cross-platform/loop-reflowing/autovectorization-on-arm-1.md
@@ -72,7 +72,7 @@ dotprod:
         ret
 ```
 
-You can see that it's a pretty standard implementation, doing one element at a time. The option `-fno-inline` is necessary to avoid inlining any code from the function `dot-prod()` into `main()` for performance reasons. In general, this is a good thing, but demonstrating the autovectorization process is more difficult if there is no easy way to distinguish the caller from the callee.
+You can see that it's a pretty standard implementation, doing one element at a time. The option `-fno-inline` is necessary to avoid inlining any code from the function `dot-prod()` into `main()` for performance reasons. In general, this is a good thing but demonstrating the autovectorization process is more difficult if there is no easy way to distinguish the caller from the callee.
 
 Next, increase the optimization level to `-O3`, recompile, and observe the assembly output again:
 
@@ -135,7 +135,7 @@ dotprod:
         b       .L3
 ```
 
-The code is larger, but you can see that some autovectorization has taken place.
+The code is larger but you can see that some autovectorization has taken place.
 
 The label `.L4` includes the main loop and you can see that the `mla` instruction is used to multiply and accumulate the dot products, 4 elements at a time. 
 
@@ -145,7 +145,7 @@ With the new code, you can expect a performance gain of about 4x.
 
 You might be wondering if there is a way to hint to the compiler that the sizes are always going to be multiples of 4 and avoid the last part of the code. 
 
-The answer is *yes*, but it depends on the compiler. In the case of gcc, it is enough to add an instruction that ensures the sizes are multiples of 4.
+The answer is *yes* but it depends on the compiler. In the case of gcc, it is enough to add an instruction that ensures the sizes are multiples of 4.
 
 Modify the `dotprod()` function to add the multiples of 4 hint as shown below:
 
@@ -160,7 +160,7 @@ int32_t dotprod(int32_t *A, int32_t *B, size_t N) {
 }
 ```
 
-Compile again ith `-O3`:
+Compile again with `-O3`:
 
 ```bash
 gcc -O3 -fno-inline dotprod.c -o dotprod
@@ -195,7 +195,7 @@ Is there anything else the compiler can do?
 
 Modern compilers are very proficient at generating code that utilizes all available instructions, provided they have the right information.
 
-For example, the `dotprod()` function operates on `int32_t` elements, what if you could limit the range to 8-bit? 
+For example, the `dotprod()` function operates on `int32_t` elements. What if you could limit the range to 8-bit? 
 
 There is an Armv8 ISA extension that [provides signed and unsigned dot product instructions](https://developer.arm.com/documentation/102651/a/What-are-dot-product-intructions-) to perform a dot product across 8-bit elements of 2 vectors and store the results in the 32-bit elements of the resulting vector. 
 
@@ -237,7 +237,7 @@ gcc -O3 -fno-inline -march=armv8-a+dotprod dotprod.c -o dotprod
 
 You need to compile with the architecture flag to use the dot product instructions. 
 
-The assembly output will be quite larger as the use of `SDOT` can only work in the main loop where the size is a multiple of 16. The compiler will unroll the loop to use Advanced SIMD instructions if the size is greater than 8, and byte-handling instructions if the size is smaller.
+The assembly output will be quite large as the use of `SDOT` can only work in the main loop where the size is a multiple of 16. The compiler will unroll the loop to use Advanced SIMD instructions if the size is greater than 8 and byte-handling instructions if the size is smaller.
 
 You can eliminate the extra tail instructions by converting `N -= N % 4` to 8 or even 16 as shown below: 
 

diff --git a/content/learning-paths/cross-platform/loop-reflowing/autovectorization-on-arm-2.md b/content/learning-paths/cross-platform/loop-reflowing/autovectorization-on-arm-2.md
@@ -8,7 +8,7 @@ layout: learningpathall
 
 The previous example using the `SDOT`/`UDOT` instructions is only one of the Arm-specific optimizations possible.
 
-While it is not possible to demonstrate all of the specialized instructions offered by the Arm architecture, it's worth looking at another example:
+While it is not possible to demonstrate all of the specialized instructions offered by the Arm architecture, it's worth looking at another example.
 
 Below is a very simple loop, calculating what is known as a Sum of Absolute Differences (SAD). Such code is very common in video codecs and used in calculating differences between video frames.
 
@@ -37,9 +37,9 @@ int main() {
 }
 ```
 
-A hint to the compiler was added that the size is a multiple of 16 to avoid generating cases for smaller lengths. *This is only for demonstration purposes*.
+A hint to the compiler was added that the size is a multiple of 16 to avoid generating cases for smaller lengths. *This is for demonstration purposes only*.
 
-Save the code above to a file named `sadtest.c` and compile it:
+Save the above code to a file named `sadtest.c` and compile it:
 
 ```bash
 gcc -O3 -fno-inline sadtest.c -o sadtest
@@ -71,11 +71,11 @@ sad8:
         ret
 ```
 
- You can see that the compiler generates code that uses 3 specialized instructions that exist only on Arm: [`SABDL2`](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SABDL--SABDL2--Signed-Absolute-Difference-Long-?lang=en), [`SABAL`](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SABAL--SABAL2--Signed-Absolute-difference-and-Accumulate-Long-?lang=en) and [`SADALP`](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SADALP--Signed-Add-and-Accumulate-Long-Pairwise-?lang=en).
+You can see that the compiler generates code that uses 3 specialized instructions that exist only on Arm: [`SABDL2`](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SABDL--SABDL2--Signed-Absolute-Difference-Long-?lang=en), [`SABAL`](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SABAL--SABAL2--Signed-Absolute-difference-and-Accumulate-Long-?lang=en) and [`SADALP`](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SADALP--Signed-Add-and-Accumulate-Long-Pairwise-?lang=en).
 
 The accumulator variable is not 8-bit but 32-bit, so the typical SIMD implementation that would involve 16 x 8-bit subtractions, then 16 x absolute values and 16 x additions would not do, and a widening conversion to 32-bit would have to take place before the accumulation.
 
-This would mean that 4x items at a time would be accumulated, but with the use of these instructions, the performance gain can be up to 16x faster than the original scalar code, or about 4x faster than the typical SIMD implementation.
+This would mean that 4x items at a time would be accumulated but, with the use of these instructions, the performance gain can be up to 16x faster than the original scalar code, or about 4x faster than the typical SIMD implementation.
 
 For completeness the SVE2 version will be provided, which does not depend on size being a multiple of 16.
 
@@ -126,7 +126,7 @@ You might ask why you should learn about autovectorization if you need to have s
 
 Autovectorization is a tool. The goal is to minimize the effort required by developers and maximize the performance, while at the same time requiring low maintenance in terms of code size. 
 
-It is far easier to maintain hundreds or thousands of functions that are known to generate the fastest code using autovectorization, for all platforms, than it is to maintain the same number of functions in multiple versions for each supported architecture and SIMD engine. 
+It is far easier to maintain hundreds or thousands of functions that are known to generate the fastest code using autovectorization for all platforms, than it is to maintain the same number of functions in multiple versions for each supported architecture and SIMD engine. 
 
 As with most tools, the better you know how to use it, the better the results will be.