From 313a5a79ee8ea86f2b1cdaac99b1c41fd579d285 Mon Sep 17 00:00:00 2001
From: Jason Andrews <jason.r.andrews@comcast.net>
Date: Mon, 15 Jan 2024 13:13:17 -0600
Subject: [PATCH 1/4] review autovectorization learning path

---
 .../cross-platform/loop-reflowing/_index.md   |  15 ++-
 .../cross-platform/loop-reflowing/_review.md  |   2 +-
 .../autovectorization-and-restrict.md         |  53 +++++---
 .../autovectorization-conditionals.md         |  29 +++--
 .../autovectorization-limits.md               | 119 ++++++++++--------
 .../autovectorization-on-arm-1.md             |  94 ++++++++++----
 .../autovectorization-on-arm-2.md             |  32 +++--
 .../introduction-to-autovectorization.md      |  32 +++--
 8 files changed, 240 insertions(+), 136 deletions(-)

diff --git a/content/learning-paths/cross-platform/loop-reflowing/_index.md b/content/learning-paths/cross-platform/loop-reflowing/_index.md
index efc01b133..24372050b 100644
--- a/content/learning-paths/cross-platform/loop-reflowing/_index.md
+++ b/content/learning-paths/cross-platform/loop-reflowing/_index.md
@@ -3,23 +3,22 @@ title: Loop Reflowing/Autovectorization
 
 minutes_to_complete: 45
 
-who_is_this_for: This is an advanced topic for C/C++ developers who are interested in taking advantage of autovectorization in compilers
+who_is_this_for: This is an advanced topic for C/C++ developers who are interested in taking advantage of autovectorization in compilers.
 
 learning_objectives: 
-    - Learn how to modify loops in order to take advantage of autovectorization in compilers
+    - Modify loops to take advantage of autovectorization in compilers
 
 prerequisites:
-    - An Arm computer running Linux OS and a recent version of compiler (Clang or GCC) installed
+    - An Arm computer running Linux and a recent version of Clang or the GNU compiler (gcc) installed.
 
 author_primary: Konstantinos Margaritis
 
 ### Tags
 skilllevels: Advanced
-subjects: Programming
+subjects: Performance and Architecture
 armips:
-    - Aarch64
-    - Armv8-a
-    - Armv9-a
+    - Neoverse
+    - Cortex-A
 tools_software_languages:
     - GCC
     - Clang
@@ -28,8 +27,8 @@ operatingsystems:
     - Linux
 shared_path: true
 shared_between:
-    - laptops-and-desktops
     - servers-and-cloud-computing
+    - laptops-and-desktops
     - smartphones-and-mobile
 
 
diff --git a/content/learning-paths/cross-platform/loop-reflowing/_review.md b/content/learning-paths/cross-platform/loop-reflowing/_review.md
index 55bec67d5..8fc1eda02 100644
--- a/content/learning-paths/cross-platform/loop-reflowing/_review.md
+++ b/content/learning-paths/cross-platform/loop-reflowing/_review.md
@@ -4,7 +4,7 @@ review:
         question: >
             Autovectorization is:
         answers:
-            - The automatic generation of 3D vectors so that 3D applications/games run faster.
+            - The automatic generation of 3D vectors so that 3D games run faster.
             - Converting an array of numbers in C to an STL C++ vector object.
             - The process where an algorithm is automatically vectorized by the compiler to use SIMD instructions.
         correct_answer: 3
diff --git a/content/learning-paths/cross-platform/loop-reflowing/autovectorization-and-restrict.md b/content/learning-paths/cross-platform/loop-reflowing/autovectorization-and-restrict.md
index 442582d27..f4e5c8afe 100644
--- a/content/learning-paths/cross-platform/loop-reflowing/autovectorization-and-restrict.md
+++ b/content/learning-paths/cross-platform/loop-reflowing/autovectorization-and-restrict.md
@@ -1,26 +1,31 @@
 ---
-title: Autovectorization and restrict
+title: Autovectorization using the restrict keyword
 weight: 3
 
 ### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
 
-## Autovectorization and restrict keyword
+You may have already experienced some form of autovectorization by reading [Understand the restrict keyword in C99](/learning-paths/cross-platform/restrict-keyword-c99/).
 
-You have already experienced some form of autovectorization by learning about the [`restrict` keyword in a previous Learning Path](https://learn.arm.com/learning-paths/cross-platform/restrict-keyword-c99/).
-Our example is a classic textbook example that the compiler will autovectorize simply by using `restrict`:
+The example in the previous section is a classic textbook example that the compiler will autovectorize by using `restrict`.
 
-Try the previously saved files, compile them both and compare the assembly output:
+Compile the previously saved files:
 
 ```bash
 gcc -O2 addvec.c -o addvec
 gcc -O2 addvec_neon.c -o addvec_neon
 ```
 
-Let's look at the assembly output of `addvec`:
+Generate the assembly output using:
 
-```as
+```bash
+objdump -D addvec 
+```
+
+The assembly output of the `addvec()` function is shown below:
+
+```output
 addvec:
         mov     x3, 0
 .L2:
@@ -34,9 +39,15 @@ addvec:
         ret
 ```
 
-Similarly, for the `addvec_neon` executable:
+Generate the assembly output for `addvec_neon` using:
+
+```bash
+objdump -D addvec_neon
+```
+
+The assembly output for the `addvec()` function from the `addvec_neon` executable is shown below:
 
-```as
+```output
 addvec:
         mov     x3, 0
 .L6:
@@ -50,9 +61,9 @@ addvec:
         ret
  ```
 
-The latter uses Advanced SIMD/Neon instructions `fadd` with operands `v0.4s`, `v1.4s` to perform calculations in 4 x 32-bit floating-point elements.
+The second example uses the Advanced SIMD/Neon instruction `fadd` with operands `v0.4s`, `v1.4s` to perform calculations in 4 x 32-bit floating-point elements.
 
-Let's try to add `restrict` to the output argument `C` in the first `addvec` function:
+Add the `restrict` keyword to the output argument `C` in the `addvec()` function in `addvec.c`:
 
 ```C
 void addvec(float *restrict C, float *A, float *B) {
@@ -63,8 +74,14 @@ void addvec(float *restrict C, float *A, float *B) {
 ```
 
 Recompile and check the assembly output again:
+```bash
+gcc -O2 addvec.c -o addvec
+objdump -D addvec
+```
+
+The assembly output for the `addvec` function is now: 
 
-```as
+```output
 addvec:
         mov     x3, 0
 .L2:
@@ -78,10 +95,16 @@ addvec:
         ret
  ```
 
-As you can see, the compiler has enabled autovectorization for this algorithm and the output is identical to the hand-written function! Strictly speaking, you don't even need `restrict` in such a trivial loop as it will be autovectorized anyway when certain optimization levels are added to the compilation flags (`-O2` for clang, `-O3` for gcc). However, the use of restrict simplifies the code and generates SIMD code similar to the hand written version in `addvec_neon.c`.
+As you can see, the compiler has enabled autovectorization for this algorithm and the output is identical to the hand-written function.
+
+Strictly speaking, you don't even need `restrict` in such a trivial loop as it will be autovectorized anyway when certain optimization levels are added to the compilation flags (`-O2` for clang, `-O3` for gcc). However, the use of restrict simplifies the code and generates SIMD code similar to the hand written version in `addvec_neon.c`.
+
+The reason for this is related to how each compiler decides whether to use autovectorization or not. 
+
+For each candidate loop the compiler will estimate the possible performance gains against a cost model, which is affected by many parameters and of course the optimization level in the compilation flags. 
 
-The reason for this is because of the way each compiler decides whether to use autovectorization or not. For each candidate loop the compiler will estimate the possible performance gains against a cost model, which is affected by many parameters and of course the optimization level in the compilation flags. This cost model will estimate whether the autovectorized code grows in size and if the performance gains are enough to outweigh this increase in code size. Based on this estimation, the compiler will decide to use this vectorized code or fall back to a more 'safe' scalar implementation. This decision however is something that is not set in stone and is constantly reevaluated during compiler development.
+The cost model estimates whether the autovectorized code grows in size and if the performance gains are enough to outweigh the increase in code size. Based on this estimation, the compiler will decide to use vectorized code or fall back to a more 'safe' scalar implementation. This decision however is fluid and is constantly reevaluated during compiler development.
 
-This analysis goes beyond the scope of this LP, this was just one trivial example to demonstrate how the autovectorization can be triggered by a flag.
+Compiler cost model analysis is beyond the scope of this Learning Path, but the example demonstrates how autovectorization can be triggered by a flag.
 
 You will see some more advanced examples in the next sections.
\ No newline at end of file
diff --git a/content/learning-paths/cross-platform/loop-reflowing/autovectorization-conditionals.md b/content/learning-paths/cross-platform/loop-reflowing/autovectorization-conditionals.md
index 7b6571b21..e2357d8ea 100644
--- a/content/learning-paths/cross-platform/loop-reflowing/autovectorization-conditionals.md
+++ b/content/learning-paths/cross-platform/loop-reflowing/autovectorization-conditionals.md
@@ -6,11 +6,13 @@ weight: 5
 layout: learningpathall
 ---
 
-## Autovectorization and conditionals 
+In the previous section, you learned that compilers cannot autovectorize loops with branches. 
 
-Previously we mentioned that compilers cannot autovectorize loops with branches. In this section, you will see that in more detail, when it is possible to enable the vectorizer in the compiler by adapting the loop and when it is required to modify the algorithm or write manually optimized code.
+In this section, you will see more examples of loops with branches.
 
-### If/else/switch in loops
+You will learn when it is possible to enable the vectorizer in the compiler by adapting the loop, and when you are required to modify the algorithm or write manually optimized code.
+
+### Loops with if/else/switch statements
 
 Consider the following function, a modified form of the previous function that uses weighted coefficients for `A[i]`.
 
@@ -26,7 +28,9 @@ void addvecweight(float *restrict C, float *A, float *B,
 }
 ```
 
-You might be tempted to think that this loop cannot be vectorized. Such loops are not that uncommon and compilers have a difficult time understanding the pattern and transforming them to vectorizeable forms, when it is possible. However, this is actually a vectorizable loop, as the conditional can actually be moved out of the loop, as this is a loop-invariant conditional. Essentially the compiler would transform -internally- the loop in something like the following:
+You might think that this loop cannot be vectorized. Such loops are not uncommon and compilers have a difficult time understanding the pattern and transforming them to vectorizable forms. However, this is actually a vectorizable loop, as the conditional can be moved out of the loop, as this is a loop-invariant conditional. 
+
+The compiler will internally transform the loop into something similar to the code below: 
 
 ```C
 void addvecweight(float *restrict C, float *A, float *B, size_t N) {
@@ -42,9 +46,11 @@ void addvecweight(float *restrict C, float *A, float *B, size_t N) {
 }
 ```
 
-which is in essence, two different loops and we know that the compiler can vectorize them. Both gcc and llvm can actually autovectorize this loop, but the output is slightly different, performance may actually vary depending on the flags used and the exact nature of the loop.
+These are two different loops that the compiler can vectorize. 
+
+Both GCC and Clang can autovectorize this loop, but the output is slightly different, performance may vary depending on the flags used and the exact nature of the loop.
 
-However, the following loop is not yet autovectorized by all compilers (llvm/clang autovectorizes this loop, but not gcc):
+However, the loop below is autovectorized by Clang but it is not autovectorized by GCC. 
 
 ```C
 void addvecweight2(float *restrict C, float *A, float *B,
@@ -58,8 +64,9 @@ void addvecweight2(float *restrict C, float *A, float *B,
 }
 ```
 
-Similarly with `switch` statements, if the condition expression in loop-invariant, that is if it does not depend on the loop variable or the elements involved in each iteration.
-For this reason we know that this loop is actually autovectorized:
+The situation is similar with `switch` statements. If the condition expression is loop-invariant, that is if it does not depend on the loop variable or the elements involved in each iteration, it can be autovectorized.
+
+This example is autovectorized:
 
 ```C
 void addvecweight(float *restrict C, float *A, float *B,
@@ -79,7 +86,7 @@ void addvecweight(float *restrict C, float *A, float *B,
 }
 ```
 
-But this one is not:
+This example is not autovectorized: 
 
 ```C
 #define sign(x) (x > 0) ? 1 : ((x < 0) ? -1 : 0)
@@ -102,4 +109,6 @@ void addvecweight(float *restrict C, float *A, float *B,
 }
 ```
 
-The cases you have seen so far are generic, they will work in other architectures besides Arm. In the next section, you will see Arm-specific usecases for autovectorization.
\ No newline at end of file
+The cases you have seen so far are generic, they work the same for any architecture. 
+
+In the next section, you will see Arm-specific cases for autovectorization.
\ No newline at end of file
diff --git a/content/learning-paths/cross-platform/loop-reflowing/autovectorization-limits.md b/content/learning-paths/cross-platform/loop-reflowing/autovectorization-limits.md
index ded597806..ae0e96bf7 100644
--- a/content/learning-paths/cross-platform/loop-reflowing/autovectorization-limits.md
+++ b/content/learning-paths/cross-platform/loop-reflowing/autovectorization-limits.md
@@ -6,65 +6,70 @@ weight: 4
 layout: learningpathall
 ---
 
-## Autovectorization limits
+Autovectorization is not as easy as adding a flag like `restrict` in the arguments list. 
 
-Autovectorization is not as easy as adding a flag like `restrict` in the arguments list. There are some requirements for autovectorization to be enabled, namely:
+There are some requirements for autovectorization to be enabled. Some of the requirements with examples are shown below.
 
-* The loops have to be countable
+#### Countable loops
 
-This means that the following can be vectorized:
+A countable loop is a loop where the number of iterations is known before the loop begins executing.
+
+Countable loops means the following can be vectorized:
 
 ```C
-    for (size_t i=0; i < N; i++) {
-        C[i] = A[i] + B[i];
-    }
+for (size_t i=0; i < N; i++) {
+    C[i] = A[i] + B[i];
+}
 ```
 
-but this one cannot be vectorized:
+This loop is not countable and cannot be vectorized:
 
 ```C
-    i = 0;
-    while(true) {
-        C[i] = A[i] + B[i];
-        i++;
-        if (condition) break;
-    }
+i = 0;
+while(1) {
+    C[i] = A[i] + B[i];
+    i++;
+    if (condition) break;
+}
 ```
 
-Having said that, if condition is such that the `while` loop is actually a countable loop in disguise, then the loop might be vectorizable. For example, this loop will *actually be vectorized*:
+If the `while` loop is actually a countable loop in disguise, then the loop might be vectorizable. 
+
+For example, this loop is vectorizable:
 
 ```C
-    i = 0;
-    while(1) {
-        C[i] = A[i] + B[i];
-        i++;
-        if (i >= N) break;
-    }
+i = 0;
+while(1) {
+    C[i] = A[i] + B[i];
+    i++;
+    if (i >= N) break;
+}
 ```
-but this one will not be vectorizable:
+
+This loop is not vectorizable:
 
 ```C
-    i = 0;
-    while(1) {
-        C[i] = A[i] + B[i];
-        i++;
-        if (C[i] > 0) break;
-    }
+i = 0;
+while(1) {
+    C[i] = A[i] + B[i];
+    i++;
+    if (C[i] > 0) break;
+}
 ```
 
-* No function calls inside the loop
+#### No function calls inside the loop
 
-For example if, `f()`, `g()` are functions that take `float` arguments, this loop cannot be autovectorized:
+If `f()` and `g()` are functions that take `float` arguments this loop cannot be autovectorized:
 
 ```C
-    for (size_t i=0; i < N; i++) {
-        C[i] = f(A[i]) + g(B[i]);
-    }
+for (size_t i=0; i < N; i++) {
+    C[i] = f(A[i]) + g(B[i]);
+}
 ```
 
-There is a special case of the math library trigonometry and transcendental functions (like `sin`, `cos`, `exp`, etc). There is progress underway to enable these functions to be autovectorized, as the compiler will be able to use their vectorized counterparts in `mathvec` library (`libmvec`).
+There is a special case of the math library trigonometry and transcendental functions (like `sin`, `cos`, `exp`, etc). There is work underway to enable these functions to be autovectorized, as the compiler will use their vectorized counterparts in the `mathvec` library (`libmvec`).
 
-So for example, something like the following is actually *already autovectorized* in current gcc trunk for Arm (note you have to add `-Ofast` to compilation flags to enable such autovectorization):
+The loop below is *already autovectorized* in current gcc trunk for Arm (note you have to add `-Ofast` to the compilation flags to enable autovectorization):
 
 ```C
 void addfunc(float *restrict C, float *A, float *B, size_t N) {
@@ -74,38 +79,42 @@ void addfunc(float *restrict C, float *A, float *B, size_t N) {
 }
 ```
 
-This will be in gcc 14 and require a new glibc as well (2.39). Until these are released, if you are using a released compiler as part of a distribution (gcc 13.2 at the time of writing), you will have to manually vectorize such code for performance.
+This feature will be in gcc 14 and require a new glibc version 2.39 as well. Until then, if you are using a released compiler as part of a Linux distribution (such as gcc 13.2), you will need to manually vectorize such code for performance.
 
-We will expand on autovectorization of conditionals in the next section.
+There is more about autovectorization of conditionals in the next section.
 
-* In general, no branches in the loop, no if/else/switch
+#### No branches in the loop and no if/else/switch statements
 
-This is not universally true, there are cases where branches can actually be vectorized, we will expand this in the next section.
-And in the case of SVE/SVE2 on Arm, predicates will actually make this easier and remove or minimize these limitations at least in some cases. There is currently work in progress on the compiler front to enable the use of predicates in such loops. We will probably return with a new LP to explain SVE/SVE2 autovectorization and predicates in more depth.
+This is not universally true, there are cases where branches can actually be vectorized. 
 
-* Only inner-most loops will be vectorized.
+In the case of SVE/SVE2 on Arm, predicates will actually make this easier and remove or minimize these limitations at least in some cases. There is currently work in progress to enable the use of predicates in such loops. SVE/SVE2 autovectorization and predicates is a good topic for a future Learning Path. 
 
-To clarify, consider the following nested loop:
+There is more information on this in the next section.
+
+#### Only inner-most loops will be vectorized.
+
+Consider the following nested loop:
 
 ```C
-    for (size_t i=0; i < N; i++) {
-        for (size_t j=0; j < M; j++) {
-           C[i][j] = A[i][j] + B[i][j];
-        }
+for (size_t i=0; i < N; i++) {
+    for (size_t j=0; j < M; j++) {
+       C[i][j] = A[i][j] + B[i][j];
     }
+}
 ```
 
-In such a case, only the inner loop will be vectorized, again provided all the other conditions also apply (no branches and the inner loop is countable). 
-In fact, there are some cases where outer loop types are also autovectorized, but these are outside the scope of this LP.
+In this case, only the inner loop will be vectorized, again provided all the other conditions also apply (no branches and the inner loop is countable).
+
+There are some cases where outer loop types are autovectorized, but these are not covered in this Learning Path.
+
+#### No data inter-dependency between iterations
 
-* No data inter-dependency between iterations
+This means that each iteration depends on the result of the previous iteration. This example is difficult, but not impossible to autovectorize. 
 
-This means that each iteration depends on the result of the previous iteration. Such a problem is difficult -but not impossible- to autovectorize. Consider the following example:
+The loop below cannot be autovectorized as it is. 
 
 ```C
-    for (size_t i=1; i < N; i++) {
-        C[i] = A[i] + B[i] + C[i-1];
-    }
+for (size_t i=1; i < N; i++) {
+    C[i] = A[i] + B[i] + C[i-1];
+}
 ```
-
-This example cannot be autovectorized as it is. 
\ No newline at end of file
diff --git a/content/learning-paths/cross-platform/loop-reflowing/autovectorization-on-arm-1.md b/content/learning-paths/cross-platform/loop-reflowing/autovectorization-on-arm-1.md
index 36e10b548..6dc45ed09 100644
--- a/content/learning-paths/cross-platform/loop-reflowing/autovectorization-on-arm-1.md
+++ b/content/learning-paths/cross-platform/loop-reflowing/autovectorization-on-arm-1.md
@@ -1,15 +1,16 @@
 ---
-title: Autovectorization
+title: Autovectorization on Arm
 weight: 6
 
 ### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
 
-## Autovectorization on Arm
+In this section you will learn how to take advantage of specific Arm instructions.
 
-This time you will see how you can take advantage of specific Arm instructions.
-The following code will calculate the dot product of two integer arrays.
+The following code calculates the dot product of two integer arrays.
+
+Copy the code and save it to a file named `dotprod.c`.
 
 ```C
 #include <stdint.h>
@@ -34,9 +35,23 @@ int main() {
 }
 ```
 
-Such code is quite common in audio/video codecs where integer arithmetic is used instead of floating-point.
+Such code is common in audio and video codecs where integer arithmetic is used instead of floating-point.
+
+Compile the code:
+
+```bash
+gcc -O2 -fno-inline dotprod.c -o dotprod
+```
+
+Look at the assembly code:
+
+```bash
+objdump -D dotprod
+```
 
-As it is, if you would compile it with `gcc -O2 -fno-inline` the assembly output for `dotprod` is the following:
+The `objdump` instructions are omitted from the remainder of the examples, but you can use `objdump` every time you recompile to see the assembly output.
+
+The assembly output for the `dotprod()` function is:
 
 ```as
 dotprod:
@@ -57,9 +72,15 @@ dotprod:
         ret
 ```
 
-You can see that it's a pretty standard implementation, doing one element at a time. The option `-fno-inline` is necessary to avoid inlining any code from the function `dot-prod()` into `main()` for performance reasons. In general this is a good thing, but in this case we want to demonstrate the autovectorization process and it will be harder if there is no easy way to distinguish the caller from the callee.
+You can see that it's a pretty standard implementation, doing one element at a time. The option `-fno-inline` is necessary to avoid inlining any code from the function `dot-prod()` into `main()` for performance reasons. In general, this is a good thing, but demonstrating the autovectorization process is more difficult if there is no easy way to distinguish the caller from the callee.
+
+Next, increase the optimization level to `-O3`, recompile, and observe the assembly output again:
+
+```bash
+gcc -O3 -fno-inline dotprod.c -o dotprod
+```
 
-Now, increase the optimization level to `-O3`, recompile and observe the assembly output again:
+The assembly for the `dotprod()` function is now:
 
 ```as
 dotprod:
@@ -114,13 +135,19 @@ dotprod:
         b       .L3
 ```
 
-Quite larger in quantity but you can confirm that some sort of autovectorization has taken place.
+The code is larger, but you can see that some autovectorization has taken place.
+
+The label `.L4` includes the main loop and you can see that the `mla` instruction is used to multiply and accumulate the dot products, 4 elements at a time. 
+
+At the end of this loop, the `addv` instruction does a horizontal addition of the 4 elements in the final vector and returns the final sum. The main loop is executed while the number of the remaining elements is a multiple of 4. The rest of the elements are processed one at a time in the `.L3` section of code.
+
+With the new code, you can expect a performance gain of about 4x.
 
-The label `.L4` includes the main loop and you can see that the instruction `MLA` is used to multiply and accumulate the dot products, 4 elements at a time. At the end of this loop, the instruction `ADDV` does a horizontal addition of the 4 elements in the final vector and returns the final sum. The main loop is executed while the number of the remaining elements is a multiple of 4. The rest of the elements are processed one at a time in the `.L3` part of the code.
+You might be wondering if there is a way to hint to the compiler that the sizes are always going to be multiples of 4 and avoid the last part of the code. 
 
-This is quite nice and you can expect a performance gain of about 4x faster in general.
+The answer is *yes*, but it depends on the compiler. In the case of gcc, it is enough to add an instruction that ensures the sizes are multiples of 4.
 
-You might be wondering if there is a way to hint to the compiler that the sizes are always going to be multiples of 4 and avoid the last part of the code and the answer is *yes*, but it depends on the compiler. In the case of gcc, it is enough to add an instruction that ensures the size is only multiples of 4:
+Modify the `dotprod()` function to add the multiples of 4 hint as shown below:
 
 ```C
 int32_t dotprod(int32_t *A, int32_t *B, size_t N) {
@@ -133,7 +160,13 @@ int32_t dotprod(int32_t *A, int32_t *B, size_t N) {
 }
 ```
 
-And the assembly output with `-O3` will much more compact, now that it does not have to handle the left over bytes:
+Compile again ith `-O3`:
+
+```bash
+gcc -O3 -fno-inline dotprod.c -o dotprod
+```
+
+The assembly output with `-O3` is much more compact because it does not need to handle the left over bytes:
 
 ```as
 dotprod:
@@ -158,12 +191,25 @@ dotprod:
         ret
 ```
 
-But is that all that the compiler can do? Thankfully not. Modern compilers are very proficient in generating code that utilizes all available instructions, provided they have the right information.
-For example, the `dotprod()` function that you wrote operates on `int32_t` elements, what if you could limit the range to 8-bit? Something like that is not untypical, and we know there is an Armv8 ISA extension that [provides a pair of new `SDOT`/`UDOT` instructions to perform a dotprot across 8-bit elements of 2 vectors and store the results in the 32-bit elements of the resulting vector](https://developer.arm.com/documentation/102651/a/What-are-dot-product-intructions-). Could the compiler make use of that automatically? Or should you resort to a hand-written version that uses the respective intrinsics?
+Is there anything else the compiler can do? 
+
+Modern compilers are very proficient at generating code that utilizes all available instructions, provided they have the right information.
+
+For example, the `dotprod()` function operates on `int32_t` elements, what if you could limit the range to 8-bit? 
+
+There is an Armv8 ISA extension that [provides signed and unsigned dot product instructions](https://developer.arm.com/documentation/102651/a/What-are-dot-product-intructions-) to perform a dot product across 8-bit elements of 2 vectors and store the results in the 32-bit elements of the resulting vector. 
+
+Could the compiler make use of the instructions automatically or does the code need to be hand-written using intrinsics? 
 
-It turns out that some compilers can, and will detect that the number or the iterations is a multiple of the number of elements in a SIMD vector. Convert the code to use `int8_t` types for `A` and `B` arrays:
+It turns out that some compilers will detect that the number or the iterations is a multiple of the number of elements in a SIMD vector. 
+
+Modify the `dotprod.c` code to use `int8_t` types for `A` and `B` arrays as shown below:
 
 ```C
+#include <stdint.h>
+#include <stdlib.h>
+#include <stdio.h>
+
 int32_t dotprod(int8_t *A, int8_t *B, size_t N) {
     int32_t result = 0;
     N -= N % 4;
@@ -183,14 +229,17 @@ int main() {
 }
 ```
 
-You need to add `-march=armv8-a+dotprod` to the compilation flags in order to hint to the compiler that it has the new instructions at its disposal, that is:
+Compile the code:
 
 ```bash
-gcc -O3 -Wall -g -fno-inline -march=armv8-a+dotprod
+gcc -O3 -fno-inline -march=armv8-a+dotprod dotprod.c -o dotprod
 ```
 
-The assembly output will be quite larger as the use of `SDOT` can only work in the main loop where the size is a multiple of 16. Then the compiler will unroll the loop to use ASIMD instructions if the size is greater than 8, and byte-handling instructions if the size is smaller.
-You could eliminate those extra tail instructions by converting `N -= N % 4` to 8 or even 16:
+You need to compile with the architecture flag to use the dot product instructions. 
+
+The assembly output will be quite larger as the use of `SDOT` can only work in the main loop where the size is a multiple of 16. The compiler will unroll the loop to use Advanced SIMD instructions if the size is greater than 8, and byte-handling instructions if the size is smaller.
+
+You can eliminate the extra tail instructions by converting `N -= N % 4` to 8 or even 16 as shown below: 
 
 ```C
 int32_t dotprod(int8_t *A, int8_t *B, size_t N) {
@@ -227,5 +276,6 @@ dotprod:
         ret
 ```
 
-As before, at the end of the loop `ADDV` is used to perform a horizontal addition of the 32-bit integer elements and produce the final dot product sum.
-This particular implementation can be up to 4x faster than the previous version using `MLA`.
+As before, at the end of the loop `addv` instruction is used to perform a horizontal addition of the 32-bit integer elements and produce the final dot product sum.
+
+This particular implementation will be up to 4x faster than the previous version using `mla`.
diff --git a/content/learning-paths/cross-platform/loop-reflowing/autovectorization-on-arm-2.md b/content/learning-paths/cross-platform/loop-reflowing/autovectorization-on-arm-2.md
index 463186f61..1e22fc5aa 100644
--- a/content/learning-paths/cross-platform/loop-reflowing/autovectorization-on-arm-2.md
+++ b/content/learning-paths/cross-platform/loop-reflowing/autovectorization-on-arm-2.md
@@ -1,15 +1,16 @@
 ---
-title: Autovectorization on Arm, continued
+title: More autovectorization on Arm
 weight: 7
 
 ### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
 
-## More autovectorization on Arm
+The previous example using the `SDOT`/`UDOT` instructions is only one of the Arm-specific optimizations possible.
 
-The previous example using the `SDOT`/`UDOT` instructions was only one of the Arm-specific optimizations possible.
-While it is not possible to exhaust all the specialized instructions offered on Arm CPUs in a single LP, it's worth looking at another example:
+While it is not possible to demonstrate all of the specialized instructions offered by the Arm architecture, it's worth looking at another example:
+
+Below is a very simple loop, calculating what is known as a Sum of Absolute Differences (SAD). Such code is very common in video codecs and used in calculating differences between video frames.
 
 ```C
 #include <stdint.h>
@@ -36,10 +37,9 @@ int main() {
 }
 ```
 
-This is a very simple loop, calculating what is known as a Sum of Absolute Differences (SAD). Again, such code is very common in video codecs and used in calculating differences between video frames, etc.
 A hint to the compiler was added that the size is a multiple of 16 to avoid generating cases for smaller lengths. *This is only for demonstration purposes*.
 
-Save the code as `sadtest.c` and compile it like this:
+Save the code above to a file named `sadtest.c` and compile it:
 
 ```bash
 gcc -O3 -fno-inline sadtest.c -o sadtest
@@ -73,22 +73,23 @@ sad8:
 
  You can see that the compiler generates code that uses 3 specialized instructions that exist only on Arm: [`SABDL2`](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SABDL--SABDL2--Signed-Absolute-Difference-Long-?lang=en), [`SABAL`](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SABAL--SABAL2--Signed-Absolute-difference-and-Accumulate-Long-?lang=en) and [`SADALP`](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SADALP--Signed-Add-and-Accumulate-Long-Pairwise-?lang=en).
 
- The accumulator variable is not 8-bit but 32-bit, so the typical SIMD implementation that would involve 16 x 8-bit subtractions, then 16 x absolute values and 16 x additions would not do, and a widening conversion to 32-bit would have to take place before the accumulation.
+The accumulator variable is not 8-bit but 32-bit, so the typical SIMD implementation that would involve 16 x 8-bit subtractions, then 16 x absolute values and 16 x additions would not do, and a widening conversion to 32-bit would have to take place before the accumulation.
 
- This would mean that 4x items at a time would be accumulated, but with the use of these instructions, the performance gain can be up to 16x faster than the original scalar code, or ~4x faster than the typical SIMD implementation.
+This would mean that 4x items at a time would be accumulated, but with the use of these instructions, the performance gain can be up to 16x faster than the original scalar code, or about 4x faster than the typical SIMD implementation.
 
 For completeness the SVE2 version will be provided, which does not depend on size being a multiple of 16.
 
-This is the output without the `N -= N % 16` before the loop.
-You could compile it on any Arm system -even without support for SVE2- just by adding the appropriate `-march` flag:
+This version is without the `N -= N % 16` before the loop.
+
+You can compile it on any Arm system (even one without support for SVE2) just by adding the appropriate `-march` flag:
 
 ```bash
 gcc -O3 -fno-inline -march=armv9-a sadtest.c -o sadtest
 ```
 
-(depending on the compiler version tested `-march=armv9-a` might not be available, in that case you could use `-march=march8-a+sve2`)
+Depending on your compiler version `-march=armv9-a` might not be available. If this is the case, you can use `-march=march8-a+sve2` instead.
 
-The SVE2 assembly output for `sad8()` in this case will be:
+The SVE2 assembly output for `sad8()` is:
 
 ```as
 sad8:
@@ -121,6 +122,11 @@ sad8:
 
 ## Conclusion
 
-You might wonder if there is a point in autovectorization, if you have to have such specialized knowledge of instructions like `SDOT`/`SADAL` etc in order to use it. The answer is that autovectorization is a tool, the goal is to minimize the effort taken by the developers and maximize the performance, while at the same time requiring low maintainance in terms of the code size. It is far easier to maintain hundreds or thousands of functions that are known to generate the fastest code using autovectorization, for all platforms, than it is to maintain the same number of functions in multiple versions for each supported architecture and SIMD engine. As with most tools, the better you know how to use it the better results you can expect.
+You might ask why you should learn about autovectorization if you need to have specialized knowledge of instructions like `SDOT`/`SADAL` in order to benefit.
+
+Autovectorization is a tool. The goal is to minimize the effort required by developers and maximize the performance, while at the same time requiring low maintenance in terms of code size. 
+
+It is far easier to maintain hundreds or thousands of functions that are known to generate the fastest code using autovectorization, for all platforms, than it is to maintain the same number of functions in multiple versions for each supported architecture and SIMD engine. 
 
+As with most tools, the better you know how to use it, the better the results will be.
 
diff --git a/content/learning-paths/cross-platform/loop-reflowing/introduction-to-autovectorization.md b/content/learning-paths/cross-platform/loop-reflowing/introduction-to-autovectorization.md
index 6e339970d..8ae3a0c43 100644
--- a/content/learning-paths/cross-platform/loop-reflowing/introduction-to-autovectorization.md
+++ b/content/learning-paths/cross-platform/loop-reflowing/introduction-to-autovectorization.md
@@ -6,15 +6,19 @@ weight: 2
 layout: learningpathall
 ---
 
-## Vectorization and autovectorization
+## Before you begin
 
-CPU time is mostly spent executing code inside loops. Almost all software, especially software that performs time-consuming calculations, be it image/video processing, games, scientific software or even AI, revolves around a few loops that do most of the calculations and the majority of the code is executed only from within those loops.
+You should have an Arm Linux system with gcc installed. Refer to the [GNU compiler](/install-guides/gcc/native/) install guide for instructions. The examples use gcc as the compiler, but you can also use Clang. 
 
-With the advent of SIMD and Vector engines in modern CPUs (like Neon and SVE), specialized instructions became available to developers to improve performance and efficiency of those loops. However the loops themselves had to be adapted to allow the use of those instructions. The process of this adaptation is called *Vectorization* and it is a synonym with SIMD optimization.
+## Introduction to autovectorization
 
-Depending on the actual loop and the operations involved, vectorization can be possible or impossible and respectively the loop can be identified as vectorizable or non-vectorizable.
+CPU time is often spent executing code inside loops. Software that performs time-consuming calculations in image/video processing, games, scientific software, and AI, often revolves around a few loops doing most of the calculations.
 
-Consider the following simple loop:
+With the advent of single instruction, multiple data (SIMD) processing and vector engines in modern CPUs (like Neon and SVE), specialized instructions are available to improve the performance and efficiency of loops. However, the loops themselves need to be adapted to use SIMD instructions. The adaptation process is called *__vectorization__* and is synonymous with SIMD optimization.
+
+Depending on the actual loop and the operations involved, vectorization is possible or impossible and the loop is labeled as vectorizable or non-vectorizable.
+
+Consider the following simple loop which adds 2 vectors:
 
 ```C
 #include <stdint.h>
@@ -35,9 +39,11 @@ int main() {
 }
 ```
 
-Save this file as `addvec.c`.
+Use a text editor to copy the code above and save it as `addvec.c`.
 
-This is practically the most referred-to example with regards to vectorization, because it is easy to explain. For Advanced SIMD/Neon the vectorized form is the following:
+This is the most referred-to example with regards to vectorization, because it is easy to explain. 
+
+For Advanced SIMD/Neon, the vectorized form is the following:
 
 ```C
 #include <stdint.h>
@@ -62,15 +68,17 @@ int main() {
 }
 ``` 
 
-Save this file as `addvec_neon.c`.
+Save the second example as `addvec_neon.c`.
+
+As you can see, vectorizing a loop can be a difficult task that takes time and very specialized knowledge. The knowledge is specific to the architecture, the SIMD engine, and sometimes the revision of the SIMD engine. 
 
-As you see, vectorizing a loop can be quite a difficult task that takes time and very specialized knowledge, not only particular to a specific architecture but to the specific SIMD engine and revision. For many developers it is such a daunting task that automating this process became one of the biggest milestones in compiler advancement for years. Enabling the compiler to perform automatic adaptation of the loop in order to be vectorizable and use SIMD instructions is called *Autovectorization*. 
+For many developers, vectorizing is a daunting task. Automating the process is one of the biggest milestones in compiler advancement in many years. Enabling the compiler to perform automatic adaptation of the loop in order to be vectorizable and use SIMD instructions is called *__autovectorization__*. 
 
-Autovectorization in compilers is being developed for the past 20 years, however recent advances in both major compilers (LLVM and gcc) have started to render autovectorization a viable alternative to hand-written SIMD code for more than just the basic loops. Some loop types are still not detected as autovectorizable and it is not directly obvious which kinds of loops are autovectorizable and which are not.
+Autovectorization in compilers has been in development for the past 20 years. However, recent advances in both major compilers (Clang and GCC) have started to render autovectorization a viable alternative to hand-written SIMD code for more than just the basic loops. Some loop types are still not detected as autovectorizable, and it is not directly obvious which kinds of loops are autovectorizable and which are not.
 
-As it is a constantly advancing field, it is not easy to keep track of what the current compiler supports with regards to autovectorization. It is a very highly advanced Computer Science topic that involves subjects such as Graph Theories, Compilers and deep understanding of each architecture and the respective SIMD engines and the number of people that are experts is extremely small.
+As a constantly advancing field, it is not easy to keep track of compiler support for autovectorization. It is an advanced Computer Science topic that involves the subjects of graph theory, compilers, and deep understanding of each architecture and the respective SIMD engines. The number of experts in the field is extremely small.
 
-In this Learning Path, you will test autovectorization through examples and identify how to adapt some loops to enable autovectorization in the compiler, on both Advanced SIMD and SVE/SVE2 systems.
+In this Learning Path, you will learn about autovectorization through examples and identify how to adapt some loops to enable autovectorization.
 
 
 

From e2e5daf4092ffbade54e59a9d5922abbcf4b43c6 Mon Sep 17 00:00:00 2001
From: Jason Andrews <jason.r.andrews@comcast.net>
Date: Tue, 16 Jan 2024 12:46:22 -0600
Subject: [PATCH 2/4] update install guides for automated testing

---
 content/install-guides/ansible.md     | 10 ++--
 content/install-guides/armclang.md    |  4 +-
 content/install-guides/aws-cli.md     |  4 +-
 content/install-guides/azure-cli.md   |  9 ++--
 content/install-guides/eksctl.md      |  9 ++--
 content/install-guides/forge.md       | 71 +++++++++++++--------------
 content/install-guides/gcc/arm-gnu.md |  2 +-
 content/install-guides/gcc/native.md  |  2 +-
 content/install-guides/gfortran.md    |  3 +-
 content/install-guides/go.md          |  2 +-
 content/install-guides/kubectl.md     |  2 +-
 content/install-guides/nomachine.md   |  4 +-
 content/install-guides/oci-cli.md     |  1 -
 content/install-guides/terraform.md   |  2 +-
 14 files changed, 56 insertions(+), 69 deletions(-)

diff --git a/content/install-guides/ansible.md b/content/install-guides/ansible.md
index 08e84511c..192b5be46 100644
--- a/content/install-guides/ansible.md
+++ b/content/install-guides/ansible.md
@@ -2,19 +2,17 @@
 additional_search_terms:
 - linux
 - deploy
-
-
+author_primary: Jason Andrews
 layout: installtoolsall
 minutes_to_complete: 10
-author_primary: Jason Andrews
 multi_install: false
 multitool_install_part: false
 official_docs: https://docs.ansible.com/ansible/latest/index.html
 test_images:
 - ubuntu:latest
+test_link: null
 test_maintenance: false
 title: Ansible
-author_primary: Jason Andrews
 tool_install: true
 weight: 1
 ---
@@ -49,7 +47,7 @@ The easiest way to install the latest version of Ansible for Ubuntu on Arm is to
 
 To enable the PPA and install Ansible run the commands:
 
-```bash { target="ubuntu:latest" }
+```bash
 sudo apt update
 sudo apt install software-properties-common -y
 sudo add-apt-repository --yes --update ppa:ansible/ansible
@@ -58,7 +56,7 @@ sudo apt install ansible -y
 
 Confirm the Ansible command line tools are installed by running: 
 
-```bash { target="ubuntu:latest" }
+```bash
 ansible-playbook --version
 ```
 
diff --git a/content/install-guides/armclang.md b/content/install-guides/armclang.md
index 6502023f4..6105dac04 100644
--- a/content/install-guides/armclang.md
+++ b/content/install-guides/armclang.md
@@ -14,8 +14,8 @@ official_docs: https://developer.arm.com/documentation/100748
 test_images:
 - ubuntu:latest
 - fedora:latest
-test_link: https://github.com/armflorentlebeau/arm-learning-paths/actions/runs/4312122327
-test_maintenance: true
+test_link: null
+test_maintenance: false
 test_status:
 - passed
 - passed
diff --git a/content/install-guides/aws-cli.md b/content/install-guides/aws-cli.md
index da6922618..73fbf78fa 100644
--- a/content/install-guides/aws-cli.md
+++ b/content/install-guides/aws-cli.md
@@ -12,8 +12,8 @@ multitool_install_part: false
 official_docs: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html
 test_images:
 - ubuntu:latest
-test_link: https://github.com/armflorentlebeau/arm-learning-paths/actions/runs/4312122327
-test_maintenance: true
+test_link: null
+test_maintenance: false
 test_status:
 - passed
 title: AWS CLI
diff --git a/content/install-guides/azure-cli.md b/content/install-guides/azure-cli.md
index e6607b85e..851f2a03b 100644
--- a/content/install-guides/azure-cli.md
+++ b/content/install-guides/azure-cli.md
@@ -2,16 +2,15 @@
 additional_search_terms:
 - cloud
 - azure
-- 
+author_primary: Jason Andrews
 layout: installtoolsall
 minutes_to_complete: 15
-author_primary: Jason Andrews
 multi_install: false
 multitool_install_part: false
 official_docs: https://learn.microsoft.com/en-us/cli/azure
 test_images:
 - ubuntu:latest
-test_link: https://github.com/armflorentlebeau/arm-learning-paths/actions/runs/4312122327
+test_link: null
 test_maintenance: true
 test_status:
 - passed
@@ -34,7 +33,7 @@ This article provides a quick solution to install Azure CLI for Ubuntu on Arm.
 
 Confirm you are using an Arm machine by running:
 
-```bash
+```bash { target="ubuntu:latest" }
 uname -m
 ```
 
@@ -69,4 +68,4 @@ source $HOME/.profile
 az version
 ```
 
-After a successful log in, you can use the [Azure CLI](../azure-cli) and automation tools like [Terraform](../terraform) from the terminal.
\ No newline at end of file
+After a successful log in, you can use the [Azure CLI](../azure-cli) and automation tools like [Terraform](../terraform) from the terminal.
diff --git a/content/install-guides/eksctl.md b/content/install-guides/eksctl.md
index 812829ae3..f5f6cd367 100644
--- a/content/install-guides/eksctl.md
+++ b/content/install-guides/eksctl.md
@@ -2,18 +2,17 @@
 additional_search_terms:
 - kubernetes
 - EKS
-- AWS 
+- AWS
 - infrastructure
-
+author_primary: Jason Andrews
 layout: installtoolsall
 minutes_to_complete: 5
-author_primary: Jason Andrews
 multi_install: false
 multitool_install_part: false
 official_docs: https://docs.aws.amazon.com/eks/latest/userguide/eksctl.html
 test_images:
 - ubuntu:latest
-
+test_link: null
 test_maintenance: true
 test_status:
 - passed
@@ -34,7 +33,7 @@ This install guide provides a quick solution to install `eksctl` on Arm Linux an
 
 For Linux, confirm you are using an Arm machine by running:
 
-```bash
+```bash { target="ubuntu:latest" }
 uname -m
 ```
 
diff --git a/content/install-guides/forge.md b/content/install-guides/forge.md
index e5a7c49d1..295bdab6d 100644
--- a/content/install-guides/forge.md
+++ b/content/install-guides/forge.md
@@ -1,39 +1,25 @@
 ---
-### Title the install tools article with the name of the tool to be installed
-### Include vendor name where appropriate
-title: Linaro Forge
-
-### Optional additional search terms (one per line) to assist in finding the article
 additional_search_terms:
-  - forge
-  - ddt
-  - map
-  - performance reports
-  - allinea
-
-### Estimated completion time in minutes (please use integer multiple of 5)
-minutes_to_complete: 15
-
+- forge
+- ddt
+- map
+- performance reports
+- allinea
 author_primary: Florent Lebeau
-
-### Link to official documentation
+layout: installtoolsall
+minutes_to_complete: 15
+multi_install: false
+multitool_install_part: false
 official_docs: https://www.linaroforge.com/documentation/
-
-### test_automation
 test_images:
 - ubuntu:latest
 test_link: null
 test_maintenance: true
 test_status:
 - passed
-
-
-### PAGE SETUP
-weight: 1                       # Defines page ordering. Must be 1 for first (or only) page.
-tool_install: true              # Set to true to be listed in main selection page, else false
-multi_install: false            # Set to true if first page of multi-page article, else false
-multitool_install_part: false   # Set to true if a sub-page of a multi-page article, else false
-layout: installtoolsall         # DO NOT MODIFY. Always true for tool install articles
+title: Linaro Forge
+tool_install: true
+weight: 1
 ---
 
 [Linaro Forge](https://www.linaroforge.com/) is a server and HPC development tool suite for C, C++, Fortran, and Python high performance code on Linux.
@@ -57,7 +43,6 @@ Download and extract the appropriate installation package from [Linaro Forge Dow
 sudo apt install wget
 wget https://downloads.linaroforge.com/23.0/linaro-forge-23.0-linux-aarch64.tar
 tar -xf linaro-forge-23.0-linux-aarch64.tar
-cd linaro-forge-23.0-linux-aarch64
 ```
 
 ## Installation
@@ -65,14 +50,17 @@ cd linaro-forge-23.0-linux-aarch64
 ### Linux host
 
 Run the installer from the command line with:
-```
+
+```console
 ./textinstall.sh [--accept-license] [install_dir]
 ```
+
 If no install directory is specified, you will be prompted to specify this while the installer runs.
 
 To install to the default directory, non-interactively:
+
 ```bash { target="ubuntu:latest" }
-./textinstall.sh --accept-license /home/ubuntu/linaro/forge/23.0
+linaro-forge-23.0-linux-aarch64/textinstall.sh --accept-license /home/ubuntu/linaro/forge/23.0
 ```
 
 ### Install on MacOS (remote client only)
@@ -103,25 +91,27 @@ You should turn off compiler optimizations as they can produce unexpected result
 
 Linaro Forge's debugging tool, Linaro DDT, can be launched with the `ddt` command. For MPI applications, you can prefix the mpirun/mpiexec command normally used to run in parallel:
 
-```bash
+```console
 ddt mpirun -n 128 myapp
 ```
 
 This startup method is called *Express Launch* and is the simplest way to get started. If your MPI is not supported by *Express Launch*, you can run the following instead:
 
-```bash
+```console
 ddt -n 128 myapp
 ```
 
 These commands will launch Linaro DDT GUI. When running on a HPC cluster, you may need to debug on compute nodes where this may not be possible. In this case, you can start the GUI on the frontend node with the `ddt` command and when running or submitting a job to the compute nodes use `ddt --connect` :
 
 With *Express Launch*:
-```bash
+
+```console
 ddt --connect mpirun -n 128 myapp
 ```
 
 Without *Express Launch*:
-```bash
+
+```console
 ddt --connect -n 128 myapp
 ```
 
@@ -136,17 +126,20 @@ Typically you should keep optimization flags enabled when profiling (rather than
 Linaro Forge's profiling tool, Linaro MAP, can be launched with the `map` command to launch the GUI. When running on a HPC cluster with MPI, you should use `map --profile` when running or submitting a job to the compute nodes:
 
 With *Express Launch*:
-```bash
+
+```console
 map --profile mpirun -n 128 myapp
 ```
 
 Without *Express Launch*:
-```bash
+
+```console
 map --profile -n 128 myapp
 ```
 
 A *.map file will be created in the current directory with profiling results when the application terminates. This file can be then open from the GUI launched on the frontend node or with the following command:
-```bash
+
+```console
 map myapp_128p_<timestamp>.map
 ```
 
@@ -157,12 +150,14 @@ Linaro Forge's reporting tool Linaro Performance Reports is designed to run on u
 Linaro Performance Reports does not use a GUI. Instead, it produces HTML and TXT files when the application terminates to summarize the application behavior. Here is how to use the tool on MPI applications
 
 With *Express Launch*:
-```bash
+
+```console
 perf-report mpirun -n 128 myapp
 ```
 
 Without *Express Launch*:
-```bash
+
+```console
 perf-report -n 128 myapp
 ```
 Two files `myapp_128p_<timestamp>.html` and `myapp_128p_<timestamp>.txt` will be created in the current directory.
diff --git a/content/install-guides/gcc/arm-gnu.md b/content/install-guides/gcc/arm-gnu.md
index 1d2190f43..fb5ffdcfc 100644
--- a/content/install-guides/gcc/arm-gnu.md
+++ b/content/install-guides/gcc/arm-gnu.md
@@ -10,7 +10,7 @@ official_docs: https://gcc.gnu.org/onlinedocs/
 test_images:
 - ubuntu:latest
 - fedora:latest
-test_link: https://github.com/armflorentlebeau/arm-learning-paths/actions/runs/4312122327
+test_link: null
 test_maintenance: true
 test_status:
 - passed
diff --git a/content/install-guides/gcc/native.md b/content/install-guides/gcc/native.md
index 738721e2a..b6f28520b 100644
--- a/content/install-guides/gcc/native.md
+++ b/content/install-guides/gcc/native.md
@@ -10,7 +10,7 @@ official_docs: https://gcc.gnu.org/onlinedocs/
 test_images:
 - ubuntu:latest
 - fedora:latest
-test_link: https://github.com/armflorentlebeau/arm-learning-paths/actions/runs/4312122327
+test_link: null
 test_maintenance: true
 test_status:
 - passed
diff --git a/content/install-guides/gfortran.md b/content/install-guides/gfortran.md
index 273b01ced..5f664e3c4 100644
--- a/content/install-guides/gfortran.md
+++ b/content/install-guides/gfortran.md
@@ -16,11 +16,10 @@ official_docs: https://gcc.gnu.org/onlinedocs/gfortran/
 test_images:
 - ubuntu:latest
 - fedora:latest
-test_link: https://github.com/armflorentlebeau/arm-learning-paths/actions/runs/4312122327
+test_link: null
 test_maintenance: true
 test_status:
 - passed
-- passed
 title: GFortran
 tool_install: true
 weight: 1
diff --git a/content/install-guides/go.md b/content/install-guides/go.md
index 0f481946f..6e18606ca 100644
--- a/content/install-guides/go.md
+++ b/content/install-guides/go.md
@@ -12,7 +12,7 @@ multitool_install_part: false
 official_docs: https://go.dev/doc/
 test_images:
 - ubuntu:latest
-test_maintenance: false
+test_maintenance: true
 title: Go
 tool_install: true
 weight: 1
diff --git a/content/install-guides/kubectl.md b/content/install-guides/kubectl.md
index 590d3c417..c1e5da852 100644
--- a/content/install-guides/kubectl.md
+++ b/content/install-guides/kubectl.md
@@ -15,7 +15,7 @@ multitool_install_part: false
 official_docs: https://kubernetes.io/docs/reference/kubectl
 test_images:
 - ubuntu:latest
-test_link: https://github.com/armflorentlebeau/arm-learning-paths/actions/runs/4312122327
+test_link: null
 test_maintenance: true
 test_status:
 - passed
diff --git a/content/install-guides/nomachine.md b/content/install-guides/nomachine.md
index aff618d6d..1887967e6 100644
--- a/content/install-guides/nomachine.md
+++ b/content/install-guides/nomachine.md
@@ -13,9 +13,7 @@ official_docs: https://www.nomachine.com/all-documents
 test_images:
 - ubuntu:latest
 test_link: null
-test_maintenance: true
-test_status:
-- passed
+test_maintenance: false
 title: NoMachine
 tool_install: true
 weight: 1
diff --git a/content/install-guides/oci-cli.md b/content/install-guides/oci-cli.md
index 27570e6dd..b2a992263 100644
--- a/content/install-guides/oci-cli.md
+++ b/content/install-guides/oci-cli.md
@@ -13,7 +13,6 @@ multitool_install_part: false
 official_docs: https://docs.oracle.com/en-us/iaas/Content/API/SDKDocs/cliinstall.htm
 test_images:
 - ubuntu:latest
-
 test_maintenance: true
 test_status:
 - passed
diff --git a/content/install-guides/terraform.md b/content/install-guides/terraform.md
index d72d7eee9..429327f59 100644
--- a/content/install-guides/terraform.md
+++ b/content/install-guides/terraform.md
@@ -14,7 +14,7 @@ multitool_install_part: false
 official_docs: https://developer.hashicorp.com/terraform/docs
 test_images:
 - ubuntu:latest
-test_link: https://github.com/armflorentlebeau/arm-learning-paths/actions/runs/4312122327
+test_link: false
 test_maintenance: true
 test_status:
 - passed

From 11612728f9d4abf804bf03d80bee30d83d4a7e4b Mon Sep 17 00:00:00 2001
From: pareenaverma <pareena.verma@arm.com>
Date: Wed, 17 Jan 2024 15:37:06 +0000
Subject: [PATCH 3/4] Fixed issue 669

---
 .../laptops-and-desktops/win_asp_net8/how-to-2.md           | 1 +
 .../microcontrollers/vcpkg-tool-installation/usage.md       | 6 ------
 2 files changed, 1 insertion(+), 6 deletions(-)

diff --git a/content/learning-paths/laptops-and-desktops/win_asp_net8/how-to-2.md b/content/learning-paths/laptops-and-desktops/win_asp_net8/how-to-2.md
index 646373955..f68b6a69d 100644
--- a/content/learning-paths/laptops-and-desktops/win_asp_net8/how-to-2.md
+++ b/content/learning-paths/laptops-and-desktops/win_asp_net8/how-to-2.md
@@ -44,6 +44,7 @@ Here the architecture is **ARM64**.
 ```console
 dotnet run -a x64
 ```
+{{% /notice %}}
 
 ## Test the GET request
 
diff --git a/content/learning-paths/microcontrollers/vcpkg-tool-installation/usage.md b/content/learning-paths/microcontrollers/vcpkg-tool-installation/usage.md
index 691b42a53..eafd230b0 100644
--- a/content/learning-paths/microcontrollers/vcpkg-tool-installation/usage.md
+++ b/content/learning-paths/microcontrollers/vcpkg-tool-installation/usage.md
@@ -25,12 +25,6 @@ microsoft:tools/kitware/cmake           3.25.2  installed            Kitware's c
 microsoft:tools/ninja-build/ninja       1.10.2  installed            Ninja is a small build system with a focus on speed.
 ```
 
-#### Activate tools using a vcpkg-configuration.json file elsewhere
-
-```bash
- vcpkg activate --json=../my-config.json
-```
-
 #### Deactivate artifacts specified by vcpkg-configuration.json
 
 ```bash { output_lines = "2-4" }

From 4f01a459604ffe2cfcddb10de9961912ff2b0f58 Mon Sep 17 00:00:00 2001
From: pareenaverma <pareena.verma@arm.com>
Date: Wed, 17 Jan 2024 16:05:19 +0000
Subject: [PATCH 4/4] Fixed tab formatting in MongoDB LP

---
 .../servers-and-cloud-computing/mongodb/benchmark_mongodb.md  | 4 ++--
 .../servers-and-cloud-computing/mongodb/perf_mongodb.md       | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/content/learning-paths/servers-and-cloud-computing/mongodb/benchmark_mongodb.md b/content/learning-paths/servers-and-cloud-computing/mongodb/benchmark_mongodb.md
index 608f01171..1aa3f607c 100644
--- a/content/learning-paths/servers-and-cloud-computing/mongodb/benchmark_mongodb.md
+++ b/content/learning-paths/servers-and-cloud-computing/mongodb/benchmark_mongodb.md
@@ -29,7 +29,7 @@ Installing Apache Maven:
 ```
 
 Installing Python 2.7:
-```
+
 {{< tabpane code=true >}}
   {{< tab header="Ubuntu" >}}
 sudo apt-get update
@@ -41,7 +41,7 @@ sudo yum install python2
 {{< /tab >}}
 {{< /tabpane >}}
 {{% notice  Python Note%}}
-```
+
 For Ubuntu 22.04 the `python` package may not be found. You can install Python 2.7 using:
 ```console
 sudo apt install python2 -y
diff --git a/content/learning-paths/servers-and-cloud-computing/mongodb/perf_mongodb.md b/content/learning-paths/servers-and-cloud-computing/mongodb/perf_mongodb.md
index 5c5f824c8..1e861267f 100644
--- a/content/learning-paths/servers-and-cloud-computing/mongodb/perf_mongodb.md
+++ b/content/learning-paths/servers-and-cloud-computing/mongodb/perf_mongodb.md
@@ -13,7 +13,7 @@ This is an open source Java application that tests the MongoDB performance, such
 ## Install OpenJDK packages
 
 Install the appropriate run-time environment to be able to use the performance test tool.
-```
+
 {{< tabpane code=true >}}
   {{< tab header="Ubuntu" >}}
 sudo apt-get install -y openjdk-18-jre
@@ -22,7 +22,7 @@ sudo apt-get install -y openjdk-18-jre
 sudo yum install java-17-openjdk
 {{< /tab >}}
 {{< /tabpane >}}
-```
+
 For more information see the [OpenJDK](https://openjdk.org/install/) website.
 
 ## Setup the MongoDB performance test tool