Skip to content

Commit

Permalink
Merge pull request #557 from lizwar/main
Browse files Browse the repository at this point in the history
Editorial amends - additional content required for this LP
  • Loading branch information
pareenaverma authored Nov 2, 2023
2 parents 6811b5a + 5db2e0b commit 41afe01
Show file tree
Hide file tree
Showing 6 changed files with 49 additions and 52 deletions.
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
---
title: restrict keyword in C99
title: Understand the `restrict` keyword in C99

minutes_to_complete: 30

who_is_this_for: C developers who are interested in software optimization

learning_objectives:
- Learn the importance of using 'restrict' keyword in C correctly
- Learn the importance of using the `restrict` keyword in C correctly

prerequisites:
- An Arm based system with Linux OS and recent compiler (clang or gcc)
- An Arm based system with Linux OS and recent compiler (Clang or GCC)

author_primary: Konstantinos Margaritis, VectorCamp

Expand Down
Original file line number Diff line number Diff line change
@@ -1,15 +1,16 @@
---
next_step_guidance: You should now be able to test the `restrict` keyword on your own or other open-source code and discover potential optimizations!
next_step_guidance: You should now be able to test the `restrict` keyword in your own code. Why not explore these other embedded software learning paths.

recommended_path: /learning-paths/embedded-systems/

further_reading:
- resource:
title: Wikipedia restrict entry
link: https://en.wikipedia.org/wiki/Restrict
type: documentation
title: How to use the restrict qualifier in C
link: https://www.oracle.com/solaris/technologies/solaris10-cc-restrict.html
type: blog

- resource:
title: Godbolt restrict tests
title: Explore the usage of restrict with Godbolt
link: https://godbolt.org/z/PxWxjc1oh
type: website

Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,19 @@
---
review:
- questions:
- questions:
question: >
What does `restrict` do?
answers:
- It increases the frequency of the CPU cores, making your program run faster
- It issues a command to clear the cache, leaving more space for your program
- It restricts the standard of the C library used to C99
- It hints to the compiler that the memory pointed to by the parameter cannot be accessed by any other means inside a particular function except using this pointer
correct_answer: 4
explanation: >
In order for the compiler to better schedule the instructions of a function, it needs to know if there are any
dependencies between the parameter variables. If there is no dependency, usually the compiler can group together instructions
increasing performance and efficiency.
- questions:
question: >
Where is `restrict` placed in the code?
answers:
Expand All @@ -10,22 +23,8 @@ review:
correct_answer: 3
explanation: >
`restrict` is placed in the arguments list of a function, between the * and the parameter name, like this:
`int func(char *restrict arg)`
- questions:
question: >
What does `restrict` do?
answers:
- It increases the frequency of the CPU cores, making your program run faster
- It issues a command to clear the cache, leaving more room for your program
- It restricts the standard of the C library used to C99
- It hints to the compiler that the memory pointed to by the parameter, cannot be accessed through any other means inside the particular function except, using this pointer
correct_answer: 4
explanation: >
In order for the compiler to better schedule the instructions of a function, it needs to know if there is any
dependency between the parameter variables. If there is no dependency, usually the compiler can group together instructions
increasing performance and efficiency.
- questions:
`int func(char *restrict arg)`
- questions:
question: >
Which language supports `restrict`
answers:
Expand All @@ -35,7 +34,7 @@ review:
- Rust
correct_answer: 3
explanation: >
`restrict` is a C-only keyword, it does not exist on C++ (`__restrict__` does, but it is not exactly the same)
`restrict` is a C-only keyword, it does not exist on C++ (`__restrict__` does, but it does not have the same function)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ void process_data (const char *in, char *out, size_t size)
}
```
This example will be easier to demonstrate with SVE2, and we found gcc 13 to have a better result than clang, this is the output of `gcc-13 -O3 -march=armv9-a`:
This example will be easier to demonstrate with SVE2. We found gcc 13 to have a better result than clang; this is the output of `gcc-13 -O3 -march=armv9-a`:
```
process_data:
Expand Down Expand Up @@ -55,7 +55,7 @@ process_data:
ret
```
Do not worry about each instruction in the assembly here, but notice that gcc has added 2 loops, one that uses the SVE2 `while*` instructions to the processing (.L4) and one scalar loop (.L3). The latter is executed in case theis any pointer aliasing -if there is any overlap between the memory pointers basically. Let's try adding `restrict` to pointer `in`:
Do not worry about each instruction in the assembly here, but notice that gcc has added 2 loops, one that uses the SVE2 `while*` instructions to the processing (.L4) and one scalar loop (.L3). The latter is executed in case there is any pointer aliasing (basically, if there is any overlap between the memory pointers). Let's try adding `restrict` to pointer `in`:
```C
void process_data (const char *restrict in, char *out, size_t size)
Expand Down Expand Up @@ -85,11 +85,10 @@ process_data:
ret
```

This is a huge improvement! Code size reduction is down from 30 lines to 14, less than half the original size. In both cases, you will note that the main loop (`.L4` in the former case, `.L3` in the latter) is exactly the same, but the entry and exit code of the function are very much simplified. The compiler was able to distinguish that the memory pointed by `in` does not overlap with memory pointed by `out`, it was able to simplify the code by eliminating the scalar loop and remove the associated code that checked if it needed to enter it.
This is a huge improvement! The code size is down from 30 lines to 14, less than half of the original size. In both cases, note that the main loop (`.L4` in the former case, `.L3` in the latter) is exactly the same, but the entry and exit code of the function is very much simplified. The compiler was able to distinguish that the memory pointed by `in` does not overlap with memory pointed by `out`, it was able to simplify the code by eliminating the scalar loop, and also remove the associated code that checked if it needed to enter it.

But I can almost hear the question: "Why is that important if the main loop is still the same?"
And it is a right question. The answer is this:
Why is this important if the main loop is still the same?

If your function is going to be called once and run over tens of billions of elements, then saving a few instructions before and after the main loop does not really matter.

But if your function is called on smaller sizes millions or even *billions* of times, then saving a few instructions in this function means we are saving a few *billions* of instructions total, which means less time to spend running on the CPU and less energy wasted.
But, if your function is going to be called on smaller sizes or even *billions* of times, then saving a few instructions in this function means we are saving a few *billions* of instructions in total, which means less time spent running on the CPU and less energy wasted.
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@ weight: 2
layout: learningpathall
---

## The problem: Overlapping memory regions as pointer arguments
## The problem: overlapping memory regions as pointer arguments

Before we go into detail of the `restrict` keyword, let's first demonstrate the problem.
Before we go into the detail of the `restrict` keyword, let's first demonstrate the problem.

Let's consider this C code:
```C
Expand Down Expand Up @@ -44,7 +44,7 @@ int main() {
}
```
So, there are 2 points to make here:
There are 2 points to make here:
1. `scaleVectors()` is the important function here, it scales two vectors by the same scalefactor `*C`
2. vector `a` overlaps with vector `b`. (`b = &a[2]`).
Expand All @@ -56,7 +56,7 @@ a(after) : 2 4 12 16
b(after) : 12 16 10 12
```
Notice that after the scaling the contents of `a` are also affected by the scaling of `b` as their elements overlap in memory.
Notice that after the scaling, the contents of `a` are also affected by the scaling of `b` as their elements overlap in memory.
We will include the assembly output of `scaleVectors` as produced by `clang-17 -O3`:
Expand Down Expand Up @@ -97,18 +97,18 @@ scaleVectors: // @scaleVectors
ret
```
This doesn't look optimal. `scaleVectors` seems to be doing each load, multiplication, store in sequence, surely it can be further optimized? This is because the memory pointers are overlapping, let's try different assignments of `a` and `b` in `main()` to make them explicitly independent, perhaps the compiler can detect that and generate faster instructions to do the same thing.
This doesn't look optimal. `scaleVectors` seems to be doing each load, multiplication, and store in sequence. Surely it can be better optimized? Because the memory pointers are overlapping, let's try different assignments of `a` and `b` in `main()` to make them explicitly independent. Perhaps the compiler will detect that and generate faster instructions to do the same thing.
```
int64_t a[] = { 1, 2, 3, 4 };
int64_t b[] = { 5, 6, 7, 8 };
```
Unsurprisingly, the disassembled output of `scaleVectors` is the same. The reason for this is that the compiler has no hint of the dependency between the two pointers used in the function so it has no choice than to assume that it has to process one element at a time. The function has no way of knowing with what arguments it is to be called. We see 8 instances of `mul`, which is correct but the number of loads and stores inbetween indicates that the CPU spends its time waiting for data to arrive from/to the cache. We need a way to be able to hint the compiler that it can assume the buffers passed are independent.
Unsurprisingly, the disassembled output of `scaleVectors` is the same. The reason for this is that the compiler has no hint about the dependency between the two pointers used in the function so it has no choice but to assume that it has to process one element at a time. The function has no way of knowing what arguments need to be called. We see 8 instances of `mul`, which is correct but the number of loads and stores inbetween indicates that the CPU spends its time waiting for data to arrive from/to the cache. We need a way to be able to tell the compiler that it can assume the buffers passed are independent.
## The Solution: restrict
This is what the C99 `restrict` keyword has come to solve. It instructs the compiler that the passed arguments are in no way dependant on each other and access to the memory of each happens only through the respective pointer. This way the compiler can schedule the instructions in a much better way. In essence it can group and schedule the loads and stores. As a note, `restrict` only works in C, not in C++.
This is what the C99 `restrict` keyword resolves. It instructs the compiler that the passed arguments are not dependant on each other and that access to the memory of each happens only through the respective pointer. This way the compiler can schedule the instructions in a much more efficient way. Essentially it can group and schedule the loads and stores. **Note**, `restrict` only works in C, not in C++.
Let's add `restrict` to `A` in the parameter list:
```C
Expand Down Expand Up @@ -149,7 +149,7 @@ scaleVectors: // @scaleVectors
ret
```

We see an obvious reduction in the number of instructions, from 32 instructions down to 22! That's 68% of the original count, which is impressive on its own. One can easily see that the loads are grouped, as well as the multiplications. Of course, still 8 multiplications, that cannot change, but far fewer loads and stores as the compiler found the opportunity to use `LDP`/`STP` which load/store in pairs for the pointer `A`.
We see an obvious reduction in the number of instructions, from 32 instructions down to 22! That's 68% of the original count, which is impressive. One can easily see that the loads are grouped, as well as the multiplications. Of course, there are still 8 multiplications as that cannot change, but there are far fewer loads and stores as the compiler found the opportunity to use `LDP`/`STP` which load/store in pairs for the pointer `A`.

Let's try adding `restrict` to `B` as well:
```C
Expand Down Expand Up @@ -185,14 +185,14 @@ scaleVectors: // @scaleVectors
ret
```
Another reduction in the number of instructions, down to 17, for a total reduction to 53% the original count. This time, only 5 loads and 4 stores. And as before, all the loads/stores are paired (because the `LDP`/`STP` instructions are used).
There is another reduction in the number of instructions, this time down to 17 from the original 32. There are only 5 loads and 4 stores and, as before, all the loads/stores are paired (because the `LDP`/`STP` instructions are used).
It is interesting to see that in such an example, adding just the `restrict` keyword reduced our code size to almost half. This will have an obvious impact in performance and efficiency.
It is interesting to see that in such an example adding the `restrict` keyword reduced our code size to almost half. This will have an obvious impact in both performance and efficiency.
## What about SVE2?
We have shown the obvious benefit of `restrict` in this function, on an armv8-a CPU, but we have new armv9-a CPUs out there with SVE2 as well as Neon/ASIMD.
Could the compiler generate better code in that case using `restrict`? To save time, the output without `restrict` is almost the same, however with `restrict` used, this is the result (we used `clang-17 -O3 -march=armv9-a`):
Could the compiler generate better code in that case using `restrict`? The output without `restrict` is almost the same, but with `restrict` used, this is the result (we used `clang-17 -O3 -march=armv9-a`):
```
scaleVectors: // @scaleVectors
Expand All @@ -208,6 +208,6 @@ scaleVectors: // @scaleVectors
ret
```
This is just 10 instructions, only 31% of the original code size! The compiler made a great use of SVE2 features, combining the multiplications and reducing them to 4, at the same time grouping loads and stores down to 2 each. We have optimized our code more than 3x by only adding a C99 keyword!
There are just 10 instructions, 31% of the original code size! The compiler has made great use of the SVE2 features, combining the multiplications and reducing them to 4 and, at the same time, grouping loads and stores down to 2 each. We have optimized our code by more than 3x just by adding a C99 keyword.
We are going to look at another example next.
We are now going to look at another example.
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,11 @@ weight: 4
layout: learningpathall
---

## So, when can we use restrict?
When can we use `restrict` or, put differently, how do we recognize that we need `restrict` in our code?

This is all very good, but when can we use it? Or put differently, how to recognize we need `restrict` in our code?
`restrict` as a pointer attribute is rather easy to test. As a rule of thumb, if the function includes one or more pointers to memory objects as arguments, we can use `restrict` if we are certain that the memory pointed to by these pointer arguments does not overlap and there is no other way to access them in the body of the function, except by the use of those pointers, i.e., there is no other global pointer or some other indirect way to access these elements.

`restrict` as a pointer attribute is rather easy to test. As a rule of thumb, if our function includes one or more pointers to memory objects as arguments, we can use `restrict` if we are certain that the memory pointed by those pointer arguments does not overlap and there is no other way to access it in the body of the function, except by the use of those pointers -eg. there is no other global pointer, or some other indirect way to access these elements.

Let's show a coutner-example:
Let's show a counter example:

```
int A[10];
Expand All @@ -30,6 +28,6 @@ int main() {
}
```

This example does not not benefit from `restrict` at all in both gcc and clang.
This example does not not benefit from `restrict` in either gcc and clang.

However, there are plenty of cases that are candidates for the `restrict` optimization. And it's safe and easy to try. Nevertheless, even if it looks like a good candidate, it is still possible that the compiler will not detect a pattern that is suited for optimization and we might not see any reduction in the code or speed gain. It is up to the compiler, some cases clang handles better or differently than gcc, and vice versa, and that even depends on the version. If you have a particular piece of code that falls in the above criteria that you would care to optimize, before you attempt to refactor it completely, or rewrite it in assembly or use any SIMD instructions, it might be worth a shot to try `restrict`. Even saving a couple of instructions in a critical loop function is worth having to add just one keyword!
However, there are plenty of cases that are candidates for the `restrict` optimization. It's safe and easy to try but, even if it looks like a good candidate, it is still possible that the compiler will not detect a pattern that is suited for optimization and we might not see any reduction in the code or speed gain. It is up to the compiler; in some cases clang handles this better or differently from gcc, and vice versa, and this will also depend on the version. If you have a particular piece of code that you would like to optimize, before you attempt to refactor it completely, rewrite it in assembly or use any SIMD instructions, it might be worth trying `restrict`. Even saving a couple of instructions in a critical loop function is worth having by just adding one keyword.

0 comments on commit 41afe01

Please sign in to comment.