Skip to content

Commit

Permalink
fixed explanations according to comments
Browse files Browse the repository at this point in the history
  • Loading branch information
markos committed Oct 25, 2023
1 parent b991ad1 commit cc6c2b4
Show file tree
Hide file tree
Showing 2 changed files with 5 additions and 5 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ process_data:
ret
```
Do not worry about each instruction in the assembly here, but notice that gcc correctly uses the SVE2 `while*` instructions to do the loops, resulting in far smaller code than with Neon. But in order to illustrate our point, let's try adding `restrict` to pointer `in`:
Do not worry about each instruction in the assembly here, but notice that gcc has added 2 loops, one that uses the SVE2 `while*` instructions to the processing (.L4) and one scalar loop (.L3). The latter is executed in case theis any pointer aliasing -if there is any overlap between the memory pointers basically. Let's try adding `restrict` to pointer `in`:
```C
void process_data (const char *restrict in, char *out, size_t size)
Expand Down Expand Up @@ -85,7 +85,7 @@ process_data:
ret
```

This is a huge improvement! Code size reduction is down from 30 lines to 14, less than half the original size, and faster too. In both cases, you will note that the main loop `.L3` is exactly the same, but the entry and exit code of the function are very much simplified, because the compiler was able to distinguish that the memory pointed by `in` does not overlap with memory pointed by `out`, it was able to simplify the conditions for entering and exiting the main loop.
This is a huge improvement! Code size reduction is down from 30 lines to 14, less than half the original size. In both cases, you will note that the main loop (`.L4` in the former case, `.L3` in the latter) is exactly the same, but the entry and exit code of the function are very much simplified, because the compiler was able to distinguish that the memory pointed by `in` does not overlap with memory pointed by `out`, it was able to simplify the code by eliminating the scalar loop.

But I can almost hear the question: "Why is that important if the main loop is still the same?"
And it is a right question. The answer is this:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ scaleVectors: // @scaleVectors
ret
```
This doesn't look optimal. `scaleVectors` seems to be doing each load, multiplication, store in sequence, surely it can be further optimized? This is because the memory pointers are overlapping, let's try different assignments of `a` and `b` in `main()` to make them explicitly independent, perhaps the compiler can detect that and better schedule the instructions.
This doesn't look optimal. `scaleVectors` seems to be doing each load, multiplication, store in sequence, surely it can be further optimized? This is because the memory pointers are overlapping, let's try different assignments of `a` and `b` in `main()` to make them explicitly independent, perhaps the compiler can detect that and generate faster instructions to do the same thing.
```
int64_t a[] = { 1, 2, 3, 4 };
Expand All @@ -120,7 +120,7 @@ void scaleVectors(int64_t *restrict A, int64_t *B, int64_t *C) {
}
```

This is the assembly output with `clang-17` (gcc has a similar output):
This is the assembly output with `clang-17 -O3` (gcc has a similar output):

```assembly
scaleVectors: // @scaleVectors
Expand Down Expand Up @@ -161,7 +161,7 @@ void scaleVectors(int64_t *restrict A, int64_t *restrict B, int64_t *C) {
}
```
And the assembly output with `clang-17`:
And the assembly output with `clang-17 -O3`:
```
scaleVectors: // @scaleVectors
Expand Down

0 comments on commit cc6c2b4

Please sign in to comment.