This program sorts arrays of unsigned long ints using a radix sort algorithm written in x64 MASM assembly. The bucket count / radix of the sort is 2. For calculating the prefix sum I used 8 AVX 256 bit registers, which can process 8 unsigned long ints at once. Because I am using a radix of 2, I make use of the popcnt instruction, which counts the amounts of 1 bits in a register.
Here are some timings of the algorithm done on an AMD Ryzen 7 2700X - note that the time it takes for the helper array to be allocated is not considered:
Array length | Time in ms |
---|---|
1000000000 | 79100 |
750000000 | 59979 |
500000000 | 39483 |
250000000 | 19688 |
---- | ---- |
100000000 | 7940 |
75000000 | 6019 |
50000000 | 4055 |
25000000 | 2031 |
---- | ---- |
10000000 | 813 |
7500000 | 599 |
5000000 | 419 |
2500000 | 201 |
---- | ---- |
1000000 | 84 |
750000 | 62 |
500000 | 45 |
250000 | 23 |
---- | ---- |
100000 | 11 |
75000 | 8 |
50000 | 6 |
25000 | 3 |
As can be seen, the timings scale linearly with the array size.
Unfortunately this program performs quite badly when compared to a radix sort with a bucket count of 256. I'd assume, that the thing, which holds this algorithm down, is that the entire array needs to be reordered 32 times. A 256 bucket radix sort only needs to reorder the array 4 times.