Language C++----Questions and Answers---Page 4
Question - Why is processing a sorted array faster than an unsorted array?
Here is a piece of Visual C++ code that seems very peculiar. For some strange reason, sorting the data miraculously makes the code almost six times faster.
The reason why the performance improves drastically when the data is sorted is that the branch prediction penalty is removed.
Now, if we look at the code
In
On an Intel Core i7-2600K @ 3.4GHz and Visual Studio 2010 Release Mode, the benchmark is (format copied from Mysticial):
x86
Now let's look more closely by investigating at the
So why can a conditional move perform better?
In a typical
In a branch case, the following instruction is determined by the preceding one, so we can not do pipelining. We have to either wait or predict.
In a conditional move case, the execution conditional move instruction is divided into several stages, but the earlier stages like
Sometimes, some modern compilers can optimize our code to assembly with better performance, sometimes some compilers can't (the code in question is using Visual Studio's native compiler). Knowing the performance difference between branch and conditional move when unpredictable can help us write code with better performance when the scenario gets so complex that the compiler can not optimize them automatically.
Here is a piece of Visual C++ code that seems very peculiar. For some strange reason, sorting the data miraculously makes the code almost six times faster.
#include <algorithm>
#include <ctime>
#include <iostream>
int main()
{
// Generate data
const unsigned arraySize = 32768;
int data[arraySize];
for (unsigned c = 0; c < arraySize; ++c)
data[c] = std::rand() % 256;
// !!! With this, the next loop runs faster
std::sort(data, data + arraySize);
// Test
clock_t start = clock();
long long sum = 0;
for (unsigned i = 0; i < 100000; ++i)
{
// Primary loop
for (unsigned c = 0; c < arraySize; ++c)
{
if (data[c] >= 128)
sum += data[c];
}
}
double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
std::cout << elapsedTime << std::endl;
std::cout << "sum = " << sum << std::endl;
}
- Without
std::sort(data, data + arraySize);
, the code runs in 11.54 seconds. - With the sorted data, the code runs in 1.93 seconds.
The reason why the performance improves drastically when the data is sorted is that the branch prediction penalty is removed.
Now, if we look at the code
if (data[c] >= 128)
sum += data[c];
we can find that the meaning of this particular if... else...
branch is to add something when a condition is satisfied. This type of branch can be easily transformed into a conditional move statement, which
would be compiled into a conditional move instruction: cmovl
, in an x86
system. The branch and thus the potential branch prediction penalty is removed.In
C
, thus C++
, the statement, which would compile directly (without any optimization) into the conditional move instruction in x86
, is the ternary operator ... ? ... : ...
.
So we rewrite the above statement into an equivalent one:sum += data[c] >=128 ? data[c] : 0;
While maintaining readability, we can check the speedup factor.On an Intel Core i7-2600K @ 3.4GHz and Visual Studio 2010 Release Mode, the benchmark is (format copied from Mysticial):
x86
// Branch - Random
seconds = 8.885
// Branch - Sorted
seconds = 1.528
// Branchless - Random
seconds = 3.716
// Branchless - Sorted
seconds = 3.71
x64// Branch - Random
seconds = 11.302
// Branch - Sorted
seconds = 1.830
// Branchless - Random
seconds = 2.736
// Branchless - Sorted
seconds = 2.737
The result is robust in multiple tests. We get great speedup when the
branch result is unpredictable, but we suffer a little bit when it is
predictable. In fact, when using a conditional move, the performance is
the same regardless of the data pattern.Now let's look more closely by investigating at the
x86
assembly they generate. For simplicity, we use two functions max1
and max2
.max1
uses the conditional branch if... else ...
:int max1(int a, int b) {
if (a > b)
return a;
else
return b;
}
max2
uses the ternary operator ... ? ... : ...
:int max2(int a, int b) {
return a > b ? a : b;
}
On a x86-64 machine, GCC -S
generates the assembly below.:max1
movl %edi, -4(%rbp)
movl %esi, -8(%rbp)
movl -4(%rbp), %eax
cmpl -8(%rbp), %eax
jle .L2
movl -4(%rbp), %eax
movl %eax, -12(%rbp)
jmp .L4
.L2:
movl -8(%rbp), %eax
movl %eax, -12(%rbp)
.L4:
movl -12(%rbp), %eax
leave
ret
:max2
movl %edi, -4(%rbp)
movl %esi, -8(%rbp)
movl -4(%rbp), %eax
cmpl %eax, -8(%rbp)
cmovge -8(%rbp), %eax
leave
ret
max2
uses much less code due to the usage of instruction cmovge
. But the real gain is that max2
does not involve branch jumps, jmp
, which would have a significant performance penalty
if the predicted result is not right.So why can a conditional move perform better?
In a typical
x86
processor, the execution of an
instruction is divided to several stages. Roughly, we have different
hardware to deal with different stages. So we do not have to wait for
one instruction to finish to start a new one.
This is called pipelining.In a branch case, the following instruction is determined by the preceding one, so we can not do pipelining. We have to either wait or predict.
In a conditional move case, the execution conditional move instruction is divided into several stages, but the earlier stages like
Fetch
, Decode
,
does not depend on the result of previous instruction, only latter
stages
need the result. So we wait a fraction of one instruction's execution
time. This is why the conditional move version is slower than the branch
when prediction is easy.Sometimes, some modern compilers can optimize our code to assembly with better performance, sometimes some compilers can't (the code in question is using Visual Studio's native compiler). Knowing the performance difference between branch and conditional move when unpredictable can help us write code with better performance when the scenario gets so complex that the compiler can not optimize them automatically.
Comments
Post a Comment