For my birthday my sister got me a copy of *From Mathematics to Generic Programming* by Alexander Stepanov and Daniel Rose. I had noticed it on Amazon.com and flagged it on my wish list, then forgotten about it, so it was a nice surprise and yet something that she knew I would be interested in.

The bulk of it is about pure math. It covers the history of math with a small number of continued examples showing how progress was made when domains were generalized. Euclid described an algorithm for Greatest Common Measure which was geometric in nature. Much later *zero* was invented and a couple hundred years later it was realized that a line segment could have zero length, and the Greatest Common Divisor is the same algorithm using division and remainders on whole numbers. Then came algebraic structures and the same algorithm can be applied to Gaussian integers, polynomials, and beyond. Noether then studied the question of what, in the most abstract sense, can the algorithm work on, and described the Euclidean domain (or ring).

Returning to the conventional approach of using integers, the function they present is:

```
//gcd from page 59
template <typename N> // N is some kind of integer
static N gcd(N a, N b)
{
using std::swap;
while (b != N(0)) {
a= a % b;
swap (a,b);
}
return a;
}
```

Pretty mundane in this day and age. But, earlier microcomputers did not have an instruction for division. I recall figuring out how to implement multiplication on an 8-bit computer in the early 1980’s. Even now, I can call up a modulo operation with a single symbol and know it maps to a single machine opcode… but that instruction is not very efficient. I was curious as to just how heavy it is, and found a report giving instruction timing details measured for various architectures. On the Intel Haswell, I saw it was indeed something of an anomaly in terms of how many micro-ops, how many compute ports it sends them to, and how long the latency is. Apparently it is a microcoded routine that works out the division just like you would code in the old days; just built-in and already perfectly optimized.

## Stein’s Algorithm

Later, the book goes on to describe Stein’s algorithm. Written for a computer in 1961, it uses bit-shift and subtract and other very primitive operations which are especially simple on binary encoded integers, with no need for division.

```
template <typename N>
static N gcd(N m, N n)
{
// if (m < N(0)) m= -m;
// if (n < N(0)) n= -n;
if (m == N(0)) return n;
if (n == N(0)) return m;
// m>0 && n>0
int d_m= 0;
while (even(m)) { m>>=1; ++d_m; }
int d_n= 0;
while (even(n)) { n >>=1; ++d_n; }
// both are now odd
while (m != n) {
if (n>m) std::swap (n, m);
m -= n;
do m >>= 1; while (even(m));
}
// m == n
return m << std::min (d_m, d_n);
}
```

How does Stein’s algorithm fare *today*? Sure it avoids the big fat modulo operation, but it has a lot of steps.

Here is a summary of my results. Values are in nanoseconds for one call of the function, having run a trial of 16 million calls. I’m using Microsoft Visual C++ 2015, and running on an Intel Xeon E3-1276 v3 @ 3.60GHz.

The first four columns is the code from the book, compiled as 32-bit and 64-bit. In 32-bit builds, the modulo operation of 64-bit values is not a single instruction and the built-in microcode is not accessible. Stein’s algorithm is a big win in this case.

For 64-bit code, the modulo operation is significantly slower for signed values than for unsigned (though all test values were unsigned). Stein’s algorithm’s time is between those two, so it’s faster when using signed 64-bit integers.

Because the `%`

operation is so much slower on signed values than unsigned, if there was a need to handle signed arguments, a wrapper function could take the absolute value of each argument and then call a version that used unsigned values. Remember that any time you are doing division or modulo! Unsigned types are faster.

I then called the 64-bit integer form with values that would all fit in 32 bits. Unlike multiplication, division can’t easily quit early when the numbers are smaller. But, fewer iterations are needed through the loop, so the function is faster. Likewise, the Stein algorithm will loop less in every stage. The built-in division regains its lead, as it speeds up more. That is, the Stein algorithm’s complexity is a steeper curve considering the length of the input values in bits.

```
template <typename N>
bool odd (N n) { return bool (n&1); }
template <typename N>
bool even (N n) { return !odd(n); }
⋮
int d_m= 0;
while (even(m)) { m>>=1; ++d_m; }
int d_n= 0;
while (even(n)) { n >>=1; ++d_n; }
```

The Stein’s algorithm code is written to avoid division, but what it *is* using is a lot of test-and-branch instructions. The above fragment removes all the trailing zero bits and remembers how many there were.

## Today’s Bottleneck Issues

Each operation is very simple: test the least-significant bit, shift right one bit, increment an int. ** But the branch is a killer!**

With a modern superscalar highly-pipelined processing core, a conditional branch breaks the pipeline. The x86 processors use branch prediction and speculative execution to mitigate this, but here the branch decision is essentially random on every iteration. It guesses wrong half the time.

Meanwhile, the instruction set does feature primitive opcodes to determine how many consecutive zero bits are at one end or the other of a register. I can access that via a compiler intrinsic, and eliminate the problematic loops.

```
template <typename N>
static N gcd(N m, N n)
{
if (m == N(0)) return n;
if (n == N(0)) return m;
unsigned long index;
_BitScanForward64(&index, m);
int d_m= index;
m >>= d_m;
_BitScanForward64(&index, n);
int d_n= index;
n >>= d_n;
// both are now odd
while (m != n) {
smaller_first (n, m);
m -= n;
_BitScanForward64(&index,m);
m >>= index;
}
// m == n
return m << std::min (d_m, d_n);
}
```

There is still one loop remaining. This main `while`

loop ought to be predicted to loop and only fail the prediction when it’s ready to exit.

The column in the results table labeled “BitScan” gives the results after making this change. All of a sudden, **Stein’s algorithm beats the built-in div code in every case!** In the case of small values to 64-bit signed integers, it is 3× the speed! Interestingly, the running time seems to be determined by the size of the values, without regard to the register size used: 32-bit values run at the same speed whether the code is compiled to use 32-bit integers or 64-bit integers. That makes sense, since the various primitive operations (unlike multiplication and division) are constant speed and don’t have any advantage to using a smaller word size.

But I still have one more trick left. `if (n>m) std::swap (n, m);`

is a branch point, and it will take one way or the other many times as it loops. That is, this is another “bad” branch.

A long time ago, branching was far worse than it is on today’s CPUs. I recall writing high-performance graphics code when *any* branch was a penalty. I learned to absorb the conditional logic into the math, and also learned how to generate and manipulate masks in various clever ways. Here is my quick attempt at a purely computational compare-and-swap:

```
template <typename N>
void smaller_first (N& n, N& m, std::false_type)
{
using std::swap;
if (n>m) swap (n, m);
}
template <typename N>
void smaller_first (N& n, N& m, std::true_type)
{
// cout << "smaller_first input n=" << n << ", m=" << m;
typedef typename std::make_signed<N>::type sN;
const auto bitcount= 8*sizeof(N);
const N mask= (sN(n)-sN(m))>>(bitcount-1); // all 1's if m is larger, all 0's if m is smaller
// cout << ", mask=" << mask;
const N smaller= (m&~mask)|(n&mask);
// cout << ", smaller=" << smaller;
const N larger= (m&mask)|(n&~mask);
// cout << ", larger=" << larger << endl;
n= smaller;
m= larger;
}
template <typename N>
void smaller_first (N& n, N& m)
{
typedef typename std::is_signed<N>::type tag;
return smaller_first (n, m, tag());
}
```

It only works on signed types, so I made the template choose the regular way for unsigned types and the non-branching trick for signed types. (Actually, as written here it would work for unsigned *types* if the *values* were constrained more.)

It works by first doing a subtraction: if `n-m`

produces a negative result, that means the highest bit is set in 2’s complement representation. So if `m`

is larger, that high bit will be set, and if `m`

is smaller it will be clear.

Then the arithmetic right shift replicates this one bit to fill the entire word size: now I have all `1`

s if `m`

is larger and all `0`

s if `m`

is smaller. In either case, the ones-compliment (`~`

) operator will reverse that.

Each input value is then ANDed with a mask: one mask will always produce zero regardless of the input, and the other mask will have no effect. Then both of them are ORed together, which results in the one that wasn’t all zero.

This is done twice, once for the larger and once for the smaller, by reversing the mask choice.

This would appear to use a lot of registers: the two input values, the output values (which can’t destroy the input values before it is done), the 1 mask and the 0 mask. But after computing the masks, the rest of the steps are not dependent on each previous step so they can all be done at the same time in various execution units, or at least scheduled optimally.

The result indicates that all this math and register usage is cheaper than branching: 44 becomes 39 or 37, 84 becomes 75. This is about an 11% speed up in the overall algorithm.

Additional experiments indicate that the final test (doing the min at the very end) doesn’t benefit from using this method. It is only performed once, at the end of the function. Likewise the tests at the beginning for an argument of zero don’t have a noticeable effect (they are almost always “not taken” and when they are the routine is fast because it doesn’t have to do the main part at all).

## What about `cmov` ?

This post on Stackoverflow reminded me that the x86/x64 has long since addressed this with the `cmov`

, or conditional move, instruction. It makes sense that an if statement containing a simple comparison controlling only simple assignment statements could cause `cmov`

instructions to be generated. However, my attempt to write it that way showed that no `cmov`

was used and conditional branching was still being done. Yet further on in the function, a `cmov`

is seen, and it comes from a use of `std::min`

. Looking at the headers, I see `std::min`

is written as a ternary expression in the assignment, not an `if`

` statement.`

Presumably the std templates are carefully crafted to trigger the kind of code that results in optimization, or the optimizer is written to handle the code generated by the standard headers, or cycles of feedback between the two.

So writing:

```
template <typename N>
void smaller_first (N& n, N& m)
{
const N old_n {n};
const N old_m {m};
n= std::min(old_n,old_m);
m= std::max(old_n,old_m);
}
```

does, on this compiler version, cause `cmov`

instructions and no branching; although it does unnecessarily do the comparison twice. Other than the lack of branching, the machine code appears rather poor to me. But here’s the real performance:

The min/max code wins hands down on 16-bit values. It does poorer than the jumping code in all other cases, so the funny arithmetic is still the winner in the signed values where it is supported.

Here is the 16-bit case, which is fast:

```
cmp ax,r8w
movzx ecx,r8w
cmovl r8w,ax
cmp cx,ax
cmovl cx,ax
movzx eax,cx
sub ax,r8w
```

The values for `m`

and `n`

start out in `AX`

and `R8W`

, and are reversed if necessary so the subsequent use (the subtraction) uses those same registers.

The generated code uses one additional register, `ECX`

, to hold the original value of `R8W`

(zero-extended to 32 bits) in case the subsequent line copies `AX`

into `R8W`

. Then the test is performed again between `AX`

(still unchanged) and `CX`

(holding the original `R8W`

), and based on that result may copy `AX`

into `CX`

. Then it moves `CX`

(which may be the value from the first copy or the conditional second copy) into `EAX`

, again zero-filling it to 32 bits.

I suppose there is a performance penalty to move into 16-bit register fields, so the (non-conditional) moves use zero extension moves to 32-bit fields. Yet it uses the 16-bit portion in other places, and never reads larger portions even if they are known to be zero or simply ignored later.

This code could be simplified by removing the second `CMP`

and then reversing the condition of the second `CMOV`

. That second `CMOV`

may be free though, since it can be done simultaneously with an adjacent instruction, and removing it would not free up resources because no other instructions are not dependent on the result.

You would think this identical code would work for 32 bits or 64 bits, just by using the full width of the same registers. But no, here’s what it generates for 32-bit values:

```
mov ecx,r8d
mov dword ptr [rsp+0F0h],ecx
mov dword ptr [rsp+0E8h],edx
lea rax,[rsp+0E8h]
cmp edx,r8d
lea r8,[rsp+0F0h]
cmovge rax,r8
mov r8d,dword ptr [rax]
lea rax,[rsp+0E8h]
lea r13,[rsp+0F0h]
cmp ecx,edx
cmovge rax,r13
mov edx,dword ptr [rax]
sub edx,r8d
```

Besides using `EDX`

and `R8D`

for the two values, it uses `ECX`

for a copy as with the first case, but also uses `RAX`

, `R8`

, and `R13`

as 64-bit pointers, and stores the values in stack-based memory locations as well, `[RSP+0F0h]`

and `[RSP+0E8h]`

!

The values are stored in memory, with pointers to those addresses in two new registers. The comparison is done with the values (still) in registers, but the pointers are what are conditionally copied, and then the resulting pointer is de-referenced to get the matching value. Besides chewing up 3 more registers it requires 3 memory accesses!

The 64-bit case works the same way, with the value-carrying registers and memory allocations widened.