Monthly Archives: August 2017

On MIPS and speed

Last time, I found that Haswell core executes 8.57 instructions per clock cycle, which is odd because there is a hard limit of 4 instructions per clock.

If the benchmark is calibrated by setting the performance of the VAX 11/780 and IBM System/370 model 158-3 as equal to 1 MIPS, I can think of a couple reasons:

First, those machines were marketed and claimed to be 1 MIPS, and the real performance of the Dhrystone benchmark is considerably less than that theoretical value.

Second, the number of “instructions” in a high level language procedure might translate to more than one machine opcode each.  Modern compilers are more efficient and generate far better code.  In fact, it’s gotten to the point of being hard to compile such a benchmark program as-intended as the compiler will eliminate it as being pre-computed or useless!

Finally, maybe the modern CPU takes fewer machine instructions to do the same amount of work.  That’s certainly true, but does the benchmark code use the larger registers?  Looking at an implementation, I notice a couple things along these lines.  Control statements can produce fused macro-ops combining the test and branch, which picks up another simultaneous instruction.  Procedure calls might be “free” as that simply starts prefetching from the new location and the unconditional jump itself is not seen as an instruction at all.  Even if the compiler is not optimizing out work, the CPU itself might be!  Assignment statements are also part of the mix, and if these variables are in registers than a MOV can be free, taken care of in the register mappings in the pipeline and never actually moving anything.  Finally, string manipulation might take place in bigger gulps, not one byte at a time.

The OCD Method

I brought up the fully-optimized version of the program in the debugger, and counted the instructions.  On the “short” path where it does the early-out, it does as few as 85 instructions and sometimes over a hundred.  To err on the side of caution (fewer actual instructions done in the given time) let’s use a weighted average of 90.  BTW, the early-out part at the top of the loop is only 21 instructions; the bulk of the expense of one iteration is permuting the array.

On the “long” path, the full code to decode the three strings with the current values is 110 instructions, and then it takes 3 to do the final comparison.  So let’s say 200 instructions in this case.

Without counting the rare times where it does a skip by reversing a few elements, this is about 135,693 full iterations at 200 and 219,360 short iterations at 90, or a total of 46,881,000 instructions.

Doing 47 million instructions in 3.4 milliseconds is about 11.8 billion instructions per second.  At 3.6 GHz this is 3.3 instructions per clock, where now “instruction” means an X64 opcode.

How fast was Knuth’s computer?

In my previous post, I wondered what the power of Knuth’s computers were, at the time TAOPC was being written.  Someone suggested the IBM S/360 series as an exemplar.  That turned out to be a good idea specifically, since I’ve written programs for the S/370 in assembly language, so I’m familiar with it.  Models 30, 40, and 50 were released in April 1965.  On the pricier side were models 65 and 75.  Here is a scanned “System Summary” describing the various models in detail.  So, I suppose somewhere between 100 and 900 kilo instructions per second.  A larger machine would probably be servicing multiple users.  Fifty years later, my Xeon E3-1276 is supposedly around 133 billion instructions per second.

Interestingly, the S/360 takes many (like 13 on average) clock cycles to perform one instruction.  Meanwhile each core of the Xeon performs 8½ instructions in one clock cycle.  I suppose the clock speed of the S/360 is the cycle time for the internal microcode.

But what’s an instruction?  On the S/360, I would not need the decode function at all, but would just sum the digits directly using unpacked decimal.

int decode (std::initializer_list<cellT> lst)
{
	int total= 0;
	for (auto digit : lst)
		total= total*10+digit;
	return total;
}

The modern CPU knows only binary arithmetic on various word sizes. So converting from a decimal digit-per-byte requires 4 iterations on two operands doing a multiply and add: at least 16 distinct instructions (if the loop is unrolled), plus the actual add once that’s all done.

Interestingly, the x64 code generated by the compiler doesn’t actually issue a multiply instruction in this loop. In fact, the entire expression does not use the regular ALU! There is neither a MUL or ADD instruction there. Instead, it exploits the address generator to do stuff that has nothing to do with actual pointer addresses. The complicated addressing modes of the CISC processor means that a separate address generator unit has a variety of things it can compute, yet it is far more limited than a completely general ALU. So, it is much simpler and thus faster.

In particular, Scaled Index mode looks like this: [ebx + ecx*S + constant] Register ebx is the base, and ecx is used as an index here. The index can be used directly, or scaled by 2, 4, or 8. If the same register is used in both positions, you can multiply it by five! The LEA instruction is Load Effective Address, and gives the processed address without fetching what it resolves to like a MOV would. So, if we have total in EAX and the digit in EBX,

LEA EDX, [EBX+EBX*4]
LEA EDX, [EAX+EDX*2]
ADD EAX, EDX

The first instruction multiplies by five. The second instruction not only multiplies by two, but also adds in the digit as the base of the addressing mode.

I also found it interesting how the S/360 line anticipated what we have today:  one compatible instruction set, but pricey implementations have more pipelining and faster clocks; also they keep adding more layers of cache memory.  The “processor storage” housed with the CPU is analogous to the L2 cache.  Adding external memory modules gives more storage but slower: 8 microsecond access time.  If you add pairs of modules you can go dual-channel and double the throughput.  Finally, later high-end models added extra-high-speed memory to keep up with the CPU, and that is analogous to our L1 cache.

Back to the quantitative comparisons:  The modern machine has 4 independent cores, but my program only used one.  If a brute force problem required a significant amount of time, it could be split up into 4 tasks.  At full utilization, 133 billion vs 133 thousand, more or less.  That’s a factor of about one million.  With the single thread, a quarter of that.  30 ms on one core would be about 8½ hours on a S/360-50 using it exclusively for this job.

Knuth’s suggestion of 10! can be scaled up by a million.  That’s midway between 12! and 13!.  Now in terms of exponential growth, realize that an institutional computer like that cost about 1000 times more than a personal desktop computer today.  At computing power per constant dollars (not adjusting for inflation) is indeed about one billion.

For floating-point calculations, the difference in power over 50 years is a few orders of magnitude higher.  A $300 GPU card can do 4 teraflops?  That means it would be a world-class supercomputer as recently as 2005!

 

Permutation Puzzle: “send+more=money”

I saw this puzzle on Bartosz Milewski’s blog, with an entry on using monads in C++.  I’d like to hold it up as an example of a completely different lesson to C++ programmers:  A central idea I want to teach is know your libraries.

I recall it was in the early ’90’s, when the STL for C++ was submitted as a proposal to include in the upcoming standard.  I noticed there were algorithms called next_permutation and prev_permutation, and wondered how they work—how do you order them and re-arrange your collection to the next such, without keeping an auxiliary state?  Then I wondered what I would ever use such a thing for.  Well, nearly 25 years later, I found out!

You should look through the list of algorithms every once in a while just to see what’s there.  Otherwise you might only know about the common ones that you use.  Consider the analogy with a tool (say, a special bent screwdriver only used to attach doorknobs) that you know is in the very back of the drawer, though you may need to rummage around to locate it.  Remembering you have that tool makes for a quick job.  Having to rig up something from common parts instead (to continue the analogy, use locking pliers to grab a short screwdriver bit from the side) is not as good, and certainly more work.

So… 8 nested for loops followed by 7 if statements containing 28 conditions?  Get real!  If you have a line that reads });});});});});});});}); then the code review will be a sight indeed.

Solution in C++ w/standard library

Here’s the meat of my solution:

using cellT = int8_t;

cellT A[10] {0,1,2,3,4,5,6,7,8,9};

void solve1()
{
    do {
	++iteration_count;
	auto [ig1,ig2, s,e,n,d,m,o,r,y] {A};
	int send= decode({s,e,n,d});
	int more= decode ({m,o,r,e});
	int money= decode ({m,o,n,e,y});
	if(send+more==money) {
	    ++solution_count;
	    solutions.push_back({send,more,money});
	}
    } while (std::next_permutation(std::begin(A), std::end(A)));
}

You’ll notice that besides the uniform initialization syntax introduced in C++11, this uses something you might not have seen before (if you’re reading this in 2017).  Hot off the press in C++17 is structured bindings.

	auto [ig1,ig2, s,e,n,d,m,o,r,y] {A};

This defines 10 variables and assigns all the elements of the array to them.  The first two are ignored so I used scratch names, and the others are simply the names of the letters in the puzzle.

One thing I have to point out from Milewski’s listing is the call to turn a word into a numeric value.  He writes:

int send = asNumber(vector{s, e, n, d});

This creates a std::vector on every use.  Let me elaborate: it allocates a block of memory from the heap (vectors can’t use a small-vector optimization).  Then after the call returns it is deallocated.  Then the next two lines to the same thing.  And that happens on every iteration.

The constructor for std::vector takes this handy literal list.  Now in C++ these containers are not special language features, but are ordinary libraries.  It should be clear that anything they do — cool syntax or unusual ability — you can do yourself on your own code!  My version of the same construct does not create a vector, doesn’t require more words to make the call, and most importantly does not have any overhead.

int send = decode({s,e,n,d});

And here is the function that takes such an initializer list:

int decode (std::initializer_list<cellT> lst)
{
	int total= 0;
	for (auto digit : lst)
		total= total*10+digit;
	return total;
}

The solving function doesn’t print the results because I want to time just the solving logic.  So the solutions are pushed onto a vector, and the caller prints them after stopping the clock.  In a real program, this might be an object (not globals) and the results available in the object afterwards, or as the return value from the solving function.  In another post I’ll make it lazy.

 Make it faster

This simple function found 50 solutions, one of which doesn’t have a leading zero.  It ran in 39.6 milliseconds, trying all 3,628,800 permutations.  That’s 92 million iterations per second.

The value of 10 factorial is an interesting number here.  Donald Knuth, in the The Art of Computer Programming, wrote that this is about the size that separates things that may be done by brute force from things that are impractical to simply try all possibilities.  Volume 3 was published in 1973.  I hazard to guess that computers now are about 230 (or about a billion) times the power of the machines that were around when he wrote that.  A billion times 30 milliseconds is 460 years.  So, I revise that to more like ten million times the speed, if I postulate that he could have run this program to completion in a matter of days.

Anyway, to make it faster, I need to skip over permutations that are “the same” as one I already rejected.  The order of the two ignored digits don’t change the solution, so if I decide that one order is canonical and when the other is detected I skip over the whole lot, that would cut the number of iterations in half.

So how do you skip over states in the next_permutation algorithm?  I looked it up — a StackOverflow post described it well and also pointed to a Wikipedia page on it.  The states are generated in lexicographical order, so when a given digit changes everything to the right is in reverse sorted order, and it “rolls over” to advance that digit by one and everything to the right is now in forward sorted order — the lowest value of that substring.

So, when I identify a digit value that I know will not be a solution no matter what the other digits are, I can skip to when that digit is advanced again by reversing everything to the right of it.

void skipfrom (int pos)
{
    std::reverse (std::begin(A)+pos, std::end(A));
}

    ⋮
    do {
	++iteration_count;
	auto [ig1,ig2, s,m,e,d,y,n,o,r] {A};
	if(ig1 > ig2) {
	    skipfrom(2);
	    continue;
	}

Indeed, it still found 50 solutions but the iteration_count showed half the number: only 1.8 million times through the loop.  However, the time only went down to 26ms — about two thirds the time, not half.

We also don’t want solutions with a leading zero, so filter those out too.  Notice in the listing above I changed the order of declaring the digit variables.  It doesn’t matter to the solution algorithm, but putting these farther to the left means I can skip over more.

        ⋮
	if(s==0) {
	    skipfrom(3);
	    continue;
	    }
	if(m==0) {
	    skipfrom(4);
	    continue;
	}
        ⋮

That didn’t save much though: 1.45 million iterations in 22 ms.

Another big constraint can be found on the right side of the puzzle.  I would think that parsing the arrangement of digits into ints would be slow, seeing as that involves multiplying by 10 and adding each digit.  Looking at the rightmost (units) digit only, the puzzle has d+e=y with a possible carry.  Test that before parsing the int values, and furthermore skip ahead until one of those three digits changes again.  To that end, the declaration order has d, e, and y next after the previous items we wanted on the left.  This leaves only 3 letters to the right, so each time the units digits don’t work it can skip 6 iterations.

I added a counter to that, and see that it happened 219,360 times.  The loop only executed 355,053 times now, taking a mere 4.7 ms.

Faster yet?!

Notice in the listing that I declared a type CellT for the array of digits and anything that holds a digit.  My thought was that keeping the array small would save in parameter passing to decode.  Keeping the size of the data small is generally good for memory caching, but it probably doesn’t matter here because I don’t have enough data to exceed the L1 cache anyway.

But, I can change the type in one place in the code and it affects everything in the program.  I changed it from int8_t to plain int (int is 32 bits here), and…  the time was down to 3.4 ms!

64-bit was the same.  16-bit values was the slowest at 5.2 ms.  So, the CPU is inefficient at loading 8-bit values and worse with 16.  I don’t know if the actual load is slower, or the compiler generated somewhat different code rather than just using MOVXS instead of MOV.  This was a difference of a quarter of the execution time over the entire algorithm, and loading values is only a part of what it does, with values kept in registers once loaded.