On MIPS and speed

Last time, I found that Haswell core executes 8.57 instructions per clock cycle, which is odd because there is a hard limit of 4 instructions per clock.

If the benchmark is calibrated by setting the performance of the VAX 11/780 and IBM System/370 model 158-3 as equal to 1 MIPS, I can think of a couple reasons:

First, those machines were marketed and claimed to be 1 MIPS, and the real performance of the Dhrystone benchmark is considerably less than that theoretical value.

Second, the number of “instructions” in a high level language procedure might translate to more than one machine opcode each.  Modern compilers are more efficient and generate far better code.  In fact, it’s gotten to the point of being hard to compile such a benchmark program as-intended as the compiler will eliminate it as being pre-computed or useless!

Finally, maybe the modern CPU takes fewer machine instructions to do the same amount of work.  That’s certainly true, but does the benchmark code use the larger registers?  Looking at an implementation, I notice a couple things along these lines.  Control statements can produce fused macro-ops combining the test and branch, which picks up another simultaneous instruction.  Procedure calls might be “free” as that simply starts prefetching from the new location and the unconditional jump itself is not seen as an instruction at all.  Even if the compiler is not optimizing out work, the CPU itself might be!  Assignment statements are also part of the mix, and if these variables are in registers than a MOV can be free, taken care of in the register mappings in the pipeline and never actually moving anything.  Finally, string manipulation might take place in bigger gulps, not one byte at a time.

The OCD Method

I brought up the fully-optimized version of the program in the debugger, and counted the instructions.  On the “short” path where it does the early-out, it does as few as 85 instructions and sometimes over a hundred.  To err on the side of caution (fewer actual instructions done in the given time) let’s use a weighted average of 90.  BTW, the early-out part at the top of the loop is only 21 instructions; the bulk of the expense of one iteration is permuting the array.

On the “long” path, the full code to decode the three strings with the current values is 110 instructions, and then it takes 3 to do the final comparison.  So let’s say 200 instructions in this case.

Without counting the rare times where it does a skip by reversing a few elements, this is about 135,693 full iterations at 200 and 219,360 short iterations at 90, or a total of 46,881,000 instructions.

Doing 47 million instructions in 3.4 milliseconds is about 11.8 billion instructions per second.  At 3.6 GHz this is 3.3 instructions per clock, where now “instruction” means an X64 opcode.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>