In my previous post, I wondered what the power of Knuth’s computers were, at the time TAOPC was being written. Someone suggested the IBM S/360 series as an exemplar. That turned out to be a good idea specifically, since I’ve written programs for the S/370 in assembly language, so I’m familiar with it. Models 30, 40, and 50 were released in April 1965. On the pricier side were models 65 and 75. Here is a scanned “System Summary” describing the various models in detail. So, I suppose somewhere between 100 and 900 *kilo* instructions per second. A larger machine would probably be servicing multiple users. Fifty years later, my Xeon E3-1276 is supposedly around 133 billion instructions per second.

Interestingly, the S/360 takes many (like 13 on average) clock cycles to perform one instruction. Meanwhile each core of the Xeon performs 8½ instructions in one clock cycle. I suppose the clock speed of the S/360 is the cycle time for the internal microcode.

But what’s an instruction? On the S/360, I would not need the `decode`

function at all, but would just sum the digits directly using unpacked decimal.

int decode (std::initializer_list<cellT> lst) { int total= 0; for (auto digit : lst) total= total*10+digit; return total; }

The modern CPU knows only binary arithmetic on various word sizes. So converting from a decimal digit-per-byte requires 4 iterations on two operands doing a multiply and add: at least 16 distinct instructions (if the loop is unrolled), plus the actual add once that’s all done.

Interestingly, the x64 code generated by the compiler doesn’t actually issue a multiply instruction in this loop. In fact, the entire expression does not use the regular ALU! There is neither a `MUL`

or ADD instruction there. Instead, it exploits the address generator to do stuff that has nothing to do with actual pointer addresses. The complicated addressing modes of the CISC processor means that a separate address generator unit has a variety of things it can compute, yet it is far more limited than a completely general ALU. So, it is much simpler and thus faster.

In particular, Scaled Index mode looks like this: `[ebx + ecx*S + constant]`

Register `ebx`

is the base, and `ecx`

is used as an index here. The index can be used directly, or scaled by 2, 4, or 8. If the same register is used in both positions, you can multiply it by five! The `LEA`

instruction is Load Effective Address, and gives the processed address without fetching what it resolves to like a `MOV`

would. So, if we have `total`

in `EAX`

and the digit in `EBX`

,

LEA EDX, [EBX+EBX*4] LEA EDX, [EAX+EDX*2] ADD EAX, EDX

The first instruction multiplies by five. The second instruction not only multiplies by two, but also adds in the digit as the base of the addressing mode.

I also found it interesting how the S/360 line anticipated what we have today: one compatible instruction set, but pricey implementations have more pipelining and faster clocks; also they keep adding more layers of cache memory. The “processor storage” housed with the CPU is analogous to the L2 cache. Adding external memory modules gives more storage but slower: 8 microsecond access time. If you add pairs of modules you can go dual-channel and double the throughput. Finally, later high-end models added extra-high-speed memory to keep up with the CPU, and that is analogous to our L1 cache.

Back to the quantitative comparisons: The modern machine has 4 independent cores, but my program only used one. If a brute force problem required a significant amount of time, it could be split up into 4 tasks. At full utilization, 133 billion vs 133 thousand, more or less. That’s a factor of about one million. With the single thread, a quarter of that. 30 ms on one core would be about 8½ hours on a S/360-50 using it exclusively for this job.

Knuth’s suggestion of 10! can be scaled up by a million. That’s midway between 12! and 13!. Now in terms of exponential growth, realize that an institutional computer like that cost about 1000 times more than a personal desktop computer today. At computing power per constant dollars (not adjusting for inflation) is indeed about one billion.

For floating-point calculations, the difference in power over 50 years is a few orders of magnitude higher. A $300 GPU card can do 4 teraflops? That means it would be a world-class supercomputer as recently as 2005!