On MIPS and speed

Last time, I found that Haswell core executes 8.57 instructions per clock cycle, which is odd because there is a hard limit of 4 instructions per clock.

If the benchmark is calibrated by setting the performance of the VAX 11/780 and IBM System/370 model 158-3 as equal to 1 MIPS, I can think of a couple reasons:

First, those machines were marketed and claimed to be 1 MIPS, and the real performance of the Dhrystone benchmark is considerably less than that theoretical value.

Second, the number of “instructions” in a high level language procedure might translate to more than one machine opcode each.  Modern compilers are more efficient and generate far better code.  In fact, it’s gotten to the point of being hard to compile such a benchmark program as-intended as the compiler will eliminate it as being pre-computed or useless!

Finally, maybe the modern CPU takes fewer machine instructions to do the same amount of work.  That’s certainly true, but does the benchmark code use the larger registers?  Looking at an implementation, I notice a couple things along these lines.  Control statements can produce fused macro-ops combining the test and branch, which picks up another simultaneous instruction.  Procedure calls might be “free” as that simply starts prefetching from the new location and the unconditional jump itself is not seen as an instruction at all.  Even if the compiler is not optimizing out work, the CPU itself might be!  Assignment statements are also part of the mix, and if these variables are in registers than a MOV can be free, taken care of in the register mappings in the pipeline and never actually moving anything.  Finally, string manipulation might take place in bigger gulps, not one byte at a time.

The OCD Method

I brought up the fully-optimized version of the program in the debugger, and counted the instructions.  On the “short” path where it does the early-out, it does as few as 85 instructions and sometimes over a hundred.  To err on the side of caution (fewer actual instructions done in the given time) let’s use a weighted average of 90.  BTW, the early-out part at the top of the loop is only 21 instructions; the bulk of the expense of one iteration is permuting the array.

On the “long” path, the full code to decode the three strings with the current values is 110 instructions, and then it takes 3 to do the final comparison.  So let’s say 200 instructions in this case.

Without counting the rare times where it does a skip by reversing a few elements, this is about 135,693 full iterations at 200 and 219,360 short iterations at 90, or a total of 46,881,000 instructions.

Doing 47 million instructions in 3.4 milliseconds is about 11.8 billion instructions per second.  At 3.6 GHz this is 3.3 instructions per clock, where now “instruction” means an X64 opcode.

How fast was Knuth’s computer?

In my previous post, I wondered what the power of Knuth’s computers were, at the time TAOPC was being written.  Someone suggested the IBM S/360 series as an exemplar.  That turned out to be a good idea specifically, since I’ve written programs for the S/370 in assembly language, so I’m familiar with it.  Models 30, 40, and 50 were released in April 1965.  On the pricier side were models 65 and 75.  Here is a scanned “System Summary” describing the various models in detail.  So, I suppose somewhere between 100 and 900 kilo instructions per second.  A larger machine would probably be servicing multiple users.  Fifty years later, my Xeon E3-1276 is supposedly around 133 billion instructions per second.

Interestingly, the S/360 takes many (like 13 on average) clock cycles to perform one instruction.  Meanwhile each core of the Xeon performs 8½ instructions in one clock cycle.  I suppose the clock speed of the S/360 is the cycle time for the internal microcode.

But what’s an instruction?  On the S/360, I would not need the decode function at all, but would just sum the digits directly using unpacked decimal.

int decode (std::initializer_list<cellT> lst)
{
	int total= 0;
	for (auto digit : lst)
		total= total*10+digit;
	return total;
}

The modern CPU knows only binary arithmetic on various word sizes. So converting from a decimal digit-per-byte requires 4 iterations on two operands doing a multiply and add: at least 16 distinct instructions (if the loop is unrolled), plus the actual add once that’s all done.

Interestingly, the x64 code generated by the compiler doesn’t actually issue a multiply instruction in this loop. In fact, the entire expression does not use the regular ALU! There is neither a MUL or ADD instruction there. Instead, it exploits the address generator to do stuff that has nothing to do with actual pointer addresses. The complicated addressing modes of the CISC processor means that a separate address generator unit has a variety of things it can compute, yet it is far more limited than a completely general ALU. So, it is much simpler and thus faster.

In particular, Scaled Index mode looks like this: [ebx + ecx*S + constant] Register ebx is the base, and ecx is used as an index here. The index can be used directly, or scaled by 2, 4, or 8. If the same register is used in both positions, you can multiply it by five! The LEA instruction is Load Effective Address, and gives the processed address without fetching what it resolves to like a MOV would. So, if we have total in EAX and the digit in EBX,

LEA EDX, [EBX+EBX*4]
LEA EDX, [EAX+EDX*2]
ADD EAX, EDX

The first instruction multiplies by five. The second instruction not only multiplies by two, but also adds in the digit as the base of the addressing mode.

I also found it interesting how the S/360 line anticipated what we have today:  one compatible instruction set, but pricey implementations have more pipelining and faster clocks; also they keep adding more layers of cache memory.  The “processor storage” housed with the CPU is analogous to the L2 cache.  Adding external memory modules gives more storage but slower: 8 microsecond access time.  If you add pairs of modules you can go dual-channel and double the throughput.  Finally, later high-end models added extra-high-speed memory to keep up with the CPU, and that is analogous to our L1 cache.

Back to the quantitative comparisons:  The modern machine has 4 independent cores, but my program only used one.  If a brute force problem required a significant amount of time, it could be split up into 4 tasks.  At full utilization, 133 billion vs 133 thousand, more or less.  That’s a factor of about one million.  With the single thread, a quarter of that.  30 ms on one core would be about 8½ hours on a S/360-50 using it exclusively for this job.

Knuth’s suggestion of 10! can be scaled up by a million.  That’s midway between 12! and 13!.  Now in terms of exponential growth, realize that an institutional computer like that cost about 1000 times more than a personal desktop computer today.  At computing power per constant dollars (not adjusting for inflation) is indeed about one billion.

For floating-point calculations, the difference in power over 50 years is a few orders of magnitude higher.  A $300 GPU card can do 4 teraflops?  That means it would be a world-class supercomputer as recently as 2005!

 

Permutation Puzzle: “send+more=money”

I saw this puzzle on Bartosz Milewski’s blog, with an entry on using monads in C++.  I’d like to hold it up as an example of a completely different lesson to C++ programmers:  A central idea I want to teach is know your libraries.

I recall it was in the early ’90’s, when the STL for C++ was submitted as a proposal to include in the upcoming standard.  I noticed there were algorithms called next_permutation and prev_permutation, and wondered how they work—how do you order them and re-arrange your collection to the next such, without keeping an auxiliary state?  Then I wondered what I would ever use such a thing for.  Well, nearly 25 years later, I found out!

You should look through the list of algorithms every once in a while just to see what’s there.  Otherwise you might only know about the common ones that you use.  Consider the analogy with a tool (say, a special bent screwdriver only used to attach doorknobs) that you know is in the very back of the drawer, though you may need to rummage around to locate it.  Remembering you have that tool makes for a quick job.  Having to rig up something from common parts instead (to continue the analogy, use locking pliers to grab a short screwdriver bit from the side) is not as good, and certainly more work.

So… 8 nested for loops followed by 7 if statements containing 28 conditions?  Get real!  If you have a line that reads });});});});});});});}); then the code review will be a sight indeed.

Solution in C++ w/standard library

Here’s the meat of my solution:

using cellT = int8_t;

cellT A[10] {0,1,2,3,4,5,6,7,8,9};

void solve1()
{
    do {
	++iteration_count;
	auto [ig1,ig2, s,e,n,d,m,o,r,y] {A};
	int send= decode({s,e,n,d});
	int more= decode ({m,o,r,e});
	int money= decode ({m,o,n,e,y});
	if(send+more==money) {
	    ++solution_count;
	    solutions.push_back({send,more,money});
	}
    } while (std::next_permutation(std::begin(A), std::end(A)));
}

You’ll notice that besides the uniform initialization syntax introduced in C++11, this uses something you might not have seen before (if you’re reading this in 2017).  Hot off the press in C++17 is structured bindings.

	auto [ig1,ig2, s,e,n,d,m,o,r,y] {A};

This defines 10 variables and assigns all the elements of the array to them.  The first two are ignored so I used scratch names, and the others are simply the names of the letters in the puzzle.

One thing I have to point out from Milewski’s listing is the call to turn a word into a numeric value.  He writes:

int send = asNumber(vector{s, e, n, d});

This creates a std::vector on every use.  Let me elaborate: it allocates a block of memory from the heap (vectors can’t use a small-vector optimization).  Then after the call returns it is deallocated.  Then the next two lines to the same thing.  And that happens on every iteration.

The constructor for std::vector takes this handy literal list.  Now in C++ these containers are not special language features, but are ordinary libraries.  It should be clear that anything they do — cool syntax or unusual ability — you can do yourself on your own code!  My version of the same construct does not create a vector, doesn’t require more words to make the call, and most importantly does not have any overhead.

int send = decode({s,e,n,d});

And here is the function that takes such an initializer list:

int decode (std::initializer_list<cellT> lst)
{
	int total= 0;
	for (auto digit : lst)
		total= total*10+digit;
	return total;
}

The solving function doesn’t print the results because I want to time just the solving logic.  So the solutions are pushed onto a vector, and the caller prints them after stopping the clock.  In a real program, this might be an object (not globals) and the results available in the object afterwards, or as the return value from the solving function.  In another post I’ll make it lazy.

 Make it faster

This simple function found 50 solutions, one of which doesn’t have a leading zero.  It ran in 39.6 milliseconds, trying all 3,628,800 permutations.  That’s 92 million iterations per second.

The value of 10 factorial is an interesting number here.  Donald Knuth, in the The Art of Computer Programming, wrote that this is about the size that separates things that may be done by brute force from things that are impractical to simply try all possibilities.  Volume 3 was published in 1973.  I hazard to guess that computers now are about 230 (or about a billion) times the power of the machines that were around when he wrote that.  A billion times 30 milliseconds is 460 years.  So, I revise that to more like ten million times the speed, if I postulate that he could have run this program to completion in a matter of days.

Anyway, to make it faster, I need to skip over permutations that are “the same” as one I already rejected.  The order of the two ignored digits don’t change the solution, so if I decide that one order is canonical and when the other is detected I skip over the whole lot, that would cut the number of iterations in half.

So how do you skip over states in the next_permutation algorithm?  I looked it up — a StackOverflow post described it well and also pointed to a Wikipedia page on it.  The states are generated in lexicographical order, so when a given digit changes everything to the right is in reverse sorted order, and it “rolls over” to advance that digit by one and everything to the right is now in forward sorted order — the lowest value of that substring.

So, when I identify a digit value that I know will not be a solution no matter what the other digits are, I can skip to when that digit is advanced again by reversing everything to the right of it.

void skipfrom (int pos)
{
    std::reverse (std::begin(A)+pos, std::end(A));
}

    ⋮
    do {
	++iteration_count;
	auto [ig1,ig2, s,m,e,d,y,n,o,r] {A};
	if(ig1 > ig2) {
	    skipfrom(2);
	    continue;
	}

Indeed, it still found 50 solutions but the iteration_count showed half the number: only 1.8 million times through the loop.  However, the time only went down to 26ms — about two thirds the time, not half.

We also don’t want solutions with a leading zero, so filter those out too.  Notice in the listing above I changed the order of declaring the digit variables.  It doesn’t matter to the solution algorithm, but putting these farther to the left means I can skip over more.

        ⋮
	if(s==0) {
	    skipfrom(3);
	    continue;
	    }
	if(m==0) {
	    skipfrom(4);
	    continue;
	}
        ⋮

That didn’t save much though: 1.45 million iterations in 22 ms.

Another big constraint can be found on the right side of the puzzle.  I would think that parsing the arrangement of digits into ints would be slow, seeing as that involves multiplying by 10 and adding each digit.  Looking at the rightmost (units) digit only, the puzzle has d+e=y with a possible carry.  Test that before parsing the int values, and furthermore skip ahead until one of those three digits changes again.  To that end, the declaration order has d, e, and y next after the previous items we wanted on the left.  This leaves only 3 letters to the right, so each time the units digits don’t work it can skip 6 iterations.

I added a counter to that, and see that it happened 219,360 times.  The loop only executed 355,053 times now, taking a mere 4.7 ms.

Faster yet?!

Notice in the listing that I declared a type CellT for the array of digits and anything that holds a digit.  My thought was that keeping the array small would save in parameter passing to decode.  Keeping the size of the data small is generally good for memory caching, but it probably doesn’t matter here because I don’t have enough data to exceed the L1 cache anyway.

But, I can change the type in one place in the code and it affects everything in the program.  I changed it from int8_t to plain int (int is 32 bits here), and…  the time was down to 3.4 ms!

64-bit was the same.  16-bit values was the slowest at 5.2 ms.  So, the CPU is inefficient at loading 8-bit values and worse with 16.  I don’t know if the actual load is slower, or the compiler generated somewhat different code rather than just using MOVXS instead of MOV.  This was a difference of a quarter of the execution time over the entire algorithm, and loading values is only a part of what it does, with values kept in registers once loaded.

 

Classic “Aggravation” board game

The Memoir

When I was a kid, my family of 4 played a game called Aggravation.  According to the history recounted on Wikipedia, this was the original version and is no longer made.  I recall it was a piece of thick material similar to what I’d buy today as high-density fiberboard, with holes in it.  This was placed on top of the bottom of the box which was printed with colored patches that would show through the holes.  The box had short sides so the marbles could be placed within and the lid put on, for storage.  (Here is the same board from 1962, but the marbles were different and I think the lid was slightly different too.)  I think it was obtained in 1964 by my parents.

Photos from Julie of the LeolasAttic store on Etsy.

Vintage 1962 Aggravation board courtesy of LeolasAttic

Later, I got a 6-player version for a birthday present.  This had a folding board like many board games used, but was this single board only—there was no backing, so the holes went through and you could see the table through the holes.  Instead, colors were drawn around the holes to indicate different zones, and this gave it a completely different character.

Now I see the 6-player set for sale as “Rare! Original vintage deluxe party edition antique” circa 1972 for $130.  I believe I’m missing one of the colors of marbles and have substituted ordinary cats-eyes for that color.  Even so, I hope it turns up in my parents’ garage in the same good condition!

I must say, it lived up to its name.  Even rolling the die (by a child barely old enough to play) might fly across the table and knock into marbles in play, fall off the table, or (most irritating to some) be dropped in place so gently that it was not sufficiently randomized.

I don’t remember personally, but I’m told that my maternal grandfather called the center the “knockout hole” and also enjoyed the game with family, when it was first released.

This is another family game that I’d like to revive, particularly with my mother-in-law visiting.

It would be easy enough to make a copy of the classic version, with a couple pieces of hardboard.  But not easy enough, right this moment.  It would be even simpler to just print out the board on paper.  After all, the historical predecessor that this is a variation of was normally made out of cloth, and many other modern versions are flat boards that don’t need holes.  It’s just a matter of using more typical game tokens rather than marbles.  These can be borrowed from another game, or, to preserve more of the original flavor, I can use glass stones from a Pente set.  (I see my “Vintage Pente Game (1977)” being sold for $125 now!)

The Project

My idea is to make it modular.  Not only does this make each module fit easily one a standard-sized sheet of printer paper, but it can then be customized for the specific situation.  Any number of players can be used by using 3, 4, 5, 6 (or more?) modules, and there will be no dead zones or asymmetry from using a board that accommodates more players than are in play.  It can also arrange the different colors in any order, to suit seating preference.

So, each module is the section of track that us a “W” shape, with the center row being one player’s home.  The modules will overlap at the first/last spots which form the inside corners.  A spot for the knockout hole needs to be positioned in the center, separately.

The track module is 6 spots tall and 5 wide.  So, it should be easy to lay out on a grid using Adobe Illustrator.

New Variations

Now, Aggravation is a specific variation owned by Hasbro.  The larger family of Pachisi (पचीसी) games is ancient, though, and there are many westernized commercial versions of the idea.  So, if I come up with my own unique variation, I can publish it as my own game that needs a suitable name.

I think the modular nature is key to making a new variation.  Adding and removing modules can be a dynamic effect of game play, not just a configuration made before starting the game.  I found a similar game called “marbles (or pegs) and Jokers” that has a modular track.  But it doesn’t have the knock-out hole or the shortcut track of the 6-player board.  And that’s the best part!  So my variation will feature more self-intersecting tracks and shortcuts.

I have a general idea of adding/removing sections that provides for a loop at each corner, and then a flying bridge shortcut between these loops.  A player can spend chips (use poker chips) to perform the actions of adding, removing, or moving track modules.

Now here comes the clever part: whenever starting a new token out of the base, the player gets a chip.  This means that attacking a player — knocking out his tokens repeatedly so he has to start over — also makes him stronger in terms of these board-changing powers.

Another variation would be to use different types of dice, with 4, 8, 12, or 20 sides, as used in role-playing games.  Simply using a different die, perhaps scaling the board to match the range of motion from one throw, isn’t much of a change.  It would be interesting to use difference dice throughout the game, giving a speed boost for example by rolling a d12, or making it easier to snuggle up the tokens in the “home” by using a d4.  I don’t have any good ideas as-yet as to decide when to allow these choices.

…work in progress…

 Files

I’ll post PDF and Illustrator files when ready.

 

Yahtzee in Chinese — scorecard adaptor

My mother-in-law is visiting from China, and one of the games my family has always played while I was growing up is Yahtzee.  Although now I have official published score cards, I recall as a small child that we originally had home-made sheets produced by my Great Aunt Harriett.  They probably date from about the time the game was first introduced and popularized: I read in Wikipedia that innovations made by Yahtzee® over the traditional forms include the upper-section bonus.  The sheets I remember did have an upper-section bonus, but had the Big/Little straight idea as shown for Yacht, and I recall it had a Pair.  So, it had to have been influenced by the E.S.Lowe product some time after 1956, and I know my parents were playing it in the 1960’s.

I’ve played Yahtzee since I was old enough to understand it, sometimes in large family gatherings with parents and grandparents.  It was always a favorite of my Mom’s.

So naturally I thought it would be great to play during the holiday season with my mother-in-law’s visit.  The catch is that she doesn’t speak English.

Yahtzee-adaptor

I had an idea to make, not a translated score sheet to use in place of our English sheets, but an adaptor.  Originally, I thought to make a stiff card, printed on letter-size paper, that the sheet would attach to using paperclips.  So, it would contain translations for the score information (the first two columns of the printed sheet) that exactly line up with the rows of the score sheet, to the left of the sheet; and general notes and instructions that could point exactly to the rows it referred to.

So, the main design work involved exactly duplicating the grid spacing and position on the sheet.  That did not seem as simple as it should be in Illustrator, so I posted a question on StackExchange.  I quickly got comments that Illustrator doesn’t do that easily but InDesign has a table tool.

While playing (initially using translations on a separate sheet), I noticed that it was proving difficult to use columns beyond the first two games.  So I modified the design to take care of this also:  I reversed the attachment idea, and now have the card attached on top of the score sheet.  The new legend appears on the right, and can be shifted over for each new game.  This turned out to be a good design in general as now the explanation can continue to the right of the name, as wide as desired.

The question then became how to attach the papers when using the rightmost columns on the score sheet?  There is no paper beyond that to clip under the card.  I solved this by having the card both in front and behind the score sheet at the same time: a careful cutout and fold, and the letter-size page can provide a “tail” that supports the entire width of the score sheet in any position, from behind.

As planned, the adaptor is lined up such that the papers can be aligned on their vertical edges, and then two paperclips will hold them perfectly.  I do suggest using smaller clips that I have in the photos: less than one inch long and they’ll naturally cover only the title area above the part you write on.  The photo above shows the adaptor card positioned to the right of the Game 5 column (nearly at the right border), and the score sheet is clipped to the folded-back strip along the entire top edge and holds securely.  I printed on 32# HP Premium printer paper.

A final improvement concerned the scoring.  I noticed some confusion in remembering that some rows used “total of all dice” and others used just the matching dice.  So I color-coded the different scoring schemes in the “score description” column, as well as color matching the row where the upper total is copied down below.  And as long as I was shading things, I shaded the “name” column to indicate which rows represented turns to be scored, as opposed to totals and bonuses.

Here is the  PDF File of the Simplified Chinese Yahtzee Scorecard Adaptor.  Be sure to print as “actual size” and simply allow the margins to be cut off as needed.  This is free to use with attribution, under the CC-BY 4.0 license, so here is the InDesign file.  If you make one for a different language (just change the text in my layout) or other variations, let me know and I’ll link to it or include it here.

Updating Urth

Isaac Asimov wrote several stories starring Wendell Urth, a leading extra-terrologist with a fear of all mechanical forms of transport.  In February 2011 thought I’d make my own character after him, with some explicit homage to the Asimov stories.

The character I invented is Professor Charles Xavier Urth, set in present day.  In his office he has a signed magazine cover framed on the wall, from 1955, inscribed “to the real Doctor Urth ⊕”.  The implication is that Dr. Robert Urth, Charles’s grandfather, was someone Asimov knew from his circles of interest in mysteries, Sherlock Holmes, and Nero Wolfe.

I pulled out a plot idea I’ve had in the back of my mind for even longer time, where a glitch or paradox or some kind of problem with quantum mechanics and/or time travel experiments causes the universe to simply seal up the offending region behind an impenetrable wall: like an oyster secreting a pearl around an irritant.

I purposefully begin the story in a manner typical of Asimov’s stories, opening with the main character’s name.  Furthermore it’s a non-action type of activity.

Although told in 3rd person, I want to make the way in which the events are perceived and related follow the perceptions of the character currently in focus.  For example, when the efficient secretary is interacting with the agents they are referred to by name, but when the professor is dealing with them they are simply “the first agent” etc. since he didn’t notice them as individuals or remember their names.

With these things in mind, I pointed the character at the situation to see what happened.  Of course, nobody knows the details of the anomaly at the beginning, so I don’t have to either!

Of course, my Professor immediately sets off to do something uncharacteristic, flying off to visit the anomaly.  I wanted a more explicit link with the mysteries of Nero Wolfe, where Professor Urth doesn’t travel.  Revisiting the story, I think he would do more telecommuting with a multi-window screen of video conferences to various grad students and in-field assistants.  After all, my own work meetings can have 16 windows on my 70″ living room screen with participants all over the world.

Another note about the character’s name: Professor X (Charles Xavier) has a logo that is a circled cross, like the mathematical symbol ⊗. Urth’s logo looks like a circled plus: ⊕, the astronomical symbol for Earth (this is used in Asimov’s story The Key.

The idea of a mindquake comes from GEB.

Short Story beginning/draft — “The Pearl”

Copyright 2011 by John M. Długosz

Professor Charles Xavier Urth was doing what he did best. His first task that morning upon arriving at his campus office was to read some scientific papers, magazine articles, and other clippings that were prepared for his attention. About two thirds of the way through the first paper, he had an idea. Reflecting in thought, one thing suggested another, and so on for a chain of half a dozen topics. Finally, the sixth thing—which was totally unrelated to the subject of the paper—apparently bore a relationship to some other topic that he had not previously realized.

A “mind quake” ensued, with symbols in his brain rearranging themselves in light of this new revelation. Normal people do such reorganization normally during a certain phase of sleep, but never on such a scale or for such an extended period of time. His brain activity was very much like that of a normal person in deep sleep in some ways, but also like a normal person having a simple partial seizure.

Professor Urth sat at his desk, seemingly relaxed, his eyes half closed, the print-out in front of him forgotten, for two hours. He still had not moved when the visitors arrived.

Nancy checked on Professor Urth every fifteen minutes when he was in such a state. When she came back into the outer office, she found the two gentlemen standing by her desk.

As way of introduction, one of the visitors took out a wallet-sized case and unfolded it to produce a very official looking badge and a laminated photo ID. He spoke his name as she read it: “I am Special Agent Sebastian Floyd, from the F.B.I.” He pronounced it so you could hear the periods. “My partner is Special Agent Christopher Friday.” Friday nodded in her direction at the mention of his name.

“We are hoping The Professor,” (you could hear the capital letters) “can help us with an investigation. It is a matter of some importance.”

“Of course,” replied Nancy. “Professor Urth will be available within the hour, and if you care to wait here, I’ll make sure he sees you immediately.”

Floyd was not used to waiting. “It is most urgent. Can we see him right away?” He was not sure but was under the impression that Urth was in his office. In his work, busy was not the same as absent, and he was used to interrupting people and making his Official Business take precedence.

“The Professor Must Not Be Disturbed at the moment.” Nancy made sure Floyd could hear her capitals. Her face wore the expression of a stern school mom. Then she softened her expression and explained, “If you want his help, you most certainly want his willing cooperation. And undoubtedly you want to make use of his talents. So, you need to accommodate his ways. His talent has needs of its own.”

She sat down at her desk, and motioned to the row chairs along the wall by the door. There were seven guest chairs in the outer office: how many students and suppliants of various importance have waited here for an audience, and for how long? The agents had been briefed, and were prepared to wait however long it took. But it didn’t hurt to check first, if only out of habit.

The agents sat down, and Friday, the junior agent, took out a tablet and started reading some documents. Research materials and ideas for this case, presumed Floyd. Floyd was more pure detective than technician, and preferred to wait with his own thoughts for a while.

Professor Urth’s fits never lasted more than three hours. Through long practice, he had trained himself to limit them and apply them to useful purposes. If he showed any unusual signs, or failed to come around in time, Nancy would take action. So, he continued, in comfort and safety, to juggle symbols in his mind. Later, when asked about a particular problem in group theory or certain practical applications of it which include self-assembly of nanostructures and unification of Dark Energy to the rest of known physics, Urth would realize that a particular long-standing mathematical problem could in fact be solved and give instructions on doing so. If anyone ever asked.

While the visitors sat, Nancy checked on Urth twice more. The second time, something about his demeanor prompted her. She tapped his hand and called softly, “Professor…” Not expecting an immediate reaction, she went to the small refrigerator tucked under a table along one wall. She took out a juice pouch, shook it for good measure, and ignoring the thin straw and all instructions printed on the pouch, used scissors to clip a corner, and poured most of into an elegant blue Chinese teacup. She parked the filled cup out of reach and tapped Urth’s hand again. She finished the remainder of the juice in the pouch herself while she waited.

As the Professor emerged from his fit, Nancy placed the drink in his hand, purposefully folding her hands over his larger one to mold his fingers to the cup. When his eyes showed that he was lucid, once again sharing this mundane world with other people, she briefed him. “You have visitors. Two F.B.I agents wish to see you. Shall I send them in now?”’

Special Agent Floyd repeated his show with his credentials, this time in the inner office. Urth showed no interest in looking at them, or even in remembering the man’s name. Furthermore, Urth waved off the agent’s introductory spiel about how he came to be referred to Urth and (presumably) that there was some great problem that needed to be solved. The Professor trusted that if the agent got this far and got in to see him, it must be an issue for him to at least hear out. He didn’t care who the agent’s knew that got them referred here; they were here now.

“Tell me about the problem.”

“Well,” said the first agent, “there is a ‘physical anomaly’ at the Organon Research facility. We, by which I mean everybody, is at a complete loss.”

The second agent handed Urth his pad. The display showed a room containing most of a large sphere. It appeared to be about three meters in radius; if it was a complete sphere, the lower half must be mostly in the room below, with portions in the room above and adjacent. A small amount of space was visible through the door along the curving wall of a giant pearl.

“This just appeared out of nowhere,” the second agent explained. “It is a perfect sphere and cuts through the walls, floor, and ceiling, but is mostly in this lab. A witness in the next room reports that the curved surface simply appeared along his lab’s wall, with no sound or disturbance. It was, as far as anyone can tell, just there.”

Professor Charles Xavier Urth was shaken. More than anyone, he knew that the universe works according to a set of rules. Not only did he know those rules to a great level of detail, but he intuitively grasped interrelationships and deep underlying principles at work. A macroscopic object appearing out of nowhere violated conservation laws that were a direct result of the deepest principles underpinning the universe. If it was merely hoaxed, it would be a serious puzzle to solve. But if it was genuine, it would be a prize indeed.

The realization that it meant travel to an unfamiliar place only slightly dented his excitement. “OK, let’s go.”

The photographs did not do justice to the anomaly. Seen in person, the surface of the sphere was, to a first approximation, a metallic gray that was diffuse rather than mirror-like. However, it had a luster and finish that looked simply strange. It was not like anything else, but not in any obvious way. It was not iridescent because it didn’t show color fringes, and yet it seemed to shimmer. It had enough shading to make it appear properly round, but less so than any real object, like it was drinking up some of the light rather than scattering it. Sighting along the edge, it seemed to bend light slightly. When not looking at an edge, it was hard to tell exactly where the surface was.

“Is it safe to touch?”

Illustrating his answer, Davis Holden placed his hand against the weird surface. “Everyone who first saw it touched it before realizing how remarkable it truly was. Nobody has had any complaints.”

Professor Urth gingerly touched it with his fingertips. “It’s cold,” he observed.

Holden explained, “it’s the same temperature as the air, but it conducts heat very well so it feels cold to the touch. Our experiments indicate that it might be a superconductor of heat, in fact. The research into the anomaly is next door.

Where the wall intersected the sphere it formed a circle about two meters in diameter. It bulged into the room, but left the bulk of the lab space available for normal use. The original contents of the lab had been removed to make room for the instrumentation and people tending the instruments.

“Just about everyone dropped what they were working on and came to study this, and brought whatever kind of tools they had.”

Someone gave Urth a fresh print out detailing the physical properties that have been determined. Optically, it scattered all light with 100% reflection, but not in a normal manner to form spectral highlights; rather, it seemed exactly backwards, preventing light from reflecting near the angle of incidence, so it looked gray and strangely flat. Light grazing the limb was bent sharply away from the expected reflection angle. In the infrared, it radiated as expected for its temperature. That it was the same temperature as the surroundings was verified with contact thermocouples. That experiment produced a secondary result: tape would not stick to it.

It was a possible superconductor of heat, but a very good electrical insulator. Magnetic fields did not pass through it. Neither did X-rays, but it had not yet been determined if they reflected in the same manner as the laser light. An alpha particle experiment was in progress; the immediate observation is that at least some particles bounce off.

The few chemicals available showed no reaction. Nothing managed to chip or even mar the perfect surface.

Urth added to the list mentally from his own observations. It was moving with the Earth, being held by the cutouts it produced in the walls and floor. If it was too massive, it would have crashed through the building. “Is anyone trying to determine its mass?” The researchers all shook their heads. “I think it might be important.” This he said to Holden. He would make sure someone worked on it.

He continued with his next subject. “I want to know more about Dr. Saunders’ work.” Urth knew from the briefing he read on the plane that the lab space that was the center of the anomaly was formerly occupied by the project of Doctor Jarod Saunders. It seems to have been formerly occupied by Dr. Saunders himself, since he could not be located, his car was parked in the lot, and there was every reason to suppose that he was working in the lab at the time of the indecent.

Holden explained, “it was some kind of quantum computer.”

Urth stopped him. “I read the report that the F.B.I. people gave me. I need to know details! What was unique about it? What equipment was in the room? What was the precise nature of the experiment he was conducting? Where are his notes?”

Dr. Saunders, as it turns out, kept meticulous notes. They were located in the computer that was used to interface with the experimental system, and was currently located well inside the anomaly.

GCD from “From Mathematics to Generic Programming”

For my birthday my sister got me a copy of From Mathematics to Generic Programming by Alexander Stepanov and Daniel Rose.  I had noticed it on Amazon.com and flagged it on my wish list, then forgotten about it, so it was a nice surprise and yet something that she knew I would be interested in.

The bulk of it is about pure math.  It covers the history of math with a small number of continued examples showing how progress was made when domains were generalized.  Euclid described an algorithm for Greatest Common Measure which was geometric in nature.  Much later zero was invented and a couple hundred years later it was realized that a line segment could have zero length, and the Greatest Common Divisor is the same algorithm using division and remainders on whole numbers.  Then came algebraic structures and the same algorithm can be applied to Gaussian integers, polynomials, and beyond.  Noether then studied the question of what, in the most abstract sense, can the algorithm work on, and described the Euclidean domain (or ring).

Returning to the conventional approach of using integers, the function they present is:

//gcd from page 59
template <typename N>  // N is some kind of integer
static N gcd(N a, N b)
{
    using std::swap;
    while (b != N(0)) {
        a= a % b;
        swap (a,b);
    }
    return a;
}

Pretty mundane in this day and age.  But, earlier microcomputers did not have an instruction for division.  I recall figuring out how to implement multiplication on an 8-bit computer in the early 1980’s.  Even now, I can call up a modulo operation with a single symbol and know it maps to a single machine opcode… but that instruction is not very efficient.  I was curious as to just how heavy it is, and found a report giving instruction timing details measured for various architectures.  On the Intel Haswell, I saw it was indeed something of an anomaly in terms of how many micro-ops, how many compute ports it sends them to, and how long the latency is.  Apparently it is a microcoded routine that works out the division just like you would code in the old days; just built-in and already perfectly optimized.

Stein’s Algorithm

Later, the book goes on to describe Stein’s algorithm.  Written for a computer in 1961, it uses bit-shift and subtract and other very primitive operations which are especially simple on binary encoded integers, with no need for division.

template <typename N>
static N gcd(N m, N n)
{
 
//	if (m < N(0))  m= -m;
//	if (n < N(0))  n= -n;
    if (m == N(0))  return n;
    if (n == N(0))  return m;
 
    // m>0 && n>0
 
    int d_m= 0;
    while (even(m)) { m>>=1; ++d_m; }
 
    int d_n= 0;
    while (even(n)) { n >>=1;  ++d_n; }
 
    // both are now odd
    while (m != n) {
        if (n>m) std::swap (n, m);
        m -= n;
        do m >>= 1; while (even(m));
    }
 
    // m == n
 
    return m << std::min (d_m, d_n);
}

How does Stein’s algorithm fare today?  Sure it avoids the big fat modulo operation, but it has a lot of steps.

Here is a summary of my results.  Values are in nanoseconds for one call of the function, having run a trial of 16 million calls.  I’m using Microsoft Visual C++ 2015, and running on an Intel Xeon E3-1276 v3 @ 3.60GHz.
000011

The first four columns is the code from the book, compiled as 32-bit and 64-bit.  In 32-bit builds, the modulo operation of 64-bit values is not a single instruction and the built-in microcode is not accessible.  Stein’s algorithm is a big win in this case.

For 64-bit code, the modulo operation is significantly slower for signed values than for unsigned (though all test values were unsigned).  Stein’s algorithm’s time is between those two, so it’s faster when using signed 64-bit integers.

Because the % operation is so much slower on signed values than unsigned, if there was a need to handle signed arguments, a wrapper function could take the absolute value of each argument and then call a version that used unsigned values.  Remember that any time you are doing division or modulo!  Unsigned types are faster.

I then called the 64-bit integer form with values that would all fit in 32 bits.  Unlike multiplication, division can’t easily quit early when the numbers are smaller.  But, fewer iterations are needed through the loop, so the function is faster.  Likewise, the Stein algorithm will loop less in every stage.  The built-in division regains its lead, as it speeds up more.  That is, the Stein algorithm’s complexity is a steeper curve considering the length of the input values in bits.

template <typename N>
bool odd (N n)  { return bool (n&1); }
 
template <typename N>
bool even (N n)  { return !odd(n); }
 
⋮
 
int d_m= 0;
while (even(m)) { m>>=1; ++d_m; }
 
int d_n= 0;
while (even(n)) { n >>=1;  ++d_n; }

The Stein’s algorithm code is written to avoid division, but what it is using is a lot of test-and-branch instructions.  The above fragment removes all the trailing zero bits and remembers how many there were.

Today’s Bottleneck Issues

Each operation is very simple: test the least-significant bit, shift right one bit, increment an int.  But the branch is a killer!

With a modern superscalar highly-pipelined processing core, a conditional branch breaks the pipeline.  The x86 processors use branch prediction and speculative execution to mitigate this, but here the branch decision is essentially random on every iteration.  It guesses wrong half the time.

Meanwhile, the instruction set does feature primitive opcodes to determine how many consecutive zero bits are at one end or the other of a register.  I can access that via a compiler intrinsic, and eliminate the problematic loops.

template <typename N>
static N gcd(N m, N n)
{
    if (m == N(0))  return n;
    if (n == N(0))  return m;
 
    unsigned long index;
    _BitScanForward64(&index, m);
    int d_m= index;
    m >>= d_m;
 
    _BitScanForward64(&index, n);
    int d_n= index;
    n >>= d_n;
 
    // both are now odd
    while (m != n) {
        smaller_first (n, m);
        m -= n;
        _BitScanForward64(&index,m);
        m >>= index;
    }
 
    // m == n
 
    return m << std::min (d_m, d_n);
}

There is still one loop remaining.  This main while loop ought to be predicted to loop and only fail the prediction when it’s ready to exit.

The column in the results table labeled “BitScan” gives the results after making this change.  All of a sudden, Stein’s algorithm beats the built-in div code in every case!  In the case of small values to 64-bit signed integers, it is 3× the speed!  Interestingly, the running time seems to be determined by the size of the values, without regard to the register size used: 32-bit values run at the same speed whether the code is compiled to use 32-bit integers or 64-bit integers.  That makes sense, since the various primitive operations (unlike multiplication and division) are constant speed and don’t have any advantage to using a smaller word size.

But I still have one more trick left. if (n>m) std::swap (n, m); is a branch point, and it will take one way or the other many times as it loops.  That is, this is another “bad” branch.

A long time ago, branching was far worse than it is on today’s CPUs.  I recall writing high-performance graphics code when any branch was a penalty.  I learned to absorb the conditional logic into the math, and also learned how to generate and manipulate masks in various clever ways.  Here is my quick attempt at a purely computational compare-and-swap:

template <typename N>
void smaller_first (N& n, N& m, std::false_type)
{
    using std::swap;
    if (n>m) swap (n, m);
}
 
template <typename N>
void smaller_first (N& n, N& m, std::true_type)
{
//        cout << "smaller_first input n=" << n << ", m=" << m;
    typedef typename std::make_signed<N>::type sN;
    const auto bitcount= 8*sizeof(N);
    const N mask= (sN(n)-sN(m))>>(bitcount-1);  // all 1's if m is larger, all 0's if m is smaller
//        cout << ",  mask=" << mask;
    const N smaller= (m&~mask)|(n&mask);
//        cout << ",  smaller=" << smaller;
    const N larger= (m&mask)|(n&~mask);
//        cout << ", larger=" << larger << endl;
    n= smaller;
    m= larger;
}
 
template <typename N>
void smaller_first (N& n, N& m)
{
    typedef typename std::is_signed<N>::type tag;
    return smaller_first (n, m, tag());
}

It only works on signed types, so I made the template choose the regular way for unsigned types and the non-branching trick for signed types.  (Actually, as written here it would work for unsigned types if the values were constrained more.)

It works by first doing a subtraction: if n-m produces a negative result, that means the highest bit is set in 2’s complement representation.  So if m is larger, that high bit will be set, and if m is smaller it will be clear.

Then the arithmetic right shift replicates this one bit to fill the entire word size: now I have all 1s if m is larger and all 0s if m is smaller.  In either case, the ones-compliment (~) operator will reverse that.

Each input value is then ANDed with a mask: one mask will always produce zero regardless of the input, and the other mask will have no effect.  Then both of them are ORed together, which results in the one that wasn’t all zero.

This is done twice, once for the larger and once for the smaller, by reversing the mask choice.

This would appear to use a lot of registers:  the two input values, the output values (which can’t destroy the input values before it is done), the 1 mask and the 0 mask.  But after computing the masks, the rest of the steps are not dependent on each previous step so they can all be done at the same time in various execution units, or at least scheduled optimally.

The result indicates that all this math and register usage is cheaper than branching: 44 becomes 39 or 37, 84 becomes 75.  This is about an 11% speed up in the overall algorithm.

Additional experiments indicate that the final test (doing the min at the very end) doesn’t benefit from using this method.  It is only performed once, at the end of the function.  Likewise the tests at the beginning for an argument of zero don’t have a noticeable effect (they are almost always “not taken” and when they are the routine is fast because it doesn’t have to do the main part at all).

 What about cmov ?

This post on Stackoverflow reminded me that the x86/x64 has long since addressed this with the cmov, or conditional move, instruction.  It makes sense that an if statement containing a simple comparison controlling only simple assignment statements could cause cmov instructions to be generated.  However, my attempt to write it that way showed that no cmov was used and conditional branching was still being done.  Yet further on in the function, a cmov is seen, and it comes from a use of std::min.  Looking at the headers, I see std::min is written as a ternary expression in the assignment, not an if statement.

Presumably the std templates are carefully crafted to trigger the kind of code that results in optimization, or the optimizer is written to handle the code generated by the standard headers, or cycles of feedback between the two.

So writing:

template <typename N>
void smaller_first (N& n, N& m)
{
    const N old_n {n};
    const N old_m {m};
    n= std::min(old_n,old_m);
    m= std::max(old_n,old_m);
}

does, on this compiler version, cause cmov instructions and no branching; although it does unnecessarily do the comparison twice.  Other than the lack of branching, the machine code appears rather poor to me.  But here’s the real performance:

000015

The min/max code wins hands down on 16-bit values. It does poorer than the jumping code in all other cases, so the funny arithmetic is still the winner in the signed values where it is supported.

Here is the 16-bit case, which is fast:

cmp         ax,r8w  
movzx       ecx,r8w  
cmovl       r8w,ax  
cmp         cx,ax  
cmovl       cx,ax  
movzx       eax,cx  
 
sub         ax,r8w

The values for m and n start out in AX and R8W, and are reversed if necessary so the subsequent use (the subtraction) uses those same registers.

The generated code uses one additional register, ECX, to hold the original value of R8W (zero-extended to 32 bits) in case the subsequent line copies AX into R8W.  Then the test is performed again between AX (still unchanged) and CX (holding the original R8W), and based on that result may copy AX into CX.  Then it moves CX (which may be the value from the first copy or the conditional second copy) into EAX, again zero-filling it to 32 bits.

I suppose there is a performance penalty to move into 16-bit register fields, so the (non-conditional) moves use zero extension moves to 32-bit fields.  Yet it uses the 16-bit portion in other places, and never reads larger portions even if they are known to be zero or simply ignored later.

This code could be simplified by removing the second CMP and then reversing the condition of the second CMOV.  That second CMOV may be free though, since it can be done simultaneously with an adjacent instruction, and removing it would not free up resources because no other instructions are not dependent on the result.

You would think this identical code would work for 32 bits or 64 bits, just by using the full width of the same registers.  But no, here’s what it generates for 32-bit values:

mov         ecx,r8d  
mov         dword ptr [rsp+0F0h],ecx  
mov         dword ptr [rsp+0E8h],edx  
lea         rax,[rsp+0E8h]  
cmp         edx,r8d  
lea         r8,[rsp+0F0h]  
cmovge      rax,r8  
mov         r8d,dword ptr [rax]  
lea         rax,[rsp+0E8h]  
lea         r13,[rsp+0F0h]  
cmp         ecx,edx  
cmovge      rax,r13  
mov         edx,dword ptr [rax]  
 
sub         edx,r8d

Besides using EDX and R8D for the two values, it uses ECX for a copy as with the first case, but also uses RAX, R8, and R13 as 64-bit pointers, and stores the values in stack-based memory locations as well, [RSP+0F0h] and [RSP+0E8h]!

The values are stored in memory, with pointers to those addresses in two new registers.  The comparison is done with the values (still) in registers, but the pointers are what are conditionally copied, and then the resulting pointer is de-referenced to get the matching value.  Besides chewing up 3 more registers it requires 3 memory accesses!

The 64-bit case works the same way, with the value-carrying registers and memory allocations widened.

The “Blood Moon” is a dull ember only just visible in a hazy sky

My photos of the lunar eclipse did not turn out well when the moon was reaching totality: basically, underexposed because the moon was (nearly) gone! I recall from earlier shots that 1/125 second was about as slow as would work, due to motion blur from atmospheric effects and the moon’s motion. So I left it at 1/125 with maximum aperture (f/5.6), and increased the ISO as the moon disappeared.

However, I integrated 12 exposures taken as a burst, giving essentially 12/125 or about 0.1 second. Even though the exposures were made within the space of 2 seconds, each one showed the image in a different position, which illustrates why a longer exposure is blurry. By chopping it into separate short exposures I was able to manually align the separate images.

Lunar Eclipse “Lantern”

This simply adds the pixel sample values together. Dedicated software, such as used with astronomical instruments, would do better at removing the random noise as part of the process. I did noise reduction on the combined exposure.

Yes, the sky really is purple.  There was a visible haze here, and later clouds were visible over the moon.  I calibrated the white balance on an exposure of the normal full moon taken just after the eclipse ended, setting that to neutral grey.  The same profile was applied here, so the red tone is visible and accurate.

The last bit of direct light was just touching the limb, and that is pure white and overexposed in this image.  By eye, the area between the white tip and the red blush did appear more turquoise (blue/green), but that’s a perceptual illusion due to the fact that it’s simply less red than the neighboring region.  These colors did not show up in the true-color photo.  I suspect that the dark colors next to a full-sunlight bright spot affects the eye differently than the camera sensor.

Also notice how the upper-right blends into the same shade as the surrounding sky.  That’s how dark it appears: only just an ember separating itself from the empty haze.

The picture loses something in the JPEG compression, and the sRGB color space is disappointing after viewing it in Photoshop in the full gamut available to my monitor.  But you get the general idea.