Category Archives: Teaching Computing

The real answer to the FizzBuzz interview question

A popular interview question is coding FizzBuzz.  Now I’ve had similar experiences to what is shown at Coding Horror — I had a description for a trivial piece of code that would offer insight into the depth of engineering skill, with my careful posing of the requirements.


I had a sheet with this written down, so I would always say it in the same carefully-crafted way.  I lost that some years ago, but here is the problem from memory:

Write a class whose constructor takes two integral arguments, and has members that return their sum and their difference.

Trivial enough?  Not even a “problem” with logic to be considered; just a routine microtask that anyone with any fluency can express without having to think about it.

What I expected was to see how well the applicant processed the statement of requirements: making assumptions, or asking for clarification.  I figure an experienced engineer would ask what was meant by integral, or use a typedef to abstract the decision of which integer type to use and specify it at only one point.  More fluent coders would write it as a template.

I was very surprised to find that most applicants who got past the screening process thus far had trouble writing a C++ class at all!  The best insight I had was from someone who explained that he usually uses the IDE and doesn’t know the details of syntax on how to write a class definition.  Because of the reliance on what Visual Studio calls wizards, we dubbed these people Wizards’ Apprentices with a reference to the segment from Disney’s Fantasia.  The story is simple: the young apprentice tries to use his master’s power but cannot control it; what we would recognize as a buggy program.

What are you showing them?

Now suppose that you are fluent in C++ (or whichever language you are using) and would have no trouble jotting down a piece of code that does what you want.

I see many solutions to FizzBuzz, as well as many other problems, that are written in a way that is completely unlike what we see in “real” code.  Someone who can effortlessly produce this might be in “scripting mode” which is used for single-use small programs — written quickly, sloppy, and highly specialized.  It does not illustrate good software engineering practices, and the interviewer can’t tell if the applicant always (only) writes like that, or is capable of working on projects like they have: large bodies of code, need for maintainability, testability; follows good practices, etc.

So, I thought about FizzBuzz as if it were a small utility feature that was ordered for use in a large project.


Certainly, a short piece of code can be written using up-to-date style and best practices as applicable for software development at the scale that is involved in real projects at work.

But I also thought about how much “good engineering” I can illustrate without making the result not-so-trivial.  Here is what I came up with:

  1. Separate the logic from I/O.
    Most solutions to simple problems like this mix console I/O with the logic, which is completely unlike real programs in multiple ways.  It should not directly solicit input from the user as if he was a text file, or print output in the middle of the work; the “solution” should feature a function that takes parameters and produces results to return.
    As with real projects, this can be called by a unit-testing main program as well as find a home in the application that needs that feature.
  2. One thing that bugs me about the typical implementation of FizzBuzz is the explicit repetition of code, checking each factor and often the combination as well.  Don’t duplicate code that varies only by a data value!  Instead, store the values and loop over them.
  3. Be mindful of future enhancements.
    Code always gets more complex over time.  It is an art forged by long experience to balance future maintenance with extra work now.  Don’t program in features that are not needed — rather, anticipate what will be changed or added later and consider that in the overall design, and allow for such things when it does not add complexity to the task at hand.  (I can probably write a whole essay on this topic.  Message me if you have examples or general thoughts.)


I was very successful in the point 2 makes the code simpler and shorter.  Point 3 takes the form of putting the configuration at the very top of the file, and showing how it can be extended; and the extensibility is free due to the architecture noted in point 2.

There are the needed #includes at the top of the file, and then the interesting part starts off with:

constexpr std::pair<int, const char*> fbchart[] {
    { 3, "Fizz" }, { 5, "Buzz" },  // standard FizzBuzz
    { 7, "Boom" }  // easily generalized and extended by adding to this chart!

When I wrote this, Visual Studio’s C++17 compiler is still in per-release, and part of the exercise for me was to use new features and see where style needs to be updated.

constexpr is the new static const.”  Naturally, this tabular data will be declared as constexpr.

Now, why did I use const char* rather than an object type?  This was a deliberate choice.  First of all, the use of this table (described later) is flexible in what it can take, so I don’t require a std::string here even though that’s what will be used in the constructed object.  There is no need to store a std::string object here, which would make another copy of the lexical string literals, and string does not have a constexpr constructor.

The usual worries about primitive C-style strings do not apply since this is constant data.  After all, do you worry about writing cout<<"hello"; and demand that you store that in a string object before passing it to the function?

A hot-off-the-press alternative would be std::string_view.  But there is simply no need here.  I chose not to use gsl::zstring, since it would be the only use of the Guidelines Support Library in the program, and there is no question that C-strings are being used since they are right there, and only right there.  This is not a function parameter that needs documenting.

Likewise for the use of the plain array, rather than std::array.  Arrays are not as poor as they used to be: with the preference for free functions begin, end, size, etc. what does the object wrapper do that the plain doesn’t?  Only the regular value semantics of passing and assigning — and I’m not doing that.

string fbencode (int n)
    string retval;
    for (auto& [div, codeword] : fbchart) {
        if ((n%div)==0)  retval += codeword;
    if (retval.empty())  retval= std::to_string (n);
    return retval;

Here is the meat:  a simple stateless function that accepts the number as a parameter and produces the results to return as a string.

The loop has a one-line body, and it is automatically iterated over the data table shown earlier.  That is why adding another item just works.  This code is smaller and simpler than the cascade of tests you normally see.

// and to drive it: 
int main() {
    using boost::irange;
    for (auto n : irange(1,101)) {
        cout << n << ": " << fbencode(n) << '\n';

In the main, I avoided a legacy for-loop by using Boost.Range 2.0. That is tried and true for production work. I’ve not tried to get the latest Visual Studio compiler drop to swallow Range-v3 yet.


Not only does this show fluency in C++ with up-to-date skills, it is well crafted and provides a number of discussion points.  As summarized along with the code samples, I can explain why I did things a certain way, alternatives considered, what I might do under different circumstances, etc.

It is good to understand the constructs used and choose them purposefully, not apply things in a cargo-cult or voodoo programming situation.  On either end of the interview process, understand that simple coding problems can give far more insight than simply finding out whether the applicant can “write code” at all.

Error Handling Concepts (again)

In Ancient Times, I was privileged to be among those who deliberated upon a fundamental question in C++ programming philosophy: “Whither exception handling?”  Exception handling was still in the conceptual stages then, and the air was charged with excitement.

Twenty two years ago, I wrote an article for Windows Tech Journal recounting my experience a few years earlier in comparing error handling mechanisms of exceptions vs. error code returns.  In particular, I discovered that using both was an impedance mismatch, with code always wrapping the other kind to convert it:  catch the exception and return an error code; or test for an error code and throw an exception.  This blather is a road leading towards a pure exception handling approach.

I’ve scanned in the magazine pages and it can be seen on my web server.

Now, the C++ community is coming full circle it seems.  Everything old is new again.

In the 2017 standard we have std::optional which is the most basic “sum type”. In a talk Andrei Alexandrescu started with the example of the variant and asked the audience “Why is it called a ‘sum type’?” My own answer: because you don’t know its type but it returns something!

Seriously, in this presentation he also said “Remember when I told you the worst thing that happened to humankind was? I see the std::optional.” He introduces the expected class (still being argued about in committee as I write this), as the right solution giving the best of both worlds:  It can return a result or an error code, and the caller can take the step of checking the error code or just access the result without checking, in which case it throws an exception if it’s actually an error state.

Meanwhile, Niall Douglas rips expected.  Based on the peer review results of Boost.Outcome, he notes for consideration that many of the findings apply to the current expected proposal.

There’s no substitute for experience

This is part of a series of posts I’m writing about using toy projects and other exploration to get a hands-on feel for the new C++17 features, library, and style, as well as the behavior and real limitations of the compiler.

I coded up a very simple recursive descent parser, since it’s been noted that the new std::string_view is especially suited for parsers.  (There is a lot of sentiment about string_view being problematic and evil, but that’s another story.)

Now a recursive descent parser is one of the few places, it is generally acknowledged, where throwing an exception as a form of control flow is genuinely a good decision.  But, this simple grammar doesn’t have enough depth to need anything like that.

The interpreter (it does the work as it parses, as opposed to building a tree representation to be evaluated later) throws exceptions in case of errors.  The user of the class will know this as the way it gives errors on bad input or run-time conditions.

The parsing step itself, though, makes heavy use of std::optional.  As is the nature of such a parser, a production (grammar rule) might be called where this thing may or may not exist: optional parts in the syntax and alternatives and lists all lead to logic that says, “read one of these; did you get one?  No? OK, skip it (or try reading one of those instead).”

Other callers need to read something, period.  In that case, the caller needs to check for an error (just as it did when it was optional) and throw an exception.  This code is what gave me déjà vu:

skip_ws (input);
auto LHS= read_identifier(input);
if (!LHS)  raise_error(input, 4);
read_required ('=', input);
auto terms_result= read_terms(input);
if (!terms_result)  raise_error (input, 5);
if (!input.empty())  raise_error (input, 3);
set_value (*LHS, terms_result);
return *LHS;

The function read_required throws an exception itself if it can’t do what was asked.  read_terms and read_identifier, like most functions modeling grammar productions, return an optional-wrapped value.

Call a function, then check the return value and throw.  This is done repeatedly in these functions.  That is exactly the kind of mismatch I saw all those years ago.

From the nature of the optional returns, it is the caller who decides on the error code.  In a more complex grammar, I can easily see wanting to propagate that or modify it as it passes back up the call tree.  But in the case of optional, there is no error code coming up — just the lack of a result.

In testing this with different syntax error test cases, I found places where I was not checking the return code.  This can coredump because dereferencing the optional does not do any checking on whether it contains a value.  (On the other hand, there is a value() member function that does check.)  I guess I’m so used to writing such that functions do what they are asked (or don’t return at all) that writing in a style where every call is followed up by an explicit test is challenging as well as ugly and obfuscating.

It’s back to the assessment I made when promoting exceptions in the first place:  look at this block of code — what does it do?  It’s main purpose is a bunch of if statements and errors.  Where is the real logic I came here for?  The testing is drowning out the real work.

Would using expected/outcome be better for this?

If the called function loaded the result with an error code rather than just the lack of result, and the attempted dereference would throw an exception, then I would not have to check after every call but still could when I wanted to.

But… it is the caller that knows what error code to assign.  Furthermore, what about the specifics of the exception and its payload?  If productions returned an error code (just an enumeration), the exception still needs to be the parser_error type and include the cursor position.

Its like the (deeply nested) called function needs to interact with the caller to formulate the proper error.  Catching and modifying and re-throwing is something that will not go over well with modern implementations, as throwing is very slow.

I’m reminded of the exception class I wrote and used long before C++ was a standard: I included a feature to simulate dynamic scoped values-of-interest.

Parsing is one case where we really care about details across several levels of function call, in order to get meaningful feedback on the error.  In more normal circumstances, you tend to rely on logging.  Even so, having a speculative log — include this line if an error is logged; flush it upon success — is very helpful and cuts back on the spew of irrelevant information logged.

But the same technique could be used to generate a more meaningful exception, adding information as the stack is unwound.

Besides catch to deal with a problem after it occurs, and destructors to automatically clean up resources, we need some way to prepare for a possible error.  This would be primarily for reporting, not for recovery actions.  That’s probably why it has been neglected as a topic — destructors are perfectly good at cleanup and using them means we don’t have to explicitly prepare.

This entry should appear on .

How fast was Knuth’s computer?

In my previous post, I wondered what the power of Knuth’s computers were, at the time TAOPC was being written.  Someone suggested the IBM S/360 series as an exemplar.  That turned out to be a good idea specifically, since I’ve written programs for the S/370 in assembly language, so I’m familiar with it.  Models 30, 40, and 50 were released in April 1965.  On the pricier side were models 65 and 75.  Here is a scanned “System Summary” describing the various models in detail.  So, I suppose somewhere between 100 and 900 kilo instructions per second.  A larger machine would probably be servicing multiple users.  Fifty years later, my Xeon E3-1276 is supposedly around 133 billion instructions per second.

Interestingly, the S/360 takes many (like 13 on average) clock cycles to perform one instruction.  Meanwhile each core of the Xeon performs 8½ instructions in one clock cycle.  I suppose the clock speed of the S/360 is the cycle time for the internal microcode.

But what’s an instruction?  On the S/360, I would not need the decode function at all, but would just sum the digits directly using unpacked decimal.

int decode (std::initializer_list<cellT> lst)
	int total= 0;
	for (auto digit : lst)
		total= total*10+digit;
	return total;

The modern CPU knows only binary arithmetic on various word sizes. So converting from a decimal digit-per-byte requires 4 iterations on two operands doing a multiply and add: at least 16 distinct instructions (if the loop is unrolled), plus the actual add once that’s all done.

Interestingly, the x64 code generated by the compiler doesn’t actually issue a multiply instruction in this loop. In fact, the entire expression does not use the regular ALU! There is neither a MUL or ADD instruction there. Instead, it exploits the address generator to do stuff that has nothing to do with actual pointer addresses. The complicated addressing modes of the CISC processor means that a separate address generator unit has a variety of things it can compute, yet it is far more limited than a completely general ALU. So, it is much simpler and thus faster.

In particular, Scaled Index mode looks like this: [ebx + ecx*S + constant] Register ebx is the base, and ecx is used as an index here. The index can be used directly, or scaled by 2, 4, or 8. If the same register is used in both positions, you can multiply it by five! The LEA instruction is Load Effective Address, and gives the processed address without fetching what it resolves to like a MOV would. So, if we have total in EAX and the digit in EBX,


The first instruction multiplies by five. The second instruction not only multiplies by two, but also adds in the digit as the base of the addressing mode.

I also found it interesting how the S/360 line anticipated what we have today:  one compatible instruction set, but pricey implementations have more pipelining and faster clocks; also they keep adding more layers of cache memory.  The “processor storage” housed with the CPU is analogous to the L2 cache.  Adding external memory modules gives more storage but slower: 8 microsecond access time.  If you add pairs of modules you can go dual-channel and double the throughput.  Finally, later high-end models added extra-high-speed memory to keep up with the CPU, and that is analogous to our L1 cache.

Back to the quantitative comparisons:  The modern machine has 4 independent cores, but my program only used one.  If a brute force problem required a significant amount of time, it could be split up into 4 tasks.  At full utilization, 133 billion vs 133 thousand, more or less.  That’s a factor of about one million.  With the single thread, a quarter of that.  30 ms on one core would be about 8½ hours on a S/360-50 using it exclusively for this job.

Knuth’s suggestion of 10! can be scaled up by a million.  That’s midway between 12! and 13!.  Now in terms of exponential growth, realize that an institutional computer like that cost about 1000 times more than a personal desktop computer today.  At computing power per constant dollars (not adjusting for inflation) is indeed about one billion.

For floating-point calculations, the difference in power over 50 years is a few orders of magnitude higher.  A $300 GPU card can do 4 teraflops?  That means it would be a world-class supercomputer as recently as 2005!


Permutation Puzzle: “send+more=money”


I saw this puzzle on Bartosz Milewski’s blog, with an entry on using monads in C++.  I’d like to hold it up as an example of a completely different lesson to C++ programmers:  A central idea I want to teach is know your libraries.

I recall it was in the early ’90’s, when the STL for C++ was submitted as a proposal to include in the upcoming standard.  I noticed there were algorithms called next_permutation and prev_permutation, and wondered how they work—how do you order them and re-arrange your collection to the next such, without keeping an auxiliary state?  Then I wondered what I would ever use such a thing for.  Well, nearly 25 years later, I found out!

You should look through the list of algorithms every once in a while just to see what’s there.  Otherwise you might only know about the common ones that you use.  Consider the analogy with a tool (say, a special bent screwdriver only used to attach doorknobs) that you know is in the very back of the drawer, though you may need to rummage around to locate it.  Remembering you have that tool makes for a quick job.  Having to rig up something from common parts instead (to continue the analogy, use locking pliers to grab a short screwdriver bit from the side) is not as good, and certainly more work.

So… 8 nested for loops followed by 7 if statements containing 28 conditions?  Get real!  If you have a line that reads });});});});});});});}); then the code review will be a sight indeed.

Solution in C++ w/standard library

Here’s the meat of my solution:

using cellT = int8_t;

cellT A[10] {0,1,2,3,4,5,6,7,8,9};

void solve1()
    do {
	auto [ig1,ig2, s,e,n,d,m,o,r,y] {A};
	int send= decode({s,e,n,d});
	int more= decode ({m,o,r,e});
	int money= decode ({m,o,n,e,y});
	if(send+more==money) {
    } while (std::next_permutation(std::begin(A), std::end(A)));

You’ll notice that besides the uniform initialization syntax introduced in C++11, this uses something you might not have seen before (if you’re reading this in 2017).  Hot off the press in C++17 is structured bindings.

	auto [ig1,ig2, s,e,n,d,m,o,r,y] {A};

This defines 10 variables and assigns all the elements of the array to them.  The first two are ignored so I used scratch names, and the others are simply the names of the letters in the puzzle.

One thing I have to point out from Milewski’s listing is the call to turn a word into a numeric value.  He writes:

int send = asNumber(vector{s, e, n, d});

This creates a std::vector on every use.  Let me elaborate: it allocates a block of memory from the heap (vectors can’t use a small-vector optimization).  Then after the call returns it is deallocated.  Then the next two lines to the same thing.  And that happens on every iteration.

The constructor for std::vector takes this handy literal list.  Now in C++ these containers are not special language features, but are ordinary libraries.  It should be clear that anything they do — cool syntax or unusual ability — you can do yourself on your own code!  My version of the same construct does not create a vector, doesn’t require more words to make the call, and most importantly does not have any overhead.

int send = decode({s,e,n,d});

And here is the function that takes such an initializer list:

int decode (std::initializer_list<cellT> lst)
	int total= 0;
	for (auto digit : lst)
		total= total*10+digit;
	return total;

The solving function doesn’t print the results because I want to time just the solving logic.  So the solutions are pushed onto a vector, and the caller prints them after stopping the clock.  In a real program, this might be an object (not globals) and the results available in the object afterwards, or as the return value from the solving function.  In another post I’ll make it lazy.

 Make it faster

This simple function found 50 solutions, one of which doesn’t have a leading zero.  It ran in 39.6 milliseconds, trying all 3,628,800 permutations.  That’s 92 million iterations per second.

The value of 10 factorial is an interesting number here.  Donald Knuth, in the The Art of Computer Programming, wrote that this is about the size that separates things that may be done by brute force from things that are impractical to simply try all possibilities.  Volume 3 was published in 1973.  I hazard to guess that computers now are about 230 (or about a billion) times the power of the machines that were around when he wrote that.  A billion times 30 milliseconds is 460 years.  So, I revise that to more like ten million times the speed, if I postulate that he could have run this program to completion in a matter of days.

Anyway, to make it faster, I need to skip over permutations that are “the same” as one I already rejected.  The order of the two ignored digits don’t change the solution, so if I decide that one order is canonical and when the other is detected I skip over the whole lot, that would cut the number of iterations in half.

So how do you skip over states in the next_permutation algorithm?  I looked it up — a StackOverflow post described it well and also pointed to a Wikipedia page on it.  The states are generated in lexicographical order, so when a given digit changes everything to the right is in reverse sorted order, and it “rolls over” to advance that digit by one and everything to the right is now in forward sorted order — the lowest value of that substring.

So, when I identify a digit value that I know will not be a solution no matter what the other digits are, I can skip to when that digit is advanced again by reversing everything to the right of it.

void skipfrom (int pos)
    std::reverse (std::begin(A)+pos, std::end(A));

    do {
	auto [ig1,ig2, s,m,e,d,y,n,o,r] {A};
	if(ig1 > ig2) {

Indeed, it still found 50 solutions but the iteration_count showed half the number: only 1.8 million times through the loop.  However, the time only went down to 26ms — about two thirds the time, not half.

We also don’t want solutions with a leading zero, so filter those out too.  Notice in the listing above I changed the order of declaring the digit variables.  It doesn’t matter to the solution algorithm, but putting these farther to the left means I can skip over more.

	if(s==0) {
	if(m==0) {

That didn’t save much though: 1.45 million iterations in 22 ms.

Another big constraint can be found on the right side of the puzzle.  I would think that parsing the arrangement of digits into ints would be slow, seeing as that involves multiplying by 10 and adding each digit.  Looking at the rightmost (units) digit only, the puzzle has d+e=y with a possible carry.  Test that before parsing the int values, and furthermore skip ahead until one of those three digits changes again.  To that end, the declaration order has d, e, and y next after the previous items we wanted on the left.  This leaves only 3 letters to the right, so each time the units digits don’t work it can skip 6 iterations.

I added a counter to that, and see that it happened 219,360 times.  The loop only executed 355,053 times now, taking a mere 4.7 ms.

Faster yet?!

Notice in the listing that I declared a type CellT for the array of digits and anything that holds a digit.  My thought was that keeping the array small would save in parameter passing to decode.  Keeping the size of the data small is generally good for memory caching, but it probably doesn’t matter here because I don’t have enough data to exceed the L1 cache anyway.

But, I can change the type in one place in the code and it affects everything in the program.  I changed it from int8_t to plain int (int is 32 bits here), and…  the time was down to 3.4 ms!

64-bit was the same.  16-bit values was the slowest at 5.2 ms.  So, the CPU is inefficient at loading 8-bit values and worse with 16.  I don’t know if the actual load is slower, or the compiler generated somewhat different code rather than just using MOVXS instead of MOV.  This was a difference of a quarter of the execution time over the entire algorithm, and loading values is only a part of what it does, with values kept in registers once loaded.

GCD from “From Mathematics to Generic Programming”

For my birthday my sister got me a copy of From Mathematics to Generic Programming by Alexander Stepanov and Daniel Rose.  I had noticed it on and flagged it on my wish list, then forgotten about it, so it was a nice surprise and yet something that she knew I would be interested in.

The bulk of it is about pure math.  It covers the history of math with a small number of continued examples showing how progress was made when domains were generalized.  Euclid described an algorithm for Greatest Common Measure which was geometric in nature.  Much later zero was invented and a couple hundred years later it was realized that a line segment could have zero length, and the Greatest Common Divisor is the same algorithm using division and remainders on whole numbers.  Then came algebraic structures and the same algorithm can be applied to Gaussian integers, polynomials, and beyond.  Noether then studied the question of what, in the most abstract sense, can the algorithm work on, and described the Euclidean domain (or ring).

Returning to the conventional approach of using integers, the function they present is:

//gcd from page 59
template <typename N>  // N is some kind of integer
static N gcd(N a, N b)
    using std::swap;
    while (b != N(0)) {
        a= a % b;
        swap (a,b);
    return a;

Pretty mundane in this day and age.  But, earlier microcomputers did not have an instruction for division.  I recall figuring out how to implement multiplication on an 8-bit computer in the early 1980’s.  Even now, I can call up a modulo operation with a single symbol and know it maps to a single machine opcode… but that instruction is not very efficient.  I was curious as to just how heavy it is, and found a report giving instruction timing details measured for various architectures.  On the Intel Haswell, I saw it was indeed something of an anomaly in terms of how many micro-ops, how many compute ports it sends them to, and how long the latency is.  Apparently it is a microcoded routine that works out the division just like you would code in the old days; just built-in and already perfectly optimized.

Stein’s Algorithm

Later, the book goes on to describe Stein’s algorithm.  Written for a computer in 1961, it uses bit-shift and subtract and other very primitive operations which are especially simple on binary encoded integers, with no need for division.

template <typename N>
static N gcd(N m, N n)
//	if (m < N(0))  m= -m;
//	if (n < N(0))  n= -n;
    if (m == N(0))  return n;
    if (n == N(0))  return m;
    // m>0 && n>0
    int d_m= 0;
    while (even(m)) { m>>=1; ++d_m; }
    int d_n= 0;
    while (even(n)) { n >>=1;  ++d_n; }
    // both are now odd
    while (m != n) {
        if (n>m) std::swap (n, m);
        m -= n;
        do m >>= 1; while (even(m));
    // m == n
    return m << std::min (d_m, d_n);

How does Stein’s algorithm fare today?  Sure it avoids the big fat modulo operation, but it has a lot of steps.

Here is a summary of my results.  Values are in nanoseconds for one call of the function, having run a trial of 16 million calls.  I’m using Microsoft Visual C++ 2015, and running on an Intel Xeon E3-1276 v3 @ 3.60GHz.

The first four columns is the code from the book, compiled as 32-bit and 64-bit.  In 32-bit builds, the modulo operation of 64-bit values is not a single instruction and the built-in microcode is not accessible.  Stein’s algorithm is a big win in this case.

For 64-bit code, the modulo operation is significantly slower for signed values than for unsigned (though all test values were unsigned).  Stein’s algorithm’s time is between those two, so it’s faster when using signed 64-bit integers.

Because the % operation is so much slower on signed values than unsigned, if there was a need to handle signed arguments, a wrapper function could take the absolute value of each argument and then call a version that used unsigned values.  Remember that any time you are doing division or modulo!  Unsigned types are faster.

I then called the 64-bit integer form with values that would all fit in 32 bits.  Unlike multiplication, division can’t easily quit early when the numbers are smaller.  But, fewer iterations are needed through the loop, so the function is faster.  Likewise, the Stein algorithm will loop less in every stage.  The built-in division regains its lead, as it speeds up more.  That is, the Stein algorithm’s complexity is a steeper curve considering the length of the input values in bits.

template <typename N>
bool odd (N n)  { return bool (n&1); }
template <typename N>
bool even (N n)  { return !odd(n); }
int d_m= 0;
while (even(m)) { m>>=1; ++d_m; }
int d_n= 0;
while (even(n)) { n >>=1;  ++d_n; }

The Stein’s algorithm code is written to avoid division, but what it is using is a lot of test-and-branch instructions.  The above fragment removes all the trailing zero bits and remembers how many there were.

Today’s Bottleneck Issues

Each operation is very simple: test the least-significant bit, shift right one bit, increment an int.  But the branch is a killer!

With a modern superscalar highly-pipelined processing core, a conditional branch breaks the pipeline.  The x86 processors use branch prediction and speculative execution to mitigate this, but here the branch decision is essentially random on every iteration.  It guesses wrong half the time.

Meanwhile, the instruction set does feature primitive opcodes to determine how many consecutive zero bits are at one end or the other of a register.  I can access that via a compiler intrinsic, and eliminate the problematic loops.

template <typename N>
static N gcd(N m, N n)
    if (m == N(0))  return n;
    if (n == N(0))  return m;
    unsigned long index;
    _BitScanForward64(&index, m);
    int d_m= index;
    m >>= d_m;
    _BitScanForward64(&index, n);
    int d_n= index;
    n >>= d_n;
    // both are now odd
    while (m != n) {
        smaller_first (n, m);
        m -= n;
        m >>= index;
    // m == n
    return m << std::min (d_m, d_n);

There is still one loop remaining.  This main while loop ought to be predicted to loop and only fail the prediction when it’s ready to exit.

The column in the results table labeled “BitScan” gives the results after making this change.  All of a sudden, Stein’s algorithm beats the built-in div code in every case!  In the case of small values to 64-bit signed integers, it is 3× the speed!  Interestingly, the running time seems to be determined by the size of the values, without regard to the register size used: 32-bit values run at the same speed whether the code is compiled to use 32-bit integers or 64-bit integers.  That makes sense, since the various primitive operations (unlike multiplication and division) are constant speed and don’t have any advantage to using a smaller word size.

But I still have one more trick left. if (n>m) std::swap (n, m); is a branch point, and it will take one way or the other many times as it loops.  That is, this is another “bad” branch.

A long time ago, branching was far worse than it is on today’s CPUs.  I recall writing high-performance graphics code when any branch was a penalty.  I learned to absorb the conditional logic into the math, and also learned how to generate and manipulate masks in various clever ways.  Here is my quick attempt at a purely computational compare-and-swap:

template <typename N>
void smaller_first (N& n, N& m, std::false_type)
    using std::swap;
    if (n>m) swap (n, m);
template <typename N>
void smaller_first (N& n, N& m, std::true_type)
//        cout << "smaller_first input n=" << n << ", m=" << m;
    typedef typename std::make_signed<N>::type sN;
    const auto bitcount= 8*sizeof(N);
    const N mask= (sN(n)-sN(m))>>(bitcount-1);  // all 1's if m is larger, all 0's if m is smaller
//        cout << ",  mask=" << mask;
    const N smaller= (m&~mask)|(n&mask);
//        cout << ",  smaller=" << smaller;
    const N larger= (m&mask)|(n&~mask);
//        cout << ", larger=" << larger << endl;
    n= smaller;
    m= larger;
template <typename N>
void smaller_first (N& n, N& m)
    typedef typename std::is_signed<N>::type tag;
    return smaller_first (n, m, tag());

It only works on signed types, so I made the template choose the regular way for unsigned types and the non-branching trick for signed types.  (Actually, as written here it would work for unsigned types if the values were constrained more.)

It works by first doing a subtraction: if n-m produces a negative result, that means the highest bit is set in 2’s complement representation.  So if m is larger, that high bit will be set, and if m is smaller it will be clear.

Then the arithmetic right shift replicates this one bit to fill the entire word size: now I have all 1s if m is larger and all 0s if m is smaller.  In either case, the ones-compliment (~) operator will reverse that.

Each input value is then ANDed with a mask: one mask will always produce zero regardless of the input, and the other mask will have no effect.  Then both of them are ORed together, which results in the one that wasn’t all zero.

This is done twice, once for the larger and once for the smaller, by reversing the mask choice.

This would appear to use a lot of registers:  the two input values, the output values (which can’t destroy the input values before it is done), the 1 mask and the 0 mask.  But after computing the masks, the rest of the steps are not dependent on each previous step so they can all be done at the same time in various execution units, or at least scheduled optimally.

The result indicates that all this math and register usage is cheaper than branching: 44 becomes 39 or 37, 84 becomes 75.  This is about an 11% speed up in the overall algorithm.

Additional experiments indicate that the final test (doing the min at the very end) doesn’t benefit from using this method.  It is only performed once, at the end of the function.  Likewise the tests at the beginning for an argument of zero don’t have a noticeable effect (they are almost always “not taken” and when they are the routine is fast because it doesn’t have to do the main part at all).

 What about cmov ?

This post on Stackoverflow reminded me that the x86/x64 has long since addressed this with the cmov, or conditional move, instruction.  It makes sense that an if statement containing a simple comparison controlling only simple assignment statements could cause cmov instructions to be generated.  However, my attempt to write it that way showed that no cmov was used and conditional branching was still being done.  Yet further on in the function, a cmov is seen, and it comes from a use of std::min.  Looking at the headers, I see std::min is written as a ternary expression in the assignment, not an if statement.

Presumably the std templates are carefully crafted to trigger the kind of code that results in optimization, or the optimizer is written to handle the code generated by the standard headers, or cycles of feedback between the two.

So writing:

template <typename N>
void smaller_first (N& n, N& m)
    const N old_n {n};
    const N old_m {m};
    n= std::min(old_n,old_m);
    m= std::max(old_n,old_m);

does, on this compiler version, cause cmov instructions and no branching; although it does unnecessarily do the comparison twice.  Other than the lack of branching, the machine code appears rather poor to me.  But here’s the real performance:


The min/max code wins hands down on 16-bit values. It does poorer than the jumping code in all other cases, so the funny arithmetic is still the winner in the signed values where it is supported.

Here is the 16-bit case, which is fast:

cmp         ax,r8w  
movzx       ecx,r8w  
cmovl       r8w,ax  
cmp         cx,ax  
cmovl       cx,ax  
movzx       eax,cx  
sub         ax,r8w

The values for m and n start out in AX and R8W, and are reversed if necessary so the subsequent use (the subtraction) uses those same registers.

The generated code uses one additional register, ECX, to hold the original value of R8W (zero-extended to 32 bits) in case the subsequent line copies AX into R8W.  Then the test is performed again between AX (still unchanged) and CX (holding the original R8W), and based on that result may copy AX into CX.  Then it moves CX (which may be the value from the first copy or the conditional second copy) into EAX, again zero-filling it to 32 bits.

I suppose there is a performance penalty to move into 16-bit register fields, so the (non-conditional) moves use zero extension moves to 32-bit fields.  Yet it uses the 16-bit portion in other places, and never reads larger portions even if they are known to be zero or simply ignored later.

This code could be simplified by removing the second CMP and then reversing the condition of the second CMOV.  That second CMOV may be free though, since it can be done simultaneously with an adjacent instruction, and removing it would not free up resources because no other instructions are not dependent on the result.

You would think this identical code would work for 32 bits or 64 bits, just by using the full width of the same registers.  But no, here’s what it generates for 32-bit values:

mov         ecx,r8d  
mov         dword ptr [rsp+0F0h],ecx  
mov         dword ptr [rsp+0E8h],edx  
lea         rax,[rsp+0E8h]  
cmp         edx,r8d  
lea         r8,[rsp+0F0h]  
cmovge      rax,r8  
mov         r8d,dword ptr [rax]  
lea         rax,[rsp+0E8h]  
lea         r13,[rsp+0F0h]  
cmp         ecx,edx  
cmovge      rax,r13  
mov         edx,dword ptr [rax]  
sub         edx,r8d

Besides using EDX and R8D for the two values, it uses ECX for a copy as with the first case, but also uses RAX, R8, and R13 as 64-bit pointers, and stores the values in stack-based memory locations as well, [RSP+0F0h] and [RSP+0E8h]!

The values are stored in memory, with pointers to those addresses in two new registers.  The comparison is done with the values (still) in registers, but the pointers are what are conditionally copied, and then the resulting pointer is de-referenced to get the matching value.  Besides chewing up 3 more registers it requires 3 memory accesses!

The 64-bit case works the same way, with the value-carrying registers and memory allocations widened.

HOTEL alfa sierra kilo echo lima×2

From earlier posts, you know that I’ve been learning Haskell. Now it’s one thing to memorize some syntax rules, and be able to discuss intelligently the semantics and ramifications of some particular language feature, even typing a line or two at an interactive prompt. It’s another to face an empty page in a text editor with the intent of writing a program.

For whatever reason, it occurred to me to do something with the spoken (phonetic) alphabet.  It’s good to pick something you easily know how to do, as to concentrate on the mechanical details of the tools instead of the problem.

The Main Loop

To that end, I started with looking up how to fetch command-line arguments (simple output was already covered from hello world) and spent a long time pondering just how to structure the loop over the arguments.

Pretty basic, right?  With pure functions, given something to do on one input value (say, foo) and an ordinary list of input values, you can process all of them using any of several ways of saying “map”.

map foo values
foo <$> values
mapF foo values
mapM foo values
forM values foo

The problem is that both the thing I want to do on each value and the function that obtains the list of values is “in” the IO monad. So instead of one token to put these two things together (whether a word like map or a infix operator like <$>), I need two separate tokens for monad composition and the mapping, and had further trouble combining the two things.

Separate steps for the monad and the map isn’t too unreasonable, but is just a little bit. There are different levels of wrapping, like having different levels of indirection. f $ g is one way to write simple function application, and the entire thing g is the argument of one call to f; f <$> g uses a different symbol to mean apply f to what’s inside g treated as a container, so f is called multiple times with values taken from inside g.  There are any number of different wrapping layers and different things can be wrapped in different numbers of layers, so having unique marks for each possibility is prohibitive.  There are ways to bump things up in general, so f g is plain function application and liftM f g tells f to look inside its argument.

So, given that the values I have are double wrapped and I’m targeting the middle layer (neither the innermost nor the outermost) it seems reasonable that an extra mark of some kind is needed to specify, in addition to a mark that says “put these two things together”, making two in all.

The other combining problem is harder to explain:  Given one function (foo) that produced something that you wanted to feed to another function (bar), simply combining them is a no-brainer, you would think.  In a C-like language, bar(foo()) is pretty obvious.  In Haskell you don’t even use the parenthesis, so it’s just bar foo.

You could use an intermediate variable name, like

s= foo();
bar (s);

but you don’t have to.  In fact, that’s one of the things wrong with using so-called OUT parameters, is that it ruins the expressiveness and you can’t simply chain them together.  And this kind of chaining is very much embraced by functional style and Haskell in particular.

If foo is actually getArgs, and bar is a function that takes a list of strings (don’t even worry about the mapping at this point—just write a function that explicitly takes a list), you can’t do it!

main = do
    s <- getArgs
    mapM_ body s

badmain = mapM_ body getArgs

Writing that without the named intermediate value and do-block, the “obvious” main = mapM_ body getArgs doesn’t work.

When I was pondering this as I was working on it, I concluded that the only thing that can be done to getArgs is the bind operation (>>=).  Now, maybe that’s not quite true if common helpers exist that themselves use bind, such as the liftM mentioned earlier.  Ideally I’d somehow mark the argument that is wrapped too much for the function, as the first argument (the body) is fine and there may be variations like liftM2 but not one that skips lifting some arguments and then lifts others.  Would ap or <*> mixed with $ work here?  That’s something to try next time.

Meanwhile, my first concern was with writing the for-each construct in a clear manner, without separating body into its own named function since it isn’t that complicated.  I wondered what the usual common idiom might be?

No matter how you arrange the pieces though, I needed to name the parameter for the individual value/iteration being processed.  In common imperative “structured” languages, naming the loop control variable is a basic part of the syntax of a for-each looping construct.  Using a higher-order function instead, writing a lambda expression for the body was a bit clunky.  In any case, the layout started giving me trouble, so I gave up on that and just made body a separate named function.

What I ended up with is (starts with, anyway)

main =
    getArgs >>= mapM_ body

and with the body being a single word, the arrangement of the components (looping construct, list of inputs, body) is not nearly so important.

 Look-up Table

Haskell’s go-to data structure is a list.  There is no simple way to make a O(1) random-access array.  I mean, there are in fact such arrays as libraries, but to populate it you need a list anyway.  Such an array would help random-access speed, but only makes creating the table more complicated.

For key-value lookup, the basic feature is a function called lookup.  It operates on a list of (key,value) pairs, and I really didn’t want to write out the keys a through z and pairs, but rather just list the values (the alfa bravo keywords) in order.  The zip function took care of that.

Since it is most definitely not taking any advantage of contiguous range of keys, I decided to use that as a feature and add keywords for some odd punctuation marks and such.

phon_table= nato_table ++ digit_table ++ 
    [('.',"Period"), ('-',"Dash"), ('#',"TicTacToe"), ('_',"Underscore"), (' ',"Space"), ('@',"Snail") ]

nato_table= zip ['a'..'z'] ["alfa", "bravo", "charlie", "delta", "echo", "foxtrot", "golf", "hotel",
                "india", "juliett", "kilo", "lima", "mike", "november", "oscar", "papa", "quebec",
                "romeo", "sierra", "tango", "uniform", "victor", "whiskey", "xray", "yankee", "zulu" ]

digit_table=  -- FAA style 
    zip ['0'..'9'] ["Zero","One","Two","Three","Four","Five","Six","Seven","Eight","Niner"]

Thinking ahead, variations can be accommodated by building the master list out of different pieces.  For example, use ITU instead of FAA code words for the digits, and add application-specific words.

Function Decomposition

Functions need to be written in small pieces.  In real-world code (using common imperitive languages) I’ve seen people write what I call meandering functions, which is a lack of decomposition: one defined unit (procedure or function in the high-level language) does stuff, then does more stuff, then goes on to work on something else, etc.  Often they are much much too long, running hundreds of lines of source code.

In Haskell, pure functions don’t have multiple statements.  That makes it hard to meander, or, without being adept at functional programming yet, even writing more than one line!

I also see lots of code that is not meandering in that the function does “one thing”, but yet is not decomposed well in that the thing it’s doing is obscured (even overwhelmed) with lower-level details that should be unimportant to “the thing” being expressed in this function.  Many programmers have a hard time with that, apparently not recognizing or even understanding what that is.

Those same programmers would more naturally break things up more in Haskell, since it is awkward to stuff more detail inside the argument of a function.  Writing a list of statements, it’s all to easy to stop what you’re really doing and issue lower-level detail work, and then put in more higher level work.  When everything is an argument to a single function, it’s more in-your-face that you are interrupting your expression with details that should be elsewhere.

Haskell facilitates writing very small functions with features such as writing multiple branches as separate top-level functions, and nested functions which close over the variables in their parent scope.

So is it “smaller is better”, period?  How much logic should go into one function as subexpressions rather than separate named functions?  Since the stuff in a where block have the same meaning as being in-line, thanks to the parent variables still being available, naming many individual pieces of a function seems to be the way to go.  What to not break up would be logical constructs that are read as a unit, and idiomatic uses that are recognized as one phrase but would be disguised if broken up.

When I first started to formulate some code, I ran into trouble with multiple nestings of local definitions, or trying to combine them in ways that either were not supported or couldn’t work in the layout rules.

A harder thing to know how to get just right is when to make local nested functions vs top-level functions.  My instinct is to nest everything and not expose internal details, since there isn’t any level of organization and name hiding smaller than the module.

Once I got past the initial main-loop panic, most of my time was spent fiddling with just how to decompose the functions.  As an anchor point, it made sense that the principle function solving this problem would be a named function and the driver (getting the args and displaying the results) would be a sample usage of that function.  In a real project, that is what I’d be interested in, and what I have here as main would just be test code, as the function would be called by other parts of the program.

One interesting detail is that my function phon processes one character, producing one code word as a result.  It doesn’t provide a function for processing an entire string, which is something you’d probably add in typical programming languages.

Given the function phon, you can call it on a character using ordinary function application, but it’s just as easy to process an entire string so there is no need to provide another function for that!

codeword = phon ch -- mundane call
codeword = phon $ ch -- means the same thing
codewords = phon <$> s -- process whole string of input
codewords = map phon s -- same thing

It’s also especially important in Haskell to keep the physical realization of the output separate from the computation. In most common programming languages, it would be just as easy to have the principle function print the result, and multiple calls would print successive items. In Haskell it is not just as easy — The function phon in this case cannot call putStr or anything like that.

In this toy program, where phon is called right from main and nothing else is happening, writing phon as an IO function would not seem awkward.  But it’s still a deliberate act, and you’d certainly notice when making a recursive call (or trying to).

Of course, that’s good in general.  In my toy program printing the results was part of the specification, but in real projects such a function will be consumed by other code, so the result needs to be in a form that’s still internalized and can be handled however the caller wants.  Unfortunately, it also makes it difficult to add debugging trace statements to existing code, but that’s another story.

Here is the complete program, as I left it.

module Main (main) where

import System.Environment (getArgs)
import Control.Applicative
import Data.Char

main =
    getArgs >>= mapM_ body
        body s = do 
            putStrLn $ announce s 
            putStr $ format $ phon <$> s
        announce s = "string \"" ++ s ++ "\" spoken as:"

format =
    concatMap decorate
    where decorate w =  "\t" ++ w ++ "\n"

phon :: Char -> String
phon c | isUpper c = toUpper <$> (phon $ toLower c)
phon c = 
    maybe (quoted c) (id) result
    where result= lookup c phon_table
          quoted c= ['"',c,'"']
phon_table= nato_table ++ digit_table ++ 
    [('.',"Period"), ('-',"Dash"), ('#',"TicTacToe"), ('_',"Underscore"), (' ',"Space"), ('@',"Snail") ]

nato_table= zip ['a'..'z'] ["alfa", "bravo", "charlie", "delta", "echo", "foxtrot", "golf", "hotel",
                "india", "juliett", "kilo", "lima", "mike", "november", "oscar", "papa", "quebec",
                "romeo", "sierra", "tango", "uniform", "victor", "whiskey", "xray", "yankee", "zulu" ]
digit_table=  -- FAA style 
    zip ['0'..'9'] ["Zero","One","Two","Three","Four","Five","Six","Seven","Eight","Niner"]

I welcome comments from experienced Haskell programmers on these issues, and overall style and idioms in general.

A Spriral Approach

I’ve heard of spiral approaches to learning, meaning that material is covered without much detail and making successive passes with more detail.

My recent experience makes me think of a slightly different spiral:

You understand, then you learn more, and then you are more confused than ever.


Maybe that’s why I like this particular representation of the Riemann ζ function?

−7/4 and Interactive Play

I saw this video concerning iterative function sequences

and thought that would be a good example to play around with in Haskell.  Haskell has arbitrary precision integers, and also has rational numbers, standard.  (I hesitate to say “built in” because, like many languages, the true built-in core is minimal with layers of library code on top of that.  But it was available out of the box.)

It also has an iterate function. What’s interesting is that the result is naturally presented as an infinite list. Other languages with abstract iteration or generators will provide some kind of interface to make your custom thing handleable in a common way. Perl even has a tie feature to make your custom thing seem exactly like the built-in collections, which pushes the interface contract down to include native datatypes, rather than being layered on top of them.

But in Haskell, the generator is the native list. Rather than make some custom generating code support the iteration mechanism, the same old list holds what appears to be the complete result, as if done as an earlier step.

In some purely imperative language, you might ordinarily do things like:

  1. Select the common, standard, container type with the appropriate properties (e.g. linear, sequential).
  2. Run a computation loop and fill the container with the results.
  3. Pass the container to another procedure, which can use the contents for other work (e.g. print them to a file).

Suppose that was unsuitable for some reason and you wanted to delay the computations of step 2, running the producer in parallel with the consumer, or not bothering until/unless a value was actually used by step 3, or not knowing how many items will really be needed.  You would have to explicitly program in a more complex sequence of commands to achieve that.

Instead of a plain common collection (e.g. a C++ vector) you’d need something else.  You need to arrange producer/consumer threads with a work queue between them, or callbacks, or something.  Step 2 would be preparing the whatsit, and step 3 would still be easy enough using standard iterators to abstract the actual details of the collection/stream/whatever.

Now look at it in Haskell.  Steps 1 is not much to think about as the list is king.  Step 2 is:

lz = iterate (λx→2*x+1) 0

Later, other code can get values from this list, e.g.

take 10 lz

produces a result of


The value lz is just another list of Integer, as far as the language is concerned. But clearly it could not have been filled in step 2 because it is as long as I need it to be, and never specified a size when filling it. It is actually an infinite list!

This certainly looks like the simple approach of: create the values, then use them. However, it actually doesn’t do any work when lz was defined, and only computes values when subsequent code reads them. Here I asked for the first 10 elements, and the list then went ahead and realized those values. If I ask for the value at index 30 (lz!!30) it needs to run the computation all the way to that element before telling me it’s 1,073,741,823.

This illustrates that the program here is not imperative. I didn’t program in “do this step, do the next step, etc.” Rather, I just specified that I wanted the 30th item of the list, and it did the work of figuring out just how to do that based on the definitions. It’s more like a spreadsheet: cells are not computed in order, but based on what they depend on. Haskell uses a top-down approach: when a specific value is really needed, it looks at the formula for it. Gathering the values needed for that function will set off other chains of downward dependent calculations.

Interactive Play

Using the GHCi interactive shell, it’s easy to play with little programs like this.  A more difficult problem featured in the video is

zz = abs <$> numerator <$> iterate (λx→x^2−7%4) 0

Trying a line of code, then recalling it to adjust and submit again, reminded me of a time long ago when microcomputers ran BASIC (or assembly language) and were attached to the household television.

Interactive play is the way to really learn something, and with a more complex build environment or larger programs you still try and do that.  It’s the combination of a direct Read-eval-print-loop and the ability to write interesting things in a single line of text that makes this experience especially efficient.

Want to try it yourself without installing anything?  Try Haskell! is a browser-based experience that invites you to “type these out and see what happens”  (psst: copy/paste is faster if you really can’t wait to find out).

Back to the Sequence

The actual sequence generated by iteration is rational numbers, but we want to ignore the denominator.  Then, I saw that the sign of the number was preserved with numerator, so the abs stripped that off.

The result starts off as [0,7,21,7,114639] and then the numbers get very large. Index 7 is 595096023596888261906480183623645954687, and it quickly gets overwhelming and impractical to display.

  • n=4 has 6 digits (is exactly 114639)
  • n=7 has 39 digits
  • n=28 has 80,807,125 digits
  • n=29 has 161,614,249 digits
  • n=30 has 323,228,496 digits
  • n=31 has 646,456,993 digits

That’s the largest I could compute with this program.  The Windows version compiles to a 32-bit program, but I used editbin to set the 3G flag, so it had 3G of user address space instead of 2.  That allowed it to computer n=30.  Running on a Mac, which is 64-bit, only increased that by one more before getting Segmentation fault.  I suspect there are things in the implementation that top out at 4G, even when more memory is available.

Foods for Thought

Here is a little problem that is isomorphic to something in real life that I was discussing with a co-worker.  Given a list of items (of a simple enumeration type), and knowing implicitly that each item has an associated priority, return the best item of the list.  Further refinements:  an empty list produces a default 0-priority item, as would a list that contains only negative-priority items.  Oh, and there might be unexpected items present whose priority is not known to the code.

To give some concrete examples, let me define the item type Snack as the set { Nothing, Gorp, Doughnut, Broccoli, Cheese_puffs, Apple, Cookie }, and a definition of the priorities, you can give the sequence (Gorp, Cheese_puffs, Apple) and the function returns Cheese_puffs.  Given the empty sequence () it returns Nothing.  Given the sequence containing only (Broccoli), it still returns Nothing.

Since I’ve been reading about and watching videos on the subject of functional programming, I recognize this an example of a foldl operation.  In particular, see C9 Lectures: Dr. Erik Meijer – Functional Programming Fundamentals Chapter 7 of 13.

Even if we’re not using higher-order functions, it still qualifies as a design pattern even if the language doesn’t do all the work for us.  In C++ pseudo-code, the solution ought to look like this:

result= initialvalue;
for (item : collection) result= f(result,item);

The logic of interest ought to be a succinct expression, something like result= std::max(result,item, comp); where comp is a comparison function that knows to compare the corresponding priorities rather than the numbers themselves.

There is a max_element algorithm in std (and in boost::range) that will go through a sequence and apply the max comparison to each element, as in the previous listing.  But, it doesn’t like empty lists.  It’s hard to beat a foreach loop for ease and clarity!  In C++11, the range-based for loop is the easy way to go through a collection.  You put other statements before, after, and inside it to control the details.  It’s not a function as a first-class object, but that’s still an easy way to specify such things.

I’ve been a fan of the “range” idea for a long time, and think it’s a nice way to make STL more friendly.  Boost has range (whole container) versions of std algorithms, but where is the humble map algorithm?  In Boost::range, you don’t use a range algorithm for this at all, but rather the concept of map is filled via range adaptors, with a generic map higher-order function called transformed.

Applying ideas from functional programming, the function (lets call it g) that takes a Snack value and returns the corresponding priority can return a pair, containing the original argument as well as the result.  That is, it decorates the value but does not destroy it.  Something like this:

pair<Snack,int> x = g(Nothing);
for (item : input | boost::transformed(g))  x= max(x,item,compare_second);
result= x.first;

compare_second should do the normal operator< on the .second, but it doesn’t appear as a predefined standard function.  It could be done using a lambda expression, but (in C++11) you would have to declare all the types and parameter arguments.

But remember the earlier musing on the utility of for loops — there’s nothing wrong with using them.  Instead of putting details inside a lambda expression, don’t bother.  max and compare_second are so simple to express naturally that it’s not worth any fuss to compose canned implementations of them.  (Now if we were working with parts that had more complexity, it would be another story)

pair<Snack,int> x = g(Nothing);
for (auto item : input | boost::transformed(g))   {
    if (item.second >= x.second)  x= item;
result= x.first;

Note that this naturally handles empty lists and lists containing only undesired items, and if g defaults to 0 for items not known at compile time, this handles all the desired cases without any special-case code.

Giving up on using a single function expression as the body of the loop, there is no need to have g return pairs and then extract first and second.  But, it still has an advantage.  If you kept two distinct variables instead:

Snack x = Nothing;
int pri= 0;
for (auto item : input)   {
    int newpri= g(item);
    if (newpri >= pri)  {
        x= item;
        pri= newpri;
result= x;

It turns out longer!  The two variables x and pri, kept separately, need to be updated as a unit.  In more complex code, that is not nice as mistakes can be made.  We still want both the item from the input list and the associated priority, so yet another variable is created.

It is more natural in C++ to have g just return the result, though.  So let’s take the best features of both and split the difference:

pair<Snack,int> x  (Nothing, g(Nothing));
for (auto item : input)   {
    pair<Snack,int> temp (item, g(item));
    if (temp.second >= x.second)  x= temp;
result= x.first;

Now remembering the associated priority with the best-so-far assumes that g has some overhead.  E.g. it’s a configuration file read a run-time and stored in a fancy data structure where items with priority defined are sparsely represented.

But if that’s not the case, and there’s no good reason not to make redundant calls to g, you don’t have to remember it.

Snack x = Nothing;
for (auto item : input)   {
    if (g(item) >= g(x))  x= item;
result= x;

It’s just so annoying to me that g is called for again for the same value it just used on the previous iteration!  If the priorities are known at compile time, you might suppose the compiler to inline and optimize out the calls to g.

I will continue this later with an exploration of how it works in a functional language.

Haskell Exploration 2, or “who needs loops?”

The previous post used a one-liner to solve the problem, using a list comprehension. Trying different limits meant editing the number in the middle of the line and entering again. That’s really the only thing you can easily do in the GHCi interactive environment—any functions or more complex code needs to be in an actual source file. At least GHCi can load your source file and then let you interact with the functions defined in it.

So, here’s a function which takes the max in the problem statement. To completely generalize, it also takes the list of factors as a parameter. After all, the original code was starting to look repetitive on the handling of the two hard-coded factors.


divisors ∷ [Int]IntInt
divisors factors max =
sum [x | x ← [1‥max], factor factors x]
factor factors n =
any (λx → n `mod` x ≡ 0) factors
-- swap out these alternatives by leaving exactly one un-commented
-- or [b | x ← factors, let b= n `mod` x ≡ 0]
-- or [True | x ← factors, n `mod` x ≡ 0]


As well as using recursion, the list comprehension syntax offers a way to write what would be a loop in imperative languages.  The generator portion of the construct, x ← [1‥max], is like a foreach loop, though it is not limited to using one generator.  The things after the vertical bar can include generators, constraint tests, and named values to use in other things.  However, it doesn’t seem directly suited to using the factors list directly as a generator with other tests — the built-in meaning is an AND between constraints, not an OR.  So, push that down to another function, top-down style.  That is, sum all the numbers up to max where the number passes the test of function factor.

I knew that there are logical functions that apply across all elements in a list, but didn’t recall how they were named.  Looking through the library, I found any first, which is even more suitable.  It reads more clearly to indicate the intent, rather than having to know that a multi-input OR can be used in that manner.  But, it needs the test packaged as a separate function.  Here I did that in-line as a lambda.

Since the point of this exercise is to work with list comprehensions, I tried it using the or function, too.  Mapping the list of factors to a list of True/False is done within the list comprehension syntax, doesn’t need extra parens, but isn’t really clearer: it trades the lambda for a let.

But, there is no reason to populate all the False entries.  The natural use of the filtering clause is to reject elements that don’t meet the criteria.  So, accept the ones that would give True, rather than mapping to True.  A form similar to this would be useful if I wanted to list the ones that fit, rather than just finding out if any fit.

Interestingly, the first form can be written without using a lambda expression.  Nice normal function composition doesn’t work out because the argument is deep in the center rather than last, but section syntax lets you use either the left or right argument in an operator.  The section syntax uses parenthesis, and you end up needing parens around the whole thing because the infix ∘ has a lower precedence than function application.

factor factors n = any ((0≡)∘(n `mod`)) factors