Category Archives: Teaching Computing

How fast was Knuth’s computer?

In my previous post, I wondered what the power of Knuth’s computers were, at the time TAOPC was being written.  Someone suggested the IBM S/360 series as an exemplar.  That turned out to be a good idea specifically, since I’ve written programs for the S/370 in assembly language, so I’m familiar with it.  Models 30, 40, and 50 were released in April 1965.  On the pricier side were models 65 and 75.  Here is a scanned “System Summary” describing the various models in detail.  So, I suppose somewhere between 100 and 900 kilo instructions per second.  A larger machine would probably be servicing multiple users.  Fifty years later, my Xeon E3-1276 is supposedly around 133 billion instructions per second.

Interestingly, the S/360 takes many (like 13 on average) clock cycles to perform one instruction.  Meanwhile each core of the Xeon performs 8½ instructions in one clock cycle.  I suppose the clock speed of the S/360 is the cycle time for the internal microcode.

But what’s an instruction?  On the S/360, I would not need the decode function at all, but would just sum the digits directly using unpacked decimal.

int decode (std::initializer_list<cellT> lst)
{
	int total= 0;
	for (auto digit : lst)
		total= total*10+digit;
	return total;
}

The modern CPU knows only binary arithmetic on various word sizes. So converting from a decimal digit-per-byte requires 4 iterations on two operands doing a multiply and add: at least 16 distinct instructions (if the loop is unrolled), plus the actual add once that’s all done.

Interestingly, the x64 code generated by the compiler doesn’t actually issue a multiply instruction in this loop. In fact, the entire expression does not use the regular ALU! There is neither a MUL or ADD instruction there. Instead, it exploits the address generator to do stuff that has nothing to do with actual pointer addresses. The complicated addressing modes of the CISC processor means that a separate address generator unit has a variety of things it can compute, yet it is far more limited than a completely general ALU. So, it is much simpler and thus faster.

In particular, Scaled Index mode looks like this: [ebx + ecx*S + constant] Register ebx is the base, and ecx is used as an index here. The index can be used directly, or scaled by 2, 4, or 8. If the same register is used in both positions, you can multiply it by five! The LEA instruction is Load Effective Address, and gives the processed address without fetching what it resolves to like a MOV would. So, if we have total in EAX and the digit in EBX,

LEA EDX, [EBX+EBX*4]
LEA EDX, [EAX+EDX*2]
ADD EAX, EDX

The first instruction multiplies by five. The second instruction not only multiplies by two, but also adds in the digit as the base of the addressing mode.

I also found it interesting how the S/360 line anticipated what we have today:  one compatible instruction set, but pricey implementations have more pipelining and faster clocks; also they keep adding more layers of cache memory.  The “processor storage” housed with the CPU is analogous to the L2 cache.  Adding external memory modules gives more storage but slower: 8 microsecond access time.  If you add pairs of modules you can go dual-channel and double the throughput.  Finally, later high-end models added extra-high-speed memory to keep up with the CPU, and that is analogous to our L1 cache.

Back to the quantitative comparisons:  The modern machine has 4 independent cores, but my program only used one.  If a brute force problem required a significant amount of time, it could be split up into 4 tasks.  At full utilization, 133 billion vs 133 thousand, more or less.  That’s a factor of about one million.  With the single thread, a quarter of that.  30 ms on one core would be about 8½ hours on a S/360-50 using it exclusively for this job.

Knuth’s suggestion of 10! can be scaled up by a million.  That’s midway between 12! and 13!.  Now in terms of exponential growth, realize that an institutional computer like that cost about 1000 times more than a personal desktop computer today.  At computing power per constant dollars (not adjusting for inflation) is indeed about one billion.

For floating-point calculations, the difference in power over 50 years is a few orders of magnitude higher.  A $300 GPU card can do 4 teraflops?  That means it would be a world-class supercomputer as recently as 2005!

 

Permutation Puzzle: “send+more=money”

I saw this puzzle on Bartosz Milewski’s blog, with an entry on using monads in C++.  I’d like to hold it up as an example of a completely different lesson to C++ programmers:  A central idea I want to teach is know your libraries.

I recall it was in the early ’90’s, when the STL for C++ was submitted as a proposal to include in the upcoming standard.  I noticed there were algorithms called next_permutation and prev_permutation, and wondered how they work—how do you order them and re-arrange your collection to the next such, without keeping an auxiliary state?  Then I wondered what I would ever use such a thing for.  Well, nearly 25 years later, I found out!

You should look through the list of algorithms every once in a while just to see what’s there.  Otherwise you might only know about the common ones that you use.  Consider the analogy with a tool (say, a special bent screwdriver only used to attach doorknobs) that you know is in the very back of the drawer, though you may need to rummage around to locate it.  Remembering you have that tool makes for a quick job.  Having to rig up something from common parts instead (to continue the analogy, use locking pliers to grab a short screwdriver bit from the side) is not as good, and certainly more work.

So… 8 nested for loops followed by 7 if statements containing 28 conditions?  Get real!  If you have a line that reads });});});});});});});}); then the code review will be a sight indeed.

Solution in C++ w/standard library

Here’s the meat of my solution:

using cellT = int8_t;

cellT A[10] {0,1,2,3,4,5,6,7,8,9};

void solve1()
{
    do {
	++iteration_count;
	auto [ig1,ig2, s,e,n,d,m,o,r,y] {A};
	int send= decode({s,e,n,d});
	int more= decode ({m,o,r,e});
	int money= decode ({m,o,n,e,y});
	if(send+more==money) {
	    ++solution_count;
	    solutions.push_back({send,more,money});
	}
    } while (std::next_permutation(std::begin(A), std::end(A)));
}

You’ll notice that besides the uniform initialization syntax introduced in C++11, this uses something you might not have seen before (if you’re reading this in 2017).  Hot off the press in C++17 is structured bindings.

	auto [ig1,ig2, s,e,n,d,m,o,r,y] {A};

This defines 10 variables and assigns all the elements of the array to them.  The first two are ignored so I used scratch names, and the others are simply the names of the letters in the puzzle.

One thing I have to point out from Milewski’s listing is the call to turn a word into a numeric value.  He writes:

int send = asNumber(vector{s, e, n, d});

This creates a std::vector on every use.  Let me elaborate: it allocates a block of memory from the heap (vectors can’t use a small-vector optimization).  Then after the call returns it is deallocated.  Then the next two lines to the same thing.  And that happens on every iteration.

The constructor for std::vector takes this handy literal list.  Now in C++ these containers are not special language features, but are ordinary libraries.  It should be clear that anything they do — cool syntax or unusual ability — you can do yourself on your own code!  My version of the same construct does not create a vector, doesn’t require more words to make the call, and most importantly does not have any overhead.

int send = decode({s,e,n,d});

And here is the function that takes such an initializer list:

int decode (std::initializer_list<cellT> lst)
{
	int total= 0;
	for (auto digit : lst)
		total= total*10+digit;
	return total;
}

The solving function doesn’t print the results because I want to time just the solving logic.  So the solutions are pushed onto a vector, and the caller prints them after stopping the clock.  In a real program, this might be an object (not globals) and the results available in the object afterwards, or as the return value from the solving function.  In another post I’ll make it lazy.

 Make it faster

This simple function found 50 solutions, one of which doesn’t have a leading zero.  It ran in 39.6 milliseconds, trying all 3,628,800 permutations.  That’s 92 million iterations per second.

The value of 10 factorial is an interesting number here.  Donald Knuth, in the The Art of Computer Programming, wrote that this is about the size that separates things that may be done by brute force from things that are impractical to simply try all possibilities.  Volume 3 was published in 1973.  I hazard to guess that computers now are about 230 (or about a billion) times the power of the machines that were around when he wrote that.  A billion times 30 milliseconds is 460 years.  So, I revise that to more like ten million times the speed, if I postulate that he could have run this program to completion in a matter of days.

Anyway, to make it faster, I need to skip over permutations that are “the same” as one I already rejected.  The order of the two ignored digits don’t change the solution, so if I decide that one order is canonical and when the other is detected I skip over the whole lot, that would cut the number of iterations in half.

So how do you skip over states in the next_permutation algorithm?  I looked it up — a StackOverflow post described it well and also pointed to a Wikipedia page on it.  The states are generated in lexicographical order, so when a given digit changes everything to the right is in reverse sorted order, and it “rolls over” to advance that digit by one and everything to the right is now in forward sorted order — the lowest value of that substring.

So, when I identify a digit value that I know will not be a solution no matter what the other digits are, I can skip to when that digit is advanced again by reversing everything to the right of it.

void skipfrom (int pos)
{
    std::reverse (std::begin(A)+pos, std::end(A));
}

    ⋮
    do {
	++iteration_count;
	auto [ig1,ig2, s,m,e,d,y,n,o,r] {A};
	if(ig1 > ig2) {
	    skipfrom(2);
	    continue;
	}

Indeed, it still found 50 solutions but the iteration_count showed half the number: only 1.8 million times through the loop.  However, the time only went down to 26ms — about two thirds the time, not half.

We also don’t want solutions with a leading zero, so filter those out too.  Notice in the listing above I changed the order of declaring the digit variables.  It doesn’t matter to the solution algorithm, but putting these farther to the left means I can skip over more.

        ⋮
	if(s==0) {
	    skipfrom(3);
	    continue;
	    }
	if(m==0) {
	    skipfrom(4);
	    continue;
	}
        ⋮

That didn’t save much though: 1.45 million iterations in 22 ms.

Another big constraint can be found on the right side of the puzzle.  I would think that parsing the arrangement of digits into ints would be slow, seeing as that involves multiplying by 10 and adding each digit.  Looking at the rightmost (units) digit only, the puzzle has d+e=y with a possible carry.  Test that before parsing the int values, and furthermore skip ahead until one of those three digits changes again.  To that end, the declaration order has d, e, and y next after the previous items we wanted on the left.  This leaves only 3 letters to the right, so each time the units digits don’t work it can skip 6 iterations.

I added a counter to that, and see that it happened 219,360 times.  The loop only executed 355,053 times now, taking a mere 4.7 ms.

Faster yet?!

Notice in the listing that I declared a type CellT for the array of digits and anything that holds a digit.  My thought was that keeping the array small would save in parameter passing to decode.  Keeping the size of the data small is generally good for memory caching, but it probably doesn’t matter here because I don’t have enough data to exceed the L1 cache anyway.

But, I can change the type in one place in the code and it affects everything in the program.  I changed it from int8_t to plain int (int is 32 bits here), and…  the time was down to 3.4 ms!

64-bit was the same.  16-bit values was the slowest at 5.2 ms.  So, the CPU is inefficient at loading 8-bit values and worse with 16.  I don’t know if the actual load is slower, or the compiler generated somewhat different code rather than just using MOVXS instead of MOV.  This was a difference of a quarter of the execution time over the entire algorithm, and loading values is only a part of what it does, with values kept in registers once loaded.

 

GCD from “From Mathematics to Generic Programming”

For my birthday my sister got me a copy of From Mathematics to Generic Programming by Alexander Stepanov and Daniel Rose.  I had noticed it on Amazon.com and flagged it on my wish list, then forgotten about it, so it was a nice surprise and yet something that she knew I would be interested in.

The bulk of it is about pure math.  It covers the history of math with a small number of continued examples showing how progress was made when domains were generalized.  Euclid described an algorithm for Greatest Common Measure which was geometric in nature.  Much later zero was invented and a couple hundred years later it was realized that a line segment could have zero length, and the Greatest Common Divisor is the same algorithm using division and remainders on whole numbers.  Then came algebraic structures and the same algorithm can be applied to Gaussian integers, polynomials, and beyond.  Noether then studied the question of what, in the most abstract sense, can the algorithm work on, and described the Euclidean domain (or ring).

Returning to the conventional approach of using integers, the function they present is:

//gcd from page 59
template <typename N>  // N is some kind of integer
static N gcd(N a, N b)
{
    using std::swap;
    while (b != N(0)) {
        a= a % b;
        swap (a,b);
    }
    return a;
}

Pretty mundane in this day and age.  But, earlier microcomputers did not have an instruction for division.  I recall figuring out how to implement multiplication on an 8-bit computer in the early 1980’s.  Even now, I can call up a modulo operation with a single symbol and know it maps to a single machine opcode… but that instruction is not very efficient.  I was curious as to just how heavy it is, and found a report giving instruction timing details measured for various architectures.  On the Intel Haswell, I saw it was indeed something of an anomaly in terms of how many micro-ops, how many compute ports it sends them to, and how long the latency is.  Apparently it is a microcoded routine that works out the division just like you would code in the old days; just built-in and already perfectly optimized.

Stein’s Algorithm

Later, the book goes on to describe Stein’s algorithm.  Written for a computer in 1961, it uses bit-shift and subtract and other very primitive operations which are especially simple on binary encoded integers, with no need for division.

template <typename N>
static N gcd(N m, N n)
{
 
//	if (m < N(0))  m= -m;
//	if (n < N(0))  n= -n;
    if (m == N(0))  return n;
    if (n == N(0))  return m;
 
    // m>0 && n>0
 
    int d_m= 0;
    while (even(m)) { m>>=1; ++d_m; }
 
    int d_n= 0;
    while (even(n)) { n >>=1;  ++d_n; }
 
    // both are now odd
    while (m != n) {
        if (n>m) std::swap (n, m);
        m -= n;
        do m >>= 1; while (even(m));
    }
 
    // m == n
 
    return m << std::min (d_m, d_n);
}

How does Stein’s algorithm fare today?  Sure it avoids the big fat modulo operation, but it has a lot of steps.

Here is a summary of my results.  Values are in nanoseconds for one call of the function, having run a trial of 16 million calls.  I’m using Microsoft Visual C++ 2015, and running on an Intel Xeon E3-1276 v3 @ 3.60GHz.
000011

The first four columns is the code from the book, compiled as 32-bit and 64-bit.  In 32-bit builds, the modulo operation of 64-bit values is not a single instruction and the built-in microcode is not accessible.  Stein’s algorithm is a big win in this case.

For 64-bit code, the modulo operation is significantly slower for signed values than for unsigned (though all test values were unsigned).  Stein’s algorithm’s time is between those two, so it’s faster when using signed 64-bit integers.

Because the % operation is so much slower on signed values than unsigned, if there was a need to handle signed arguments, a wrapper function could take the absolute value of each argument and then call a version that used unsigned values.  Remember that any time you are doing division or modulo!  Unsigned types are faster.

I then called the 64-bit integer form with values that would all fit in 32 bits.  Unlike multiplication, division can’t easily quit early when the numbers are smaller.  But, fewer iterations are needed through the loop, so the function is faster.  Likewise, the Stein algorithm will loop less in every stage.  The built-in division regains its lead, as it speeds up more.  That is, the Stein algorithm’s complexity is a steeper curve considering the length of the input values in bits.

template <typename N>
bool odd (N n)  { return bool (n&1); }
 
template <typename N>
bool even (N n)  { return !odd(n); }
 
⋮
 
int d_m= 0;
while (even(m)) { m>>=1; ++d_m; }
 
int d_n= 0;
while (even(n)) { n >>=1;  ++d_n; }

The Stein’s algorithm code is written to avoid division, but what it is using is a lot of test-and-branch instructions.  The above fragment removes all the trailing zero bits and remembers how many there were.

Today’s Bottleneck Issues

Each operation is very simple: test the least-significant bit, shift right one bit, increment an int.  But the branch is a killer!

With a modern superscalar highly-pipelined processing core, a conditional branch breaks the pipeline.  The x86 processors use branch prediction and speculative execution to mitigate this, but here the branch decision is essentially random on every iteration.  It guesses wrong half the time.

Meanwhile, the instruction set does feature primitive opcodes to determine how many consecutive zero bits are at one end or the other of a register.  I can access that via a compiler intrinsic, and eliminate the problematic loops.

template <typename N>
static N gcd(N m, N n)
{
    if (m == N(0))  return n;
    if (n == N(0))  return m;
 
    unsigned long index;
    _BitScanForward64(&index, m);
    int d_m= index;
    m >>= d_m;
 
    _BitScanForward64(&index, n);
    int d_n= index;
    n >>= d_n;
 
    // both are now odd
    while (m != n) {
        smaller_first (n, m);
        m -= n;
        _BitScanForward64(&index,m);
        m >>= index;
    }
 
    // m == n
 
    return m << std::min (d_m, d_n);
}

There is still one loop remaining.  This main while loop ought to be predicted to loop and only fail the prediction when it’s ready to exit.

The column in the results table labeled “BitScan” gives the results after making this change.  All of a sudden, Stein’s algorithm beats the built-in div code in every case!  In the case of small values to 64-bit signed integers, it is 3× the speed!  Interestingly, the running time seems to be determined by the size of the values, without regard to the register size used: 32-bit values run at the same speed whether the code is compiled to use 32-bit integers or 64-bit integers.  That makes sense, since the various primitive operations (unlike multiplication and division) are constant speed and don’t have any advantage to using a smaller word size.

But I still have one more trick left. if (n>m) std::swap (n, m); is a branch point, and it will take one way or the other many times as it loops.  That is, this is another “bad” branch.

A long time ago, branching was far worse than it is on today’s CPUs.  I recall writing high-performance graphics code when any branch was a penalty.  I learned to absorb the conditional logic into the math, and also learned how to generate and manipulate masks in various clever ways.  Here is my quick attempt at a purely computational compare-and-swap:

template <typename N>
void smaller_first (N& n, N& m, std::false_type)
{
    using std::swap;
    if (n>m) swap (n, m);
}
 
template <typename N>
void smaller_first (N& n, N& m, std::true_type)
{
//        cout << "smaller_first input n=" << n << ", m=" << m;
    typedef typename std::make_signed<N>::type sN;
    const auto bitcount= 8*sizeof(N);
    const N mask= (sN(n)-sN(m))>>(bitcount-1);  // all 1's if m is larger, all 0's if m is smaller
//        cout << ",  mask=" << mask;
    const N smaller= (m&~mask)|(n&mask);
//        cout << ",  smaller=" << smaller;
    const N larger= (m&mask)|(n&~mask);
//        cout << ", larger=" << larger << endl;
    n= smaller;
    m= larger;
}
 
template <typename N>
void smaller_first (N& n, N& m)
{
    typedef typename std::is_signed<N>::type tag;
    return smaller_first (n, m, tag());
}

It only works on signed types, so I made the template choose the regular way for unsigned types and the non-branching trick for signed types.  (Actually, as written here it would work for unsigned types if the values were constrained more.)

It works by first doing a subtraction: if n-m produces a negative result, that means the highest bit is set in 2’s complement representation.  So if m is larger, that high bit will be set, and if m is smaller it will be clear.

Then the arithmetic right shift replicates this one bit to fill the entire word size: now I have all 1s if m is larger and all 0s if m is smaller.  In either case, the ones-compliment (~) operator will reverse that.

Each input value is then ANDed with a mask: one mask will always produce zero regardless of the input, and the other mask will have no effect.  Then both of them are ORed together, which results in the one that wasn’t all zero.

This is done twice, once for the larger and once for the smaller, by reversing the mask choice.

This would appear to use a lot of registers:  the two input values, the output values (which can’t destroy the input values before it is done), the 1 mask and the 0 mask.  But after computing the masks, the rest of the steps are not dependent on each previous step so they can all be done at the same time in various execution units, or at least scheduled optimally.

The result indicates that all this math and register usage is cheaper than branching: 44 becomes 39 or 37, 84 becomes 75.  This is about an 11% speed up in the overall algorithm.

Additional experiments indicate that the final test (doing the min at the very end) doesn’t benefit from using this method.  It is only performed once, at the end of the function.  Likewise the tests at the beginning for an argument of zero don’t have a noticeable effect (they are almost always “not taken” and when they are the routine is fast because it doesn’t have to do the main part at all).

 What about cmov ?

This post on Stackoverflow reminded me that the x86/x64 has long since addressed this with the cmov, or conditional move, instruction.  It makes sense that an if statement containing a simple comparison controlling only simple assignment statements could cause cmov instructions to be generated.  However, my attempt to write it that way showed that no cmov was used and conditional branching was still being done.  Yet further on in the function, a cmov is seen, and it comes from a use of std::min.  Looking at the headers, I see std::min is written as a ternary expression in the assignment, not an if statement.

Presumably the std templates are carefully crafted to trigger the kind of code that results in optimization, or the optimizer is written to handle the code generated by the standard headers, or cycles of feedback between the two.

So writing:

template <typename N>
void smaller_first (N& n, N& m)
{
    const N old_n {n};
    const N old_m {m};
    n= std::min(old_n,old_m);
    m= std::max(old_n,old_m);
}

does, on this compiler version, cause cmov instructions and no branching; although it does unnecessarily do the comparison twice.  Other than the lack of branching, the machine code appears rather poor to me.  But here’s the real performance:

000015

The min/max code wins hands down on 16-bit values. It does poorer than the jumping code in all other cases, so the funny arithmetic is still the winner in the signed values where it is supported.

Here is the 16-bit case, which is fast:

cmp         ax,r8w  
movzx       ecx,r8w  
cmovl       r8w,ax  
cmp         cx,ax  
cmovl       cx,ax  
movzx       eax,cx  
 
sub         ax,r8w

The values for m and n start out in AX and R8W, and are reversed if necessary so the subsequent use (the subtraction) uses those same registers.

The generated code uses one additional register, ECX, to hold the original value of R8W (zero-extended to 32 bits) in case the subsequent line copies AX into R8W.  Then the test is performed again between AX (still unchanged) and CX (holding the original R8W), and based on that result may copy AX into CX.  Then it moves CX (which may be the value from the first copy or the conditional second copy) into EAX, again zero-filling it to 32 bits.

I suppose there is a performance penalty to move into 16-bit register fields, so the (non-conditional) moves use zero extension moves to 32-bit fields.  Yet it uses the 16-bit portion in other places, and never reads larger portions even if they are known to be zero or simply ignored later.

This code could be simplified by removing the second CMP and then reversing the condition of the second CMOV.  That second CMOV may be free though, since it can be done simultaneously with an adjacent instruction, and removing it would not free up resources because no other instructions are not dependent on the result.

You would think this identical code would work for 32 bits or 64 bits, just by using the full width of the same registers.  But no, here’s what it generates for 32-bit values:

mov         ecx,r8d  
mov         dword ptr [rsp+0F0h],ecx  
mov         dword ptr [rsp+0E8h],edx  
lea         rax,[rsp+0E8h]  
cmp         edx,r8d  
lea         r8,[rsp+0F0h]  
cmovge      rax,r8  
mov         r8d,dword ptr [rax]  
lea         rax,[rsp+0E8h]  
lea         r13,[rsp+0F0h]  
cmp         ecx,edx  
cmovge      rax,r13  
mov         edx,dword ptr [rax]  
 
sub         edx,r8d

Besides using EDX and R8D for the two values, it uses ECX for a copy as with the first case, but also uses RAX, R8, and R13 as 64-bit pointers, and stores the values in stack-based memory locations as well, [RSP+0F0h] and [RSP+0E8h]!

The values are stored in memory, with pointers to those addresses in two new registers.  The comparison is done with the values (still) in registers, but the pointers are what are conditionally copied, and then the resulting pointer is de-referenced to get the matching value.  Besides chewing up 3 more registers it requires 3 memory accesses!

The 64-bit case works the same way, with the value-carrying registers and memory allocations widened.

HOTEL alfa sierra kilo echo lima×2

From earlier posts, you know that I’ve been learning Haskell. Now it’s one thing to memorize some syntax rules, and be able to discuss intelligently the semantics and ramifications of some particular language feature, even typing a line or two at an interactive prompt. It’s another to face an empty page in a text editor with the intent of writing a program.

For whatever reason, it occurred to me to do something with the spoken (phonetic) alphabet.  It’s good to pick something you easily know how to do, as to concentrate on the mechanical details of the tools instead of the problem.

The Main Loop

To that end, I started with looking up how to fetch command-line arguments (simple output was already covered from hello world) and spent a long time pondering just how to structure the loop over the arguments.

Pretty basic, right?  With pure functions, given something to do on one input value (say, foo) and an ordinary list of input values, you can process all of them using any of several ways of saying “map”.

map foo values
foo <$> values
mapF foo values
mapM foo values
forM values foo
ff

The problem is that both the thing I want to do on each value and the function that obtains the list of values is “in” the IO monad. So instead of one token to put these two things together (whether a word like map or a infix operator like <$>), I need two separate tokens for monad composition and the mapping, and had further trouble combining the two things.

Separate steps for the monad and the map isn’t too unreasonable, but is just a little bit. There are different levels of wrapping, like having different levels of indirection. f $ g is one way to write simple function application, and the entire thing g is the argument of one call to f; f <$> g uses a different symbol to mean apply f to what’s inside g treated as a container, so f is called multiple times with values taken from inside g.  There are any number of different wrapping layers and different things can be wrapped in different numbers of layers, so having unique marks for each possibility is prohibitive.  There are ways to bump things up in general, so f g is plain function application and liftM f g tells f to look inside its argument.

So, given that the values I have are double wrapped and I’m targeting the middle layer (neither the innermost nor the outermost) it seems reasonable that an extra mark of some kind is needed to specify, in addition to a mark that says “put these two things together”, making two in all.

The other combining problem is harder to explain:  Given one function (foo) that produced something that you wanted to feed to another function (bar), simply combining them is a no-brainer, you would think.  In a C-like language, bar(foo()) is pretty obvious.  In Haskell you don’t even use the parenthesis, so it’s just bar foo.

You could use an intermediate variable name, like

s= foo();
bar (s);

but you don’t have to.  In fact, that’s one of the things wrong with using so-called OUT parameters, is that it ruins the expressiveness and you can’t simply chain them together.  And this kind of chaining is very much embraced by functional style and Haskell in particular.

If foo is actually getArgs, and bar is a function that takes a list of strings (don’t even worry about the mapping at this point—just write a function that explicitly takes a list), you can’t do it!

main = do
    s <- getArgs
    mapM_ body s

badmain = mapM_ body getArgs

Writing that without the named intermediate value and do-block, the “obvious” main = mapM_ body getArgs doesn’t work.

When I was pondering this as I was working on it, I concluded that the only thing that can be done to getArgs is the bind operation (>>=).  Now, maybe that’s not quite true if common helpers exist that themselves use bind, such as the liftM mentioned earlier.  Ideally I’d somehow mark the argument that is wrapped too much for the function, as the first argument (the body) is fine and there may be variations like liftM2 but not one that skips lifting some arguments and then lifts others.  Would ap or <*> mixed with $ work here?  That’s something to try next time.

Meanwhile, my first concern was with writing the for-each construct in a clear manner, without separating body into its own named function since it isn’t that complicated.  I wondered what the usual common idiom might be?

No matter how you arrange the pieces though, I needed to name the parameter for the individual value/iteration being processed.  In common imperative “structured” languages, naming the loop control variable is a basic part of the syntax of a for-each looping construct.  Using a higher-order function instead, writing a lambda expression for the body was a bit clunky.  In any case, the layout started giving me trouble, so I gave up on that and just made body a separate named function.

What I ended up with is (starts with, anyway)

main =
    getArgs >>= mapM_ body

and with the body being a single word, the arrangement of the components (looping construct, list of inputs, body) is not nearly so important.

 Look-up Table

Haskell’s go-to data structure is a list.  There is no simple way to make a O(1) random-access array.  I mean, there are in fact such arrays as libraries, but to populate it you need a list anyway.  Such an array would help random-access speed, but only makes creating the table more complicated.

For key-value lookup, the basic feature is a function called lookup.  It operates on a list of (key,value) pairs, and I really didn’t want to write out the keys a through z and pairs, but rather just list the values (the alfa bravo keywords) in order.  The zip function took care of that.

Since it is most definitely not taking any advantage of contiguous range of keys, I decided to use that as a feature and add keywords for some odd punctuation marks and such.

phon_table= nato_table ++ digit_table ++ 
    [('.',"Period"), ('-',"Dash"), ('#',"TicTacToe"), ('_',"Underscore"), (' ',"Space"), ('@',"Snail") ]

nato_table= zip ['a'..'z'] ["alfa", "bravo", "charlie", "delta", "echo", "foxtrot", "golf", "hotel",
                "india", "juliett", "kilo", "lima", "mike", "november", "oscar", "papa", "quebec",
                "romeo", "sierra", "tango", "uniform", "victor", "whiskey", "xray", "yankee", "zulu" ]

digit_table=  -- FAA style 
    zip ['0'..'9'] ["Zero","One","Two","Three","Four","Five","Six","Seven","Eight","Niner"]

Thinking ahead, variations can be accommodated by building the master list out of different pieces.  For example, use ITU instead of FAA code words for the digits, and add application-specific words.

Function Decomposition

Functions need to be written in small pieces.  In real-world code (using common imperitive languages) I’ve seen people write what I call meandering functions, which is a lack of decomposition: one defined unit (procedure or function in the high-level language) does stuff, then does more stuff, then goes on to work on something else, etc.  Often they are much much too long, running hundreds of lines of source code.

In Haskell, pure functions don’t have multiple statements.  That makes it hard to meander, or, without being adept at functional programming yet, even writing more than one line!

I also see lots of code that is not meandering in that the function does “one thing”, but yet is not decomposed well in that the thing it’s doing is obscured (even overwhelmed) with lower-level details that should be unimportant to “the thing” being expressed in this function.  Many programmers have a hard time with that, apparently not recognizing or even understanding what that is.

Those same programmers would more naturally break things up more in Haskell, since it is awkward to stuff more detail inside the argument of a function.  Writing a list of statements, it’s all to easy to stop what you’re really doing and issue lower-level detail work, and then put in more higher level work.  When everything is an argument to a single function, it’s more in-your-face that you are interrupting your expression with details that should be elsewhere.

Haskell facilitates writing very small functions with features such as writing multiple branches as separate top-level functions, and nested functions which close over the variables in their parent scope.

So is it “smaller is better”, period?  How much logic should go into one function as subexpressions rather than separate named functions?  Since the stuff in a where block have the same meaning as being in-line, thanks to the parent variables still being available, naming many individual pieces of a function seems to be the way to go.  What to not break up would be logical constructs that are read as a unit, and idiomatic uses that are recognized as one phrase but would be disguised if broken up.

When I first started to formulate some code, I ran into trouble with multiple nestings of local definitions, or trying to combine them in ways that either were not supported or couldn’t work in the layout rules.

A harder thing to know how to get just right is when to make local nested functions vs top-level functions.  My instinct is to nest everything and not expose internal details, since there isn’t any level of organization and name hiding smaller than the module.

Once I got past the initial main-loop panic, most of my time was spent fiddling with just how to decompose the functions.  As an anchor point, it made sense that the principle function solving this problem would be a named function and the driver (getting the args and displaying the results) would be a sample usage of that function.  In a real project, that is what I’d be interested in, and what I have here as main would just be test code, as the function would be called by other parts of the program.

One interesting detail is that my function phon processes one character, producing one code word as a result.  It doesn’t provide a function for processing an entire string, which is something you’d probably add in typical programming languages.

Given the function phon, you can call it on a character using ordinary function application, but it’s just as easy to process an entire string so there is no need to provide another function for that!

codeword = phon ch -- mundane call
codeword = phon $ ch -- means the same thing
codewords = phon <$> s -- process whole string of input
codewords = map phon s -- same thing

It’s also especially important in Haskell to keep the physical realization of the output separate from the computation. In most common programming languages, it would be just as easy to have the principle function print the result, and multiple calls would print successive items. In Haskell it is not just as easy — The function phon in this case cannot call putStr or anything like that.

In this toy program, where phon is called right from main and nothing else is happening, writing phon as an IO function would not seem awkward.  But it’s still a deliberate act, and you’d certainly notice when making a recursive call (or trying to).

Of course, that’s good in general.  In my toy program printing the results was part of the specification, but in real projects such a function will be consumed by other code, so the result needs to be in a form that’s still internalized and can be handled however the caller wants.  Unfortunately, it also makes it difficult to add debugging trace statements to existing code, but that’s another story.

Here is the complete program, as I left it.

module Main (main) where

import System.Environment (getArgs)
import Control.Applicative
import Data.Char


main =
    getArgs >>= mapM_ body
    where
        body s = do 
            putStrLn $ announce s 
            putStr $ format $ phon <$> s
        announce s = "string \"" ++ s ++ "\" spoken as:"

format =
    concatMap decorate
    where decorate w =  "\t" ++ w ++ "\n"


phon :: Char -> String
phon c | isUpper c = toUpper <$> (phon $ toLower c)
phon c = 
    maybe (quoted c) (id) result
    where result= lookup c phon_table
          quoted c= ['"',c,'"']
    
    
phon_table= nato_table ++ digit_table ++ 
    [('.',"Period"), ('-',"Dash"), ('#',"TicTacToe"), ('_',"Underscore"), (' ',"Space"), ('@',"Snail") ]

nato_table= zip ['a'..'z'] ["alfa", "bravo", "charlie", "delta", "echo", "foxtrot", "golf", "hotel",
                "india", "juliett", "kilo", "lima", "mike", "november", "oscar", "papa", "quebec",
                "romeo", "sierra", "tango", "uniform", "victor", "whiskey", "xray", "yankee", "zulu" ]
                
digit_table=  -- FAA style 
    zip ['0'..'9'] ["Zero","One","Two","Three","Four","Five","Six","Seven","Eight","Niner"]

I welcome comments from experienced Haskell programmers on these issues, and overall style and idioms in general.

A Spriral Approach

I’ve heard of spiral approaches to learning, meaning that material is covered without much detail and making successive passes with more detail.

My recent experience makes me think of a slightly different spiral:

You understand, then you learn more, and then you are more confused than ever.

zeta

Maybe that’s why I like this particular representation of the Riemann ζ function?

−7/4 and Interactive Play

I saw this video concerning iterative function sequences

and thought that would be a good example to play around with in Haskell.  Haskell has arbitrary precision integers, and also has rational numbers, standard.  (I hesitate to say “built in” because, like many languages, the true built-in core is minimal with layers of library code on top of that.  But it was available out of the box.)

It also has an iterate function. What’s interesting is that the result is naturally presented as an infinite list. Other languages with abstract iteration or generators will provide some kind of interface to make your custom thing handleable in a common way. Perl even has a tie feature to make your custom thing seem exactly like the built-in collections, which pushes the interface contract down to include native datatypes, rather than being layered on top of them.

But in Haskell, the generator is the native list. Rather than make some custom generating code support the iteration mechanism, the same old list holds what appears to be the complete result, as if done as an earlier step.

In some purely imperative language, you might ordinarily do things like:

  1. Select the common, standard, container type with the appropriate properties (e.g. linear, sequential).
  2. Run a computation loop and fill the container with the results.
  3. Pass the container to another procedure, which can use the contents for other work (e.g. print them to a file).

Suppose that was unsuitable for some reason and you wanted to delay the computations of step 2, running the producer in parallel with the consumer, or not bothering until/unless a value was actually used by step 3, or not knowing how many items will really be needed.  You would have to explicitly program in a more complex sequence of commands to achieve that.

Instead of a plain common collection (e.g. a C++ vector) you’d need something else.  You need to arrange producer/consumer threads with a work queue between them, or callbacks, or something.  Step 2 would be preparing the whatsit, and step 3 would still be easy enough using standard iterators to abstract the actual details of the collection/stream/whatever.

Now look at it in Haskell.  Steps 1 is not much to think about as the list is king.  Step 2 is:

lz = iterate (λx→2*x+1) 0

Later, other code can get values from this list, e.g.

take 10 lz

produces a result of

[0,1,3,7,15,31,63,127,255,511]

The value lz is just another list of Integer, as far as the language is concerned. But clearly it could not have been filled in step 2 because it is as long as I need it to be, and never specified a size when filling it. It is actually an infinite list!

This certainly looks like the simple approach of: create the values, then use them. However, it actually doesn’t do any work when lz was defined, and only computes values when subsequent code reads them. Here I asked for the first 10 elements, and the list then went ahead and realized those values. If I ask for the value at index 30 (lz!!30) it needs to run the computation all the way to that element before telling me it’s 1,073,741,823.

This illustrates that the program here is not imperative. I didn’t program in “do this step, do the next step, etc.” Rather, I just specified that I wanted the 30th item of the list, and it did the work of figuring out just how to do that based on the definitions. It’s more like a spreadsheet: cells are not computed in order, but based on what they depend on. Haskell uses a top-down approach: when a specific value is really needed, it looks at the formula for it. Gathering the values needed for that function will set off other chains of downward dependent calculations.

Interactive Play

Using the GHCi interactive shell, it’s easy to play with little programs like this.  A more difficult problem featured in the video is

zz = abs <$> numerator <$> iterate (λx→x^2−7%4) 0

Trying a line of code, then recalling it to adjust and submit again, reminded me of a time long ago when microcomputers ran BASIC (or assembly language) and were attached to the household television.

Interactive play is the way to really learn something, and with a more complex build environment or larger programs you still try and do that.  It’s the combination of a direct Read-eval-print-loop and the ability to write interesting things in a single line of text that makes this experience especially efficient.

Want to try it yourself without installing anything?  Try Haskell! is a browser-based experience that invites you to “type these out and see what happens”  (psst: copy/paste is faster if you really can’t wait to find out).

Back to the Sequence

The actual sequence generated by iteration is rational numbers, but we want to ignore the denominator.  Then, I saw that the sign of the number was preserved with numerator, so the abs stripped that off.

The result starts off as [0,7,21,7,114639] and then the numbers get very large. Index 7 is 595096023596888261906480183623645954687, and it quickly gets overwhelming and impractical to display.

  • n=4 has 6 digits (is exactly 114639)
  • n=7 has 39 digits
  • n=28 has 80,807,125 digits
  • n=29 has 161,614,249 digits
  • n=30 has 323,228,496 digits
  • n=31 has 646,456,993 digits

That’s the largest I could compute with this program.  The Windows version compiles to a 32-bit program, but I used editbin to set the 3G flag, so it had 3G of user address space instead of 2.  That allowed it to computer n=30.  Running on a Mac, which is 64-bit, only increased that by one more before getting Segmentation fault.  I suspect there are things in the implementation that top out at 4G, even when more memory is available.

Foods for Thought

Here is a little problem that is isomorphic to something in real life that I was discussing with a co-worker.  Given a list of items (of a simple enumeration type), and knowing implicitly that each item has an associated priority, return the best item of the list.  Further refinements:  an empty list produces a default 0-priority item, as would a list that contains only negative-priority items.  Oh, and there might be unexpected items present whose priority is not known to the code.

To give some concrete examples, let me define the item type Snack as the set { Nothing, Gorp, Doughnut, Broccoli, Cheese_puffs, Apple, Cookie }, and a definition of the priorities, you can give the sequence (Gorp, Cheese_puffs, Apple) and the function returns Cheese_puffs.  Given the empty sequence () it returns Nothing.  Given the sequence containing only (Broccoli), it still returns Nothing.

Since I’ve been reading about and watching videos on the subject of functional programming, I recognize this an example of a foldl operation.  In particular, see C9 Lectures: Dr. Erik Meijer – Functional Programming Fundamentals Chapter 7 of 13.

Even if we’re not using higher-order functions, it still qualifies as a design pattern even if the language doesn’t do all the work for us.  In C++ pseudo-code, the solution ought to look like this:

result= initialvalue;
for (item : collection) result= f(result,item);

The logic of interest ought to be a succinct expression, something like result= std::max(result,item, comp); where comp is a comparison function that knows to compare the corresponding priorities rather than the numbers themselves.

There is a max_element algorithm in std (and in boost::range) that will go through a sequence and apply the max comparison to each element, as in the previous listing.  But, it doesn’t like empty lists.  It’s hard to beat a foreach loop for ease and clarity!  In C++11, the range-based for loop is the easy way to go through a collection.  You put other statements before, after, and inside it to control the details.  It’s not a function as a first-class object, but that’s still an easy way to specify such things.

I’ve been a fan of the “range” idea for a long time, and think it’s a nice way to make STL more friendly.  Boost has range (whole container) versions of std algorithms, but where is the humble map algorithm?  In Boost::range, you don’t use a range algorithm for this at all, but rather the concept of map is filled via range adaptors, with a generic map higher-order function called transformed.

Applying ideas from functional programming, the function (lets call it g) that takes a Snack value and returns the corresponding priority can return a pair, containing the original argument as well as the result.  That is, it decorates the value but does not destroy it.  Something like this:

pair<Snack,int> x = g(Nothing);
for (item : input | boost::transformed(g))  x= max(x,item,compare_second);
result= x.first;

compare_second should do the normal operator< on the .second, but it doesn’t appear as a predefined standard function.  It could be done using a lambda expression, but (in C++11) you would have to declare all the types and parameter arguments.

But remember the earlier musing on the utility of for loops — there’s nothing wrong with using them.  Instead of putting details inside a lambda expression, don’t bother.  max and compare_second are so simple to express naturally that it’s not worth any fuss to compose canned implementations of them.  (Now if we were working with parts that had more complexity, it would be another story)

pair<Snack,int> x = g(Nothing);
for (auto item : input | boost::transformed(g))   {
    if (item.second >= x.second)  x= item;
}
result= x.first;

Note that this naturally handles empty lists and lists containing only undesired items, and if g defaults to 0 for items not known at compile time, this handles all the desired cases without any special-case code.

Giving up on using a single function expression as the body of the loop, there is no need to have g return pairs and then extract first and second.  But, it still has an advantage.  If you kept two distinct variables instead:

Snack x = Nothing;
int pri= 0;
for (auto item : input)   {
    int newpri= g(item);
    if (newpri >= pri)  {
        x= item;
        pri= newpri;
    }
}
result= x;

It turns out longer!  The two variables x and pri, kept separately, need to be updated as a unit.  In more complex code, that is not nice as mistakes can be made.  We still want both the item from the input list and the associated priority, so yet another variable is created.

It is more natural in C++ to have g just return the result, though.  So let’s take the best features of both and split the difference:

pair<Snack,int> x  (Nothing, g(Nothing));
for (auto item : input)   {
    pair<Snack,int> temp (item, g(item));
    if (temp.second >= x.second)  x= temp;
}
result= x.first;

Now remembering the associated priority with the best-so-far assumes that g has some overhead.  E.g. it’s a configuration file read a run-time and stored in a fancy data structure where items with priority defined are sparsely represented.

But if that’s not the case, and there’s no good reason not to make redundant calls to g, you don’t have to remember it.

Snack x = Nothing;
for (auto item : input)   {
    if (g(item) >= g(x))  x= item;
}
result= x;

It’s just so annoying to me that g is called for again for the same value it just used on the previous iteration!  If the priorities are known at compile time, you might suppose the compiler to inline and optimize out the calls to g.

I will continue this later with an exploration of how it works in a functional language.

Haskell Exploration 2, or “who needs loops?”

The previous post used a one-liner to solve the problem, using a list comprehension. Trying different limits meant editing the number in the middle of the line and entering again. That’s really the only thing you can easily do in the GHCi interactive environment—any functions or more complex code needs to be in an actual source file. At least GHCi can load your source file and then let you interact with the functions defined in it.

So, here’s a function which takes the max in the problem statement. To completely generalize, it also takes the list of factors as a parameter. After all, the original code was starting to look repetitive on the handling of the two hard-coded factors.

 

divisors ∷ [Int]IntInt
divisors factors max =
sum [x | x ← [1‥max], factor factors x]
 
factor factors n =
any (λx → n `mod` x ≡ 0) factors
-- swap out these alternatives by leaving exactly one un-commented
-- or [b | x ← factors, let b= n `mod` x ≡ 0]
-- or [True | x ← factors, n `mod` x ≡ 0]

 

As well as using recursion, the list comprehension syntax offers a way to write what would be a loop in imperative languages.  The generator portion of the construct, x ← [1‥max], is like a foreach loop, though it is not limited to using one generator.  The things after the vertical bar can include generators, constraint tests, and named values to use in other things.  However, it doesn’t seem directly suited to using the factors list directly as a generator with other tests — the built-in meaning is an AND between constraints, not an OR.  So, push that down to another function, top-down style.  That is, sum all the numbers up to max where the number passes the test of function factor.

I knew that there are logical functions that apply across all elements in a list, but didn’t recall how they were named.  Looking through the library, I found any first, which is even more suitable.  It reads more clearly to indicate the intent, rather than having to know that a multi-input OR can be used in that manner.  But, it needs the test packaged as a separate function.  Here I did that in-line as a lambda.

Since the point of this exercise is to work with list comprehensions, I tried it using the or function, too.  Mapping the list of factors to a list of True/False is done within the list comprehension syntax, doesn’t need extra parens, but isn’t really clearer: it trades the lambda for a let.

But, there is no reason to populate all the False entries.  The natural use of the filtering clause is to reject elements that don’t meet the criteria.  So, accept the ones that would give True, rather than mapping to True.  A form similar to this would be useful if I wanted to list the ones that fit, rather than just finding out if any fit.

Interestingly, the first form can be written without using a lambda expression.  Nice normal function composition doesn’t work out because the argument is deep in the center rather than last, but section syntax lets you use either the left or right argument in an operator.  The section syntax uses parenthesis, and you end up needing parens around the whole thing because the infix ∘ has a lower precedence than function application.

factor factors n = any ((0≡)∘(n `mod`)) factors

First attempt at Haskell

Haskell is what’s called a functional programming language.  Compared to BASIC, Pascal, batch-file, C, C++, Java, JavaScript, C♯, Ada, FORTRAN, assembly language, and any any other procedural (whether structured or OO) language, Haskell is very different.

And that’s really the point:  I don’t need to pick up on a slightly different syntax and somewhat different environment for representing the same kinds of constructs I’m used to.  There might be a small number of new and interesting concepts to inspire you, but for the most part they are “me too” on the current engineering paradigms and if anything leaning back to simplicity and leaving out features that could separate the top coders from the mundane (since the latter will need to maintain the code later).

But brushing up on techniques and ways of thinking that are truly different helps improve your engineering skills in general.

For example, a number of years ago I re-implemented a piece of code for laying out images on a page.  The old code had been extended, fixed, patched up, and “maintained” to the point where it was impossible to understand what was going on.  Cleaning it off would be a temporary measure if new requirements and changes continued as they always have.  So, I approached it using concepts from functional programming, even though written in (an early version) of C♯.  Defining the constraints and properties of the desired layout, rather than a spaghetti flowchart of patched procedures that manipulate all the state variables, means at the very least that it could be extended without breaking it.  The logic was unfathomable because of state dependance—what happened before affects what some block of code will do to it next.  So I had no state variables at all.  The layout rectangle of an item was expressed as a pure function, and the various bottom level functions could easily be reviewed for correctness.

Functional programming has been making inroads into mainstream programming, with otherwise procedural and OO languages acquiring features friendly to techniques gained from FP.

In particular, template metaprogramming in C++ (which I’ve seen described as “a slow descent into madness”) exists because the template system is a Turing-complete, albeit crudely primitive, functional programming language.

Meanwhile, I’ve read that monads have a strange property: anyone who comes to understand it loses any ability to explain it to others.  That reminds me of the science fiction novel Babel-17.  In fact as in the story, language influences thought and perception, which is what I was getting at earlier in this essay.  Being a writer on programming topics, I thought I’d take that as a challenge.  Maybe I’ll write a truly good explanation of monads; or maybe it will end up joining the hundreds of others that are are either indecipherable or lack proper deep meaning.  (See also monad tutorial fallacy)

chart

A lot of what I see of beginners’ Haskell examples remind me of Prolog.

Anyway, I just tried crafting my first Haskell snippet, other than things like 2+2 or copying lines from the book.  Project Euler, problem 1,

If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6 and 9. The sum of these multiples is 23.

Find the sum of all the multiples of 3 or 5 below 1000.

Here is the session from the interactive environment:

Prelude> [x | x <- [1..10], x `mod` 3 == 0]
[3,6,9]
Prelude> [x | x <- [1..10], x `mod` 3 == 0 || x `mod` 5 == 0]
[3,5,6,9,10]
Prelude> sum [x | x <- [1..10], x `mod` 3 == 0 || x `mod` 5 == 0]
33
Prelude> sum [x | x <- [1..9], x `mod` 3 == 0 || x `mod` 5 == 0]
23
Prelude> sum [x | x <- [1..999], x `mod` 3 == 0 || x `mod` 5 == 0]
233168
Prelude>  [x | x <- [1..999], x `mod` 3 == 0 || x `mod` 5 == 0]
[3,5,6,9,10,12,15,18,20,21,24,25,27,30,33,35,36,39,40,42,45,48,50,51,54,55,57,60,63,65,66,69,70,72,75,78,80,81,84,85,87,90,93,95,96,99,100,102,105,108,110,111,114,115,117,120,123,125,126,129,130,132,135,138,140,141,144,145,147,150,153,155,156,159,160,162,165,168,170,171,174,175,177,180,183,185,186,189,190,192,195,198,200,201,204,205,207,210,213,215,216,219,220,222,225,228,230,231,234,235,237,240,243,245,246,249,250,252,255,258,260,261,264,265,267,270,273,275,276,279,280,282,285,288,290,291,294,295,297,300,303,305,306,309,310,312,315,318,320,321,324,325,327,330,333,335,336,339,340,342,345,348,350,351,354,355,357,360,363,365,366,369,370,372,375,378,380,381,384,385,387,390,393,395,396,399,400,402,405,408,410,411,414,415,417,420,423,425,426,429,430,432,435,438,440,441,444,445,447,450,453,455,456,459,460,462,465,468,470,471,474,475,477,480,483,485,486,489,490,492,495,498,500,501,504,505,507,510,513,515,516,519,520,522,525,528,530,531,534,535,537,540,543,545,546,549,550,552,555,558,560,561,564,565,567,570,573,575,576,579,580,582,585,588,590,591,594,595,597,600,603,605,606,609,610,612,615,618,620,621,624,625,627,630,633,635,636,639,640,642,645,648,650,651,654,655,657,660,663,665,666,669,670,672,675,678,680,681,684,685,687,690,693,695,696,699,700,702,705,708,710,711,714,715,717,720,723,725,726,729,730,732,735,738,740,741,744,745,747,750,753,755,756,759,760,762,765,768,770,771,774,775,777,780,783,785,786,789,790,792,795,798,800,801,804,805,807,810,813,815,816,819,820,822,825,828,830,831,834,835,837,840,843,845,846,849,850,852,855,858,860,861,864,865,867,870,873,875,876,879,880,882,885,888,890,891,894,895,897,900,903,905,906,909,910,912,915,918,920,921,924,925,927,930,933,935,936,939,940,942,945,948,950,951,954,955,957,960,963,965,966,969,970,972,975,978,980,981,984,985,987,990,993,995,996,999]
Prelude>

After expressing the shorter example, first to get the list of terms to see if that part is right thus far, and then summing them, I changed the bound to the larger value and got an answer of 233168. Just to see what it was, I then backed off the final sum to get the list of terms.

So far, so good.

Happiness is having a good backup

I’ve my share of hard drive failures and software accidents, and more often than not I’ve been able to recover.  Here are my current back-up provisions:

  • Daily backups using Acronis
  • Windows 7 “Previous Versions” feature
  • Backup copies of important files on different drives
  • Backup copies of important files on different machines at home
  • Annual off-site backup of entire machine
  • Important files stored using RAID-5

The technology of back-up media has changed over the years.  Once upon a time I used a stack of 50–100 5¼″ floppy disks!  I also remember when I went to DAT tape, which could hold a full backup and incremental backups every day for a month on one 400 MB cartridge.  Eventually came recordable CDs, and later DVD-RAM.  Along the way were 20MB “floptical” disks, Jaz drives, and Zip discs.

Today, the best backup medium is another hard drive.  A cheap on-sale hard drive has a better price per gigabyte than optical media of reputable quality, and nothing else is even close.  Desktop HDDs are also quite robust—I’ve heard of data recovery companies reading a hard drive after a house fire had destroyed the computer.  Plastic optical media would be toast!

Partition vs File

There are two fundamentally different kinds of backup.  For typical data applications, “documents” are files and can be easily copied elsewhere for safe keeping.  Any manner of copying a file (and copying it back where it came from) will serve to back up a word processing or any kind of office document, photo, video, etc.

But the “system” is different.  The operating system and the arrangement of installed programs has files you don’t understand, and even special things in special places on the hard drive.  The way to back that up is to make an exact sector-by-sector image of the partition.  This requires specialized software both to make and restore.

That is also one reason why I still keep my data separate from the system.  My C: drive is for the operating system and installed programs, and my files are on a different partition (say, E: and G:).  On Windows this means ignoring the prepared My Documents locations or taking steps to point that to another partition.

It has definite advantages, and I’ve made good use of it recently.  When updating some program caused problems, I simply restored to the previous day’s system backup of the entire C: drive.  My work, which was on E:, was not affected.  Had my work been on C: also, this step would have erased my efforts that were performed since that backup point.

Multiple Methods

Besides using different tools for the System backup and your day-to-day work files, you can use a variety of different overlapping techniques all at the same time.  You don’t have to use one tool or another.  You can use a 3rd party backup suite and casually replicate your work to your spouse’s computer.  Even with a single too, you can have automated daily incremental backups to another drive and make monthly full backups to Blu-ray to store off-site.

Automatically and Frequent

I used to boot the computer specially in order to do a complete partition back up of the normal C: drive.  I would do so before making significant changes, and was supposed to do so once a month regardless.  But it was a chore and a bother.

Now, Windows can reliably back up the running C: drive using an operating system called Volume Shadowing.  Being able to perform the backup while running the regular system is liberating, because it can be done automatically on a timer, and it can be done in the background.  So I have Acronis True Image perform daily backups of the C: drive.

Likewise, the same technology applies to data files.  Even if I happened to be still working at the odd hour at which I scheduled the daily file backup, using the files would not conflict with the backing up.

Windows Previous Versions feature

Windows 7 has a feature called Previous Versions that can be handy.  You can turn on System Protection and also enable it for your data drive.  Use the System control panel applet, and there is a tab for System Protection.

Windows 8 File History

This is deprecated but still available on Windows 8.  Windows 8 revamps the general idea with something that’s said to be more like Mac’s Time Machine.  It backs up to an external drive or network location, and it is hourly (or customizable interval).

Search for file history on the Windows 8 Start screen to get to the applet.  However, there seems no way to specify which files are being backed up!  It only and always applies to places that are part of a Library.  So I worked-around it by adding the directories of interest to a Library in File Explorer.

Windows Restore

I tried (on Windows 7) using its supplied System Backup feature, and was less than trilled with it.  It backs up to a hidden directory on the same drive, I don’t know what it does about having multiple versions stored there.  And I can’t simply copy the backup file elsewhere.  It’s actually the same feature that the Previous Versions uses, so I imagine it’s also better on Windows 8.

Drill!  Be confidant

Make sure you know how to restore files, and that it actually works.  When an urgent deadline coincides with a messed up file, that is not the time to be figuring out an unfamiliar system.

So, after you initiate your automated backup system for work files, also create one or more scratch files of the same kind you normally work with.  A silly word processor document containing a stupid joke, perhaps.  After a couple days, when the automated system has had time to do its thing, delete the file.

Now, get it back.

Make notes, and keep them on actual paper, to refer to when this is not a drill.

Then, you can be happy.  Be smug even, especially when someone else has “an incident”.