−7/4 and Interactive Play

I saw this video concerning iterative function sequences

and thought that would be a good example to play around with in Haskell.  Haskell has arbitrary precision integers, and also has rational numbers, standard.  (I hesitate to say “built in” because, like many languages, the true built-in core is minimal with layers of library code on top of that.  But it was available out of the box.)

It also has an iterate function. What’s interesting is that the result is naturally presented as an infinite list. Other languages with abstract iteration or generators will provide some kind of interface to make your custom thing handleable in a common way. Perl even has a tie feature to make your custom thing seem exactly like the built-in collections, which pushes the interface contract down to include native datatypes, rather than being layered on top of them.

But in Haskell, the generator is the native list. Rather than make some custom generating code support the iteration mechanism, the same old list holds what appears to be the complete result, as if done as an earlier step.

In some purely imperative language, you might ordinarily do things like:

  1. Select the common, standard, container type with the appropriate properties (e.g. linear, sequential).
  2. Run a computation loop and fill the container with the results.
  3. Pass the container to another procedure, which can use the contents for other work (e.g. print them to a file).

Suppose that was unsuitable for some reason and you wanted to delay the computations of step 2, running the producer in parallel with the consumer, or not bothering until/unless a value was actually used by step 3, or not knowing how many items will really be needed.  You would have to explicitly program in a more complex sequence of commands to achieve that.

Instead of a plain common collection (e.g. a C++ vector) you’d need something else.  You need to arrange producer/consumer threads with a work queue between them, or callbacks, or something.  Step 2 would be preparing the whatsit, and step 3 would still be easy enough using standard iterators to abstract the actual details of the collection/stream/whatever.

Now look at it in Haskell.  Steps 1 is not much to think about as the list is king.  Step 2 is:

lz = iterate (λx→2*x+1) 0

Later, other code can get values from this list, e.g.

take 10 lz

produces a result of

[0,1,3,7,15,31,63,127,255,511]

The value lz is just another list of Integer, as far as the language is concerned. But clearly it could not have been filled in step 2 because it is as long as I need it to be, and never specified a size when filling it. It is actually an infinite list!

This certainly looks like the simple approach of: create the values, then use them. However, it actually doesn’t do any work when lz was defined, and only computes values when subsequent code reads them. Here I asked for the first 10 elements, and the list then went ahead and realized those values. If I ask for the value at index 30 (lz!!30) it needs to run the computation all the way to that element before telling me it’s 1,073,741,823.

This illustrates that the program here is not imperative. I didn’t program in “do this step, do the next step, etc.” Rather, I just specified that I wanted the 30th item of the list, and it did the work of figuring out just how to do that based on the definitions. It’s more like a spreadsheet: cells are not computed in order, but based on what they depend on. Haskell uses a top-down approach: when a specific value is really needed, it looks at the formula for it. Gathering the values needed for that function will set off other chains of downward dependent calculations.

Interactive Play

Using the GHCi interactive shell, it’s easy to play with little programs like this.  A more difficult problem featured in the video is

zz = abs <$> numerator <$> iterate (λx→x^2−7%4) 0

Trying a line of code, then recalling it to adjust and submit again, reminded me of a time long ago when microcomputers ran BASIC (or assembly language) and were attached to the household television.

Interactive play is the way to really learn something, and with a more complex build environment or larger programs you still try and do that.  It’s the combination of a direct Read-eval-print-loop and the ability to write interesting things in a single line of text that makes this experience especially efficient.

Want to try it yourself without installing anything?  Try Haskell! is a browser-based experience that invites you to “type these out and see what happens”  (psst: copy/paste is faster if you really can’t wait to find out).

Back to the Sequence

The actual sequence generated by iteration is rational numbers, but we want to ignore the denominator.  Then, I saw that the sign of the number was preserved with numerator, so the abs stripped that off.

The result starts off as [0,7,21,7,114639] and then the numbers get very large. Index 7 is 595096023596888261906480183623645954687, and it quickly gets overwhelming and impractical to display.

  • n=4 has 6 digits (is exactly 114639)
  • n=7 has 39 digits
  • n=28 has 80,807,125 digits
  • n=29 has 161,614,249 digits
  • n=30 has 323,228,496 digits
  • n=31 has 646,456,993 digits

That’s the largest I could compute with this program.  The Windows version compiles to a 32-bit program, but I used editbin to set the 3G flag, so it had 3G of user address space instead of 2.  That allowed it to computer n=30.  Running on a Mac, which is 64-bit, only increased that by one more before getting Segmentation fault.  I suspect there are things in the implementation that top out at 4G, even when more memory is available.

Foods for Thought

Here is a little problem that is isomorphic to something in real life that I was discussing with a co-worker.  Given a list of items (of a simple enumeration type), and knowing implicitly that each item has an associated priority, return the best item of the list.  Further refinements:  an empty list produces a default 0-priority item, as would a list that contains only negative-priority items.  Oh, and there might be unexpected items present whose priority is not known to the code.

To give some concrete examples, let me define the item type Snack as the set { Nothing, Gorp, Doughnut, Broccoli, Cheese_puffs, Apple, Cookie }, and a definition of the priorities, you can give the sequence (Gorp, Cheese_puffs, Apple) and the function returns Cheese_puffs.  Given the empty sequence () it returns Nothing.  Given the sequence containing only (Broccoli), it still returns Nothing.

Since I’ve been reading about and watching videos on the subject of functional programming, I recognize this an example of a foldl operation.  In particular, see C9 Lectures: Dr. Erik Meijer – Functional Programming Fundamentals Chapter 7 of 13.

Even if we’re not using higher-order functions, it still qualifies as a design pattern even if the language doesn’t do all the work for us.  In C++ pseudo-code, the solution ought to look like this:

result= initialvalue;
for (item : collection) result= f(result,item);

The logic of interest ought to be a succinct expression, something like result= std::max(result,item, comp); where comp is a comparison function that knows to compare the corresponding priorities rather than the numbers themselves.

There is a max_element algorithm in std (and in boost::range) that will go through a sequence and apply the max comparison to each element, as in the previous listing.  But, it doesn’t like empty lists.  It’s hard to beat a foreach loop for ease and clarity!  In C++11, the range-based for loop is the easy way to go through a collection.  You put other statements before, after, and inside it to control the details.  It’s not a function as a first-class object, but that’s still an easy way to specify such things.

I’ve been a fan of the “range” idea for a long time, and think it’s a nice way to make STL more friendly.  Boost has range (whole container) versions of std algorithms, but where is the humble map algorithm?  In Boost::range, you don’t use a range algorithm for this at all, but rather the concept of map is filled via range adaptors, with a generic map higher-order function called transformed.

Applying ideas from functional programming, the function (lets call it g) that takes a Snack value and returns the corresponding priority can return a pair, containing the original argument as well as the result.  That is, it decorates the value but does not destroy it.  Something like this:

pair<Snack,int> x = g(Nothing);
for (item : input | boost::transformed(g))  x= max(x,item,compare_second);
result= x.first;

compare_second should do the normal operator< on the .second, but it doesn’t appear as a predefined standard function.  It could be done using a lambda expression, but (in C++11) you would have to declare all the types and parameter arguments.

But remember the earlier musing on the utility of for loops — there’s nothing wrong with using them.  Instead of putting details inside a lambda expression, don’t bother.  max and compare_second are so simple to express naturally that it’s not worth any fuss to compose canned implementations of them.  (Now if we were working with parts that had more complexity, it would be another story)

pair<Snack,int> x = g(Nothing);
for (auto item : input | boost::transformed(g))   {
    if (item.second >= x.second)  x= item;
}
result= x.first;

Note that this naturally handles empty lists and lists containing only undesired items, and if g defaults to 0 for items not known at compile time, this handles all the desired cases without any special-case code.

Giving up on using a single function expression as the body of the loop, there is no need to have g return pairs and then extract first and second.  But, it still has an advantage.  If you kept two distinct variables instead:

Snack x = Nothing;
int pri= 0;
for (auto item : input)   {
    int newpri= g(item);
    if (newpri >= pri)  {
        x= item;
        pri= newpri;
    }
}
result= x;

It turns out longer!  The two variables x and pri, kept separately, need to be updated as a unit.  In more complex code, that is not nice as mistakes can be made.  We still want both the item from the input list and the associated priority, so yet another variable is created.

It is more natural in C++ to have g just return the result, though.  So let’s take the best features of both and split the difference:

pair<Snack,int> x  (Nothing, g(Nothing));
for (auto item : input)   {
    pair<Snack,int> temp (item, g(item));
    if (temp.second >= x.second)  x= temp;
}
result= x.first;

Now remembering the associated priority with the best-so-far assumes that g has some overhead.  E.g. it’s a configuration file read a run-time and stored in a fancy data structure where items with priority defined are sparsely represented.

But if that’s not the case, and there’s no good reason not to make redundant calls to g, you don’t have to remember it.

Snack x = Nothing;
for (auto item : input)   {
    if (g(item) >= g(x))  x= item;
}
result= x;

It’s just so annoying to me that g is called for again for the same value it just used on the previous iteration!  If the priorities are known at compile time, you might suppose the compiler to inline and optimize out the calls to g.

I will continue this later with an exploration of how it works in a functional language.

Haskell Exploration 2, or “who needs loops?”

The previous post used a one-liner to solve the problem, using a list comprehension. Trying different limits meant editing the number in the middle of the line and entering again. That’s really the only thing you can easily do in the GHCi interactive environment—any functions or more complex code needs to be in an actual source file. At least GHCi can load your source file and then let you interact with the functions defined in it.

So, here’s a function which takes the max in the problem statement. To completely generalize, it also takes the list of factors as a parameter. After all, the original code was starting to look repetitive on the handling of the two hard-coded factors.

 

divisors ∷ [Int]IntInt
divisors factors max =
sum [x | x ← [1‥max], factor factors x]
 
factor factors n =
any (λx → n `mod` x ≡ 0) factors
-- swap out these alternatives by leaving exactly one un-commented
-- or [b | x ← factors, let b= n `mod` x ≡ 0]
-- or [True | x ← factors, n `mod` x ≡ 0]

 

As well as using recursion, the list comprehension syntax offers a way to write what would be a loop in imperative languages.  The generator portion of the construct, x ← [1‥max], is like a foreach loop, though it is not limited to using one generator.  The things after the vertical bar can include generators, constraint tests, and named values to use in other things.  However, it doesn’t seem directly suited to using the factors list directly as a generator with other tests — the built-in meaning is an AND between constraints, not an OR.  So, push that down to another function, top-down style.  That is, sum all the numbers up to max where the number passes the test of function factor.

I knew that there are logical functions that apply across all elements in a list, but didn’t recall how they were named.  Looking through the library, I found any first, which is even more suitable.  It reads more clearly to indicate the intent, rather than having to know that a multi-input OR can be used in that manner.  But, it needs the test packaged as a separate function.  Here I did that in-line as a lambda.

Since the point of this exercise is to work with list comprehensions, I tried it using the or function, too.  Mapping the list of factors to a list of True/False is done within the list comprehension syntax, doesn’t need extra parens, but isn’t really clearer: it trades the lambda for a let.

But, there is no reason to populate all the False entries.  The natural use of the filtering clause is to reject elements that don’t meet the criteria.  So, accept the ones that would give True, rather than mapping to True.  A form similar to this would be useful if I wanted to list the ones that fit, rather than just finding out if any fit.

Interestingly, the first form can be written without using a lambda expression.  Nice normal function composition doesn’t work out because the argument is deep in the center rather than last, but section syntax lets you use either the left or right argument in an operator.  The section syntax uses parenthesis, and you end up needing parens around the whole thing because the infix ∘ has a lower precedence than function application.

factor factors n = any ((0≡)∘(n `mod`)) factors

First attempt at Haskell

Haskell is what’s called a functional programming language.  Compared to BASIC, Pascal, batch-file, C, C++, Java, JavaScript, C♯, Ada, FORTRAN, assembly language, and any any other procedural (whether structured or OO) language, Haskell is very different.

And that’s really the point:  I don’t need to pick up on a slightly different syntax and somewhat different environment for representing the same kinds of constructs I’m used to.  There might be a small number of new and interesting concepts to inspire you, but for the most part they are “me too” on the current engineering paradigms and if anything leaning back to simplicity and leaving out features that could separate the top coders from the mundane (since the latter will need to maintain the code later).

But brushing up on techniques and ways of thinking that are truly different helps improve your engineering skills in general.

For example, a number of years ago I re-implemented a piece of code for laying out images on a page.  The old code had been extended, fixed, patched up, and “maintained” to the point where it was impossible to understand what was going on.  Cleaning it off would be a temporary measure if new requirements and changes continued as they always have.  So, I approached it using concepts from functional programming, even though written in (an early version) of C♯.  Defining the constraints and properties of the desired layout, rather than a spaghetti flowchart of patched procedures that manipulate all the state variables, means at the very least that it could be extended without breaking it.  The logic was unfathomable because of state dependance—what happened before affects what some block of code will do to it next.  So I had no state variables at all.  The layout rectangle of an item was expressed as a pure function, and the various bottom level functions could easily be reviewed for correctness.

Functional programming has been making inroads into mainstream programming, with otherwise procedural and OO languages acquiring features friendly to techniques gained from FP.

In particular, template metaprogramming in C++ (which I’ve seen described as “a slow descent into madness”) exists because the template system is a Turing-complete, albeit crudely primitive, functional programming language.

Meanwhile, I’ve read that monads have a strange property: anyone who comes to understand it loses any ability to explain it to others.  That reminds me of the science fiction novel Babel-17.  In fact as in the story, language influences thought and perception, which is what I was getting at earlier in this essay.  Being a writer on programming topics, I thought I’d take that as a challenge.  Maybe I’ll write a truly good explanation of monads; or maybe it will end up joining the hundreds of others that are are either indecipherable or lack proper deep meaning.  (See also monad tutorial fallacy)

chart

A lot of what I see of beginners’ Haskell examples remind me of Prolog.

Anyway, I just tried crafting my first Haskell snippet, other than things like 2+2 or copying lines from the book.  Project Euler, problem 1,

If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6 and 9. The sum of these multiples is 23.

Find the sum of all the multiples of 3 or 5 below 1000.

Here is the session from the interactive environment:

Prelude> [x | x <- [1..10], x `mod` 3 == 0]
[3,6,9]
Prelude> [x | x <- [1..10], x `mod` 3 == 0 || x `mod` 5 == 0]
[3,5,6,9,10]
Prelude> sum [x | x <- [1..10], x `mod` 3 == 0 || x `mod` 5 == 0]
33
Prelude> sum [x | x <- [1..9], x `mod` 3 == 0 || x `mod` 5 == 0]
23
Prelude> sum [x | x <- [1..999], x `mod` 3 == 0 || x `mod` 5 == 0]
233168
Prelude>  [x | x <- [1..999], x `mod` 3 == 0 || x `mod` 5 == 0]
[3,5,6,9,10,12,15,18,20,21,24,25,27,30,33,35,36,39,40,42,45,48,50,51,54,55,57,60,63,65,66,69,70,72,75,78,80,81,84,85,87,90,93,95,96,99,100,102,105,108,110,111,114,115,117,120,123,125,126,129,130,132,135,138,140,141,144,145,147,150,153,155,156,159,160,162,165,168,170,171,174,175,177,180,183,185,186,189,190,192,195,198,200,201,204,205,207,210,213,215,216,219,220,222,225,228,230,231,234,235,237,240,243,245,246,249,250,252,255,258,260,261,264,265,267,270,273,275,276,279,280,282,285,288,290,291,294,295,297,300,303,305,306,309,310,312,315,318,320,321,324,325,327,330,333,335,336,339,340,342,345,348,350,351,354,355,357,360,363,365,366,369,370,372,375,378,380,381,384,385,387,390,393,395,396,399,400,402,405,408,410,411,414,415,417,420,423,425,426,429,430,432,435,438,440,441,444,445,447,450,453,455,456,459,460,462,465,468,470,471,474,475,477,480,483,485,486,489,490,492,495,498,500,501,504,505,507,510,513,515,516,519,520,522,525,528,530,531,534,535,537,540,543,545,546,549,550,552,555,558,560,561,564,565,567,570,573,575,576,579,580,582,585,588,590,591,594,595,597,600,603,605,606,609,610,612,615,618,620,621,624,625,627,630,633,635,636,639,640,642,645,648,650,651,654,655,657,660,663,665,666,669,670,672,675,678,680,681,684,685,687,690,693,695,696,699,700,702,705,708,710,711,714,715,717,720,723,725,726,729,730,732,735,738,740,741,744,745,747,750,753,755,756,759,760,762,765,768,770,771,774,775,777,780,783,785,786,789,790,792,795,798,800,801,804,805,807,810,813,815,816,819,820,822,825,828,830,831,834,835,837,840,843,845,846,849,850,852,855,858,860,861,864,865,867,870,873,875,876,879,880,882,885,888,890,891,894,895,897,900,903,905,906,909,910,912,915,918,920,921,924,925,927,930,933,935,936,939,940,942,945,948,950,951,954,955,957,960,963,965,966,969,970,972,975,978,980,981,984,985,987,990,993,995,996,999]
Prelude>

After expressing the shorter example, first to get the list of terms to see if that part is right thus far, and then summing them, I changed the bound to the larger value and got an answer of 233168. Just to see what it was, I then backed off the final sum to get the list of terms.

So far, so good.

Desires of a Backup System by Bob Hyatt

You invited me to write more on this topic. I believe I will do so, but in parts, rather than in a 20 page university paper. I will start with an overview of the ‘desired’ home server backup, vs what seems to be available.

For most homes and small businesses, the primary desires / goals are (in my view):

A. secure backups. In this case I add, secure from failure for at least 5 years!

B. Easy to do backups. I do NOT mean automatically scheduled backups. They automatically back up things you are NOT interested in. Or that will be useless to you if there is a problem. For example, do you really want to back up all the Operating System files, settings, etc. Things that change a lot, and will occupy a lot of space on your backup drives? I do NOT.

C. Allow you to determine what is important to You!
Not every photo or video is important to you. Not every PDF is valuable to you. Not every MP3 or FLAC is valuable to you. So you need a way to differentiate between valued and chaff. And for the ‘tool’ to remember. OR for you to keep the data you want to keep separately from the chaff. Then have the ‘tool’ look only at the desired data.

D. Allow you to add storage, as needed, easily. Adding a new hard drive should not be ‘a hope and a prayer’ kind of task.

E. If I should desire to change from one backup product or tools to another, the existing should NOT act as a barrier or gatekeeper. (as in keeping your data as prisoner).

F. The running of the backup device or server should NOT wear out the disks, ever! The issue with the video I described in the earlier ‘letter’ here was that the one always busy was attempting to make the ‘distribution’ of data on the drives “perfect”.

If you are a corporation or a government agency, such wearing out of the disks would be considered ‘cost of doing business’. And they have the budget for this task. A home user or a small office does NOT have a budget for this willful destruction. Further, “Perfect distribution” is NOT what a home user or small office is looking for. They are looking for a safe place to put their data, and the exact location is not important.

Home or small business users will seldom read the backups. Perhaps, not ‘seldom’. But the reality is you will do plenty of writes to update the backup. In some cases you might do more reads (if you share the data with media players, other computers on your network, etc).

One more difference between the products that are most sold in this arena? Large organizations are also looking for quick access to the data. E.g. The usage of this kind of thing is for ‘backup or archival’ purposes for the home and small business, whereas the large organizations are using the data put onto such a device as production data and they expect lots of access and lots of updates.

SAN and / or NAS systems in the large organizations are for processing production data, not for backups. So you begin the issues with a fundamental difference in the use of these devices. This is partly why the components tend to be expensive.

The good news for the home or small business is the time tested components used by the large organizations, while expensive, are likely to last longer than your need.

If I get some feedback on this introduction, I will introduce ‘task vs tool’ in the next ‘letter’ I write.

Designing the home NAS

In an earlier installment, I pointed out that popular branded solutions are surprisingly expensive for low-performing hardware.  Reviews indicate that they have rather poor performance.  So for my comparison, I’ll use the Synology DiskStation DS1513+, which reportedly has good performance, more than what a single-link gigabit Ethernet connection can handle.

It has quite a few things in common with the home-made solution:  Multiple Ethernet ports that can be used for redundancy or increased throughput, the ability to host related servers as “apps”, and not user-friendly enough for novices to do the initial setup.

While I was doing this, the Synology DiskStation could be found for $830.  It contains a dual-core Atom D2700 running at 2.13 GHz and 2GB of DDR3 RAM.

Now, there are two ways to approach this.  Clearly a competent file server can run on low-end x86_64 processors with a small (by today’s desktop standards) amount of RAM.  The original FreeNAS was commonly used with hand-me-down hardware.

But, times have changed.  The new FreeNAS, a rewrite by iXsystems, was designed with more modern concerns in mind:  RAM is much cheaper now and the system can be more capable and easier to write if it doesn’t have to cope with low-RAM installations.  In addition, the safety of ZFS against mysterious data corruption relies on the RAM not having mysterious corruption too, and should be used with ECC RAM.  Then comes dire warnings about Windows file shares (CIFS) being single threaded and thus needing a fast CPU (as opposed to multiple slower cores), and features such as encryption demanding ever more CPU performance.  Oh, and the Realtek NIC used on many consumer motherboards is not good for FreeNAS; it needs an Intel NIC.

In short, I’m looking at a server-grade system, not a typical desktop or “gamer” enthusiast system.  What you don’t need is fancy overclock support, sound, lots of slots and multi-video-card support, etc. so a low-end server board is actually about the same price as a “fancy” desktop motherboard.

In particular, the Supermicro brand comes highly recommended.  I could have gotten an X9-series server motherboard and put a Xeon E3 v2 CPU on it.  But why stop there?  I spent more to go with the newer X10-series board and a Xeon E3 v3 “Haswell” CPU.  The X10-SL7-f in fact contains an 8-channel SAS controller as well as the usual 6 SATA channels, sprouting a whopping 14 SATA connectors on the motherboard.  It also features IPMI 2.0 on its own dedicated network port, which is a wonderful feature and I’ll have more to say about it later.

So without further ado, here is the breakdown of my build:

Parts List

Item Description
Price
ICY DOCK MB153SP-B 3 in 2 SATA Internal Backplane Raid Cage Module$63.99
Intel Intel Xeon E3-1245V3 Haswell 3.4GHz LGA 1150 84W Quad-Core Server Processor$289.99
SUPERMICRO MBD-X10SL7-F-O uATX Server Motherboard$239.99
SeaSonic SSR-360GP 360W ATX12V v2.31 80 PLUS GOLD Certified Active PFC Power Supply New 4th Gen CPU Certified Haswell Ready$59.99
Fractal Design Define R4 Black Pearl w/ USB 3.0 ATX Mid Tower Silent PC Computer Case$99.99
ZALMAN CNPS5X Performa 92mm FSB (Fluid Shield Bearing) Powerful Cooling Performance CPU Cooler$19.99
2 × 8GB PC3-12800 DDR3-1600MHz ECC Unbuffered CL11 HYNIX Memory178.92
Total without drives$952.86
WD Red WD40EFRX 4TB IntelliPower 64MB Cache SATA 6.0Gb/s 3.5" NAS Internal Hard Drive -Bulk3×$189.99 = $569.97
Seagate ST4000DM000 Desktop 4TB 64MB Cache2×$155.49 = $310.98
Total for Build$1833.81

The raw power seriously outclasses the DiskStation, and is only $120 more.  With the X9/v2 option, it would have actually been less.

Oort build - inside

Above is the result, Oort, with the side open.  You can see the stack of 8 drive trays, and the large heat sink over the CPU.

view-_MG_2530

Here is a front view.  The grill along the edges allow air intake from the front.   The blank front face is imposing and mysterious… I wonder if I can get some artwork over it?

view-_MG_2534

And finally, with the front panel open.  There is foam sound-dampening on all the case surfaces including the inside of this door.  The ICY-Dock hot-swap bays are now accessible.  I plan to use these for backing up and mounting off-site volumes while they are resident.  The main drives require side access, which is simply a matter of removing two thumb screws.

Now back to the details.  The X10 (rather than X9) series mainboard allows the use of the newer Haswell processors, which run cooler and save power.  The onboard SAS saves what would be a hundred dollar PCI card, and is much easier as well since it provides common SATA-compatible connectors.  And finally, this motherboard has the wonderful IPMI 2.0 with full KVM-over-LAN.

For the CPU, I looked at the the chart in Wikipedia, along with the prices and availability at NewEgg.  I chose the lowest (cheapest) Xeon E3 that had onboard graphics and hyperthreading.  Why do I need onboard graphics if the system doesn’t have a monitor?  I think that the monitor-over-LAN feature still requires an actual VGA; it doesn’t emulate one, but just captures the output.  There is a more primitive remote management feature that allows for a TTY-style console (also over LAN), but I don’t think that helps with initial BIOS screen stuff.  Also, with the standard built-in GPU I can use it for computation other than drawing graphics.  Maybe it will accelerate other software I run on the box at some point.

I’m keeping the box in a closet which besides building up heat from the machines gets afternoon sun on the outside wall.  The closet is warm in the summer.  My experience with the stock cooler that comes with the CPU is that it’s loud or even inadequate.  Looking through NewEgg, I looked for this style with low noise and a good price.  I normally like this style in part because it takes a standard square fan which can be updated and replaced, but the Zalman is known for quiet fans too.  I mounted it, not with the thermal grease that it came with, but with Phobia HeGrease, carefully applied and spread.

The RAM was not available at NewEgg.  Apparently ECC but not Buffered/Registered also is uncommon.  Buffering is used to facilitate having many more memory sticks on a board, which is not the case of this server-but-desktop board.  I found it at a specialty RAM site, www.memoryamerica.com, which has a wide selection.  To be on the safe side, I looked at the brands that Supermicro had tested on this board, and took the cheaper of the two.  16GiB uses two of the four memory slots, so it can be doubled in the future.

I use Seasonic power supplies, and that’s another story.  I looked for “Haswell support”, which enables a new improved stand-by mode.

Now for the case:  Some mentions on the FreeNAS web forum led me to Fractal Designs.  I followed up by reading reviews and the manufacturer’s web site.  There are a couple models that are so similar that I wonder what the difference is!  Since there is no direct explanation, it takes reading the specs very carefully and comparing the dimensions to spot the real differences.  This R4 features an internal stack of 8 HDD trays (with the anti-vibration mounting) plus two half-height 5¼″ external bays.  If you include two SSDs stuck elsewhere, that is 13 drives total, which is nicely close to the motherboard’s support of 14.

I chose an option with two external bays so I could fit a 3-disk hot-swap backplane.  Here I went with the name-brand ICY-Dock and a with-tray design, because I had trouble with trayless on Mercury.  So using the front-loading drive bay requires the use of two mounting screws, which is not very handy as it turns out.

Worse, the “2 half height bays” is a little exaggerated.  It’s more like 1.95 half height bays, as a large bulge protrudes into the area where the bottom bay should be.  I had to remove the bottom piece of sheet metal from the ICY-Dock in order to squeeze it in; this also got rid of the normal mounting ears.  I’ll make a bracket some day (a perfect job for a 3D printer), but it fits tightly and is not heavily used, so I left it without screws for the time being.

Other than that, assembling was easy and straightforward.  Testing proved interesting and adventuresome, and I’ll tell you about that later.

Network File Sharing

What are the ramifications of using NAS (Network Attached Storage) instead of DAS?  Will I want to work directly off of it, or keep work locally and only use it for backing up?  The answer will vary with the nature of the work, I think.

File Copy

Let’s start with simple file copying.  This was performed using the command line, not the File Explorer GUI.

Large files

17 files totaling 273.6 GiBytes

Windows (CIFS) share49min 41sec94 MiBytes/second
NFS share
different drive on same machine48min 2sec97 MiBytes/second

Small files

14,403 files in 1,582 directories totaling 2.53 GiBytes

CIFS share64min 53sec683 KiiBytes/second
same drive3h 25min216 KiBytes/second
different drive on same machine56min 50sec780 KiBytes/second

For large files, the transfer rate of 94 MiBytes/second (or 98.6 MBytes/second) is respectable.  Everything I could find on real-world speed of Gigabit Ethernet is outdated, with home PCs being limited by the hard drive speed.  Note that I’m going through two cheap switches between the test machines.

The small-file case is two orders of magnitude slower!  This bears out the common wisdom that it’s faster to zip up a large collection of files first, and then transfer the zip file over the network, and unzip on the receiving side.

I think that the speed is limited by the individual remote file open/close operations, which are slow on Windows, and the network ads latency if this is a synchronous operation.  The DAS (different drive on the same computer) is only 14% faster then the NAS in this case.  The data transfer time of the file content is only 0.7% of the time involved.  The real limiting factor seems to be the ~4 files or directories processed per second.  That does not sound at all realistic as I’ve seen programs process many more files than that.  There must be some quantity after which it slows down to the observed rate.  Since it is similar for DAS and NAS, It must be a Windows problem.  I’ll have to arrange some tests using other operating systems, later.

Working with Files

Compiler

This is what I do all day.  What are the ramifications of keeping my work on the NAS, as compared with other options?

Compile Build Job

Project located on local drive. (A regular directory or a VHD makes no difference)4min 9sec
Project located on NAS, accessed via CIFS (normal Windows share)10min 29sec
using NFS share insteadN/A

The Microsoft Visual Studio project reads a few thousand files, writes around 1500, reads those again, and writes some more.  When the project is located on a local drive, the CPU usage reads 100% most of the time, indicating that the job is CPU bound.

When the project is located on the NAS, the situation is quite different.  Given that the actual work performed is the same, the difference in time is due to file I/O.  And the extra I/O takes more time than the job did originally; that is, the time more than doubled.  It was observed that the CPU utilization was not maxed out in this case.  The file I/O dominated.

The same job was performed again immediately afterwards, giving the computer a chance to use cached file data it had already read recently.  That made no difference to the time.  It appears that with Windows (CIFS) shares, even on Windows 7 (the sharing protocol was significantly reworked as of Vista), file data is not cached in memory but is re-read each time it is needed.  That, or the “lots of small files speed limit”, or both, kills performance.

I tried to repeat that using a NFS share instead of the CIFS share.  However, I could not get it to work at all.  The Windows machine could see the file names and navigate the directories, but could not read any file.

Video Encoding

Encoding a video entailed reading one very large file and writing one smaller file.  The process’s performance metrics indicate reading only 2.5MB/s and writing merely 80KB/s.  I would not expect it to matter if the input file, output file, or both were on the NAS.

Likewise, video editing and programs like Photoshop will read in the files and maintain the contents in memory or manage its own overflow swap space (which you put on a local drive).  It’s harder to do actual timing here, but the impression is that various programs are perfectly responsive when the file are directly attached.  If that changes when using the NAS instead, I’ll note the circumstances.

Caveat

All of the performance characteristics above are made with the assumption that the storage unit and the network links are all mine for the duration of the test.  If multiple people and pets in the household are using the NAS, you have the added issue of having to divide up the performance among the simultaneous users.

Note that FreeNAS does support link aggregation, so I could plug in two gigabit Ethernet cables if I replaced the switch with one that also understood aggregation.

I need a (home made) NAS!

I ran out of space on my RAID-5 drive, which I built several years ago.  At the time, 1TB drives were as large as you could get before the price increased disproportionately to the capacity.  I bought 2 “enterprise” grade drives for around $250 each, and a consumer drive for half that.  The usable capacity is 2TB because of the redundancy.

I decided I was not going to lose data ever again.  Having redundancy against drive failure is one component of this.  So, all my photos and system backups are stored there, along with anything else I want to keep safe, including Virtual Machine images.

It turns out that a lot of the space is taken by the daily backups.  Even with a plan that uses occasional full backups and mostly incremental backups, they just keep on coming.  I also need to better tune the quota for each backup task, but with multiple tasks on multiple machines there is no feature to coordinate the quotas across everything.

Meanwhile, a new drive I installed had a capacity of 3TB by itself.  Lots of room for intermediate files and things I don’t need to keep safe.  But that’s more to be backed up!

Now I could simply replace the drives with larger ones, using the same Directly Attached controller and chassis space.  But there are reasons for looking at Network Attached storage.  They even have Drobo NAS units at WalMart now, so it must be quite the mainstream thing now.

Besides future upgradability to more drives (rather than just replace the drives with larger ones again), and better compatibility with different operating systems used in the home, and specialized media services for “smart” TVs and tablets, a compelling reason for me is the reason I’m using a RAID-5 in the first place:  to preserve my data.  As I’ve noted elsewhere, silent data loss is a bigger problem than generally realized, and it is actually growing.  Individual files somehow go bad, and are never noticed until long after you have reused backups from that period.

Direct-Attached storage — simply having the drive on my main PC — limits the choice of file systems.  In particular, Windows 7 doesn’t have full support for anything other than NTFS and various varieties of FAT, and new more advanced file systems are only available on a few operating systems as they are not widely used yet.

A file system that specifically keeps data integrity and guards against silent errors is ZFS.  When I first learned about it, it was only available for BSD-family operating systems.  A NAS appliance installation (using FreeBSD) called FreeNAS started up in 2005.  More generally, someone could run a FreeBSD or Linux system that has drives attached that are using ZFS or btrfs or whatever special thing is needed, and put that box on the network.

As I write this, a Drobo 5N (without disks) sells for $550.  It reportedly uses a multi-core ARM CPU, and is still underpowered according to reviews.  Most 2-disk systems seem to use one of two ARM-based SoC that costs about $30.  Now you could put something like the Addonics RAID-5 SATA port multiplier ($62) on that to control more disks at a low price.  Most 5-disk home/SOHO NAS systems seem to be based on x86 Atom boards.

Anyway, if you used hand-me-down hardware, such as the previous desktop PC you just replaced with a newer model, you’d have a much more powerful platform for free.   Buying a modest PC motherboard, CPU, and RAM for the purpose (supposing you had a case and power supply laying around) could be found for … (perusing the NewEgg website for current prices) … $225.

So basically, if you know what you’re doing (or can hire someone to do it for you for a hundred dollars) you can get hardware substantially more powerful for a fraction of the price.

Being an enthusiast who’s never bought a pre-made desktop PC, it’s a no-brainer for me to put something together from parts, even if it only had the features these home/SOHO appliances that have become so common, even if I don’t re-use anything I have on hand.

But, none of the NAS boxes I see advertised discuss anything like the silent data corruption problem.  They don’t say what kind of file system is being used, or how the drives might be mounted on a different system in the event of a board (not drive) failure if a replacement for the exact model is no longer available.  I would think that if a NAS had advanced data integrity features then it would feature prominently in the advertising.  So, build I must, to meet the requirements.

In future posts I’ll discuss the silent corruption problem at more length, and of course show what I actually built.  (I’ve named the NAS server OORT, by the way.)

 

 

 

 

Archival File Storage

More people these days have important computer data that they want to keep.  More (most?) household business is done with computer files now.  Hobbies and interests are likely to involve computer files.  And, the all important family photos and videos are now digital.

Around the year 2000 I stopped using 35mm film, as digital cameras were getting good enough to be interesting.  I also realized that the capacity of hard disk space was growing faster than the resolution of the cameras, so I should be able to keep all my photos immediately available on the computer’s drive, as opposed to storing them in drawers full of discs and having to put in the right disc to get the desired files.

We want to keep these files safe and sound.

So what exactly does “safe” mean?

It is easy to understand the threat of a drive dieing, media being damaged or unreadable after being stored for a long time, and accidentally deleting or saving-over the wrong file.

Having multiple copies protects against these things.  Automated replication/backing up won’t necessarily protect you against destroying files by mistake, but there are specific ways of guarding against that which I’ll discuss in another post.  It’s replication and redundancy that I want to discuss here.

Let me relate an anecdote.  One time I wanted to do something with an old project I had worked on previously, but when I opened the file I found it was filled with gibberish!  I looked at my backup, and it was the same.  Whatever had happened to the file had occurred some time ago, without being noticed at the time.  I continued backing up the now-corrupted file, and a good one was now beyond my retention period for the backups.

Fortunately, I found a copy on another disc where I had made an ad-hoc backup copy of the work and put it in drawer to be forgotten.  Many years later, large data centers report that “silent corruption” affects more stored data than previously realized.  Just as with my backups, a RAID-5 verification won’t detect anything amiss.

So, I worry about the integrity of all my saved photos, old projects, and whatever else I’ve saved for a long time.

I’ve thought about ways to perform a cryptographic hash of each file’s contents, to be used as a checksum.  Repeating this will verify that the files are still readable and unchanged.  For example, when putting backup files on a disc I’ve used a command-line script to generate a text file of filenames and hashes, and include that on the disc too.

With files being stored on always-available hard drives, it is possible to automatically and periodically check these.  For example, do it just before performing a backup so you don’t back up a corrupted file, and you are alerted to restore it instead!  This is complicated by the fact that some files are changed on purpose, and how can an automated tool know which are not expected to be changed and which are really in current use?  Also, with arbitrary files — large numbers of files in a deep directory tree — it is more difficult to store the hash of each file.

So, I’ve never gotten around to making an automated system that does this.

My plans and dabbling had been to use XML to store information for each file, including the date/time stamp, hash, and where it was last backed up to and when, and any necessary overrides to the backup policy.  That would then be used to perform incremental backups, and re-checking would be part of the back-up process.

This problem can be simplified in light of current habits, and to just address one single issue.  It won’t track backup generations and such, but will only give the hash of each file.  The directory will be purely for archival files, so any change at all other than adding files will be something to yell about.  And finally, don’t worry about bloat in the hash file due to repeating the directory name over and over — it will be insignificant compared to the files anyway, or can be handled by running a standard compression tool on the file.

Originally, my hash generation program for media was done using the 4NT command shell.  However, the tool has evolved in ways that I don’t care for, so I stopped buying upgrades.  Now it (the plain command-line version) is known as TCC/LE, and this free version blocks the MD5 hashing function.  So, I’ve lost features I’ve previously paid for because I don’t want to buy features I don’t want.  I suppose I could dig out and preserve an old copy, but then that wouldn’t be something useful to you unless you also bought the tool, and only on Windows.

Writing such a file is/was trivial.  Just write the formatted directory listing to a file, along with a note of exactly the command arguments used.  (Include the hash and the file name as a relative (not fully-qualified) name, and skip the hash file itself).  Now recall that this was to be placed on saved media, which would then not change again.  So, repeating the same command on the top directory of the media would generate exactly the same file.  If a dumb byte-by-byte compare turned up anything, then any common DIFF tool would be used to turn up details.

An example sha256.txt file from 2007. This is placed on the backup media along with the files. Note that the file begins with instructions on how it was created, so it may be re-run to check the files for corruption even if you don’t remember how you did it.

Result of running:
     pdir /(f z @sha256[*]) *.tib |tee sha256.txt


Pluto-1.tib 2186522624 B5D19101E8821CE886E56738A9207E00E6C324DB4EC9A01111E9469A6FA2C233
Pluto-2.tib 2301180416 8DF3BDD0F4A1390A56537BE0D1BD93628BD1EB52D15B49BCFC272C7869C2CC53
Pluto-3.tib 2959220224 0E1FA34E0DCB9D1EE480EE75F1394DE9C95347D247A84EE52C746F175DC579D9
Pluto-41.tib 4660039168 9813846E427B4AFF8996A6AE275E4F9DB7C4897A46362A7B2FCA849A7E948E8F
Pluto-42.tib   11684864 2ADD81B304A19EE753EA8E868562B95A959214897F4846905CD7CD65F23EB817
Pluto-51.tib 4660039168 224B85F067026F0360E4ECEEB03E1EE49CE751BF37180453C6732935032BE0C9
Pluto-52.tib 1935769600 F10764774D18A1469CD48B4A605575D21DE558DE3129201F0DE1E82CCD6D6D1B

As a check of a live replication destination, it needs to be fully automated and handle added files, and allow me to confirm any deletion was done on purpose but then remove the matching hash data.

So, the most difficult part is reading in the hash data.  Exactly how that’s approached would depend on the programming language and/or framework used.  It doesn’t have to be XML, but can be minimally simple.

As I’ve implied, I want to produce such a tool (at long last!) for my own use and as something I can give away and encourage everyone else to use, too.  It should be available on all platforms, not just Windows.  So, what should I write it in?

Using Perl would be very easy, as it has common library code available for performing the hash and for traversing a directory structure, and also makes it dead simple to read and parse the text file containing the results to compare against.  The only real drawback would be the difficulty for non-technical Windows users to install, since Perl is not included on Windows by default.

Writing it in standard C++ using common portable library code means that it could be compiled for any platform to produce an executable that can be fully stand-alone.  Assuming that the SHA-256 code is obtainable easily enough, it would still need code to traverse the directory and to suck back in the results; the very things that are trivial in Perl.

To be continued…

[Citation Needed] — the intelligent heckler

The xkcd comics can be funny normally, but this strikes me as particularly humorous because of other things I’ve seen recently.   In particular, the You-Tube contributor potholer54 is a former science journalist who not only gives his comments on various bunk but explains why you don’t have to take his word for it and how to spot possible bunk through journalistic techniques.  In particular, follow up on the sources.  See if the proponent is just repeating (and further distorting) a mis-reported story, and find the original that started it.

In this Internet age, it is easier than ever.  Just click on the link, or use Google.  Long before I saw these techniques spelled out, I recall reading something that seemed fishy.  In a minute or two I figured out that all reports were just repeating the company’s own white paper (and each other).

In order for “news” outlets to return to some standard of accuracy and integrity, their readers need to care.  There might be resources like Snopes that people can easily check, and you can well imagine browser extensions that automatically indicate the credibility of an article.  But that would mean more people would care, and that should push back on the providers.

So really, if someone is telling you something (or posting, or publishing), you shouldn’t need to necessarily believe him.  Is he just making things up?  Did he get the facts wrong?  Is he deliberately distorting the picture?  For important issues like climate change and GMO foods, you can and should find out for yourself who to trust on the subject.