Monthly Archives: September 2013

Archival File Storage

More people these days have important computer data that they want to keep.  More (most?) household business is done with computer files now.  Hobbies and interests are likely to involve computer files.  And, the all important family photos and videos are now digital.

Around the year 2000 I stopped using 35mm film, as digital cameras were getting good enough to be interesting.  I also realized that the capacity of hard disk space was growing faster than the resolution of the cameras, so I should be able to keep all my photos immediately available on the computer’s drive, as opposed to storing them in drawers full of discs and having to put in the right disc to get the desired files.

We want to keep these files safe and sound.

So what exactly does “safe” mean?

It is easy to understand the threat of a drive dieing, media being damaged or unreadable after being stored for a long time, and accidentally deleting or saving-over the wrong file.

Having multiple copies protects against these things.  Automated replication/backing up won’t necessarily protect you against destroying files by mistake, but there are specific ways of guarding against that which I’ll discuss in another post.  It’s replication and redundancy that I want to discuss here.

Let me relate an anecdote.  One time I wanted to do something with an old project I had worked on previously, but when I opened the file I found it was filled with gibberish!  I looked at my backup, and it was the same.  Whatever had happened to the file had occurred some time ago, without being noticed at the time.  I continued backing up the now-corrupted file, and a good one was now beyond my retention period for the backups.

Fortunately, I found a copy on another disc where I had made an ad-hoc backup copy of the work and put it in drawer to be forgotten.  Many years later, large data centers report that “silent corruption” affects more stored data than previously realized.  Just as with my backups, a RAID-5 verification won’t detect anything amiss.

So, I worry about the integrity of all my saved photos, old projects, and whatever else I’ve saved for a long time.

I’ve thought about ways to perform a cryptographic hash of each file’s contents, to be used as a checksum.  Repeating this will verify that the files are still readable and unchanged.  For example, when putting backup files on a disc I’ve used a command-line script to generate a text file of filenames and hashes, and include that on the disc too.

With files being stored on always-available hard drives, it is possible to automatically and periodically check these.  For example, do it just before performing a backup so you don’t back up a corrupted file, and you are alerted to restore it instead!  This is complicated by the fact that some files are changed on purpose, and how can an automated tool know which are not expected to be changed and which are really in current use?  Also, with arbitrary files — large numbers of files in a deep directory tree — it is more difficult to store the hash of each file.

So, I’ve never gotten around to making an automated system that does this.

My plans and dabbling had been to use XML to store information for each file, including the date/time stamp, hash, and where it was last backed up to and when, and any necessary overrides to the backup policy.  That would then be used to perform incremental backups, and re-checking would be part of the back-up process.

This problem can be simplified in light of current habits, and to just address one single issue.  It won’t track backup generations and such, but will only give the hash of each file.  The directory will be purely for archival files, so any change at all other than adding files will be something to yell about.  And finally, don’t worry about bloat in the hash file due to repeating the directory name over and over — it will be insignificant compared to the files anyway, or can be handled by running a standard compression tool on the file.

Originally, my hash generation program for media was done using the 4NT command shell.  However, the tool has evolved in ways that I don’t care for, so I stopped buying upgrades.  Now it (the plain command-line version) is known as TCC/LE, and this free version blocks the MD5 hashing function.  So, I’ve lost features I’ve previously paid for because I don’t want to buy features I don’t want.  I suppose I could dig out and preserve an old copy, but then that wouldn’t be something useful to you unless you also bought the tool, and only on Windows.

Writing such a file is/was trivial.  Just write the formatted directory listing to a file, along with a note of exactly the command arguments used.  (Include the hash and the file name as a relative (not fully-qualified) name, and skip the hash file itself).  Now recall that this was to be placed on saved media, which would then not change again.  So, repeating the same command on the top directory of the media would generate exactly the same file.  If a dumb byte-by-byte compare turned up anything, then any common DIFF tool would be used to turn up details.

An example sha256.txt file from 2007. This is placed on the backup media along with the files. Note that the file begins with instructions on how it was created, so it may be re-run to check the files for corruption even if you don’t remember how you did it.

Result of running:
     pdir /(f z @sha256[*]) *.tib |tee sha256.txt


Pluto-1.tib 2186522624 B5D19101E8821CE886E56738A9207E00E6C324DB4EC9A01111E9469A6FA2C233
Pluto-2.tib 2301180416 8DF3BDD0F4A1390A56537BE0D1BD93628BD1EB52D15B49BCFC272C7869C2CC53
Pluto-3.tib 2959220224 0E1FA34E0DCB9D1EE480EE75F1394DE9C95347D247A84EE52C746F175DC579D9
Pluto-41.tib 4660039168 9813846E427B4AFF8996A6AE275E4F9DB7C4897A46362A7B2FCA849A7E948E8F
Pluto-42.tib   11684864 2ADD81B304A19EE753EA8E868562B95A959214897F4846905CD7CD65F23EB817
Pluto-51.tib 4660039168 224B85F067026F0360E4ECEEB03E1EE49CE751BF37180453C6732935032BE0C9
Pluto-52.tib 1935769600 F10764774D18A1469CD48B4A605575D21DE558DE3129201F0DE1E82CCD6D6D1B

As a check of a live replication destination, it needs to be fully automated and handle added files, and allow me to confirm any deletion was done on purpose but then remove the matching hash data.

So, the most difficult part is reading in the hash data.  Exactly how that’s approached would depend on the programming language and/or framework used.  It doesn’t have to be XML, but can be minimally simple.

As I’ve implied, I want to produce such a tool (at long last!) for my own use and as something I can give away and encourage everyone else to use, too.  It should be available on all platforms, not just Windows.  So, what should I write it in?

Using Perl would be very easy, as it has common library code available for performing the hash and for traversing a directory structure, and also makes it dead simple to read and parse the text file containing the results to compare against.  The only real drawback would be the difficulty for non-technical Windows users to install, since Perl is not included on Windows by default.

Writing it in standard C++ using common portable library code means that it could be compiled for any platform to produce an executable that can be fully stand-alone.  Assuming that the SHA-256 code is obtainable easily enough, it would still need code to traverse the directory and to suck back in the results; the very things that are trivial in Perl.

To be continued…