More people these days have important computer data that they want to keep. More (most?) household business is done with computer files now. Hobbies and interests are likely to involve computer files. And, the all important family photos and videos are now digital.
Around the year 2000 I stopped using 35mm film, as digital cameras were getting good enough to be interesting. I also realized that the capacity of hard disk space was growing faster than the resolution of the cameras, so I should be able to keep all my photos immediately available on the computer’s drive, as opposed to storing them in drawers full of discs and having to put in the right disc to get the desired files.
We want to keep these files safe and sound.
So what exactly does “safe” mean?
It is easy to understand the threat of a drive dieing, media being damaged or unreadable after being stored for a long time, and accidentally deleting or saving-over the wrong file.
Having multiple copies protects against these things. Automated replication/backing up won’t necessarily protect you against destroying files by mistake, but there are specific ways of guarding against that which I’ll discuss in another post. It’s replication and redundancy that I want to discuss here.
Let me relate an anecdote. One time I wanted to do something with an old project I had worked on previously, but when I opened the file I found it was filled with gibberish! I looked at my backup, and it was the same. Whatever had happened to the file had occurred some time ago, without being noticed at the time. I continued backing up the now-corrupted file, and a good one was now beyond my retention period for the backups.
Fortunately, I found a copy on another disc where I had made an ad-hoc backup copy of the work and put it in drawer to be forgotten. Many years later, large data centers report that “silent corruption” affects more stored data than previously realized. Just as with my backups, a RAID-5 verification won’t detect anything amiss.
So, I worry about the integrity of all my saved photos, old projects, and whatever else I’ve saved for a long time.
I’ve thought about ways to perform a cryptographic hash of each file’s contents, to be used as a checksum. Repeating this will verify that the files are still readable and unchanged. For example, when putting backup files on a disc I’ve used a command-line script to generate a text file of filenames and hashes, and include that on the disc too.
With files being stored on always-available hard drives, it is possible to automatically and periodically check these. For example, do it just before performing a backup so you don’t back up a corrupted file, and you are alerted to restore it instead! This is complicated by the fact that some files are changed on purpose, and how can an automated tool know which are not expected to be changed and which are really in current use? Also, with arbitrary files — large numbers of files in a deep directory tree — it is more difficult to store the hash of each file.
So, I’ve never gotten around to making an automated system that does this.
My plans and dabbling had been to use XML to store information for each file, including the date/time stamp, hash, and where it was last backed up to and when, and any necessary overrides to the backup policy. That would then be used to perform incremental backups, and re-checking would be part of the back-up process.
This problem can be simplified in light of current habits, and to just address one single issue. It won’t track backup generations and such, but will only give the hash of each file. The directory will be purely for archival files, so any change at all other than adding files will be something to yell about. And finally, don’t worry about bloat in the hash file due to repeating the directory name over and over — it will be insignificant compared to the files anyway, or can be handled by running a standard compression tool on the file.
Originally, my hash generation program for media was done using the 4NT command shell. However, the tool has evolved in ways that I don’t care for, so I stopped buying upgrades. Now it (the plain command-line version) is known as TCC/LE, and this free version blocks the MD5 hashing function. So, I’ve lost features I’ve previously paid for because I don’t want to buy features I don’t want. I suppose I could dig out and preserve an old copy, but then that wouldn’t be something useful to you unless you also bought the tool, and only on Windows.
Writing such a file is/was trivial. Just write the formatted directory listing to a file, along with a note of exactly the command arguments used. (Include the hash and the file name as a relative (not fully-qualified) name, and skip the hash file itself). Now recall that this was to be placed on saved media, which would then not change again. So, repeating the same command on the top directory of the media would generate exactly the same file. If a dumb byte-by-byte compare turned up anything, then any common DIFF tool would be used to turn up details.
As a check of a live replication destination, it needs to be fully automated and handle added files, and allow me to confirm any deletion was done on purpose but then remove the matching hash data.
So, the most difficult part is reading in the hash data. Exactly how that’s approached would depend on the programming language and/or framework used. It doesn’t have to be XML, but can be minimally simple.
As I’ve implied, I want to produce such a tool (at long last!) for my own use and as something I can give away and encourage everyone else to use, too. It should be available on all platforms, not just Windows. So, what should I write it in?
Using Perl would be very easy, as it has common library code available for performing the hash and for traversing a directory structure, and also makes it dead simple to read and parse the text file containing the results to compare against. The only real drawback would be the difficulty for non-technical Windows users to install, since Perl is not included on Windows by default.
Writing it in standard C++ using common portable library code means that it could be compiled for any platform to produce an executable that can be fully stand-alone. Assuming that the SHA-256 code is obtainable easily enough, it would still need code to traverse the directory and to suck back in the results; the very things that are trivial in Perl.
To be continued…