As defined in the Chunk Specification, every chunk in a ZIP2 archive includes a checksum byte. The purpose is to verify that the data has not been corrupted, and that the proper boundaries of the data structures have been noted.

The PKZIP file format uses a CRC-32 calculation for this purpose. The PNG file format uses the Adler-32, which is weaker but designed to be faster. The ZIP2 file format uses a calculation which is even faster and smaller.

It is acceptable to have a relativly weak (1-byte) checksum because this is not a noisy medium. Only elementary checking is needed, to verify that a file has not been damaged, to facilitate reading good chunks out of a file where some chunks are damaged, and verify the basic sanity of the data (that the length matches the content).

More advanced error checking is *optional* and can be added (via
additional chunks) if more significant error detection is required. In cases
where minimal size is favored over error detection and partial recovery, this
one-byte per chunk code is the *minimum* mechanism that cannot be left
out.

The checksum algorithm is chosen with several useful properties to serve our purpose.

First, changing any one bit will affect, on the average, half the bits of the output. That’s true about the last input byte, too. Some simple checksums, such as adding each byte in to the accumulator while shifting the whole thing, have a very small cascade factor. That is, it will have to consume a lot more input before a bit's effect spreads out over the entire result. Some, like simply xor’ing each byte with no shifting involved, have no propagation at all, and one bit in the input will only affect one bit of the output.

Second, two single-bit changes are unrelated and will not cancel each other out. Instead, it further randomizes the result. It is important that we don’t have errors easily cancel out, or a consistantly-applied error (such as losing the high bit) leave no resulting effect at all!

Third, it is short. To get maximum effect out of a one-byte checksum, there should be uniform distribution over all possible codes.

That describes what it must be to do its job. Now it also has other properties that make this particular algorithm nice to have.

It is fast. Each calculation does minimal processing on each byte of input. Basically, it’s as little as possible to meet the essential properties. It also requires no set-up time, because it has no large internal state.

It uses no memory. It requires no large tables to produce an efficient implementation, and it doesn’t use any storage more than a single integer when working.

It’s easy to specify. It does not require lists of constants, just two 16-bit numbers and one of them is the number 1.

It can be implemented efficiently on low-end microcontrollers. It uses multiplication but not division. What multiplication it needs can be done with a 16-by-16 bit multiply (and takes only the low 16-bits of the result), or three 8-by-8-bit multiplys (which is easily done with a table). It uses only one 16-bit storage location, plus the pointer for scanning the input.

But it does not *demand* 16-bit two’s complement arithmetic either. It can use any larger
word size without relying on implicit truncation of higher bits, or taking explicit steps to mask
them off. So, it’s friendly for mainstream and advanced CPUs, too.

It can be implemented on high-level scripting languages that don’t provide machine-language-like access to the CPU. So, there is no rotate instruction needed. It uses only addition and multiplication, and can be performed with any available number domain including floating point! It does not demand access to twos-compliment arithmetic overflow behavior.

The essential mapping of an integer to another “shuffled” integer is given
in section 6.4 of Knuth’s *Art of Computer Programming*. That is,
multiplying by (sqrt(5)-1)/2 multiplied by the power of two to match the
integer size will (after you ignore overflow) map 1-to-1 all the representable
integer values in an apparently random order.

Mapping each input byte this way means that two input bytes that differ in
one bit originally will really be totally different^{1} and unrelated thereafter.
That gives us part of our requirements.

Each byte is added to the accumulator which is then transformed. This means that every result depends on all the input before it, yet even the last input byte can affect all the bits of the final result.

For each byte of input *b _{i}*, calculate

In actual code,

# written in Perl 5.6 use strict; use warnings; subhash_accumulate{ my ($string, $accumulator)= @_; $accumulator= 1 unless defined $accumulator; foreach my $byte (unpack "C*", $string) { use integer; $accumulator= ($accumulator + $byte) * 40503; # $accumulator &= 0xffff; # keep the low 16 bits only # can skip that if numbers keep the least-significant bits and discard overflow bits. # so, skip it for integers on normal architectures and perform it for floating-point. } return $accumulator & 0xffff; } subhash_result{ my ($accumulator)= @_; return ($accumulator >> 8)&0xff; } sub hash { my $accumulator; # illustrate that I can feed input to the algorithm # a little at a time, and accumulate a result. foreach my $gulp (@_) { $accumulator= hash_accumulate ($gulp, $accumulator); } my $result= hash_result ($accumulator); return $result; } ### main program - test data ### my $result1= hash ("Hello ", "world", "!"); my $result2= hash ("Hello world!"); print "$result1 should equal $result2\n"; my @testdata= ( "Hello World!", "XXX", "XXY", "XYX", 'A', 'a', 'b' ); foreach my $input (@testdata) { my $output=hash($input); printf ("$input --> %02x\n", $output); }

Note that the main calculation is done by performing this calculation on each byte of input:

$accumulator= ($accumulator + $byte) * 40503;

Note that the addition must be done with at least a 16-bit unsigned integer, but could use any larger size. In a typical implementation, integers will be 32 bits and signed, but this does not matter. The extra bits can be simply ignored. You just must have at least 16 bits, but don’t need to do extra work to make it exactly 16.

If you had to use floating point number, though, you should discard high bits because the result will not overflow as with integers, but will “float” up and lose significance! That is, floating point will drop the lowest bits when the numbers get too large, and this will not work. You can throw in a modulo 65536 to prevent them from getting too large.

Likewise, the multiplication by the highly magic number 40503 can be done with an instruction that multiplies a 16-bit (unsigned) values and keeps only the least-significant 16 bits of the result. Or, you can use a 32-bit multiply, and you don’t have to go out of the way to get rid of the extra bits of the result.

Actually, there is some pattern that is apparent if you analyse it but doesn’t spoil the results of randomness testing. A change to the high byte only of the input will not change the low byte of the result. However, this checksum avoids that problem by only using the high half of the accumulator, which contains fully-mixed values. In general, use Knuth’s function on a word size that’s twice what you need as a result, and only use the high half.

Page content copyright 2003 by John M. Dlugosz. Home:http://www.dlugosz.com, email:mailto:john@dlugosz.com