An aid for creating hash digests of large files
Digestif lets you create fast checksums of large files by
skipping sections of the file. It was created with compressed media
files in mind, which generally have such a high information density
that we can get away with a checksum that doesn’t actually consider all
the bits. Someday I’d like to understand the likelyhood-of-collision
implications for specific compression algorithms (mp3, h.264, xvid, et al.),
but right now I’m going to settle for guessing at where “good enough for me”
might lie.
One side-effect of this approach is that the error-corrective nature of
digests is, of course, lost. This is really more of an inescapable artifact
of the problem we’re trying to solve. To create a hash of a really large
file, the biggest bottleneck with modern computers is streaming
5-10 gigs off of the disk. The actual checksumming is not hard.
By looking at less data, we speed up the hash process immensely, and
we incur the cost of vulnerability of file corruption. Because the
purpose I have in mind for this tool is identity checking, not
corruption detection, this issue is not a problem for me.
gem install digestif
Just like md5 on the command line, but it only works on files, not on
streaming data (can’t seek a stream).
digestif some_large_file
Since this program is designed to get around file limitations specifically, it
didn’t make sense for me to invest in making streams work.
For a detailed look at the options, see
digestif --help
I wrote digestif to solve a problem for a media catalogue I was working on.
I wanted a filename-independent way to evaluate whether or not a file was in
the catalogue yet, but the files were so large that streaming the whole file
off of the hard drive was too slow for the response time I was hoping for.
(Interested parties, I was getting 5 gigs hashed using md5 in about 2.4
minutes.)
Copyright 2011 Andrew Roberts