Checksums on Transfers

howto Add comments

I used to send people checksums for downloads; recently, we seem to need to do this as a company.

Whenever a file is uploaded at the EU FTP server that I manage, the server sends me a sanity-check email something like:

size: 2628153754
user: abc_cheese
md5sum:      23fe22d1afad721740c0178b6ab842b0  /home/abccheese/
BSD sum -r:  38354 2566557
SYSV sum -s: 43793 5133113 /home/abccheese/

Processing archive: /home/abccheese/

Error: Can not open file as archive

When I was having to double-check uploads, I found that it was easier if the server itself told me when the upload was finished, and better, told me also if the file is OK. It doesn’t necessarily say the file is complete, simply that if it’s certain files, “does it look sane?” Zip files and 7-zips should pass a zip-defined sanity check, for example.

On an old FTP server, I have enabled the “SITE” command to allow checksums — which I later had to optimize to return a pre-calculated checksum (using “make” logic to update on changed files) in order to avoid a DoS on the server when the checksum took too long to generate. The intent was to allow a random user to calculate the checksum of a file to ensure that transfer was successful to reduce the possible errors when “something didn’t work out”… like an XML schema, it confirms that “the FTP server made specific delivery: the obligation of providing a file accurately was performed” just as a schema splits the “where did the error occur?” question in half.

In the “SantaSack” project, every file was a checksum. Yes, I had to do collision-avoidance in an MD5 signature storage. I joined the “mysqlfs”project as a replacement of SantaSack — with the intent of developing a layer that pre-calculates MD5s and SHAs asynchronously on change, storing them as file attributes for later query. I’m still considering that for MDS on OSX.

My company is looking into checksums on transferred files, now; it seems self-gratifying in an arrogant sense to see them crossing ground I’ve been over, but I regret that I’m not better-prepared.

MD5 checksum is everywhere: UNIX (md5sum {file}), BSD derivative of UNIX (md5 {file}), Windows, and cross-platform in Java: (martin: 2010-07-28_13:49:49):

        static String getMd5Digest(String pInput)
                MessageDigest lDigest = MessageDigest.getInstance("MD5");
	        BigInteger lHashInt = new BigInteger(1, lDigest.digest());
                return String.format("%1$032X", lHashInt);
            catch (NoSuchAlgorithmException lException)
                throw new RuntimeException(lException);

All of MD5 routines (both opensource and the RSA version) are a case of starting with a basic signature, updating it with a variable-length buffer of entropy, and reporting the resulting value. This can be done on a buffer as it’s used for other things, which I’ve done: the ftpkgadd tool, which was a pkgadd that worked from FTP URLs rather than filenames (connecting the socket for the GET to the inbound stream of the pkgadd decompressor) — this could be done similarly in a layer such as a compression that also MD5Update()s a buffer when compressed, or when written to the output stream. In this way, the checksum is ready when the archive is, at little additional cost.

MD5 is fairly ubiquitous, but sadly I don’t have much of this implemented anymore, save the upload-sanity-check on the FTP server.

Leave a Reply

WP Theme & Icons by N.Design Studio
Entries RSS Comments RSS Log in