Bit Rot: A Reminder To Check Your Files...

Up until last week I thought I had my files well in hand. All my work stuff is managed with a version control system, with decentralised repositories mirrored automatically onto four backup disks (one internal, two external and one NAS). All my important non-work files (mostly family pictures and videos) are kept on one archive disk, which I keep synced with an external backup disk and two separate NAS drives. My not so important files are kept on two mirrored external drives. I log the S.M.A.R.T. status of all my disks, regularly check filesystems for errors, and keep everything spread around so that no single hardware failure can wipe out all copies of my data. I used to sleep soundly, knowing that my rsync backup scripts were keeping me safe…

And then I had to dig out some old files from the archive. I tried to extract them, but they just spat CRC errors in my face. No bother, I thought – I’ll grab one of the backup copies…

They were all the same. Every copy was broken. In a sick kind of panic I started going through everything. Out of ~3TB of data, almost 500GB was knackered. I curled up on the floor under a table.

Once the sobbing had subsided, I realised what had happened.

Bit rot.

I always knew that such a thing was possible, but had considered it a theoretical threat. The data on a hard drive are just microscopic patches of magnetisation on a film of ferromagnetic grains; it stands to reason that magnetic fields will degrade over time, and that’s before we consider the impact of cosmic rays flipping bits in memory, or the manufacturers’ own quoted ‘raw read error rates’ of ~1 in every 1016 bits… but surely this could have no significant effect in practice? I mean, modern hard drives have all kinds of in-built error handling and correction – if something goes wrong during a read, the disk should be able to rebuild data using error correction codes, and rewrite it to another sector, and give us a warning via S.M.A.R.T.

Well now I know different. Bit rot is very real indeed, and hard drives cannot be trusted. No matter what precautions you take, your data will at some point experience silent corruption. I found a paper which references a study performed at CERN: apparently they wrote 9.7 x 1016 bytes (97 petabytes) to state-of-the-art storage and after 6 months found that 1.28 x 108 bytes were permanently corrupted, with no apparent explanation or reported errors. That gives a figure of one bad byte for every 758MB written. Perhaps that doesn’t sound like much, but consider: if you fill a 3TB drive, within half a year you’ll probably see ~28,700 individual knackered bits. And it only takes one flipped bit to ruin a JPEG image, or a Zip file, or any one of a hundred other types of compressed data.

And this is what happened to me. A few bad bits, and – POOF – that’s half my RAR files gone. A few more bits, and – POOF – that’s all my tarballs gone. And best of all: I had been backing up corrupt data. I had multiple nicely maintained copies of utter junk.

I was not best pleased.

Fortunately, my work was okay (any respectable version control system detects corruption automatically) and I managed to piece together one good set of files from the various damaged copies of the family album. But everything else was a crapshoot.

In hindsight, I realise it was my own fault. I broke the cardinal rule: Verify your backups. But who honestly does that? It takes so long. It’s so tedious. I bet 99% of people never give their backups a second thought… until the worst finally and inevitably happens…

Oh well. Such is life. At least I learned a lesson I won’t soon forget, and maybe my disaster will encourage someone else to take precautions with their storage. To that end, here are some tips and things that I picked up during the ordeal:

(NB: this is all from a Linux perspective; if you use Windows or some such, you’ll have to look elsewhere)

1. Checking what you can when you don’t have a checksum…

So say you’re like me: you have a disk full of files, you have a backup, and that’s as far as it goes. You assume that everything is good. But how do you know? Had you generated checksums, or used a “next-generation filesystem”, you could tell straight away. But you didn’t. And now you’ve found a picture that doesn’t look right, or an archive that won’t open. Perhaps bit rot is destroying all the things even as we speak…

Before you do anything, you need to verify what you have – and in truth, you’re probably stuffed. You have thousands, maybe tens of thousands of files, each of which you’ll have to examine with the Mark I Eyeball… but there are a handful of file types that you can test automatically without any prior planning: compressed archives, pictures, videos, MP3s, etc. Here is a list of commands you can enter in a terminal to check these common formats:

# ZIP archives (*.zip)
$ unzip -t <file_name>

# RAR archives (*.rar)
$ unrar t <file_name>

# 7-Zip archives (*.7z)
$ 7za t <file_name>

# Tarballs (*.tar.gz, *.tgz)
$ tar -tzf <file_name>

# JPEG images (*.jpeg, *.jpg)
$ jpeginfo -c <file_name>

# PNG images (*.png)
$ pngcheck -q <file_name>

# TIFF images (*.tif, *.tiff)
$ tiffinfo -D <file_name> 2>&1 | grep -i -e error -e warning -e bad

# PDF documents (*.pdf)
$ pdftotext <file_name> /dev/null 2>&1 | grep -i -e error -e warning

# Video files (*.mkv, *.mp4, *.avi, etc.)
$ ffmpeg -v error -i <file_name> -f null /dev/null 2>&1 | grep -i -e error -e warn -e invalid

# MP3 files (*.mkv)
$ mp3val -si <file_name> 2>&1 | grep -i -e error -e warning

Most of these commands should be installed by default or be available from the standard repositories of your Linux distribution. I think the only ones you might not find immediately are:

To run these tests on a whole disk’s worth of files I cobbled together a little script, which you can get here: NoChecksumScan.bash

Just download the script and make it executable:

$ chmod a+x NoChecksumScan.bash

Then run it by entering:

$ ./path/to/NoChecksumScan.bash <directory_path>

It will scan recursively through everything in ‘<directory_path>’, logging any corrupt files to screen and disk (log files are recorded as ‘<directory_path>/scan_log_<timestamp>.log’). If it doesn’t know how to handle a particular file, it will log it as ‘skipped’. (Feel free to add checks for additional file types…)

Note that some tests can produce false positives – or more correctly, a file that is nominally corrupt may still be usable. For example, a few bad bytes in a video might do no more harm than mess up a single frame; a few bad bytes in an MP3 file might just add one tiny click to part of the audio… But hopefully this script will pare down your data to a manageable list of ‘skipped’ and ‘corrupt’ files which you can then check by hand. If you do find damaged files in your main archive, then I suggest you run the script again on your backup and cross-reference the logs. Perhaps you’ll get lucky, and the rot will have affected only one of your copies, and you can restore the good data; if not, then you’re stuffed… but at least you’ll know what you’ve lost…

2. Monitoring file integrity

So let’s assume that you’ve got your files in order. You’ve laboriously checked your archives, and now you have a disk full of stuff that you know is good. But it could go bad at any second! You need to be able to test easily and automatically whether this has happened, so you can quickly restore things from your backup. (And likewise, you need to able to easily and automatically verify your backup…)

You could use a “next-generation filesystem”, such as ZFS. This is an excellent solution for many people: the filesystem itself will generate checksums for everything written, detect any unwanted changes and even provide automatic ‘healing’. I would probably have chosen this route myself, except:

To keep things simple I decided not to mess about with how my files are stored, but to generate my own checksums and use these to verify integrity. This can be done very easily from the command line. I started hacking up a script to build a database for all my stuff… but then I found that someone else had done it already, and far better.

So to keep tabs on your data I recommend a little tool called ‘bitrot’, available here. The home page gives no installation instructions, but all you have to do is type the following in a terminal:

$ git clone https://github.com/ambv/bitrot.git
$ cd bitrot
$ sudo python setup.py install

(This requires that you have both git and Python installed, but these should be readily available via your package manager)

Now you can just navigate to the top-level directory containing your files, and run the command ‘bitrot’:

$ cd /path/to/main/archive
$ bitrot

This may take a long time, as it has to recursively read every byte stored in the directory… but when it’s done, all files will be indexed in the database ‘.bitrot.db’ (located within the top-level directory). To check that your data are intact, just run the command again; it will scan the index, refreshing the database with any items that have been edited, added or removed intentionally, but warning of any files that have been corrupted by bit rot.

Thus you can always tell at a glance whether your files are how you left them. This only helps, of course, if you have a good backup from which to restore broken things, but the ‘bitrot’ command can ensure this. To perform a standard backup (e.g. using rsync, which everyone should be using anyway), you would merely type something like the following:

$ cd /path/to/main/archive
$ bitrot

[Pause to check that bitrot gives no errors; if all is good, then proceed...]

$ rsync --progress -a -v -H --delete-after --no-inc-recursive /path/to/main/archive/ /path/to/backup/archive/
$ cd /path/to/backup/archive
$ bitrot

[Pause to check that bitrot gives no errors; if all is good, then backup is a bit-for-bit clone of the original]

In-between backups, just run ‘bitrot’ periodically on both the original and the backup copies (I’d suggest once a week, or however frequently you can tolerate). All doubt is then gone: you’ll know whether your files are wholesome, and you won’t overwrite your backups with corrupted data. Lovely.

3. Preparing for the worst: adding redundant data to your files

If you have known good copies of your data on at least two disks, there is a negligible risk that bit rot will destroy every instance of a particular file. There is, however, a non-negligible risk that one of your drives will suffer hardware failure, or be wiped by a software bug or human error. If you find a bad file in your main archive, it is sod’s law that your backup disk will die before you can restore the unafflicted version…

Okay, I realise this is an unlikely occurrence… but if you have files that are especially precious, then why take chances? You should store them with additional parity data, which allows corrupt or missing information to be regenerated (up to a point).

One of the most common and best methods for adding redundant data to your archives is to use parchive: you just tell it what files to protect, specify the amount of damage you want to be able to absorb, and it’ll vomit out a whole bunch of parity files. There’s a nice write-up about it in this blog post, if you want to know more.

My problem with parchive is that it makes such an ungodly mess. You end up with hundreds of .par2 files everywhere, cluttering up everything. It just doesn’t seem manageable.

I much prefer to keep things clean, and the simplest way that I have found to bundle files and parity data together is to use the Linux command-line version of WinRAR. Yes, I know that ‘rar’ itself is shareware, and not ‘free’… but it’s widely available in the standard repositories for most distributions (in OpenSUSE you can install it from PackMan), millions of people use it every day, and no one seems to care about the ‘limited trial period’ (which in practice lasts forever). So that’s good enough for me.

To protect your files in a RAR archive, just use the ‘-rr’ (recovery record) option. An example: to store ‘my_file’, you would type the following in a terminal:

$ rar a -rr5p "my_file.rar" "my_file"

The ‘5p’ after the ‘-rr’ option means a recovery record of 5%. Thus you could damage up to 5% of the output ‘my_file.rar’ archive, and still extract the original ‘my_file’ successfully. To store a directory, don’t forget the ‘-r’ recursive option…

$ rar a -r -rr5p "my_directory.rar" "my_directory"

I have used a hex editor to twiddle bits and simulate bit rot on a large number of archives generated in this fashion, and all have been recoverable. (I wish I had known about this before my data loss…)

As noted above, to test a RAR file (e.g. ‘my_archive.rar’) for corruption you would enter the following:

$ unrar t "my_archive.rar"

UNRAR 5.00 freeware      Copyright (c) 1993-2013 Alexander Roshal


Testing archive my_archive.rar

Testing     pic/PICT4514.JPG                                           7%
pic/PICT4514.JPG     - checksum error
Testing     pic/PICT4525.JPG                                          23%
pic/PICT4525.JPG     - checksum error
Testing     pic/PICT4522.JPG                                          OK 
Testing     pic/PICT4510.JPG                                          OK 
Testing     pic/PICT4517.JPG                                          OK 
Testing     pic/PICT4515.JPG                                          OK 
Testing     pic/PICT4539.JPG                                          OK 
Testing     pic/PICT4511.JPG                                          OK 
Testing     pic/PICT4523.JPG                                          OK 
Testing     pic/PICT4513.JPG                                          OK 
Total errors: 2

Oh noes! It’s broken! But since we have a recovery record, we can repair it:

$ rar r "my_archive.rar"

RAR 5.30   Copyright (c) 1993-2015 Alexander Roshal   18 Nov 2015
Trial version             Type RAR -? for help

Building fixed.my_archive.rar
Scanning...
Data recovery record found
Repairing 100%
Sector 200 (offsets 19000...19200) damaged - data recovered
Sector 302 (offsets 25C00...25E00) damaged - data recovered
Sector 898 (offsets 70400...70600) damaged - data recovered
Done

This will generate a copy of the archive with an added ‘fixed.’ prefix, which can then be extracted normally:

$ unrar x fixed.my_archive.rar

UNRAR 5.00 freeware      Copyright (c) 1993-2013 Alexander Roshal


Extracting from fixed.my_archive.rar

Creating    pic                                                       OK
Extracting  pic/PICT4514.JPG                                          OK
Extracting  pic/PICT4525.JPG                                          OK
Extracting  pic/PICT4522.JPG                                          OK
Extracting  pic/PICT4510.JPG                                          OK
Extracting  pic/PICT4517.JPG                                          OK
Extracting  pic/PICT4515.JPG                                          OK
Extracting  pic/PICT4539.JPG                                          OK
Extracting  pic/PICT4511.JPG                                          OK
Extracting  pic/PICT4523.JPG                                          OK
Extracting  pic/PICT4513.JPG                                          OK
All OK

Phew!

I believe that a 5% recovery record is more than ample defence against bit rot, given that only a few bytes are likely to be affected in a bit rotted file (assuming time-scales of a few years, and otherwise healthy storage media). I know it’s not practical to archive everything; if you have a music or video collection that you play on a regular basis, then keeping it in a RAR file will cause nothing but frustration… but for important documents or irreplaceable stuff I think it’s worthwhile.

During the process of sorting and protecting my own documents, I grew bored of typing the whole ‘rar’ command and waiting for it to crunch data. So I wrote a tiny ‘quick rar’ function that you can paste in your ‘~/.bashrc’ file; pass it the name of one file or directory, and it will make an appropriately named RAR archive with 5% recovery and minimal (i.e. fastest) compression:

# Function to quickly RAR one file or directory (5% recovery record)
qrar() {
    if [ -z "$1" ]
    then
        echo "Usage: qrar <file or directory name>"
        return 1
    else
        local RAR_FILE=""
        if [[ -f "$1" ]]
        then
            # Input is a file -> remove extension
            RAR_FILE="${1%.*}.rar"
        elif [[ -d "$1" ]]
        then
            # Input is a directory -> remove trailing slash
            RAR_FILE="${1%/}.rar"
        else
            echo "Error: \"$1\" does not exist - exiting..."
            return 2
        fi
        rar a -r -rr5p -m1 "$RAR_FILE" "$1"
    fi
}

With this in place, you can just type:

$ qrar "my_directory/"

…and the archive ‘my_directory.rar’ will be produced.

Oh, and one last thing: during my scavenger hunt through the ruins of my bit rotted backups, I learned that the venerable old tarball archive format (so loved by all us Linux users) is not at all resilient to damage. If you mangle a few bits in the middle of a tarball then everything past that point is probably gone for good. Heck, I even tried flipping one random bit in a 1.8GB tarball of my email folder, and it blew up: I couldn’t extract a singe file. This is disappointing, as tar is often essential for preserving permissions and links when making backups…

But all is not lost! You can still use tar to make an archive, but then compress and protect it with ‘rar’. Here’s a simple one-liner that’ll do the job:

$ tar -cf - <file_list> | rar a -rr5p -si <archive_name>.tar.rar

…and to extract an archive produced in this manner, you can type:

$ unrar p -inul <archive_name>.tar.rar | tar -xf -

So there you have it: with backups, integrity checks and redundancy, you should be able to avoid the kind of crippling data loss that I suffered. Keep safe, and don’t let the bit rot bite.