When Btrfs reported checksum mismatch and made files inaccessible — the snapshot verification and restore path I used

Btrfs has long been seen as a modern and powerful filesystem choice, especially for those who value data integrity, snapshot capabilities, and efficient storage management. However, with great power comes the complexity of handling potential errors that arise when your data integrity expectations are challenged. Recently, I encountered an issue where Btrfs reported a checksum mismatch, rendering some of my important files inaccessible. This led to a deep-dive into snapshot verification, comparisons, and eventual restoration — all without needing to reformat the drive or lose data unnecessarily.

TL;DR

I encountered a Btrfs checksum mismatch error that made certain files unreadable. Using pre-configured read-only snapshots, I was able to identify the last intact state of the files and verify their integrity. The recovery was done using a simple file copy from a known good snapshot. The key lesson: automated, regular snapshots combined with awareness of Btrfs’s tooling can be a lifesaver.

What Happened: The Btrfs Checksum Mismatch

It all started when trying to access a project archive stored on a Btrfs-formatted volume. A seemingly normal copy operation failed with an Input/Output error. Running btrfs check did not report structural filesystem issues, which left me puzzled. But further inspection using dmesg revealed checksum errors associated with specific file blocks:

btrfs_dev_stat_print_on_error: 97 callbacks suppressed
BTRFS warning (device sda1): checksum error at logical 95874560 on dev /dev/sda1, sector 187180, root 256, inode 5829, offset 0, length 4096, links 1 (path: /data/archive.tar.gz)

This checksum error indicated that the filesystem recognized inconsistency between stored data and its checksum metadata. Since Btrfs uses checksums specifically to detect silent corruption, it was now actively protecting me — but also preventing access to corrupted files.

Initial Analysis

Before panicking, I reviewed the following:

No recent hardware issues were reported
SMART data for the drive was clean
Only a few files generated checksum errors — isolated, not widespread

This suggested the corruption could have come from a brief memory error (pre-checksum calculation), a write interruption, or an overlooked storage quirk. Regardless of cause, I needed a reliable recovery approach.

Step 1: Identifying the Corrupted Files

By tailing dmesg and cross-checking with my metadata database, I built a list of file paths referenced in Btrfs error messages. These were flagged for recovery investigation. Notably, the files weren’t missing — they were simply inaccessible due to failing integrity checks.

Step 2: Evaluating Snapshot Availability

Luckily, I had a scheduled process in place to create hourly, daily, and weekly snapshots using btrfs subvolume snapshot. This low-overhead feature of Btrfs had taken consistent, read-only snapshots of the /data subvolume.

In my case, the snapshot hierarchy looked like this:

/data/.snapshots
├── hourly.2024-03-10-09:00
├── hourly.2024-03-10-10:00
├── daily.2024-03-09
├── weekly.2024-03-03

The goal now was to track down the last intact snapshot of the corrupted files, compare them, and restore if necessary.

Step 3: Verifying Snapshotted Files

I began with the most recent hourly snapshot and attempted to access archive.tar.gz. To my relief, the file opened successfully and passed checksum verification using sha256sum. That specific snapshot preserved the pre-corruption state.

To ensure consistency, I went further and checked the file’s content in a couple older snapshots as well. They too were intact — confirming the corruption occurred very recently, possibly caused by a system freeze or power issue during last writes.

How I Restored the File

Here’s the careful, verified process I used to restore the file — avoiding any command that might trigger another checksum error or write to a potentially compromised drive region:

Mounted the snapshot subvolume read-only (optional, safety step)
Copied the file using cp --reflink=never to prevent deduplication or CoW side effects
Manually re-checked the file checksum post-copy

The command I used looked like:

cp --reflink=never /data/.snapshots/hourly.2024-03-10-09:00/archive.tar.gz /data/archive_restored.tar.gz

I also appended a note in the file metadata describing the origin snapshot and recovery time using setfattr:

setfattr -n user.recovery_source -v "snapshot_hourly.2024-03-10-09:00" /data/archive_restored.tar.gz

Lessons Learned

This situation, while tense, offered several valuable reminders:

Checksums do their job — Btrfs was not the problem; it was merely the messenger.
Snapshots are vital — without them, I might have needed to restore from an older backup or lost some data.
Prevention matters — having ECC RAM, a stable power supply, and good-quality SSD/hard disk matters immensely where data integrity is concerned.

Btrfs Tools That Helped

Throughout this investigation and recovery, a few Btrfs utilities proved especially useful:

btrfs check --readonly /dev/sda1 — Verifies structure without modifying the filesystem
btrfs subvolume list /data — Helped enumerate snapshots
btrfs subvolume find-new — Pointed me to recent changes since last snapshot

Additional Verification Techniques

Even after recovery, I didn’t stop there. I performed extra verification by mounting the snapshots with bind mounts and scanning with rsync –dry-run for any discrepancies between snapshots and live data. This helped detect if any other files were silently corrupted:

rsync -avn /data/.snapshots/hourly.2024-03-10-09:00/ /data/

Luckily, no additional mismatches were found, but this gave further peace of mind. If more widespread issues had been uncovered, I would have initiated a more complex recovery plan from earlier full-disk backups.

Final Thoughts

Experiencing a Btrfs checksum mismatch that locks down access to files can be unsettling — but it’s better than suffering from silent, undetected corruption. Filesystems like ext4 or XFS might have simply served damaged data without a warning. With Btrfs, the cost is stricter read access, but the benefit is trustworthy detection.

I see this event not as a failure of Btrfs, but as a success story in a well-engineered system doing exactly what it promised. Data transparency, snapshots, and clear tooling allowed safe, direct file restoration without downtime or fear. I strongly encourage anyone using Btrfs to schedule subvolume snapshots and test their recovery flow before incidents occur.

In summary, preparedness and the right tools make integrity-focused filesystems not only usable — but empowering.