Keeping Business Data Alive When Sync Tools Flake Out

Table of Contents

Cloud sync isn’t a backup, and here’s how that nearly wrecked us

There’s this dangerous assumption I see a lot: “We use Google Drive and Dropbox Business, so our stuff is safe.” No it’s not. It’s synchronized, not backed up. Sync just replicates your data (and deletes!) across devices. If someone renames the wrong client folder to “zzz_old_final_but_use”, it propagates everywhere instantly—and now your accounts team is panicking because the latest ledger is MIA.

I had a co-founder once who thought dragging a folder into Box handled everything. He didn’t realize that when a rogue Windows update scrambled file permissions and half the files synced up as 0KB shadows, Box happily reflected that too. Poof.

“Sync is reactive. Backup is historical.”

Most services blur those lines because saying “but this doesn’t protect you from accidental file modifications six weeks ago” is bad marketing. The real fix is layering proper backup versions over top of sync, or decoupling them entirely with solutions like Arq, Veeam, or even less shiny ones like duplicity. At the very least, stick your sync drive into a nightly rsync process to a local storage mount and rotate it weekly. Ugly, but it works.

Undocumented Google Drive API quota traps

So you plug into the Google Drive API to run automated exports? Cool. Wait ‘til the daily quota hits a ghost limit.

We were pulling invoices from a shared folder across several G Suite accounts. Suddenly, jobs stalled. Seemed like we weren’t hitting the documented 10000 requests/day, but it still throttled. Turns out: download bandwidth has its own internal ceiling, not listed anywhere—and it adds up fast when you’re yanking PDFs.

What fixed it (sort of)? Using a service account for one-off exports, but spreading out queries across 5 user tokens. I wish I was joking. Also, avoid exporting Google Docs to PDF via the API more than a few hundred times a day; the conversion time spikes hard after a certain threshold.

I eventually found a StackOverflow post where a guy parsed the X-Goog-Resource-State headers to pre-disqualify docs for export unless they’d been modified post-sync. Saved us maybe 60 requests per batch. Hacky, but it shaved off just enough to keep below the mystery ceiling.

Versioning paranoia: When the backup is the problem

I thought Backblaze B2 versioning was going to be our safety net for a dev media library. Uploaded nightly builds of a simulation rig, stored like 250 GB/month, all good. Except we didn’t realize how aggressive the lifecycle rules get if you toggle the “keep only last version” setting mid-cycle. It nuked six weeks of diff history overnight.

Even better? Their UI doesn’t show deleted version entries unless you filter manually—and the CLI doesn’t return them unless you pass a --showAllVersions flag I completely missed.

Triage tips if your backup versioning turns against you:

Check your lifecycle JSON config directly—not through their web dashboard.
Never switch to “last version only” without exporting a manifest of current versions.
Double-check how your tool defines “unmodified”: some count rename/move events, some don’t.
If you’re using any rsync-based backup tool, validate that your delete flags aren’t implicitly cascading old versions.
Push backup logs into a separate logging pipeline or at least email yourself digests—restoring blind is chaos.

My current setup: Backblaze B2 hooked via rclone, versioned per-week folders using timestamps, and lifecycle pruning capped at 90 days. Ugly blob mountain, but survivable.

Offsite matters—but so does bandwidth weirdness

One of our clients used Carbonite Pro. Great interface, nice alerts, but their upstream crawler is weird with network-mounted drives. Any latency spike over 100ms and you’d get “incomplete file” flags that wouldn’t retry.

We sniffed it using Wireshark and it’s doing something odd: it opens an exclusive handle, pauses five seconds, retries a checksum, and if latency crosses a threshold—in certain office routers with Netgear firmware—it bails instead of queuing. Not documented anywhere, by the way.

Once we switched them to CrashPlan and jammed a Raspberry Pi at the router level as a local cache/proxy, things stabilized. Point is: bandwidth is not just about capacity. Tool behavior around latency and partial locks is make-or-break. Especially with Pro/SMB tiers of these tools—the consumer ones often retry better.

File-level sync failures that never throw errors (OneDrive is the worst)

I maintain exactly two machines for testing one thing: how badly different sync tools handle long filename paths, symlinks, and permissions. And OneDrive wins for being the silent killer.

Several enterprise clients sync proposal folders over OneDrive/SharePoint with deeply nested directories. If you cross the 400-character path limit internally enforced by NTFS and masqueraded by Windows, OneDrive just… silently skips the folder. Doesn’t even warn in the UI unless you check the hidden Error Log view buried in its Settings panel under Account → Choose folders → Hidden → More info.

Bonus: syncing junction points and trying to keep project template structure consistent across multiple teams? OneDrive sometimes syncs the link, sometimes the contents of the pointed directory, depending on whether the source has ever been flagged as “Known Folder Move” inside their internal configuration GUID registry. Don’t ask. Just… don’t try this in production.

iCloud Drive: beautiful on Mac, existential on Windows

So Apple’s iCloud Drive on Mac actually handles cached storage elegantly. You check a file, your system pulls metadata from the cloud, only fetches full bits when needed. Great. Now open the same share on a Windows client. Oh boy.

File placeholders look synced but aren’t downloaded until double-clicked. You can’t touch them with scripts reliably because the shell sees it as a valid local path, but the actual read fails unless Explorer gets involved. Even PowerShell’s Test-Path returns True… until you try a Get-Content and it dies with EOF or pipe errors.

I tried a Python script to diff project folder states across teams. Half the team uses macOS, half uses Windows. We got false positives every time. Final fix? Replacing all access calls with a wrapper that queried file size and existence, then forced a dummy read of first 512 bytes to ‘claim’ the file from the cloud before diffing. Painful, but you sort of have to do it if you want any automation on iCloud Drive across OSes.

NAS snapshotting only helps if you can actually pull from it under load

So this was dumb: we had a QNAP with RAID5 storing live Adobe project files. Snapshots every 4 hours. Someone corrupted the main working folder during file merges, totally blew out the timeline assets. No biggie, we thought—revert via snapshot.

Except, loading a snapshot mount when the CPU is slammed (QNAPs love to overcommit CPU to background sync) led to hangs. Our snapshot sat there nearly accessible, spinning. We SSH’d in and tried to mount manually from CLI, but QNAP’s own snapshot system is locked behind their proprietary LVM layer—so we couldn’t mount it outside their UI even though we had CLI root access.

Eventually got it to mount 40 minutes later after forcibly killing their thumbnail generator daemon. That’s the kind of thing that never shows up in marketing material—”Requires thumbnailing to be idle for snapshots to load under load.”

Restic deduplication is great until you try to browse trees

I love Restic. It’s efficient, encrypted, and deduplicates like a champ. But if you’ve ever needed to restore a single file from a snapshot tree with 800k files? You’re in for a hell of a time.

The first time I hit this, I was auditing a backup from a dev laptop that had symlinked vendor directories. Restic stored the data okay, but doing restic ls snapshotID would take literal minutes—even with local cache enabled. Turns out traversal is globally sorted before display. No pagination, no indexing. When I finally broke it up by date folders and paged manually via --path credit/tmp, it worked, but boy was that fun at 2am during an incident restore.

Still use it. But now I tag key snapshots with --tag fast-restore and only dump core configs and db exports there. The rest can wait.

Duplicati says it’s done backing up—but test restores tell a different story

This is one of those tools where the UI lies by omission. Duplicati has a gorgeous web interface, shows you status like “Backup successful” with nice green bars—but under certain circumstances (read: partial file locks, permission issues, or interrupted mono-runtime errors), it will skip files without failing the job.

There’s a setting called “Skip files with unknown read errors” that defaults to enabled. So if your external disk hiccups, the backup skips the whole subtree, but still reports success. Found this only after I ran a test restore and noticed our ‘HR_Important’ directory came back with folders but no actual PDFs.

Corollary tip: schedule test restores biweekly and don’t trust green checkmarks. And always wrap mono-based backup tools with a watchdog that scans stderr for unhandled exceptions—they don’t always bubble up to the UI cue.