Chasing Ghosts in App Crash Reporting and Monitoring

Table of Contents

Firebase Crashlytics not syncing traces in cold starts

Welcome to the war zone of early app launches. I had a cold-start crash drought for weeks in Firebase Crashlytics. Everything looked healthy — until someone on mobile support sent me a video of the app flash-crashing before the splash screen even loaded. No entry in Crashlytics, no breadcrumbs, not even a signal bounce. It’s like the crash happened in a different dimension.

The issue turned out to be a combo of two things:

Crashlytics relies on a native crash handler that doesn’t always wake up fast enough during raw cold starts, especially on older Androids.
If you’re depending on delayed or asynchronous initialization, like lazy loading your Firebase init call or using background executors, the crash can beat Crashlytics to the punch.

Solution was gross but effective: force-init Crashlytics in a base Application class, right before anything else — even before hooking into any UI thread logic. I stuffed this into an ugly static context block, didn’t feel great about it, but sure enough: boom, cold crash reports started flowing back in.

class MyApp : Application() {
    override fun onCreate() {
        FirebaseApp.initializeApp(this)
        FirebaseCrashlytics.getInstance().setCrashlyticsCollectionEnabled(true)
        // Call this before anything that might throw
        super.onCreate()
    }
}

Delayed ANR traces in Play Console vs reality

Here’s a fun one: you get users saying “your app’s freezing every time I login.” But Play Console flags nothing for days. Then suddenly, boop — five ANRs appear retroactively timestamped to 3 days ago. No rhyme, no push, no logs matching timeline.

This happens because ANR traces aren’t flushed until the system timeout blows. If the user force-stops the app, switches context, or their Android kernel is a spicy aftermarket build (Xiaomi users, looking at you), you might not get the trace at all. Or you get it, but two major issues:

Play Console pages update in batches — daily or worse depending on usage tier.
Locator timestamps lose fidelity — so what you see in the timeline might not be when the actual freeze occurred.

I started adding my own debug-signal ANR logging with a watchdog thread:

val handlerThread = HandlerThread("watchdog")
handlerThread.start()
Handler(handlerThread.looper).postDelayed({
    if (mainThreadIsBlockedLongEnough()) {
        logCustomEvent("possible_anr", System.currentTimeMillis())
    }
}, 5000)

Sounds hacky but gave me actual timestamps to correlate user complaints instead of retroactive mysteries. Not perfect — this won’t tell you why the block happened — but it’s a breadcrumb that actually lands close to the truth.

Sentry and breadcrumbs: invisible until you dig in config

I plumbed Sentry into a Swift app expecting beautiful crash + network + breadcrumb linkage. Instead: empty trails. Crashes were coming in, but the breadcrumb log was blank like a wiped drive. Turned out iOS Sentry needs you to nudge it to activate some built-in integrations.

“EnableAutoBreadcrumbs = true” is not on by default in some SDK versions. And yes, that should be illegal.

This was the moment I found the quote in a GitHub issue thread referencing an undocumented combo of config toggles:

SentrySDK.start { options in
    options.dsn = "your-dsn-here"
    options.enableNetworkTracking = true
    options.enableAutoBreadcrumbs = true
    options.enableAutoSessionTracking = true
}

After that, breadcrumbs started showing up for taps, route transitions, even low-level NSURLSession activity. Undocumented edge case? Turns out if your iOS app uses a custom URL loading stack or wraps session delegates heavily, Sentry can’t detect network breadcrumbs unless you directly hook them using SentryNetworkTracker.track(). That bit took two coffees and an accidental console log to uncover.

Crash-free rate is a lie if you use JavaScript bridges

If you build with any form of hybrid layer — React Native, Ionic, even custom JS-core bridges — your crash reporting might look amazing while your users are suffering. JavaScript-side crashes often don’t kill the native app process. That means Crashlytics, Sentry, or Bugsnag might not record them unless their platform-specific plugin is injected properly and error surfaces are piped in manually.

React Native’s default global handler catches the JS crash and often logs it only to the console or devtools if not explicitly sent out:

global.ErrorUtils.setGlobalHandler((error, isFatal) => {
  Sentry.captureException(error); // This is not default!
});

Moral of the story: crash-free rate is only crash-free for the layers you’re measuring. If your JS thread locks up or infinite-loops (looking at you, while(true) spinners), users may relaunch 5 times before you see any hint something’s broken.

Datadog RUM shows performance issues Google misses entirely

Datadog won me over after exactly one incident. I had this Android app where half the France-based users were seeing 10+ second boot times. Firebase showed nothing. Logcat traces were useless. Stack traces were clean.

But once I installed Datadog’s Android Real User Monitoring (RUM) SDK with network monitoring, it spotted the issue within-hours: cold boots were blocked on a giant config JSON pull from a Paris-based server zone. One CDN node was bad, but only from specific IP blocks. Aadaf438.fr.cdn was resolving 2 seconds slower than the rest.

Google Play’s diagnostics — even using Vitals — didn’t flag network conditions because everything technically “worked.” RUM metrics told the truth:

Giant payloads on cold start blocks visual rendering
No fallback timeout = UI thread idle
30% longer load time for Orange ISP users

I ended up pushing a patch that kicks off config pulls post render instead of blocking on it. Datadog showed the impact drop within three hours of the rollout. Vitals caught up… maybe five days later.

Fastlane crash symbol upload fails silently on CI

If you’re using Fastlane with Crashlytics (or Bugsnag, same problem), and your builds look like they shipped symbols but you’re only getting cryptic memory addresses in stack traces… check the logs. I wasted more than a day on this once because the Fastlane task printed Uploading... then nothing. Turns out:

If you don’t set $GOOGLE_SERVICE_INFO_PLIST or your app doesn’t have dSYM mapping enabled, Crashlytics CLI completes with exit 0 — but doesn’t upload squat.

Add verbose output to your Fastlane step or force a failure on missing files. This snippet saved my sanity:

sh("./gradlew uploadCrashlyticsSymbolFileRelease") do |stdout, _|
  unless stdout.include?("Symbol upload complete")
    UI.user_error!("Crashlytics symbols upload failed silently.")
  end
end

Wish I had known that a week earlier. The call stack traces from that broken release were utterly incomprehensible.

Out-of-memory (OOM) kills don’t generate crash reports

This one surprised me more than it should’ve. Some of our iOS users were reporting app restarts mid-video, randomly and unreplicably. No crash. No low-memory warnings. Nothing in Crashlytics. No breadcrumbs. It felt like a ghost hitting the kill-switch.

Turns out, iOS will silently OOM-kill your app without triggering Crash handlers or any exit observer. This is especially common with AVFoundation + large buffer collections or heavy WKWebView usage. You’ll never know unless you look into OS-level crash logs (Jetsam reports).

There’s no reliable way to detect that you were force-killed — but one legit hack is comparing timestamps on app resume vs prior session termination logs. I use this basic check:

let now = Date()
if lastCloseWasGraceful == false && now.timeIntervalSince(lastActive) < 5 {
    logOOMEventEstimate()
}

Some monitoring tools like Instabug or Embrace attempt to flag possible OOM events by correlating the last known session state. Just don’t trust automated crash metrics alone to expose these — they’re invisible unless you’re actively digging for them.

Console.log bombs with giant JSON objects are a silent killer

This one’s for anyone on React Native, Cordova, or Electron. I had this nasty performance tailspin on a pre-rendering task, and all it came down to was a single line:

console.log(JSON.stringify(obj, null, 2))

The object in question? A nested user state tree that spanned about 1.6MB serialized. Chromedriver paused. The mobile app went non-responsive. And Sentry? It got the report 40 seconds later, but the full payload exceeded the default max size for logging. So it dropped it silently.

Enabling structured console filtering + safe stringifying fixed it, but still — watch your logs. Verbose logging can not only tank perf but cause crashes that never report the root cause because the logging itself becomes the failure.

Misconfigured Proguard removes crash classnames in release build

If you’re using Proguard (or R8 now), you might see crashes in your logs but the actual class names are gone — just obfuscated placeholders. Did you forget to preserve your crash reporting classes?

You must explicitly keep your crash handler annotations and Serializable/Parcelable classes if they’re involved in crash paths. The default Proguard template won’t do this. Here’s a minimal but critical setup:

-keep class com.yourapp.crash.** {*;}
-keepattributes *Annotation*
-keepclassmembers class ** throws java.lang.Exception

I once had half a release cycle go by before I saw that every crash listed in Play Console was attributed to: com.a.a.a(Unknown Source). Very helpful. Love the mystery.

Also: some dependencies (especially OkHttp variants or custom Retrofit converters) get mangled unless explicitly kept. If your crashes involve HTTP-level stack activity, add them too.