· ai, machine-learning
Inside TrackCrumb's predictive churn model: AUC, cold starts, drift, and survival curves
How TrackCrumb scores user churn nightly with XGBoost — expected accuracy by data depth, cold-start behavior, weekly drift retraining, shadow-model promotion gates, and per-user 30-day survival curves.
Inside TrackCrumb's predictive churn model
TrackCrumb scores every user's churn risk every night with an XGBoost classifier. On a synthetic development dataset, the model reaches a holdout AUC of 0.70–0.82 for a typical early-stage SaaS tenant (500–5,000 training examples) and 0.85–0.93 for a mature tenant with rich event history (more than 50,000 examples). Below 100 examples it does not guess: it returns a neutral churn probability of 0.5 for every user. Drift is checked weekly, and a retrain triggers when PSI > 0.25 or KS p < 0.01 on any feature. A retrained model ships only when it beats the incumbent by at least 0.01 AUC.
That is the whole contract: predict when there is enough signal, abstain when there isn't, retrain when the data shifts, and never promote a model that isn't measurably better. The rest of this post unpacks each part.
Where churn scoring runs
TrackCrumb is self-hostable product analytics, and churn scoring runs as a dedicated Python (FastAPI) ML service alongside the Go ingest path, ClickHouse event store, and Postgres metadata store. Scores are produced by a nightly batch job, not on the request path, so prediction load never competes with event ingest or dashboard queries.
Accuracy depends on how much history you have
Churn AUC is not a single headline number — it scales with the depth of a tenant's event history. On a synthetic development dataset, TrackCrumb's XGBoost classifier shows:
| Tenant data depth | Expected holdout AUC |
|---|---|
| < 100 training examples | No prediction — returns 0.5 for all users |
| 500–5,000 examples (typical early-stage SaaS) | 0.70–0.82 |
| > 50,000 examples (mature, rich history) | 0.85–0.93 |
These ranges were measured on a synthetic dev dataset; production AUC depends on how much real event history a tenant has accumulated. The practical read: an early-stage product should expect a useful-but-imperfect classifier, and accuracy improves as the event log grows.
Cold start: no prediction beats a bad prediction
For a brand-new workspace with fewer than 100 training examples, TrackCrumb's churn model returns a neutral probability of 0.5 for every user instead of emitting an unreliable score. This is deliberate. A model trained on almost no churn events would produce noise dressed up as a number, and acting on that noise is worse than admitting there isn't enough data yet. The score stays neutral until the workspace has accumulated enough history to train on.
Keeping the model honest: drift detection and promotion gates
Models decay as user behavior shifts. TrackCrumb guards against this in two places.
Drift detection. Every week, TrackCrumb checks each input feature for distribution drift and triggers a retrain when Population Stability Index (PSI) > 0.25 or a Kolmogorov–Smirnov test returns p < 0.01 on any feature. A retrain is a trigger, not an automatic deployment.
Promotion gate. A freshly retrained "shadow" model is promoted to production only when its holdout AUC beats the current production model by at least 0.01 (one percentage point of AUC). If the new model isn't measurably better, the incumbent stays. This prevents churn-from-retraining: swapping models on noise rather than genuine improvement.
Beyond a single score: 30-day survival curves
A churn probability answers "how likely," not "when." TrackCrumb also fits a per-user 30-day survival curve using a Cox proportional-hazards model (via the lifelines library), with a piecewise-exponential fallback when the proportional-hazards assumption is violated.
The survival curve turns a single risk score into a time profile — useful when the question is not just who is at risk but how soon an intervention needs to land.
Why early churn signal is worth the machinery
Two independent benchmarks explain why predicting churn early matters, both from Amplitude's 2025 Product Benchmark Report (2,600+ companies, data Sept 2023–Sept 2024).
First, early engagement strongly tracks long-term retention: 69% of products that were top performers on day-7 activation were also top performers on three-month retention. This is a correlation across products, not causation, and "activation" here is Amplitude's day-7 return metric — but the direction is clear: what happens early predicts what happens later.
Second, product-usage retention is brutally skewed. In Amplitude's data, B2B-technology three-month retention was 15.6% at the 90th percentile versus 2.5% at the median — a more than 6× gap. (These are active-user retention curves, not subscription or revenue retention.) When the median product keeps only a small fraction of users, identifying the at-risk fraction early — while there's still time to act — is where a nightly churn score earns its keep.
FAQ
How accurate is TrackCrumb's churn model?
On a synthetic development dataset, expected holdout AUC is 0.70–0.82 for an early-stage tenant (500–5,000 examples) and 0.85–0.93 for a mature tenant (>50,000 examples). Production accuracy depends on a tenant's event-history depth.
What happens for a brand-new workspace with little data?
With fewer than 100 training examples, the model returns a neutral churn probability of 0.5 for all users rather than an unreliable prediction.
How often does the model retrain?
Feature drift is checked weekly; a retrain triggers when PSI > 0.25 or KS p < 0.01 on any feature. A retrained model is promoted only if it beats the incumbent by ≥ 0.01 AUC.
Does it predict when a user will churn, not just if?
Yes. Alongside the churn probability, TrackCrumb fits a per-user 30-day survival curve using a Cox proportional-hazards model (lifelines), with a piecewise-exponential fallback when the proportional-hazards assumption is violated.
