We've shipped three end-to-end ML systems. The pieces that look obvious in slides and turn out to be the actual work.
We've shipped three end-to-end ML systems in the last two years: a churn predictor, a recommendation engine, and a customer-facing classifier. Each is "build a model, deploy it, monitor it." Each took 3-6 months of real work. Most of that work wasn't training the model. This post is the things that took the actual time, with what we'd plan for differently next time.
The slide-deck version:
The reality has more boxes. For each project, the actual stages took roughly:
"Train the model" is 10% of the effort. The other 90% is the system around it.
Most of an ML project is data work. Specific things:
Data discovery. "Where is the customer-event data we need? What's the schema? Is there a dictionary somewhere?" Answering these questions for one project took 3 weeks. Most companies' data catalogs are partial; we spent meaningful time talking to the teams that owned upstream systems.
Data quality issues. Real data has nulls, duplicates, outliers, encoding issues, timezone confusion, schema drift, and inconsistencies. Cleaning these isn't glamorous; it's the work.
Feature pipeline. Once you have the data, computing features at training time AND at inference time has to match. Training on aggregated daily features but serving on real-time features causes train/serve skew. The pipeline that computes features identically in both contexts is its own engineering project.
Backfill. When you change a feature, you need it computed historically for training. For our recommendation engine, a backfill computing 2 years of features over 50M users took 18 hours on a Spark cluster.
Refresh cadence. How often does the training data refresh? How often does the feature pipeline run? How fresh do features need to be at inference time? Each answer drives infrastructure cost.
We use:
The feature service is the piece that took the most engineering. Feature stores like Feast or Tecton exist; we built our own because our scale is manageable and the off-the-shelf options didn't quite fit. Whether to build vs buy here is project-specific.
Training the actual model is usually the easiest stage. We use:
A typical training pipeline runs in <2 hours. The hyperparameter search and model selection, when needed, runs in a few hours more.
The mistake we've made twice: spending too long tuning a model when the data was the bottleneck. Better data + simple model > worse data + complex model.
Once you have a trained model, how do you serve it?
For our use cases:
Batch scoring (the churn predictor): runs nightly, scores all users, writes results to a database. Other services query the database. Simple. Cheapest.
Online serving (the classifier): a Python service hosting the model, called over HTTP. Same as any other service: containerized, deployed via our standard pipeline. We use ONNX Runtime or Triton for inference (depending on model type) instead of native PyTorch/TF for ~3-5x throughput at the same accuracy.
Embedding generation (the recommendation engine): a batch service generates embeddings for new items; another service serves the embeddings via a vector store (we use pgvector).
The serving layer has the same operational requirements as any web service: deploys, monitoring, autoscaling, latency SLOs. We use the same Kubernetes infrastructure we use for non-ML services.
A specific gotcha: model artifact loading time. Some models take 30+ seconds to load into memory. Pod startup with cold model load is slow. We bake the model into the container image (rather than downloading at startup) and use readiness probes that wait for the model to load before accepting traffic.
Standard service monitoring (CPU, memory, error rate, latency) is necessary but insufficient. ML-specific monitoring:
Input drift. The distribution of incoming features changes over time. Features that were 1-100 in training start arriving as 10-1000 in production (because something upstream changed). Models often fail silently when this happens.
We compute statistics (mean, stddev, min, max, % nulls) of input features per day, and alert when they deviate > 3σ from training-time baselines.
Prediction drift. The distribution of outputs shifts. The classifier that used to predict 30% positive class now predicts 50%. Could be input drift; could be a real change in the world.
Performance drift. When ground truth becomes available (purchases happen, churn is observed), how is the model doing now vs at training time? This is the gold-standard metric but has long lag.
For one of our projects, performance drift was visible weeks before we noticed. We added per-day rolling-window evaluation (using the previous week's now-ground-truthed predictions) to catch this faster.
The hard part of production ML isn't shipping v1. It's iterating efficiently.
What helps:
Reproducible training. Every model artifact has a known training dataset (versioned), training code (Git SHA), and hyperparameters. Re-training with the same inputs produces the same model.
Champion-challenger. When a new model candidate exists, we run it side-by-side with the current production model on a slice of traffic. Compare metrics. Promote if better.
Easy retrain. "Re-train with the latest data" is one button (or one command). For one project, retraining was a half-day setup; we cut it to ~20 minutes by automating the data pipeline + training pipeline + deployment as a single workflow.
Feature flags for model versions. New models go behind a flag, ramp from 1% → 100%. Same as code deploys.
Specific mistakes we made on each project:
Underestimated data quality work. First project, planned 1 month for data engineering. Spent 3. Same on the second project; we did better on the third.
Over-engineered the first model. Started with a deep neural network where logistic regression would have been fine. The DNN took longer to train, was harder to interpret, and didn't outperform LR on the eval. We retrained with LR; quality improved (because we had time to iterate on features).
Skipped offline evaluation infrastructure. First model went straight to online evaluation. When we wanted to compare alternatives, we had no offline benchmark — every comparison required A/B testing in production. Slow. Now we have an offline eval framework as a prerequisite for online.
Didn't plan for the unhappy paths. Models occasionally produce bad outputs. When they do, what happens? "The downstream system uses the bad output and does the wrong thing" is the default if you didn't design otherwise. We added confidence-thresholding (low confidence → fallback path) on each project after seeing this.
Forgot about cold start. New users / new items have no data. The model treats them as average — which is sometimes wrong. Cold-start handling is a separate sub-problem we underestimated.
A working ML system needs:
Sometimes one person plays multiple roles (typical for early ML projects). For our projects, the data engineer + ML engineer were often the same person; the platform engineer was usually distinct.
The domain expert is the most underestimated. Without them, you build something technically correct but practically useless. They're the person who notices that "the model predicts churn risk 0.92 for everyone in cohort X" is wrong because that cohort just signed annual contracts, not because they're churning.
For our projects, monthly costs:
Total ongoing: ~$1,400/month operational + engineering time. Plus the upfront 3-6 months to ship.
The ROI varies by project. The recommendation engine paid back its development cost in months via lift in user engagement. The churn predictor was harder to attribute concrete dollars to (it informed retention work that may or may not have happened differently).
Plan 90% of the time for non-model work. Data, infrastructure, monitoring. The model is the small part.
Build the offline eval framework before training the first model. Without it, every iteration is slow and uncertain.
Start simple. Logistic regression / boosted trees / sentence-transformers as defaults. Reach for complexity only when you've shown simple isn't enough.
Have a domain expert in the loop. The non-obvious problems with ML output are usually visible to someone who knows the business domain.
Plan for drift, low confidence, cold start. These are not edge cases; they are the dominant cases over time.
Match infrastructure to project size. A churn predictor doesn't need Kubeflow. Don't build a feature store before you have a feature.
ML in production is an engineering discipline. The model is the most photogenic part; the rest is what determines whether it works in real life. The teams that ship reliable ML treat it as a system, not a model.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
We started routing 90% of LLM traffic through a small internal gateway. The gateway wasn't planned — it emerged from solving the same problem in 5 places. Here's the shape it took.
Design serverless apps for reliability, cold start, and cost. Event-driven patterns and observability.
Explore more articles in this category
Embedding indexes degrade silently. The signals that catch drift, how often to re-embed, and the operational patterns we built after one quiet quality regression.
Streaming LLM responses is easy until the client disconnects, the model stalls, or the user cancels. The patterns that keep streaming responsive without leaking spend.
When LLMs can call tools that change real state, the design decisions that matter most are about what's gated, what's automatic, and what triggers a human checkpoint.
Evergreen posts worth revisiting.