We've shipped three end-to-end ML systems. The pieces that look obvious in slides and turn out to be the actual work.

On this page

Production AI Pipelines: What Actually Goes Into Shipping

We've shipped three end-to-end ML systems in the last two years: a churn predictor, a recommendation engine, and a customer-facing classifier. Each is "build a model, deploy it, monitor it." Each took 3-6 months of real work. Most of that work wasn't training the model. This post is the things that took the actual time, with what we'd plan for differently next time.

The shape of an ML system #

The slide-deck version:

Get data
Train model
Deploy
Monitor

The reality has more boxes. For each project, the actual stages took roughly:

30% data engineering (collect, clean, validate, store, refresh)
15% feature engineering
10% model training and selection
20% serving infrastructure
15% monitoring and drift detection
10% iteration loop tooling

"Train the model" is 10% of the effort. The other 90% is the system around it.

Stage 1: Data is the actual work #

Most of an ML project is data work. Specific things:

Data discovery. "Where is the customer-event data we need? What's the schema? Is there a dictionary somewhere?" Answering these questions for one project took 3 weeks. Most companies' data catalogs are partial; we spent meaningful time talking to the teams that owned upstream systems.

Data quality issues. Real data has nulls, duplicates, outliers, encoding issues, timezone confusion, schema drift, and inconsistencies. Cleaning these isn't glamorous; it's the work.

Feature pipeline. Once you have the data, computing features at training time AND at inference time has to match. Training on aggregated daily features but serving on real-time features causes train/serve skew. The pipeline that computes features identically in both contexts is its own engineering project.

Backfill. When you change a feature, you need it computed historically for training. For our recommendation engine, a backfill computing 2 years of features over 50M users took 18 hours on a Spark cluster.

Refresh cadence. How often does the training data refresh? How often does the feature pipeline run? How fresh do features need to be at inference time? Each answer drives infrastructure cost.

We use:

Snowflake / BigQuery for feature warehousing
DBT for feature transformation logic
A custom feature service for low-latency serving
Airflow for orchestration

The feature service is the piece that took the most engineering. Feature stores like Feast or Tecton exist; we built our own because our scale is manageable and the off-the-shelf options didn't quite fit. Whether to build vs buy here is project-specific.

Stage 2: Training is rarely the bottleneck #

Training the actual model is usually the easiest stage. We use:

Notebooks for exploration, then convert to scripts for production runs.
MLflow for experiment tracking — every training run logs hyperparameters, metrics, model artifact, and dataset version.
Off-the-shelf models (XGBoost, LightGBM for tabular; sentence-transformers for embeddings; pre-trained LLMs we fine-tune) where possible. Custom architectures only when there's a reason.
Cloud GPUs when we need them (mostly for fine-tuning), but most of our training is CPU on tabular features.

A typical training pipeline runs in <2 hours. The hyperparameter search and model selection, when needed, runs in a few hours more.

The mistake we've made twice: spending too long tuning a model when the data was the bottleneck. Better data + simple model > worse data + complex model.

Stage 3: Serving is its own infrastructure problem #

Once you have a trained model, how do you serve it?

For our use cases:

Batch scoring (the churn predictor): runs nightly, scores all users, writes results to a database. Other services query the database. Simple. Cheapest.

Online serving (the classifier): a Python service hosting the model, called over HTTP. Same as any other service: containerized, deployed via our standard pipeline. We use ONNX Runtime or Triton for inference (depending on model type) instead of native PyTorch/TF for ~3-5x throughput at the same accuracy.

Embedding generation (the recommendation engine): a batch service generates embeddings for new items; another service serves the embeddings via a vector store (we use pgvector).

The serving layer has the same operational requirements as any web service: deploys, monitoring, autoscaling, latency SLOs. We use the same Kubernetes infrastructure we use for non-ML services.

A specific gotcha: model artifact loading time. Some models take 30+ seconds to load into memory. Pod startup with cold model load is slow. We bake the model into the container image (rather than downloading at startup) and use readiness probes that wait for the model to load before accepting traffic.

Stage 4: Monitoring drift, not just metrics #

Standard service monitoring (CPU, memory, error rate, latency) is necessary but insufficient. ML-specific monitoring:

Input drift. The distribution of incoming features changes over time. Features that were 1-100 in training start arriving as 10-1000 in production (because something upstream changed). Models often fail silently when this happens.

We compute statistics (mean, stddev, min, max, % nulls) of input features per day, and alert when they deviate > 3σ from training-time baselines.

Prediction drift. The distribution of outputs shifts. The classifier that used to predict 30% positive class now predicts 50%. Could be input drift; could be a real change in the world.

Performance drift. When ground truth becomes available (purchases happen, churn is observed), how is the model doing now vs at training time? This is the gold-standard metric but has long lag.

For one of our projects, performance drift was visible weeks before we noticed. We added per-day rolling-window evaluation (using the previous week's now-ground-truthed predictions) to catch this faster.

Stage 5: The iteration loop #

The hard part of production ML isn't shipping v1. It's iterating efficiently.

What helps:

Reproducible training. Every model artifact has a known training dataset (versioned), training code (Git SHA), and hyperparameters. Re-training with the same inputs produces the same model.

Champion-challenger. When a new model candidate exists, we run it side-by-side with the current production model on a slice of traffic. Compare metrics. Promote if better.

Easy retrain. "Re-train with the latest data" is one button (or one command). For one project, retraining was a half-day setup; we cut it to ~20 minutes by automating the data pipeline + training pipeline + deployment as a single workflow.

Feature flags for model versions. New models go behind a flag, ramp from 1% → 100%. Same as code deploys.

What we got wrong #

Specific mistakes we made on each project:

Underestimated data quality work. First project, planned 1 month for data engineering. Spent 3. Same on the second project; we did better on the third.

Over-engineered the first model. Started with a deep neural network where logistic regression would have been fine. The DNN took longer to train, was harder to interpret, and didn't outperform LR on the eval. We retrained with LR; quality improved (because we had time to iterate on features).

Skipped offline evaluation infrastructure. First model went straight to online evaluation. When we wanted to compare alternatives, we had no offline benchmark — every comparison required A/B testing in production. Slow. Now we have an offline eval framework as a prerequisite for online.

Didn't plan for the unhappy paths. Models occasionally produce bad outputs. When they do, what happens? "The downstream system uses the bad output and does the wrong thing" is the default if you didn't design otherwise. We added confidence-thresholding (low confidence → fallback path) on each project after seeing this.

Forgot about cold start. New users / new items have no data. The model treats them as average — which is sometimes wrong. Cold-start handling is a separate sub-problem we underestimated.

The roles involved #

A working ML system needs:

A data engineer (build the pipeline)
An ML engineer (train and tune the model)
A platform engineer (serve and monitor)
A domain expert (does the model do the right thing?)

Sometimes one person plays multiple roles (typical for early ML projects). For our projects, the data engineer + ML engineer were often the same person; the platform engineer was usually distinct.

The domain expert is the most underestimated. Without them, you build something technically correct but practically useless. They're the person who notices that "the model predicts churn risk 0.92 for everyone in cohort X" is wrong because that cohort just signed annual contracts, not because they're churning.

Cost reality #

For our projects, monthly costs:

Feature pipeline (Snowflake + Airflow): ~$800
Training compute (GPU when needed): ~$200 amortized
Serving infrastructure (a few small services): ~$300
Monitoring (custom dashboards, drift detection): ~$100
Model development time (an engineer): ~$15k

Total ongoing: ~$1,400/month operational + engineering time. Plus the upfront 3-6 months to ship.

The ROI varies by project. The recommendation engine paid back its development cost in months via lift in user engagement. The churn predictor was harder to attribute concrete dollars to (it informed retention work that may or may not have happened differently).

What I'd tell a team starting #

Plan 90% of the time for non-model work. Data, infrastructure, monitoring. The model is the small part.

Build the offline eval framework before training the first model. Without it, every iteration is slow and uncertain.

Start simple. Logistic regression / boosted trees / sentence-transformers as defaults. Reach for complexity only when you've shown simple isn't enough.

Have a domain expert in the loop. The non-obvious problems with ML output are usually visible to someone who knows the business domain.

Plan for drift, low confidence, cold start. These are not edge cases; they are the dominant cases over time.

Match infrastructure to project size. A churn predictor doesn't need Kubeflow. Don't build a feature store before you have a feature.

ML in production is an engineering discipline. The model is the most photogenic part; the rest is what determines whether it works in real life. The teams that ship reliable ML treat it as a system, not a model.

Production AI Pipelines: Building End-to-End ML Systems

Production AI Pipelines: What Actually Goes Into Shipping

The shape of an ML system #

Stage 1: Data is the actual work #

Stage 2: Training is rarely the bottleneck #

Stage 3: Serving is its own infrastructure problem #

Stage 4: Monitoring drift, not just metrics #

Stage 5: The iteration loop #

What we got wrong #

The roles involved #

Cost reality #

What I'd tell a team starting #

Stay Updated

Architecture Review: LLM Gateway Design for Multi-Provider Inference

AWS Lambda and Serverless Best Practices for Production

More from AI

Token Budgeting for Long-Context Prompts: What to Cut First

Multi-Provider LLM Gateways: Routing, Fallback, and Cost Control

Streaming LLM Responses: SSE, Backpressure, and Cancellation

Token Budgeting for Long-Context Prompts: What to Cut First

Multi-Provider LLM Gateways: Routing, Fallback, and Cost Control

Streaming LLM Responses: SSE, Backpressure, and Cancellation

Choosing an Embedding Model: Dimensions, Cost, and MTEB Reality

Feature Flags for Safe Deploys: Decoupling Release From Deploy

Four Signals That Matter: Choosing SLIs Users Actually Feel

You might have missed

GitOps with Argo CD: Best Practices for 2025

Process Management and Monitoring in Linux

Linux Performance Tuning for Containers and Kubernetes Nodes

About Kiril Urbonas