Tracking experiments and shipping models are different problems. The MLOps tooling assumes one solution; production splits them. The patterns we use.
The MLOps tooling space conflates two distinct workflows: tracking experiments (which model gave the best results in research) and managing production models (which model is live, how do we promote a new one, what was the previous one). MLflow does both reasonably well, which is why it's popular, but the workflows have different requirements and most teams need different patterns for each.
After running ML in production for ~18 months, this is the split we've landed on.
Experiment tracking. A data scientist trains 50 variations of a model with different hyperparameters, data splits, and feature sets. They want to compare which performed best, see how training went, reproduce a specific run. The artifact at the end isn't necessarily going to production — it's research output.
Model registry. A specific trained model goes from "this is the best one" to "this is in production." It needs versioning, promotion workflow (staging → prod), rollback, and clear "what's deployed right now."
MLflow has "Tracking" (the experiment side) and "Model Registry" (the production side). They share the underlying model storage but are conceptually different.
For the experiment side:
mlflow.start_run() wraps a training run; logs parameters, metrics, artifacts (the model file, evaluation outputs).mlflow.log_artifact for non-model files (evaluation plots, sample predictions).For research, this is great. Data scientists can churn through experiments and the results are systematically captured.
What it's not designed for:
The Registry is a layer on top: take a specific Tracking run's model, register it as a versioned artifact, manage lifecycle (Staging → Production → Archived).
Concretely:
import mlflow
# Register a model from a tracking run
mlflow.register_model(
f"runs:/{run_id}/model",
"fraud-classifier"
)
# Promote to production
client = mlflow.MlflowClient()
client.transition_model_version_stage(
name="fraud-classifier",
version=3,
stage="Production"
)
Now mlflow.pyfunc.load_model("models:/fraud-classifier/Production") loads whichever version is in the Production stage. Promotion = a single API call.
This is genuinely useful for the production side. But it's also limited in ways that matter for serious production use.
A few patterns we found awkward:
Stage transitions are global. "Production" is one named slot. Multi-region deployments where different regions might be on different versions are awkward.
No traffic split / canary built-in. "Move from v3 to v4" is atomic — you can't easily say "10% to v4, 90% to v3." We layered our own routing in front for canary.
Limited audit trail by default. Stage transitions are logged but the UI doesn't make it easy to ask "who promoted what when." We layered our own approval workflow on top.
Tied to MLflow Tracking's lineage. A model that was trained outside MLflow (a SageMaker job, an external pipeline) requires manual registration. Doable but adds friction.
For teams with simple needs (one model, one production environment, one stage), MLflow Registry is enough. For more complex setups, the Registry is a starting point and you add discipline around it.
We use MLflow for what it's good at and supplement for the rest:
MLflow Tracking — every training run logged. Used as the experiment notebook. Long retention isn't a goal; we keep ~6 months.
MLflow Registry — staging area. Trained models that look promising get registered. Comparison and selection happen here.
Production gates — a separate, simpler system (a GitOps repo with a YAML file per model) controls what's actually deployed. PRs to that repo are the production deploy mechanism. Approvals, audit, traffic split rules all live there.
The flow:
MLflow handles steps 1-2 well. The repo handles 3-4 with the same discipline we apply to other production changes.
For every training run:
Last one is the most underrated. If we change preprocessing and re-train, the data hash changes; we can prove which preprocessing version a particular model used.
We're selective about what gets registered:
Random ad-hoc experiments stay in Tracking only. The Registry stays clean.
A model going to production must pass:
MLflow handles step 1 well via the Tracking artifacts. Steps 2-4 are the production-side discipline that MLflow Registry alone doesn't cover.
Weights & Biases. Better experiment tracking UI than MLflow. More polished. Costs more. We use it for some research teams; production registry workflows are still MLflow.
Vertex AI Model Registry / SageMaker Model Registry. Cloud-vendor specific. Tighter integration with their training and serving. Lock-in concerns.
Neptune.ai. Similar to W&B. We didn't evaluate deeply.
Custom in Postgres + S3. A few teams roll their own. Tempting at small scale; pays back if your needs diverge from off-the-shelf tools. We didn't.
For our shape (mixed-cloud, want portability, ~10 models in production), MLflow + GitOps for production gates is the right balance. Pure-AWS shop might prefer SageMaker Registry; pure-GCP shop, Vertex.
MLflow does a lot, and the conflation of tracking and registry is a feature for simple cases. For production-grade ML workflows, separating the experiment side from the production-gate side gives you discipline where you need it (production) without slowing down where you don't (research). The split is the working pattern.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Explore more articles in this category
AI agents for incident triage sound great in demos. We've tried it in production. The patterns that earn their keep, the ones that backfire, and where humans still beat agents.
Most LLM eval suites correlate poorly with what real users experience. The eval patterns we run that move with prod metrics — and the ones that lied to us.
Single-provider LLM apps fail when the provider does. Multi-provider routing isn't just resilience — it's also a cost lever. The patterns we run.
Evergreen posts worth revisiting.