Tracking experiments and shipping models are different problems. The MLOps tooling assumes one solution; production splits them. The patterns we use.

On this page

MLOps — Model Registry vs MLflow Tracking, And When You Need Both

The MLOps tooling space conflates two distinct workflows: tracking experiments (which model gave the best results in research) and managing production models (which model is live, how do we promote a new one, what was the previous one). MLflow does both reasonably well, which is why it's popular, but the workflows have different requirements and most teams need different patterns for each.

After running ML in production for ~18 months, this is the split we've landed on.

The two workflows #

Experiment tracking. A data scientist trains 50 variations of a model with different hyperparameters, data splits, and feature sets. They want to compare which performed best, see how training went, reproduce a specific run. The artifact at the end isn't necessarily going to production — it's research output.

Model registry. A specific trained model goes from "this is the best one" to "this is in production." It needs versioning, promotion workflow (staging → prod), rollback, and clear "what's deployed right now."

MLflow has "Tracking" (the experiment side) and "Model Registry" (the production side). They share the underlying model storage but are conceptually different.

What MLflow Tracking is good at #

For the experiment side:

mlflow.start_run() wraps a training run; logs parameters, metrics, artifacts (the model file, evaluation outputs).
UI showing every run, sortable by metric, filterable by parameter. Compare runs side-by-side.
mlflow.log_artifact for non-model files (evaluation plots, sample predictions).
Search API for programmatic queries ("show me runs where val_accuracy > 0.9 and feature_set = 'v3'").

For research, this is great. Data scientists can churn through experiments and the results are systematically captured.

What it's not designed for:

Long-term archival of every experiment forever.
Compliance/auditing of which model went to production when.
Rollback workflows.

What the Model Registry adds #

The Registry is a layer on top: take a specific Tracking run's model, register it as a versioned artifact, manage lifecycle (Staging → Production → Archived).

Concretely:

python.python

import mlflow

# Register a model from a tracking run
mlflow.register_model(
    f"runs:/{run_id}/model",
    "fraud-classifier"
)

# Promote to production
client = mlflow.MlflowClient()
client.transition_model_version_stage(
    name="fraud-classifier",
    version=3,
    stage="Production"
)

Now mlflow.pyfunc.load_model("models:/fraud-classifier/Production") loads whichever version is in the Production stage. Promotion = a single API call.

This is genuinely useful for the production side. But it's also limited in ways that matter for serious production use.

Where MLflow Registry falls short for production #

A few patterns we found awkward:

Stage transitions are global. "Production" is one named slot. Multi-region deployments where different regions might be on different versions are awkward.

No traffic split / canary built-in. "Move from v3 to v4" is atomic — you can't easily say "10% to v4, 90% to v3." We layered our own routing in front for canary.

Limited audit trail by default. Stage transitions are logged but the UI doesn't make it easy to ask "who promoted what when." We layered our own approval workflow on top.

Tied to MLflow Tracking's lineage. A model that was trained outside MLflow (a SageMaker job, an external pipeline) requires manual registration. Doable but adds friction.

For teams with simple needs (one model, one production environment, one stage), MLflow Registry is enough. For more complex setups, the Registry is a starting point and you add discipline around it.

The split that worked for us #

We use MLflow for what it's good at and supplement for the rest:

MLflow Tracking — every training run logged. Used as the experiment notebook. Long retention isn't a goal; we keep ~6 months.

MLflow Registry — staging area. Trained models that look promising get registered. Comparison and selection happen here.

Production gates — a separate, simpler system (a GitOps repo with a YAML file per model) controls what's actually deployed. PRs to that repo are the production deploy mechanism. Approvals, audit, traffic split rules all live there.

The flow:

Data scientist trains model in MLflow Tracking.
Best variant gets registered in MLflow Registry.
SRE/ML engineer opens a PR against the model-config repo updating the version pointer.
PR review = production gate. On merge, deployment automation rolls out the new model with canary.

MLflow handles steps 1-2 well. The repo handles 3-4 with the same discipline we apply to other production changes.

What we log in Tracking #

For every training run:

Parameters. Hyperparameters, dataset version, feature set version, random seeds.
Metrics. Training metrics (loss curves), validation metrics, test metrics on held-out sets.
The trained model. Pickle/joblib/whatever the framework uses; MLflow standardizes the wrapper.
The evaluation outputs. Confusion matrices, sample predictions on canary inputs, calibration plots. As MLflow artifacts.
The data hash. A hash of the training data so we know which version was used.

Last one is the most underrated. If we change preprocessing and re-train, the data hash changes; we can prove which preprocessing version a particular model used.

What goes in the Registry #

We're selective about what gets registered:

Models that pass evaluation thresholds (not every experiment).
One per "candidate for production" — clear lineage.
With explicit metadata: training data version, evaluation results summary, who initiated, when.

Random ad-hoc experiments stay in Tracking only. The Registry stays clean.

The eval workflow #

A model going to production must pass:

Offline eval suite. A set of held-out test cases the model is evaluated against. Pass/fail per metric (accuracy, calibration, fairness measures we track).
Regression eval. Compare new model against current production on a labeled "regression set" of historical failures. New must not regress on these.
Shadow traffic. Run the model on production traffic, log predictions, but don't act on them. Compare against current production for some time.
Canary deploy. Roll to 5%, observe, expand.

MLflow handles step 1 well via the Tracking artifacts. Steps 2-4 are the production-side discipline that MLflow Registry alone doesn't cover.

Alternatives we evaluated #

Weights & Biases. Better experiment tracking UI than MLflow. More polished. Costs more. We use it for some research teams; production registry workflows are still MLflow.

Vertex AI Model Registry / SageMaker Model Registry. Cloud-vendor specific. Tighter integration with their training and serving. Lock-in concerns.

Neptune.ai. Similar to W&B. We didn't evaluate deeply.

Custom in Postgres + S3. A few teams roll their own. Tempting at small scale; pays back if your needs diverge from off-the-shelf tools. We didn't.

For our shape (mixed-cloud, want portability, ~10 models in production), MLflow + GitOps for production gates is the right balance. Pure-AWS shop might prefer SageMaker Registry; pure-GCP shop, Vertex.

What we monitor #

Production model freshness. When was the current production version registered? Old models often perform worse on new data; flag if a model is > 90 days old without being re-evaluated.
Eval suite pass rate. Trends in the eval results across registered candidates. Catches data quality issues.
Promotion frequency. Healthy teams promote when there's a real improvement; suspicious if it's been weeks since the last promotion (no improvements?) or every few days (chasing noise?).

What to read next #

LLM evals that actually predict production quality — the eval discipline for LLM-specific cases
AI model deployment strategies — from development to production — the deployment side this connects to
Production AI pipelines — building end-to-end ML systems — the broader picture
AI observability — monitoring LLM performance in production — what happens after the model is live

MLflow does a lot, and the conflation of tracking and registry is a feature for simple cases. For production-grade ML workflows, separating the experiment side from the production-gate side gives you discipline where you need it (production) without slowing down where you don't (research). The split is the working pattern.

MLOps — Model Registry vs MLflow Tracking, And When You Need Both

MLOps — Model Registry vs MLflow Tracking, And When You Need Both

The two workflows #

What MLflow Tracking is good at #

What the Model Registry adds #

Where MLflow Registry falls short for production #

The split that worked for us #

What we log in Tracking #

What goes in the Registry #

The eval workflow #

Alternatives we evaluated #

What we monitor #

What to read next #

Stay Updated

HashiCorp Vault as a Secrets Backend for Kubernetes

Kubernetes HPA and VPA — Tuning From Production Pain

More from AI

Best LLM APIs and AI Infrastructure in 2026 — A Cost and Capability Map

Ollama vs vLLM — Which to Use for Serving LLMs

Best RAG Frameworks in 2026 — Compared

Best LLM APIs and AI Infrastructure in 2026 — A Cost and Capability Map

Ollama vs vLLM — Which to Use for Serving LLMs

Best RAG Frameworks in 2026 — Compared

AI Gateway Comparison — Portkey, LiteLLM, Cloudflare, and More

Best CI/CD Platforms in 2026 — GitHub Actions, GitLab, Jenkins, and More

LLM API Pricing Compared — Cost per Million Tokens in 2026

About Kiril Urbonas

You might have missed

GitOps with Argo CD: Best Practices for 2025

Process Management and Monitoring in Linux

Linux Performance Tuning for Containers and Kubernetes Nodes