Week 7 Part IV · MLOps

Experiment Tracking, Model Registry & Serving

Instructor lesson plan: lecture (2 h) and practice (2 h).

Learning objectives

Track experiments so any result is reproducible.
Version models in a registry with governed stage transitions.
Serve a model behind a REST endpoint and roll it out safely.

Tools this week

MLflow trackingMLflow registryBentoML / FastAPI servingmodel cards

🎓Lecture · 2 hours

0:00-0:10	10 min	Recap & objectives Retrieval: train/serve skew; when streaming wins. Today: the model lifecycle, from training run to live endpoint.
0:10-0:25	15 min	Motivation: which run made this model? Story: the production model whose training data, code version, and hyperparameters nobody could name during an audit. Training is the easy part; versioning, serving, and safely replacing models is where production ML lives. Everything today reuses weeks 3 to 5: the API contract, the canary, the data version.
0:25-0:50	25 min	Experiment tracking, properly What to log: parameters, metrics over time (curves, not just finals), artifacts, environment. The reproducibility triple: git SHA + data version (week 5's DVC snapshot) + environment, pinned to every run. Comparing runs in the MLflow UI; reading a learning curve versus a final number. Tracking applies to LLM work too: prompt versions and eval scores are runs (weeks 9 to 10 will reuse this).
0:50-1:10	20 min	The model registry & model cards A model version is an artifact plus its lineage, not a file on a laptop; the registry doubles as the model catalogue: discovery, ownership, and governance in one place. Registry stages: staging, production, archived; promotion as a governed, auditable transition. Rollback as a first-class operation: the previous version is one transition away. Model cards: intended use, training data, metrics, limitations; documentation that ships with the artifact.
1:10-1:20	10 min	Break
1:20-1:40	20 min	Serving patterns Packaging: the model plus pinned dependencies, containerised like any service. Online (request/response), batch (precomputed), and streaming scoring; choosing by freshness need and cost. The serving endpoint is a REST API: week 3's contracts, validation, and health checks apply unchanged. Latency engineering: model warm-up, request batching, the p95 budget.
1:40-1:55	15 min	Safe rollout for models (predict, then run) Shadow: real traffic, responses unused; the free lunch of rollout safety. Canary and A/B for models; what online metric decides promotion. Predict: offline AUC up 2 points; what can still go wrong online? Then the offline-online gap, demonstrated with a latency regression.
1:55-2:00	5 min	Wrap-up & practice previewPractice takes the project model from tracked run to canaried endpoint.

Common misconception to confront.

Students often think: A good offline metric guarantees a good online result.
Set it straight: Offline metrics use historical data and proxy objectives. Feedback loops, latency, and distribution shift mean online behaviour can differ, so validate with shadow, canary, or A/B.

Check for understanding (pose during the concept blocks; let students answer before revealing).

What three things must a tracked run pin to be reproducible?

The code version (git SHA), the data version, and the environment plus hyperparameters.

What is a shadow deployment?

Sending real traffic to the new model in parallel without using its responses, to compare it against production safely.

Key takeaways.

A metric is reproducible only when tied to code, data, and environment.
A registry gives governed, auditable model promotion and rollback.
Offline metrics are a proxy; confirm online with shadow or canary.

📚Reading & resources

Designing Machine Learning Systems, ch. 6 to 7 Huyen; model development, offline evaluation, deployment and prediction services.
MLflow documentation Tracking and the model registry; the practice session end to end.
Hidden Technical Debt in Machine Learning Systems Sculley et al., NeurIPS 2015; why the system around the model dominates.
Machine Learning Engineering, the deployment and serving chapters Burkov; compact treatment of packaging, serving modes, and rollout.
BentoML documentation A serving runtime in practice; alternative to hand-rolled FastAPI serving.

💻Practice · 2 hours

In the practice session the instructor demonstrates the tooling live and teaches the hands-on topics that belong at the keyboard. There are no separate weekly labs: each session closes with the project-integration brief, the increment every team adds to its end-to-end system before next week.

0:00-0:10	10 min	Setup & recap Open the project's training script and the MLflow UI. Recap the reproducibility triple.
0:10-0:35	25 min	Instrument and track Instrument the project's training: params, metric curves, artifacts. Pin the git SHA and the DVC data version to the run; show the lineage end to end. Train twice with different hyperparameters; compare runs in the UI.
0:35-1:00	25 min	Register, promote, roll back Register the better run as model v1; write its model card from the run metadata. Promote staging to production; then practice the rollback transition. Chatbot and document teams: the 'model' may be a prompt + model-tier configuration; track it the same way.
1:00-1:10	10 min	Break
1:10-1:35	25 min	Serve it Wrap the registered model in a serving runtime behind the project's REST API. Health probes, input validation, and warm-up; deploy to the cluster from week 4. Load-test; read p95 on the RED dashboard; compare to the recorded baseline.
1:35-1:50	15 min	Canary the model Ship model v2 as a canary next to v1, exactly like week 4's service canary. Compare per-version predictions and latency; promote or roll back on evidence.
1:50-2:00	10 min	Project-integration briefThe 'Project integration' card: tracked, registered, served, canaried; the model lifecycle is now demonstrable for Presentation 2.

Common pitfalls to pre-empt.

An unpinned data version makes the run impossible to reproduce.
Serving a pickled model without pinned library versions breaks silently.

Project integration (this week)

Project training instrumented with MLflow; reproducibility triple pinned on every run.
Model v1 registered with a model card; promotion and rollback demonstrated.
Model served behind the project's REST API on the cluster, with p95 recorded; v2 canaried next to v1.

Curated references Project brief