Engineering of AI Systems · HIT

Week 7   Part IV · MLOps

Experiment Tracking, Model Registry & Serving

Instructor lesson plan: lecture (2 h) and practice (2 h).

Learning objectives

Tools this week

MLflow trackingMLflow registryBentoML / FastAPI servingmodel cards

🎓Lecture · 2 hours

0:00-0:1010 minRecap & objectives
  • Retrieval: train/serve skew; when streaming wins.
  • Today: the model lifecycle, from training run to live endpoint.
0:10-0:2515 minMotivation: which run made this model?
  • Story: the production model whose training data, code version, and hyperparameters nobody could name during an audit.
  • Training is the easy part; versioning, serving, and safely replacing models is where production ML lives.
  • Everything today reuses weeks 3 to 5: the API contract, the canary, the data version.
0:25-0:5025 minExperiment tracking, properly
  • What to log: parameters, metrics over time (curves, not just finals), artifacts, environment.
  • The reproducibility triple: git SHA + data version (week 5's DVC snapshot) + environment, pinned to every run.
  • Comparing runs in the MLflow UI; reading a learning curve versus a final number.
  • Tracking applies to LLM work too: prompt versions and eval scores are runs (weeks 9 to 10 will reuse this).
0:50-1:1020 minThe model registry & model cards
  • A model version is an artifact plus its lineage, not a file on a laptop; the registry doubles as the model catalogue: discovery, ownership, and governance in one place.
  • Registry stages: staging, production, archived; promotion as a governed, auditable transition.
  • Rollback as a first-class operation: the previous version is one transition away.
  • Model cards: intended use, training data, metrics, limitations; documentation that ships with the artifact.
1:10-1:2010 minBreak
1:20-1:4020 minServing patterns
  • Packaging: the model plus pinned dependencies, containerised like any service.
  • Online (request/response), batch (precomputed), and streaming scoring; choosing by freshness need and cost.
  • The serving endpoint is a REST API: week 3's contracts, validation, and health checks apply unchanged.
  • Latency engineering: model warm-up, request batching, the p95 budget.
1:40-1:5515 minSafe rollout for models (predict, then run)
  • Shadow: real traffic, responses unused; the free lunch of rollout safety.
  • Canary and A/B for models; what online metric decides promotion.
  • Predict: offline AUC up 2 points; what can still go wrong online? Then the offline-online gap, demonstrated with a latency regression.
1:55-2:005 minWrap-up & practice previewPractice takes the project model from tracked run to canaried endpoint.
Common misconception to confront.

Students often think: A good offline metric guarantees a good online result.
Set it straight: Offline metrics use historical data and proxy objectives. Feedback loops, latency, and distribution shift mean online behaviour can differ, so validate with shadow, canary, or A/B.

Check for understanding (pose during the concept blocks; let students answer before revealing).
What three things must a tracked run pin to be reproducible?
The code version (git SHA), the data version, and the environment plus hyperparameters.
What is a shadow deployment?
Sending real traffic to the new model in parallel without using its responses, to compare it against production safely.
Key takeaways.

📚Reading & resources

💻Practice · 2 hours

In the practice session the instructor demonstrates the tooling live and teaches the hands-on topics that belong at the keyboard. There are no separate weekly labs: each session closes with the project-integration brief, the increment every team adds to its end-to-end system before next week.

0:00-0:1010 minSetup & recap
  • Open the project's training script and the MLflow UI.
  • Recap the reproducibility triple.
0:10-0:3525 minInstrument and track
  • Instrument the project's training: params, metric curves, artifacts.
  • Pin the git SHA and the DVC data version to the run; show the lineage end to end.
  • Train twice with different hyperparameters; compare runs in the UI.
0:35-1:0025 minRegister, promote, roll back
  • Register the better run as model v1; write its model card from the run metadata.
  • Promote staging to production; then practice the rollback transition.
  • Chatbot and document teams: the 'model' may be a prompt + model-tier configuration; track it the same way.
1:00-1:1010 minBreak
1:10-1:3525 minServe it
  • Wrap the registered model in a serving runtime behind the project's REST API.
  • Health probes, input validation, and warm-up; deploy to the cluster from week 4.
  • Load-test; read p95 on the RED dashboard; compare to the recorded baseline.
1:35-1:5015 minCanary the model
  • Ship model v2 as a canary next to v1, exactly like week 4's service canary.
  • Compare per-version predictions and latency; promote or roll back on evidence.
1:50-2:0010 minProject-integration briefThe 'Project integration' card: tracked, registered, served, canaried; the model lifecycle is now demonstrable for Presentation 2.
Common pitfalls to pre-empt.

Project integration (this week)

Curated references Project brief

PreviousWeek 6: Data Quality, Contracts, Streaming & Feature StoresNextWeek 8: Monitoring, Model Drift & Governance