Engineering of AI Systems · HIT

Week 4   Part II · DevOps

Orchestration, Deployment Patterns & Observability

Instructor lesson plan: lecture (2 h) and practice (2 h).

Learning objectives

Tools this week

Kubernetes (kind)kubectlPrometheusGrafanaArgo CD (demo)

🎓Lecture · 2 hours

0:00-0:1010 minRecap & objectives
  • Retrieval: the testing pyramid; why version an API.
  • Today: from one container to a managed fleet, and seeing what it does.
0:10-0:2515 minMotivation: 3 a.m. and the pod is gone
  • One container is easy; forty containers with deploys, crashes, and traffic spikes are why orchestration exists.
  • Story: the deploy that took down checkout because all instances restarted at once; the rollout pattern that would have saved it.
  • You cannot operate what you cannot see: the observability half of today.
0:25-0:5025 minKubernetes: the mental model
  • Desired state and reconciliation: you declare, the control loop converges.
  • Pods, Deployments, Services; labels and selectors as the glue.
  • Liveness and readiness probes: the /health endpoint from week 3 finds its purpose.
  • Resource requests and limits; horizontal autoscaling.
  • Board work: trace a request from DNS through the Service to a pod, and what happens when that pod dies.
0:50-1:1020 minDeployment patterns & GitOps
  • Rolling update: the default, and its window of mixed versions.
  • Blue-green: two full environments, one switch, instant rollback, double cost.
  • Canary: a small slice of real traffic watched closely, then ramp; the pattern the project will use for models too.
  • GitOps in one picture: the cluster state lives in Git, an agent (Argo CD) reconciles it; the audit log is the repo history.
1:10-1:2010 minBreak
1:20-1:4020 minObservability: logs, metrics, traces
  • The three pillars and the question each answers: what happened, how much, and where in the chain.
  • Monitoring answers known questions; observability lets you ask new ones without shipping code.
  • Structured logs (JSON) versus grep archaeology; high-cardinality labels.
  • The RED method: rate, errors, duration, per service; USE for resources.
1:40-1:5515 minTail latency (predict, then run)
  • Predict: average latency is 80 ms; what is p99, and who experiences it?
  • Run a live load test; watch p50 stay flat while p99 explodes under saturation.
  • Why SLOs are written on percentiles, never averages.
1:55-2:005 minWrap-up & practice previewPractice deploys the project to a cluster, runs a canary, and builds the dashboard the rest of the course reads.
Common misconception to confront.

Students often think: Observability just means more dashboards.
Set it straight: Observability is the ability to ask new questions of a running system without shipping new code. Well-structured, high-cardinality telemetry, not the number of dashboards, is what enables it.

Check for understanding (pose during the concept blocks; let students answer before revealing).
What is the difference between blue-green and canary deployment?
Blue-green swaps all traffic between two full environments at once; canary shifts a small percentage gradually and watches metrics before ramping up.
When do traces help where aggregate metrics do not?
Traces follow a single request across services and show where its latency is spent, which an aggregate metric cannot localise.
Key takeaways.

📚Reading & resources

💻Practice · 2 hours

In the practice session the instructor demonstrates the tooling live and teaches the hands-on topics that belong at the keyboard. There are no separate weekly labs: each session closes with the project-integration brief, the increment every team adds to its end-to-end system before next week.

0:00-0:1010 minSetup & recap
  • Start a local cluster (kind or minikube) on every team machine.
  • Recap: desired state, probes, RED.
0:10-0:3525 minDeploy the project service
  • Write the Deployment and Service manifests for the week-3 service.
  • Wire the liveness and readiness probes to /health.
  • Scale to three replicas; kill a pod; watch reconciliation bring it back.
0:35-1:0025 minCanary rollout, for real
  • Ship v2 of the service next to v1; shift 10% of traffic.
  • Watch the error rate per version; promote, then practice the rollback path.
  • This exact pattern returns in week 7 for models.
1:00-1:1010 minBreak
1:10-1:3525 minThe project dashboard
  • Expose Prometheus metrics from the service; scrape them.
  • Build the RED dashboard in Grafana: rate, errors, p50/p95/p99 duration.
  • Run the load test; read the tail; record the baseline p95 for the project log.
1:35-1:5015 minStudents drive
  • Each team gets its service deployed, canaried, and on the dashboard.
  • Instructor circulates on probe and scrape misconfigurations.
1:50-2:0010 minProject-integration briefThe 'Project integration' card: deployed service + canary path + RED dashboard; the baseline p95 goes into the Presentation-1 spec.
Common pitfalls to pre-empt.

Project integration (this week)

Curated references Project brief

PreviousWeek 3: CI/CD, Testing & REST ServicesNextWeek 5: Data Lakes, Pipelines & Versioning