Week 4 Part II · DevOps

Orchestration, Deployment Patterns & Observability

Instructor lesson plan: lecture (2 h) and practice (2 h).

Learning objectives

Run the project service under an orchestrator with health checks and scaling.
Choose and execute a rollout pattern (blue-green, canary).
Instrument the three pillars of observability and read a RED dashboard.

Tools this week

Kubernetes (kind)kubectlPrometheusGrafanaArgo CD (demo)

🎓Lecture · 2 hours

0:00-0:10	10 min	Recap & objectives Retrieval: the testing pyramid; why version an API. Today: from one container to a managed fleet, and seeing what it does.
0:10-0:25	15 min	Motivation: 3 a.m. and the pod is gone One container is easy; forty containers with deploys, crashes, and traffic spikes are why orchestration exists. Story: the deploy that took down checkout because all instances restarted at once; the rollout pattern that would have saved it. You cannot operate what you cannot see: the observability half of today.
0:25-0:50	25 min	Kubernetes: the mental model Desired state and reconciliation: you declare, the control loop converges. Pods, Deployments, Services; labels and selectors as the glue. Liveness and readiness probes: the /health endpoint from week 3 finds its purpose. Resource requests and limits; horizontal autoscaling. Board work: trace a request from DNS through the Service to a pod, and what happens when that pod dies.
0:50-1:10	20 min	Deployment patterns & GitOps Rolling update: the default, and its window of mixed versions. Blue-green: two full environments, one switch, instant rollback, double cost. Canary: a small slice of real traffic watched closely, then ramp; the pattern the project will use for models too. GitOps in one picture: the cluster state lives in Git, an agent (Argo CD) reconciles it; the audit log is the repo history.
1:10-1:20	10 min	Break
1:20-1:40	20 min	Observability: logs, metrics, traces The three pillars and the question each answers: what happened, how much, and where in the chain. Monitoring answers known questions; observability lets you ask new ones without shipping code. Structured logs (JSON) versus grep archaeology; high-cardinality labels. The RED method: rate, errors, duration, per service; USE for resources.
1:40-1:55	15 min	Tail latency (predict, then run) Predict: average latency is 80 ms; what is p99, and who experiences it? Run a live load test; watch p50 stay flat while p99 explodes under saturation. Why SLOs are written on percentiles, never averages.
1:55-2:00	5 min	Wrap-up & practice previewPractice deploys the project to a cluster, runs a canary, and builds the dashboard the rest of the course reads.

Common misconception to confront.

Students often think: Observability just means more dashboards.
Set it straight: Observability is the ability to ask new questions of a running system without shipping new code. Well-structured, high-cardinality telemetry, not the number of dashboards, is what enables it.

Check for understanding (pose during the concept blocks; let students answer before revealing).

What is the difference between blue-green and canary deployment?

Blue-green swaps all traffic between two full environments at once; canary shifts a small percentage gradually and watches metrics before ramping up.

When do traces help where aggregate metrics do not?

Traces follow a single request across services and show where its latency is spent, which an aggregate metric cannot localise.

Key takeaways.

Orchestration reconciles desired state across a fleet.
Canary shifts traffic gradually and watches metrics; blue-green flips all at once.
Tail latency (p95/p99) and traces reveal what averages hide.

📚Reading & resources

Kubernetes: Up and Running, the Pods, Deployments and Services chapters Burns, Beda, Hightower and Evenson; the primitives used in practice.
Kubernetes Basics tutorial The official interactive walkthrough; do it before the practice session.
Observability Engineering, ch. 1 to 2 Majors, Fong-Jones and Miranda; what observability is and is not.
Prometheus: getting started Scrape, query, alert; the practice wiring in document form.
Grafana fundamentals tutorial Dashboards over Prometheus data; the RED dashboard recipe.
Argo CD: getting started The GitOps loop in fifteen minutes; demo-level familiarity is enough.

💻Practice · 2 hours

In the practice session the instructor demonstrates the tooling live and teaches the hands-on topics that belong at the keyboard. There are no separate weekly labs: each session closes with the project-integration brief, the increment every team adds to its end-to-end system before next week.

0:00-0:10	10 min	Setup & recap Start a local cluster (kind or minikube) on every team machine. Recap: desired state, probes, RED.
0:10-0:35	25 min	Deploy the project service Write the Deployment and Service manifests for the week-3 service. Wire the liveness and readiness probes to /health. Scale to three replicas; kill a pod; watch reconciliation bring it back.
0:35-1:00	25 min	Canary rollout, for real Ship v2 of the service next to v1; shift 10% of traffic. Watch the error rate per version; promote, then practice the rollback path. This exact pattern returns in week 7 for models.
1:00-1:10	10 min	Break
1:10-1:35	25 min	The project dashboard Expose Prometheus metrics from the service; scrape them. Build the RED dashboard in Grafana: rate, errors, p50/p95/p99 duration. Run the load test; read the tail; record the baseline p95 for the project log.
1:35-1:50	15 min	Students drive Each team gets its service deployed, canaried, and on the dashboard. Instructor circulates on probe and scrape misconfigurations.
1:50-2:00	10 min	Project-integration briefThe 'Project integration' card: deployed service + canary path + RED dashboard; the baseline p95 goes into the Presentation-1 spec.

Common pitfalls to pre-empt.

p50 latency hides the tail; always watch p95/p99.
Unstructured logs are hard to query; emit structured (JSON) logs.

Project integration (this week)

Deploy the project service to a cluster with health probes and three replicas.
Demonstrate a canary rollout and a rollback on the project service.
Stand up the project RED dashboard; record baseline p95 latency for the spec.

Curated references Project brief