Week 1 Part I · Foundations & the Cloud

Production Engineering & the Ops Landscape

Instructor lesson plan: lecture (2 h) and practice (2 h).

Learning objectives

Explain why production systems fail in operations more than in modelling.
Define SLIs, SLOs, and error budgets and write one of each.
Map the five operational layers (DevOps through AgentOps) and how they stack.
Name the course use-case domains and shortlist one for the team project.

Tools this week

GitGitHubDocker

🎓Lecture · 2 hours

0:00-0:10	10 min	Welcome & objectives Course mechanics: 2 h lecture + 2 h practice, one running project, three presentations (weeks 5, 8, 13), no written exams. What this course is: the engineering around the model, not the model itself. This week's objectives on the board.
0:10-0:25	15 min	Motivation: the prototype-to-production gap A walk through a real incident: an accurate model, a stale feature pipeline, a silent 40% error rate for three weeks. The 90/10 inversion: the model is a small box in a large systems diagram. Discussion prompt: what could have caught this, and at which layer?
0:25-0:50	25 min	Reliability as a contract: SLIs, SLOs, error budgets SLI: a measured indicator (request latency, error rate, freshness). SLO: the target on that indicator; SLA: the contract with consequences. Error budgets: a quantified allowance of unreliability to spend on change. Board work: turn 'the chatbot should be fast and reliable' into two SLOs with numbers. The toil concept and why automation is an SRE obligation, not a luxury.
0:50-1:10	20 min	What 'production-ready' means Availability, latency (p95, not average), cost, recoverability, observability. Day-one versus day-two: shipping is the start, operating is the job. The on-call loop: detect, triage, mitigate, learn (blameless postmortems). Quick poll: which property does your favourite app most visibly fail on?
1:10-1:20	10 min	Break
1:20-1:40	20 min	The five operational layers DevOps: code to running service (CI/CD, containers, observability). DataOps: trustworthy data (pipelines, quality, versioning). MLOps: the model lifecycle (tracking, registry, serving, drift). LLMOps: operating models you did not train (APIs, RAG, evaluation, cost). AgentOps: operating loops that act (tools, tracing, bounds). The stacking argument: each layer inherits the guarantees, and the debts, of those below.
1:40-1:55	15 min	The course use cases & the running project IoT telemetry: sensor streams, anomaly alerts, predictive maintenance. Document-QA chatbot: RAG over a real corpus, grounded answers with citations. Document processing: structured extraction from invoices and forms at scale. How one system will thread all five layers across thirteen weeks; what each Student Project Presentation must show.
1:55-2:00	5 min	Wrap-up & practice previewRevisit the checks below; practice sets up Git, Docker, and the team repositories.

Common misconception to confront.

Students often think: If the model is accurate, the system is done.
Set it straight: Accuracy is one property. Availability, latency, cost, data freshness, and recoverability are separate properties that usually dominate production outcomes.

Check for understanding (pose during the concept blocks; let students answer before revealing).

Give one failure a 99%-accurate model can still cause in production.

It can serve at 5 s latency, crash under load, or be fed stale or garbage inputs. Accuracy says nothing about any of these.

What does an error budget let a team do?

Spend a quantified amount of unreliability on shipping features, and freeze risky changes once it is exhausted. It makes reliability a negotiable, measurable resource.

Key takeaways.

Production failures are mostly operational, not algorithmic.
SLOs and error budgets turn 'reliability' into a measurable contract.
The five layers compose; each later layer inherits the earlier ones.

📚Reading & resources

Site Reliability Engineering, ch. 1 and ch. 4 Introduction and Service Level Objectives; the SLO and error-budget vocabulary used all course. Free online.
The Phoenix Project Kim, Behr and Spafford; narrative motivation for why operations dominates outcomes. Read any third of it this week.
Accelerate, Part I Forsgren, Humble and Kim; the evidence that delivery practice predicts performance.
SRE fundamentals: SLIs, SLAs and SLOs Short Google Cloud explainer; good first pass before the book chapter.

💻Practice · 2 hours

In the practice session the instructor demonstrates the tooling live and teaches the hands-on topics that belong at the keyboard. There are no separate weekly labs: each session closes with the project-integration brief, the increment every team adds to its end-to-end system before next week.

0:00-0:10	10 min	Setup & recap Verify Git, Docker, and an editor on every machine; fix stragglers now. Recap: SLOs, error budgets, the five layers, the three use cases.
0:10-0:35	25 min	Git for teams, done properly Branch, commit, push, pull request, review, merge: the full loop live. Branch protection and required reviews on the shared repository. Repository layout for a service: src, tests, infra, docs; .gitignore and secrets hygiene from day one.
0:35-1:00	25 min	Containers from zero Write a Dockerfile for a minimal 'hello-service' line by line. Build, run, stop, and inspect; read the logs; map the port. Pin the base image and dependency versions; rebuild and show identical behaviour.
1:00-1:10	10 min	Break
1:10-1:35	25 min	Team formation & project kickoff Form teams of three or four; create each team's repository from the course template. Walk the template: service skeleton, infra folder, README contract. Each team opens its first pull request (the README) and merges it through review.
1:35-1:50	15 min	Students drive Each team containerises the template service and runs it locally. Instructor circulates; common failures (port clashes, missing pins) fixed live.
1:50-2:00	10 min	Project-integration briefWalk the 'Project integration' card below: what each team adds to its system before next week and how it builds toward Presentation 1.

Common pitfalls to pre-empt.

Never commit secrets or large data to Git; use .gitignore and environment variables from day one.
A container that runs only on your laptop usually pins no versions. Pin them.

Project integration (this week)

Create the team repository from the template, with branch protection and a clean layout.
Containerise the hello-service and run it locally; pin all versions.
Shortlist two use-case domains (IoT, chatbot, document processing, or a proposed alternative); decide by week 3.

Curated references Project brief