Engineering of AI Systems · HIT

Week 1   Part I · Foundations & the Cloud

Production Engineering & the Ops Landscape

Instructor lesson plan: lecture (2 h) and practice (2 h).

Learning objectives

Tools this week

GitGitHubDocker

🎓Lecture · 2 hours

0:00-0:1010 minWelcome & objectives
  • Course mechanics: 2 h lecture + 2 h practice, one running project, three presentations (weeks 5, 8, 13), no written exams.
  • What this course is: the engineering around the model, not the model itself.
  • This week's objectives on the board.
0:10-0:2515 minMotivation: the prototype-to-production gap
  • A walk through a real incident: an accurate model, a stale feature pipeline, a silent 40% error rate for three weeks.
  • The 90/10 inversion: the model is a small box in a large systems diagram.
  • Discussion prompt: what could have caught this, and at which layer?
0:25-0:5025 minReliability as a contract: SLIs, SLOs, error budgets
  • SLI: a measured indicator (request latency, error rate, freshness).
  • SLO: the target on that indicator; SLA: the contract with consequences.
  • Error budgets: a quantified allowance of unreliability to spend on change.
  • Board work: turn 'the chatbot should be fast and reliable' into two SLOs with numbers.
  • The toil concept and why automation is an SRE obligation, not a luxury.
0:50-1:1020 minWhat 'production-ready' means
  • Availability, latency (p95, not average), cost, recoverability, observability.
  • Day-one versus day-two: shipping is the start, operating is the job.
  • The on-call loop: detect, triage, mitigate, learn (blameless postmortems).
  • Quick poll: which property does your favourite app most visibly fail on?
1:10-1:2010 minBreak
1:20-1:4020 minThe five operational layers
  • DevOps: code to running service (CI/CD, containers, observability).
  • DataOps: trustworthy data (pipelines, quality, versioning).
  • MLOps: the model lifecycle (tracking, registry, serving, drift).
  • LLMOps: operating models you did not train (APIs, RAG, evaluation, cost).
  • AgentOps: operating loops that act (tools, tracing, bounds).
  • The stacking argument: each layer inherits the guarantees, and the debts, of those below.
1:40-1:5515 minThe course use cases & the running project
  • IoT telemetry: sensor streams, anomaly alerts, predictive maintenance.
  • Document-QA chatbot: RAG over a real corpus, grounded answers with citations.
  • Document processing: structured extraction from invoices and forms at scale.
  • How one system will thread all five layers across thirteen weeks; what each Student Project Presentation must show.
1:55-2:005 minWrap-up & practice previewRevisit the checks below; practice sets up Git, Docker, and the team repositories.
Common misconception to confront.

Students often think: If the model is accurate, the system is done.
Set it straight: Accuracy is one property. Availability, latency, cost, data freshness, and recoverability are separate properties that usually dominate production outcomes.

Check for understanding (pose during the concept blocks; let students answer before revealing).
Give one failure a 99%-accurate model can still cause in production.
It can serve at 5 s latency, crash under load, or be fed stale or garbage inputs. Accuracy says nothing about any of these.
What does an error budget let a team do?
Spend a quantified amount of unreliability on shipping features, and freeze risky changes once it is exhausted. It makes reliability a negotiable, measurable resource.
Key takeaways.

📚Reading & resources

💻Practice · 2 hours

In the practice session the instructor demonstrates the tooling live and teaches the hands-on topics that belong at the keyboard. There are no separate weekly labs: each session closes with the project-integration brief, the increment every team adds to its end-to-end system before next week.

0:00-0:1010 minSetup & recap
  • Verify Git, Docker, and an editor on every machine; fix stragglers now.
  • Recap: SLOs, error budgets, the five layers, the three use cases.
0:10-0:3525 minGit for teams, done properly
  • Branch, commit, push, pull request, review, merge: the full loop live.
  • Branch protection and required reviews on the shared repository.
  • Repository layout for a service: src, tests, infra, docs; .gitignore and secrets hygiene from day one.
0:35-1:0025 minContainers from zero
  • Write a Dockerfile for a minimal 'hello-service' line by line.
  • Build, run, stop, and inspect; read the logs; map the port.
  • Pin the base image and dependency versions; rebuild and show identical behaviour.
1:00-1:1010 minBreak
1:10-1:3525 minTeam formation & project kickoff
  • Form teams of three or four; create each team's repository from the course template.
  • Walk the template: service skeleton, infra folder, README contract.
  • Each team opens its first pull request (the README) and merges it through review.
1:35-1:5015 minStudents drive
  • Each team containerises the template service and runs it locally.
  • Instructor circulates; common failures (port clashes, missing pins) fixed live.
1:50-2:0010 minProject-integration briefWalk the 'Project integration' card below: what each team adds to its system before next week and how it builds toward Presentation 1.
Common pitfalls to pre-empt.

Project integration (this week)

Curated references Project brief

NextWeek 2: Cloud Computing Fundamentals