References Curated reading across the five layers
Books, canonical papers, and authoritative docs for DevOps, DataOps, MLOps, LLMOps, and AgentOps.
The list below groups recommendations by layer. The reading-by-week table at the foot maps each lecture to its primary sources; the per-week links on the lesson plans point there. Free online resources are linked directly; books are listed with ISBN.
The backbone systems text: storage, replication, partitioning, consistency, and batch/stream processing. Assign the 1st edition; the 2nd is forthcoming (2026).
The canonical SRE text: SLOs, error budgets, toil, on-call. Read free online.
Hands-on companion with worked SLO and error-budget examples. Read free online.
The evidence base for the four DORA delivery metrics.
Architecture characteristics and trade-off analysis for production systems.
The cloud-fundamentals reading for week 2: pillars, deployment trade-offs, and who secures what.
The closest thing DevOps/SRE has to a body of knowledge: the SRE books plus DORA's delivery-performance research; awesome-sre is the curated index.
The practitioner reference for flow, feedback, and continual learning.
A narrative on-ramp to DevOps thinking; good week-1 motivation.
The standard hands-on Kubernetes primer from its creators.
Bridges Kubernetes to CI/CD, GitOps, and operations.
The go-to practical infrastructure-as-code text.
Tool-agnostic IaC principles and patterns.
Modern logs, metrics, traces, and the structured-events view of observability.
The best survey of the data lifecycle (ingest, store, transform, serve) with a DataOps chapter.
The definitive hands-on orchestration book; maps to the pipeline weeks.
Compact, practical ELT/ETL patterns in Python and SQL.
Transformation, testing, and versioned models: the modern data-quality workflow.
Orchestration, transformation/testing, data validation, and streaming, respectively.
The bronze/silver/gold staged-refinement pattern taught in week 5; short and canonical.
The curated index for the data layer. DAMA-DMBOK is the formal data-management body of knowledge; enterprise-flavoured but citable.
The leading end-to-end ML system-design text: data, features, deployment, monitoring, drift.
Compact, opinionated coverage of the full ML project lifecycle.
Accessible framing of model lifecycle, governance, and monitoring.
A pattern catalog for data prep, model building, and MLOps; a useful lookup.
Applies SRE principles to ML in production: the bridge between DevOps and MLOps.
Why ML systems accrue operational debt; the founding MLOps motivation.
MLOps has a genuine body of knowledge: the ml-ops.org principles plus Google's 43 Rules of ML; the two lists are the freshest curated indexes.
The anchor text for building production LLM apps: evaluation, prompting, RAG, fine-tuning, agents, deployment.
End-to-end LLMOps pipeline: data, fine-tuning, RAG, deployment, monitoring.
Shipping LLMs: serving, cost, LoRA/RLHF, benchmarking.
Operational LLM concerns: evaluation, governance, GenAI security, cost.
The original RAG paper.
Prompting guidance; programmable guardrails for LLM apps.
The week 9-10 operating stack: a managed AI platform (catalog, guardrails, knowledge bases), the open-source gateway, and LLM tracing/evaluation.
LLMOps' closest things to a body of knowledge are security-flavoured: the OWASP list and the NIST framework; Awesome-LLMOps is the curated tool index.
Hands-on multi-agent systems: personas, tool use, memory, collaboration.
Production agent design: orchestration, evaluation, monitoring, human-in-the-loop.
The reason-act-observe loop most agents are built on.
Foundational treatment of tool use / function calling.
When to use workflows versus agents; practical patterns and their failure modes.
Graph-based agent orchestration; a standard tool/context interface.
AgentOps has no body of knowledge yet; OWASP's agentic-security work is in progress. The MCP server index is the best live map of the agent-tool ecosystem.
| Wk | Topic | Primary sources |
|---|---|---|
| 1 | Ops landscape & SRE | SRE book (SLOs, error budgets), The Phoenix Project, Accelerate. |
| 2 | Cloud computing fundamentals | AWS Well-Architected Framework, the Shared Responsibility Model, the provider's free-tier and pricing docs. |
| 3 | CI/CD, testing & REST services | DevOps Handbook, Accelerate (DORA), Terraform Up & Running (IaC chapter), Infrastructure as Code. |
| 4 | Orchestration, rollouts & observability | Kubernetes Up & Running, Cloud Native DevOps, Observability Engineering, OpenTelemetry. |
| 5 | Data lakes, pipelines & versioning | Fundamentals of Data Engineering, the medallion architecture, Data Pipelines with Airflow, DDIA (batch/stream). |
| 6 | Quality, contracts, streaming & features | Great Expectations, Apache Kafka docs, Analytics Engineering with dbt, Fundamentals of Data Engineering (quality, streaming). |
| 7 | Tracking, registry & serving | Designing ML Systems (training, deployment), MLflow docs, ML Engineering (Burkov), Hidden Technical Debt. |
| 8 | Monitoring, drift & governance | Designing ML Systems (monitoring), Reliable ML, Introducing MLOps (governance). |
| 9 | LLM foundations, AI APIs & the token economy | AI Engineering (foundations, prompting), the OpenAI/Anthropic API docs, AWS Bedrock docs, OpenAI prompt guide. |
| 10 | RAG & serving LLMs: vector databases & gateways | AI Engineering (RAG), RAG paper (Lewis), LiteLLM docs, LLM Engineer's Handbook, LLMs in Production (serving). |
| 11 | LLM evaluation, guardrails & observability | AI Engineering (evaluation), Langfuse docs, Ragas, NeMo Guardrails, OWASP Top 10 for LLM Apps (injection), OpenAI prompt guide. |
| 12 | Agents, MCP & AgentOps | AI Engineering (agents), Building Effective Agents, MCP docs, ReAct, Toolformer, AI Agents in Action, LangGraph. |
| 13 | Security, governance & synthesis | OWASP Top 10 for LLM Apps, NIST AI RMF, Reliable ML, SRE Workbook. |