Engineering of AI Systems · HIT

References   Curated reading across the five layers

References & Reading List

Books, canonical papers, and authoritative docs for DevOps, DataOps, MLOps, LLMOps, and AgentOps.

The list below groups recommendations by layer. The reading-by-week table at the foot maps each lecture to its primary sources; the per-week links on the lesson plans point there. Free online resources are linked directly; books are listed with ISBN.

Foundations & General

Book
Designing Data-Intensive Applications · Martin Kleppmann · O'Reilly, 2017 · ISBN 9781449373320

The backbone systems text: storage, replication, partitioning, consistency, and batch/stream processing. Assign the 1st edition; the 2nd is forthcoming (2026).

Free
Site Reliability Engineering · Beyer, Jones, Petoff & Murphy (eds.) · Google / O'Reilly, 2016

The canonical SRE text: SLOs, error budgets, toil, on-call. Read free online.

Free
The Site Reliability Workbook · Beyer, Murphy, Rensin, Kawahara & Thorne (eds.) · Google / O'Reilly, 2018

Hands-on companion with worked SLO and error-budget examples. Read free online.

Book
Accelerate: The Science of Lean Software and DevOps · Forsgren, Humble & Kim · IT Revolution, 2018 · ISBN 9781942788331

The evidence base for the four DORA delivery metrics.

Book
Fundamentals of Software Architecture · Mark Richards & Neal Ford · O'Reilly, 2020 · ISBN 9781492043454

Architecture characteristics and trade-off analysis for production systems.

Docs
AWS Well-Architected Framework · Shared Responsibility Model

The cloud-fundamentals reading for week 2: pillars, deployment trade-offs, and who secures what.

BoK & lists
DORA research · awesome-sre

The closest thing DevOps/SRE has to a body of knowledge: the SRE books plus DORA's delivery-performance research; awesome-sre is the curated index.

DevOps

Book
The DevOps Handbook (2nd ed.) · Kim, Humble, Debois, Willis & Forsgren · IT Revolution, 2021 · ISBN 9781950508402

The practitioner reference for flow, feedback, and continual learning.

Book
The Phoenix Project · Kim, Behr & Spafford · IT Revolution, 2013 (rev. 2018) · ISBN 9781942788294

A narrative on-ramp to DevOps thinking; good week-1 motivation.

Book
Kubernetes: Up & Running (3rd ed.) · Burns, Beda, Hightower & Evenson · O'Reilly, 2022 · ISBN 9781098110208

The standard hands-on Kubernetes primer from its creators.

Book
Cloud Native DevOps with Kubernetes (2nd ed.) · Arundel & Domingus · O'Reilly, 2022 · ISBN 9781098116828

Bridges Kubernetes to CI/CD, GitOps, and operations.

Book
Terraform: Up & Running (3rd ed.) · Yevgeniy Brikman · O'Reilly, 2022 · ISBN 9781098116743

The go-to practical infrastructure-as-code text.

Book
Infrastructure as Code (2nd ed.) · Kief Morris · O'Reilly, 2020 · ISBN 9781098114671

Tool-agnostic IaC principles and patterns.

Book
Observability Engineering · Majors, Fong-Jones & Miranda · O'Reilly, 2022 · ISBN 9781492076445

Modern logs, metrics, traces, and the structured-events view of observability.

Docs
OpenTelemetry documentation · CNCF

The vendor-neutral standard for traces, metrics, and logs.

DataOps

Book
Fundamentals of Data Engineering · Joe Reis & Matt Housley · O'Reilly, 2022 · ISBN 9781098108304

The best survey of the data lifecycle (ingest, store, transform, serve) with a DataOps chapter.

Book
Data Pipelines with Apache Airflow · Harenslak & de Ruiter · Manning, 2021 · ISBN 9781617296901

The definitive hands-on orchestration book; maps to the pipeline weeks.

Book
Data Pipelines Pocket Reference · James Densmore · O'Reilly, 2021 · ISBN 9781492087830

Compact, practical ELT/ETL patterns in Python and SQL.

Book
Analytics Engineering with SQL and dbt · Machado & Russa · O'Reilly, 2024 · ISBN 9781098142384

Transformation, testing, and versioned models: the modern data-quality workflow.

Docs
Apache Airflow · dbt · Great Expectations · Apache Kafka

Orchestration, transformation/testing, data validation, and streaming, respectively.

Docs
The medallion architecture · Databricks glossary

The bronze/silver/gold staged-refinement pattern taught in week 5; short and canonical.

BoK & lists
awesome-data-engineering

The curated index for the data layer. DAMA-DMBOK is the formal data-management body of knowledge; enterprise-flavoured but citable.

MLOps

Book
Designing Machine Learning Systems · Chip Huyen · O'Reilly, 2022 · ISBN 9781098107963

The leading end-to-end ML system-design text: data, features, deployment, monitoring, drift.

Book
Machine Learning Engineering · Andriy Burkov · True Positive Inc., 2020 · ISBN 9781999579579

Compact, opinionated coverage of the full ML project lifecycle.

Book
Introducing MLOps · Mark Treveil & the Dataiku team · O'Reilly, 2020 · ISBN 9781492083290

Accessible framing of model lifecycle, governance, and monitoring.

Book
Machine Learning Design Patterns · Lakshmanan, Robinson & Munn · O'Reilly, 2020 · ISBN 9781098115784

A pattern catalog for data prep, model building, and MLOps; a useful lookup.

Book
Reliable Machine Learning · Chen, Murphy, Parisa, Sculley & Underwood · O'Reilly, 2022 · ISBN 9781098106225

Applies SRE principles to ML in production: the bridge between DevOps and MLOps.

Paper
Hidden Technical Debt in Machine Learning Systems · Sculley et al. · NeurIPS 2015

Why ML systems accrue operational debt; the founding MLOps motivation.

Docs
MLflow · Huyen's MLOps guide

Experiment tracking and registry; a free curated MLOps overview.

BoK & lists
ml-ops.org · Rules of ML · awesome-production-ML · awesome-mlops (tools)

MLOps has a genuine body of knowledge: the ml-ops.org principles plus Google's 43 Rules of ML; the two lists are the freshest curated indexes.

LLMOps

Book
AI Engineering: Building Applications with Foundation Models · Chip Huyen · O'Reilly, 2025 · ISBN 9781098166304

The anchor text for building production LLM apps: evaluation, prompting, RAG, fine-tuning, agents, deployment.

Book
LLM Engineer's Handbook · Iusztin & Labonne · Packt, 2024 · ISBN 9781836200079

End-to-end LLMOps pipeline: data, fine-tuning, RAG, deployment, monitoring.

Book
LLMs in Production · Brousseau & Sharp · Manning, 2024 · ISBN 9781633437203

Shipping LLMs: serving, cost, LoRA/RLHF, benchmarking.

Book
LLMOps: Managing Large Language Models in Production · Abi Aryan · O'Reilly, 2025

Operational LLM concerns: evaluation, governance, GenAI security, cost.

Paper
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks · Lewis et al. · 2020

The original RAG paper.

Docs
OpenAI Prompt Engineering · NeMo Guardrails

Prompting guidance; programmable guardrails for LLM apps.

Docs
AWS Bedrock · LiteLLM · Langfuse

The week 9-10 operating stack: a managed AI platform (catalog, guardrails, knowledge bases), the open-source gateway, and LLM tracing/evaluation.

BoK & lists
OWASP Top 10 for LLM Apps (2025) · NIST AI RMF · Awesome-LLMOps

LLMOps' closest things to a body of knowledge are security-flavoured: the OWASP list and the NIST framework; Awesome-LLMOps is the curated tool index.

AgentOps

Book
AI Agents in Action · Micheal Lanham · Manning, 2025 · ISBN 9781633436343

Hands-on multi-agent systems: personas, tool use, memory, collaboration.

Book
Building Applications with AI Agents · Michael Albada · O'Reilly, 2025

Production agent design: orchestration, evaluation, monitoring, human-in-the-loop.

Paper
ReAct: Synergizing Reasoning and Acting in Language Models · Yao et al. · ICLR 2023

The reason-act-observe loop most agents are built on.

Paper
Toolformer: Language Models Can Teach Themselves to Use Tools · Schick et al. · 2023

Foundational treatment of tool use / function calling.

Article
Building Effective Agents · Anthropic, 2024

When to use workflows versus agents; practical patterns and their failure modes.

Docs
LangGraph · Model Context Protocol

Graph-based agent orchestration; a standard tool/context interface.

BoK & lists
awesome-mcp-servers · awesome-llm-agents

AgentOps has no body of knowledge yet; OWASP's agentic-security work is in progress. The MCP server index is the best live map of the agent-tool ecosystem.

Reading by week

WkTopicPrimary sources
1Ops landscape & SRESRE book (SLOs, error budgets), The Phoenix Project, Accelerate.
2Cloud computing fundamentalsAWS Well-Architected Framework, the Shared Responsibility Model, the provider's free-tier and pricing docs.
3CI/CD, testing & REST servicesDevOps Handbook, Accelerate (DORA), Terraform Up & Running (IaC chapter), Infrastructure as Code.
4Orchestration, rollouts & observabilityKubernetes Up & Running, Cloud Native DevOps, Observability Engineering, OpenTelemetry.
5Data lakes, pipelines & versioningFundamentals of Data Engineering, the medallion architecture, Data Pipelines with Airflow, DDIA (batch/stream).
6Quality, contracts, streaming & featuresGreat Expectations, Apache Kafka docs, Analytics Engineering with dbt, Fundamentals of Data Engineering (quality, streaming).
7Tracking, registry & servingDesigning ML Systems (training, deployment), MLflow docs, ML Engineering (Burkov), Hidden Technical Debt.
8Monitoring, drift & governanceDesigning ML Systems (monitoring), Reliable ML, Introducing MLOps (governance).
9LLM foundations, AI APIs & the token economyAI Engineering (foundations, prompting), the OpenAI/Anthropic API docs, AWS Bedrock docs, OpenAI prompt guide.
10RAG & serving LLMs: vector databases & gatewaysAI Engineering (RAG), RAG paper (Lewis), LiteLLM docs, LLM Engineer's Handbook, LLMs in Production (serving).
11LLM evaluation, guardrails & observabilityAI Engineering (evaluation), Langfuse docs, Ragas, NeMo Guardrails, OWASP Top 10 for LLM Apps (injection), OpenAI prompt guide.
12Agents, MCP & AgentOpsAI Engineering (agents), Building Effective Agents, MCP docs, ReAct, Toolformer, AI Agents in Action, LangGraph.
13Security, governance & synthesisOWASP Top 10 for LLM Apps, NIST AI RMF, Reliable ML, SRE Workbook.
BackCourse home