Week 10 Part V · LLMOps & AgentOps

RAG & Serving LLMs: Vector Databases & Gateways

Instructor lesson plan: lecture (2 h) and practice (2 h).

Learning objectives

Build a retrieval-augmented generation service over a real corpus.
Manage prompts as versioned program logic.
Choose between hosted APIs and self-hosted serving (vLLM).
Route all LLM traffic through a gateway with fallbacks and budgets.

Tools this week

FAISS / Qdrantembedding modelsLiteLLM gatewayvLLM (discussed)prompt versioning

🎓Lecture · 2 hours

0:00-0:10	10 min	Recap & objectives Retrieval: two cost levers; why pin model versions. Today: grounding the model in your data, and putting an operations layer in front of every call.
0:10-0:25	15 min	Motivation: the model has never read your documents A model answers from training data; your corpus, your sensors, and your invoices are not in it. Fine-tuning versus retrieval: why retrieval wins for fresh, factual, citable knowledge. And once LLM calls are load-bearing: who handles a provider outage, a leaked key, a runaway bill? The gateway half of today.
0:25-0:50	25 min	RAG, end to end Embeddings: text to vectors; semantic similarity as geometry. The vector database (FAISS, Qdrant): index, search, metadata filters. Chunking: size, overlap, structure; where most RAG quality is won or lost. Assembling the grounded prompt: retrieved context, citations, instruction discipline. The chatbot use case on the board: silver-layer corpus to index to retrieval to grounded answer.
0:50-1:10	20 min	Prompts as code Prompts are program logic: a prompt change alters behaviour like a code change. Versioning, review, and rollback for prompts; tracked in MLflow like week 7's models. Prompt templates and variables; separating instructions from data (the injection defense begins here). Regression discipline previewed: next week every prompt change runs the evals.
1:10-1:20	10 min	Break
1:20-1:40	20 min	The serving choice: hosted versus self-hosted Hosted APIs: zero infrastructure, per-token cost, provider control of versions and limits. Self-hosted open weights with vLLM: control and unit economics at scale, in exchange for GPUs and ops burden. Throughput engineering in one slide: batching, KV cache, time-to-first-token. The decision rubric: volume, privacy, latency, and team capacity.
1:40-1:55	15 min	The gateway pattern (predict, then run) One OpenAI-compatible endpoint, many providers behind it (LiteLLM): routing, fallbacks, key custody, budgets, logging. Predict: the primary provider goes down mid-request; what should the user see? Then we kill it live and watch the fallback. Where the gateway sits in every project architecture from today on.
1:55-2:00	5 min	Wrap-up & practice previewPractice builds each project's RAG core and puts the gateway in front of it.

Common misconception to confront.

Students often think: RAG means the model can no longer hallucinate.
Set it straight: Retrieval grounds the answer, but the model can still ignore or misread the context. Faithfulness depends on retrieval quality, chunking, and citations, and must be measured rather than assumed.

Check for understanding (pose during the concept blocks; let students answer before revealing).

Why version prompts like code?

Prompts are program logic: a prompt change alters behaviour, so it must be tracked, reviewed, and rollback-able like any other code.

Give one reason RAG retrieval fails even with a good model.

Poor chunking or an embedding mismatch means the relevant passage is never retrieved; a recall failure the generator cannot recover from.

Key takeaways.

Retrieval quality bounds RAG quality; chunking is where it is won or lost.
Prompts are versioned program logic, not free text.
A gateway centralises routing, keys, fallbacks, and budgets for every LLM call.

📚Reading & resources

AI Engineering, ch. 6 and ch. 9 Huyen; RAG and agents, and inference optimization (the serving-choice economics).
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks Lewis et al., 2020; the original RAG paper.
LiteLLM documentation The gateway used from this week on: routing, fallbacks, budgets, keys.
Qdrant quickstart A vector database hands-on; FAISS docs are the in-process alternative.
vLLM documentation Self-hosted serving: batching, KV cache, throughput; read the architecture overview.

💻Practice · 2 hours

In the practice session the instructor demonstrates the tooling live and teaches the hands-on topics that belong at the keyboard. There are no separate weekly labs: each session closes with the project-integration brief, the increment every team adds to its end-to-end system before next week.

0:00-0:10	10 min	Setup & recap Open the project corpus (silver layer) and the vector database. Recap: chunking, grounded prompts, the gateway.
0:10-0:35	25 min	Build the RAG core Chunk and embed the corpus; index it with metadata. Retrieve for real questions; inspect what actually came back before blaming the model. Tune the chunking once and re-inspect: see retrieval quality move.
0:35-1:00	25 min	Ground and generate Assemble the grounded prompt with citations; generate and read the answers critically. Store the prompt as a versioned file; register the version like a model artifact. Swap prompt versions live and diff the behaviour.
1:00-1:10	10 min	Break
1:10-1:35	25 min	The gateway in front Stand up LiteLLM with two providers; the project now calls one endpoint. Demonstrate fallback by killing the primary provider mid-demo. Set a per-team budget cap and a logging hook; the bill is now observable.
1:35-1:50	15 min	Students drive Each team gets retrieval running on its own corpus behind the gateway. Instructor circulates on chunking and embedding mismatches.
1:50-2:00	10 min	Project-integration briefThe 'Project integration' card: the project's RAG or extraction pipeline runs behind the gateway from this week on.

Common pitfalls to pre-empt.

Embedding and querying with different models silently destroys retrieval.
A gateway nobody routes through protects nothing; move every call behind it.

Project integration (this week)

The project's RAG or extraction pipeline runs behind the LiteLLM gateway with fallback and a budget cap.
Prompt versions tracked like model artifacts; one version swap demonstrated.
Retrieval quality inspected and chunking tuned on the project's own corpus.

Curated references Project brief