| 0:00-0:10 | 10 min | Recap & objectives- Retrieval: two cost levers; why pin model versions.
- Today: grounding the model in your data, and putting an operations layer in front of every call.
|
| 0:10-0:25 | 15 min | Motivation: the model has never read your documents- A model answers from training data; your corpus, your sensors, and your invoices are not in it.
- Fine-tuning versus retrieval: why retrieval wins for fresh, factual, citable knowledge.
- And once LLM calls are load-bearing: who handles a provider outage, a leaked key, a runaway bill? The gateway half of today.
|
| 0:25-0:50 | 25 min | RAG, end to end- Embeddings: text to vectors; semantic similarity as geometry.
- The vector database (FAISS, Qdrant): index, search, metadata filters.
- Chunking: size, overlap, structure; where most RAG quality is won or lost.
- Assembling the grounded prompt: retrieved context, citations, instruction discipline.
- The chatbot use case on the board: silver-layer corpus to index to retrieval to grounded answer.
|
| 0:50-1:10 | 20 min | Prompts as code- Prompts are program logic: a prompt change alters behaviour like a code change.
- Versioning, review, and rollback for prompts; tracked in MLflow like week 7's models.
- Prompt templates and variables; separating instructions from data (the injection defense begins here).
- Regression discipline previewed: next week every prompt change runs the evals.
|
| 1:10-1:20 | 10 min | Break |
| 1:20-1:40 | 20 min | The serving choice: hosted versus self-hosted- Hosted APIs: zero infrastructure, per-token cost, provider control of versions and limits.
- Self-hosted open weights with vLLM: control and unit economics at scale, in exchange for GPUs and ops burden.
- Throughput engineering in one slide: batching, KV cache, time-to-first-token.
- The decision rubric: volume, privacy, latency, and team capacity.
|
| 1:40-1:55 | 15 min | The gateway pattern (predict, then run)- One OpenAI-compatible endpoint, many providers behind it (LiteLLM): routing, fallbacks, key custody, budgets, logging.
- Predict: the primary provider goes down mid-request; what should the user see? Then we kill it live and watch the fallback.
- Where the gateway sits in every project architecture from today on.
|
| 1:55-2:00 | 5 min | Wrap-up & practice previewPractice builds each project's RAG core and puts the gateway in front of it. |
Common misconception to confront.
Students often think: RAG means the model can no longer hallucinate.
Set it straight: Retrieval grounds the answer, but the model can still ignore or misread the context. Faithfulness depends on retrieval quality, chunking, and citations, and must be measured rather than assumed.
In the practice session the instructor demonstrates the tooling live and teaches the hands-on topics that belong at the keyboard. There are no separate weekly labs: each session closes with the project-integration brief, the increment every team adds to its end-to-end system before next week.
| 0:00-0:10 | 10 min | Setup & recap- Open the project corpus (silver layer) and the vector database.
- Recap: chunking, grounded prompts, the gateway.
|
| 0:10-0:35 | 25 min | Build the RAG core- Chunk and embed the corpus; index it with metadata.
- Retrieve for real questions; inspect what actually came back before blaming the model.
- Tune the chunking once and re-inspect: see retrieval quality move.
|
| 0:35-1:00 | 25 min | Ground and generate- Assemble the grounded prompt with citations; generate and read the answers critically.
- Store the prompt as a versioned file; register the version like a model artifact.
- Swap prompt versions live and diff the behaviour.
|
| 1:00-1:10 | 10 min | Break |
| 1:10-1:35 | 25 min | The gateway in front- Stand up LiteLLM with two providers; the project now calls one endpoint.
- Demonstrate fallback by killing the primary provider mid-demo.
- Set a per-team budget cap and a logging hook; the bill is now observable.
|
| 1:35-1:50 | 15 min | Students drive- Each team gets retrieval running on its own corpus behind the gateway.
- Instructor circulates on chunking and embedding mismatches.
|
| 1:50-2:00 | 10 min | Project-integration briefThe 'Project integration' card: the project's RAG or extraction pipeline runs behind the gateway from this week on. |