| 0:00-0:10 | 10 min | Recap & objectives- Retrieval: what chunking controls; what the gateway centralises.
- Today: how you know the LLM feature works, keeps working, and resists abuse.
|
| 0:10-0:25 | 15 min | Motivation: 'looks good' is not evaluation- Story: the prompt tweak that improved the demo and silently broke the long tail; nobody had an eval set to catch it.
- LLM outputs are free text: no exception is thrown when quality drops.
- The discipline transfer: tests (week 3), monitoring (week 8), now evals; same idea, new layer.
|
| 0:25-0:50 | 25 min | Evaluation that means something- The eval set: representative inputs with expected properties; built from real usage, refreshed as usage shifts.
- Metrics for RAG: faithfulness (is the answer grounded?), answer relevance, retrieval recall.
- Golden answers versus property checks; when each applies.
- Eval-driven development: every prompt or model change runs the evals; a drop blocks the merge like a failing test.
- Board work: design five eval items for the chatbot use case, live.
|
| 0:50-1:10 | 20 min | LLM-as-judge, used honestly- Why judges scale where humans do not; where they break.
- The biases named: position, verbosity, self-preference.
- Calibration: agree the judge with human labels on a sample before trusting it at scale.
- Cheap tier for judging, flagship for generation: the cost asymmetry that makes evals affordable.
|
| 1:10-1:20 | 10 min | Break |
| 1:20-1:40 | 20 min | LLM observability- Tracing an LLM request: prompt version, retrieved context, model, tokens, cost, latency, as one span tree (Langfuse).
- Token and cost dashboards per feature and per user; the week-9 spreadsheet becomes live telemetry.
- Sampling production traffic into the eval set: the flywheel that keeps evals representative.
- Week 4's observability lesson, one layer up: same pillars, new signals.
|
| 1:40-1:55 | 15 min | Guardrails & injection (predict, then run)- Prompt injection: untrusted content that rewrites your instructions; the retrieved document that says 'ignore your rules'.
- Predict: will our chatbot follow a malicious instruction embedded in a retrieved document? Then we run it.
- Defense layers: input/output filtering, PII handling, instruction hierarchy, least privilege on what the model can trigger.
- Caching as both a cost and a latency guardrail.
|
| 1:55-2:00 | 5 min | Wrap-up & practice previewPractice builds each project's eval harness, wires tracing, and attacks the service. |
Common misconception to confront.
Students often think: LLM-as-judge is objective.
Set it straight: A judge model carries position, verbosity, and self-preference biases. It must be calibrated against human labels and used with controls, not trusted blindly.
In the practice session the instructor demonstrates the tooling live and teaches the hands-on topics that belong at the keyboard. There are no separate weekly labs: each session closes with the project-integration brief, the increment every team adds to its end-to-end system before next week.
| 0:00-0:10 | 10 min | Setup & recap- Open the project's RAG service and the tracing dashboard.
- Recap: faithfulness, judge biases, injection.
|
| 0:10-0:35 | 25 min | Build the eval harness- Build a 30-item starter eval set live (each team extends to 50+ on its own corpus).
- Score faithfulness and answer relevance (Ragas or a calibrated judge).
- Log the eval run in MLflow like any experiment (week 7's discipline).
|
| 0:35-1:00 | 25 min | Regression in action- Change the prompt; re-run the evals; read the regression like a failing test.
- Calibrate the judge on ten human-labelled items; measure the agreement.
- Wire the eval run into CI so a quality drop blocks the merge.
|
| 1:00-1:10 | 10 min | Break |
| 1:10-1:35 | 25 min | Trace and observe- Wire Langfuse tracing through the gateway: every call now records prompt version, tokens, cost, latency.
- Read a full span tree for one RAG request; find the latency hotspot.
- Build the per-feature cost view; compare against the week-9 estimate.
|
| 1:35-1:50 | 15 min | Attack it- Plant an injection in a retrieved document; watch the unguarded service comply.
- Add input/output guardrails; re-attack; compare.
- Add the response cache; measure the cost and latency drop on repeated queries.
|
| 1:50-2:00 | 10 min | Project-integration briefThe 'Project integration' card: eval set, tracing, and a guardrail land on the project's LLM path this week. |