Week 11 Part V · LLMOps & AgentOps

LLM Evaluation, Guardrails & Observability

Instructor lesson plan: lecture (2 h) and practice (2 h).

Learning objectives

Build a representative eval set and run it as a regression suite.
Use LLM-as-judge with its biases named and calibrated.
Trace and observe LLM applications (Langfuse): spans, tokens, cost.
Apply guardrails against unsafe and injected behaviour.

Tools this week

RagasLangfuseLLM-as-judgeNeMo Guardrailsresponse cache

🎓Lecture · 2 hours

0:00-0:10	10 min	Recap & objectives Retrieval: what chunking controls; what the gateway centralises. Today: how you know the LLM feature works, keeps working, and resists abuse.
0:10-0:25	15 min	Motivation: 'looks good' is not evaluation Story: the prompt tweak that improved the demo and silently broke the long tail; nobody had an eval set to catch it. LLM outputs are free text: no exception is thrown when quality drops. The discipline transfer: tests (week 3), monitoring (week 8), now evals; same idea, new layer.
0:25-0:50	25 min	Evaluation that means something The eval set: representative inputs with expected properties; built from real usage, refreshed as usage shifts. Metrics for RAG: faithfulness (is the answer grounded?), answer relevance, retrieval recall. Golden answers versus property checks; when each applies. Eval-driven development: every prompt or model change runs the evals; a drop blocks the merge like a failing test. Board work: design five eval items for the chatbot use case, live.
0:50-1:10	20 min	LLM-as-judge, used honestly Why judges scale where humans do not; where they break. The biases named: position, verbosity, self-preference. Calibration: agree the judge with human labels on a sample before trusting it at scale. Cheap tier for judging, flagship for generation: the cost asymmetry that makes evals affordable.
1:10-1:20	10 min	Break
1:20-1:40	20 min	LLM observability Tracing an LLM request: prompt version, retrieved context, model, tokens, cost, latency, as one span tree (Langfuse). Token and cost dashboards per feature and per user; the week-9 spreadsheet becomes live telemetry. Sampling production traffic into the eval set: the flywheel that keeps evals representative. Week 4's observability lesson, one layer up: same pillars, new signals.
1:40-1:55	15 min	Guardrails & injection (predict, then run) Prompt injection: untrusted content that rewrites your instructions; the retrieved document that says 'ignore your rules'. Predict: will our chatbot follow a malicious instruction embedded in a retrieved document? Then we run it. Defense layers: input/output filtering, PII handling, instruction hierarchy, least privilege on what the model can trigger. Caching as both a cost and a latency guardrail.
1:55-2:00	5 min	Wrap-up & practice previewPractice builds each project's eval harness, wires tracing, and attacks the service.

Common misconception to confront.

Students often think: LLM-as-judge is objective.
Set it straight: A judge model carries position, verbosity, and self-preference biases. It must be calibrated against human labels and used with controls, not trusted blindly.

Check for understanding (pose during the concept blocks; let students answer before revealing).

Name two biases of an LLM judge.

Position bias (favouring the first answer), verbosity bias (favouring longer answers), and self-preference (favouring its own style); calibrate against human labels before trusting it.

What is prompt injection and one defense?

Untrusted input that overrides the system instructions. Defenses include input/output filtering, instruction hierarchy, and not acting on instructions embedded in untrusted content.

Key takeaways.

The eval set is the regression suite for prompts; run it on every change.
Calibrate the judge against humans; never trust it blind.
Trace every LLM call: prompt version, context, tokens, cost, latency.

📚Reading & resources

AI Engineering, ch. 3 to 4 Huyen; evaluation methodology and evaluating AI systems, the week's backbone.
Ragas documentation Faithfulness, answer relevance, and context metrics for RAG evals.
Langfuse documentation Tracing, token/cost analytics, and eval workflows; the observability stack of the practice.
OWASP Top 10 for LLM Applications Read LLM01 (prompt injection) and LLM02 (insecure output handling) this week.
NeMo Guardrails Programmable guardrails; the README and examples are enough for the practice.

💻Practice · 2 hours

In the practice session the instructor demonstrates the tooling live and teaches the hands-on topics that belong at the keyboard. There are no separate weekly labs: each session closes with the project-integration brief, the increment every team adds to its end-to-end system before next week.

0:00-0:10	10 min	Setup & recap Open the project's RAG service and the tracing dashboard. Recap: faithfulness, judge biases, injection.
0:10-0:35	25 min	Build the eval harness Build a 30-item starter eval set live (each team extends to 50+ on its own corpus). Score faithfulness and answer relevance (Ragas or a calibrated judge). Log the eval run in MLflow like any experiment (week 7's discipline).
0:35-1:00	25 min	Regression in action Change the prompt; re-run the evals; read the regression like a failing test. Calibrate the judge on ten human-labelled items; measure the agreement. Wire the eval run into CI so a quality drop blocks the merge.
1:00-1:10	10 min	Break
1:10-1:35	25 min	Trace and observe Wire Langfuse tracing through the gateway: every call now records prompt version, tokens, cost, latency. Read a full span tree for one RAG request; find the latency hotspot. Build the per-feature cost view; compare against the week-9 estimate.
1:35-1:50	15 min	Attack it Plant an injection in a retrieved document; watch the unguarded service comply. Add input/output guardrails; re-attack; compare. Add the response cache; measure the cost and latency drop on repeated queries.
1:50-2:00	10 min	Project-integration briefThe 'Project integration' card: eval set, tracing, and a guardrail land on the project's LLM path this week.

Common pitfalls to pre-empt.

A 10-item eval set measures noise; build a representative one.
Never put secrets or PII in prompts or traces; scrub at the boundary.

Project integration (this week)

An eval set of at least 50 items exists; the first eval run is logged and wired into CI.
Langfuse tracing live on the project's LLM path: prompt version, tokens, cost, latency per request.
One guardrail in place and demonstrated against a real injection attempt; response cache measured.

Curated references Project brief