Engineering of AI Systems · HIT

Week 11   Part V · LLMOps & AgentOps

LLM Evaluation, Guardrails & Observability

Instructor lesson plan: lecture (2 h) and practice (2 h).

Learning objectives

Tools this week

RagasLangfuseLLM-as-judgeNeMo Guardrailsresponse cache

🎓Lecture · 2 hours

0:00-0:1010 minRecap & objectives
  • Retrieval: what chunking controls; what the gateway centralises.
  • Today: how you know the LLM feature works, keeps working, and resists abuse.
0:10-0:2515 minMotivation: 'looks good' is not evaluation
  • Story: the prompt tweak that improved the demo and silently broke the long tail; nobody had an eval set to catch it.
  • LLM outputs are free text: no exception is thrown when quality drops.
  • The discipline transfer: tests (week 3), monitoring (week 8), now evals; same idea, new layer.
0:25-0:5025 minEvaluation that means something
  • The eval set: representative inputs with expected properties; built from real usage, refreshed as usage shifts.
  • Metrics for RAG: faithfulness (is the answer grounded?), answer relevance, retrieval recall.
  • Golden answers versus property checks; when each applies.
  • Eval-driven development: every prompt or model change runs the evals; a drop blocks the merge like a failing test.
  • Board work: design five eval items for the chatbot use case, live.
0:50-1:1020 minLLM-as-judge, used honestly
  • Why judges scale where humans do not; where they break.
  • The biases named: position, verbosity, self-preference.
  • Calibration: agree the judge with human labels on a sample before trusting it at scale.
  • Cheap tier for judging, flagship for generation: the cost asymmetry that makes evals affordable.
1:10-1:2010 minBreak
1:20-1:4020 minLLM observability
  • Tracing an LLM request: prompt version, retrieved context, model, tokens, cost, latency, as one span tree (Langfuse).
  • Token and cost dashboards per feature and per user; the week-9 spreadsheet becomes live telemetry.
  • Sampling production traffic into the eval set: the flywheel that keeps evals representative.
  • Week 4's observability lesson, one layer up: same pillars, new signals.
1:40-1:5515 minGuardrails & injection (predict, then run)
  • Prompt injection: untrusted content that rewrites your instructions; the retrieved document that says 'ignore your rules'.
  • Predict: will our chatbot follow a malicious instruction embedded in a retrieved document? Then we run it.
  • Defense layers: input/output filtering, PII handling, instruction hierarchy, least privilege on what the model can trigger.
  • Caching as both a cost and a latency guardrail.
1:55-2:005 minWrap-up & practice previewPractice builds each project's eval harness, wires tracing, and attacks the service.
Common misconception to confront.

Students often think: LLM-as-judge is objective.
Set it straight: A judge model carries position, verbosity, and self-preference biases. It must be calibrated against human labels and used with controls, not trusted blindly.

Check for understanding (pose during the concept blocks; let students answer before revealing).
Name two biases of an LLM judge.
Position bias (favouring the first answer), verbosity bias (favouring longer answers), and self-preference (favouring its own style); calibrate against human labels before trusting it.
What is prompt injection and one defense?
Untrusted input that overrides the system instructions. Defenses include input/output filtering, instruction hierarchy, and not acting on instructions embedded in untrusted content.
Key takeaways.

📚Reading & resources

💻Practice · 2 hours

In the practice session the instructor demonstrates the tooling live and teaches the hands-on topics that belong at the keyboard. There are no separate weekly labs: each session closes with the project-integration brief, the increment every team adds to its end-to-end system before next week.

0:00-0:1010 minSetup & recap
  • Open the project's RAG service and the tracing dashboard.
  • Recap: faithfulness, judge biases, injection.
0:10-0:3525 minBuild the eval harness
  • Build a 30-item starter eval set live (each team extends to 50+ on its own corpus).
  • Score faithfulness and answer relevance (Ragas or a calibrated judge).
  • Log the eval run in MLflow like any experiment (week 7's discipline).
0:35-1:0025 minRegression in action
  • Change the prompt; re-run the evals; read the regression like a failing test.
  • Calibrate the judge on ten human-labelled items; measure the agreement.
  • Wire the eval run into CI so a quality drop blocks the merge.
1:00-1:1010 minBreak
1:10-1:3525 minTrace and observe
  • Wire Langfuse tracing through the gateway: every call now records prompt version, tokens, cost, latency.
  • Read a full span tree for one RAG request; find the latency hotspot.
  • Build the per-feature cost view; compare against the week-9 estimate.
1:35-1:5015 minAttack it
  • Plant an injection in a retrieved document; watch the unguarded service comply.
  • Add input/output guardrails; re-attack; compare.
  • Add the response cache; measure the cost and latency drop on repeated queries.
1:50-2:0010 minProject-integration briefThe 'Project integration' card: eval set, tracing, and a guardrail land on the project's LLM path this week.
Common pitfalls to pre-empt.

Project integration (this week)

Curated references Project brief

PreviousWeek 10: RAG & Serving LLMs: Vector Databases & GatewaysNextWeek 12: Agents & AgentOps: Tools, MCP & Managed Agents