Engineering of AI Systems · HIT

Week 9   Part V · LLMOps & AgentOps

LLM Foundations: AI APIs, Tokens & the Token Economy

Instructor lesson plan: lecture (2 h) and practice (2 h).

Learning objectives

Tools this week

OpenAI SDKAnthropic SDKtokenizerJSON Schema / structured outputsAWS Bedrock console

🎓Lecture · 2 hours

0:00-0:1010 minRecap & objectives
  • Retrieval: data versus concept drift.
  • Today: language models enter the system. No prior LLM experience assumed.
0:10-0:2515 minMotivation: the model you did not train
  • Most AI features today are built on a large model behind an API: you control neither weights nor updates.
  • The engineering questions shift: not 'how was it trained' but cost, latency, reliability, and correctness of a dependency.
  • Live teaser: the same prompt, run twice, gives two different answers; today explains why and what to do about it.
0:25-0:5025 minLLMs for engineers
  • Next-token prediction over a context window: the one sentence that explains most behaviour.
  • Tokens and tokenization: the unit of cost, latency, and length limits; live tokenizer demo on Hebrew and English text.
  • The API call anatomy: system prompt, user messages, temperature, stop conditions, max tokens.
  • Sampling and nondeterminism: why temperature 0 reduces but does not eliminate variance.
  • Context windows: what fits, what falls out, and what 'lost in the middle' means.
0:50-1:1020 minFrom free text to engineering-grade outputs
  • Why parsing free text is a trap: format drift breaks parsers silently.
  • Structured outputs: JSON schemas and function signatures; the model fills a contract (week 3's lesson, again).
  • Prompt patterns that matter in production: role, constraints, examples, output schema.
  • Failure modes catalogued: hallucination, prompt sensitivity, instruction conflicts; live demo on the invoice use case.
1:10-1:2010 minBreak
1:20-1:4020 minThe token economy
  • Pricing per input and output token; input is usually the volume driver in RAG-shaped systems.
  • Estimating a feature's cost: tokens per request times requests per day; the spreadsheet every team fills today.
  • The cost levers in order: shorter context, prompt caching, batching, cheaper tier, then and only then a different model.
  • Rate limits and quotas; retries with backoff, and why naive retries double the bill.
  • Latency anatomy: time-to-first-token versus tokens-per-second; what streaming changes for UX.
1:40-1:5515 minManaged AI services (Bedrock-class)
  • What the platforms bundle: model catalog, guardrails, knowledge bases, agents, evals, provisioned throughput.
  • AWS Bedrock as the archetype; Microsoft Foundry and Vertex AI as the siblings; naming churn as a cautionary tale.
  • Platform versus direct API: governance and one console versus control and portability.
  • Where the gateway pattern (next week) sits between the two.
1:55-2:005 minWrap-up & practice previewPractice makes first API calls, counts real tokens, and prices each team's feature.
Common misconception to confront.

Students often think: An LLM API call behaves like a normal deterministic function.
Set it straight: The same input can produce different outputs (sampling), the contract is tokens rather than characters, and the model behind the endpoint can change. Engineering around it needs structured outputs, retries with backoff, pinned model versions, and evaluation.

Check for understanding (pose during the concept blocks; let students answer before revealing).
Why does a longer context cost more and respond slower?
Billing is per token on both input and output, and attention computation grows with sequence length, so every extra token costs money and latency.
Name two levers that cut an LLM feature's cost before switching models.
Shorten the prompt and context, cache repeated calls; batching requests and routing easy queries to a cheaper tier are next.
Key takeaways.

📚Reading & resources

💻Practice · 2 hours

In the practice session the instructor demonstrates the tooling live and teaches the hands-on topics that belong at the keyboard. There are no separate weekly labs: each session closes with the project-integration brief, the increment every team adds to its end-to-end system before next week.

0:00-0:1010 minSetup & recap
  • API keys issued through the course proxy; never raw keys in code.
  • Recap: tokens, temperature, structured outputs.
0:10-0:3525 minFirst calls, measured
  • First API calls (OpenAI or Anthropic SDK): system prompt, user prompt, temperature.
  • Count tokens with the tokenizer; log latency and computed cost per call.
  • Same prompt at temperature 0 and 1, five runs each; see the variance with your own eyes.
0:35-1:0025 minStructured extraction (document-processing use case)
  • Define a JSON schema for invoice fields; the model fills the contract.
  • Feed a malformed document; handle the refusal/failure path explicitly.
  • Validate the output against the schema at the boundary, like any API response (week 3).
1:00-1:1010 minBreak
1:10-1:3525 minTiers and the bill
  • Run the same extraction on a flagship and a mini model; compare quality, latency, and cost in a table.
  • Fill the token-economy spreadsheet for the project's LLM feature at 10k requests/day.
  • Find each team's biggest cost lever; usually it is context length.
1:35-1:5015 minManaged-platform tour
  • Instructor-led tour of a Bedrock-class console: catalog, guardrails, knowledge bases, evals.
  • Map each console feature to the open-source counterpart the course teaches.
1:50-2:0010 minProject-integration briefThe 'Project integration' card: the project's LLM feature wired with structured outputs and a costed token-economy model.
Common pitfalls to pre-empt.

Project integration (this week)

Curated references Project brief

PreviousWeek 8: Monitoring, Model Drift & GovernanceNextWeek 10: RAG & Serving LLMs: Vector Databases & Gateways