Week 9 Part V · LLMOps & AgentOps

LLM Foundations: AI APIs, Tokens & the Token Economy

Instructor lesson plan: lecture (2 h) and practice (2 h).

Learning objectives

Explain what an LLM does, from an engineering standpoint.
Call an AI API well: prompts, parameters, structured outputs.
Reason about tokens, context windows, pricing, and model tiers.
Survey the managed AI services (Bedrock-class platforms).

Tools this week

OpenAI SDKAnthropic SDKtokenizerJSON Schema / structured outputsAWS Bedrock console

🎓Lecture · 2 hours

0:00-0:10	10 min	Recap & objectives Retrieval: data versus concept drift. Today: language models enter the system. No prior LLM experience assumed.
0:10-0:25	15 min	Motivation: the model you did not train Most AI features today are built on a large model behind an API: you control neither weights nor updates. The engineering questions shift: not 'how was it trained' but cost, latency, reliability, and correctness of a dependency. Live teaser: the same prompt, run twice, gives two different answers; today explains why and what to do about it.
0:25-0:50	25 min	LLMs for engineers Next-token prediction over a context window: the one sentence that explains most behaviour. Tokens and tokenization: the unit of cost, latency, and length limits; live tokenizer demo on Hebrew and English text. The API call anatomy: system prompt, user messages, temperature, stop conditions, max tokens. Sampling and nondeterminism: why temperature 0 reduces but does not eliminate variance. Context windows: what fits, what falls out, and what 'lost in the middle' means.
0:50-1:10	20 min	From free text to engineering-grade outputs Why parsing free text is a trap: format drift breaks parsers silently. Structured outputs: JSON schemas and function signatures; the model fills a contract (week 3's lesson, again). Prompt patterns that matter in production: role, constraints, examples, output schema. Failure modes catalogued: hallucination, prompt sensitivity, instruction conflicts; live demo on the invoice use case.
1:10-1:20	10 min	Break
1:20-1:40	20 min	The token economy Pricing per input and output token; input is usually the volume driver in RAG-shaped systems. Estimating a feature's cost: tokens per request times requests per day; the spreadsheet every team fills today. The cost levers in order: shorter context, prompt caching, batching, cheaper tier, then and only then a different model. Rate limits and quotas; retries with backoff, and why naive retries double the bill. Latency anatomy: time-to-first-token versus tokens-per-second; what streaming changes for UX.
1:40-1:55	15 min	Managed AI services (Bedrock-class) What the platforms bundle: model catalog, guardrails, knowledge bases, agents, evals, provisioned throughput. AWS Bedrock as the archetype; Microsoft Foundry and Vertex AI as the siblings; naming churn as a cautionary tale. Platform versus direct API: governance and one console versus control and portability. Where the gateway pattern (next week) sits between the two.
1:55-2:00	5 min	Wrap-up & practice previewPractice makes first API calls, counts real tokens, and prices each team's feature.

Common misconception to confront.

Students often think: An LLM API call behaves like a normal deterministic function.
Set it straight: The same input can produce different outputs (sampling), the contract is tokens rather than characters, and the model behind the endpoint can change. Engineering around it needs structured outputs, retries with backoff, pinned model versions, and evaluation.

Check for understanding (pose during the concept blocks; let students answer before revealing).

Why does a longer context cost more and respond slower?

Billing is per token on both input and output, and attention computation grows with sequence length, so every extra token costs money and latency.

Name two levers that cut an LLM feature's cost before switching models.

Shorten the prompt and context, cache repeated calls; batching requests and routing easy queries to a cheaper tier are next.

Key takeaways.

The model API is the new runtime; cost and latency are token-driven.
Structured outputs turn LLM calls into engineering-grade components.
Managed platforms bundle catalog, guardrails, and RAG; direct APIs trade that for control.

📚Reading & resources

AI Engineering, ch. 2 and ch. 5 Huyen; understanding foundation models, and prompt engineering as an engineering practice.
OpenAI: structured outputs guide JSON-schema outputs; the week's core engineering pattern.
Anthropic: prompt engineering overview The vendor-neutral lessons transfer; read the structure, not just the tips.
OpenAI tokenizer Paste your own text; build the token intuition the cost model needs.
AWS Bedrock The managed-platform archetype toured in practice: catalog, guardrails, knowledge bases, agents.

💻Practice · 2 hours

In the practice session the instructor demonstrates the tooling live and teaches the hands-on topics that belong at the keyboard. There are no separate weekly labs: each session closes with the project-integration brief, the increment every team adds to its end-to-end system before next week.

0:00-0:10	10 min	Setup & recap API keys issued through the course proxy; never raw keys in code. Recap: tokens, temperature, structured outputs.
0:10-0:35	25 min	First calls, measured First API calls (OpenAI or Anthropic SDK): system prompt, user prompt, temperature. Count tokens with the tokenizer; log latency and computed cost per call. Same prompt at temperature 0 and 1, five runs each; see the variance with your own eyes.
0:35-1:00	25 min	Structured extraction (document-processing use case) Define a JSON schema for invoice fields; the model fills the contract. Feed a malformed document; handle the refusal/failure path explicitly. Validate the output against the schema at the boundary, like any API response (week 3).
1:00-1:10	10 min	Break
1:10-1:35	25 min	Tiers and the bill Run the same extraction on a flagship and a mini model; compare quality, latency, and cost in a table. Fill the token-economy spreadsheet for the project's LLM feature at 10k requests/day. Find each team's biggest cost lever; usually it is context length.
1:35-1:50	15 min	Managed-platform tour Instructor-led tour of a Bedrock-class console: catalog, guardrails, knowledge bases, evals. Map each console feature to the open-source counterpart the course teaches.
1:50-2:00	10 min	Project-integration briefThe 'Project integration' card: the project's LLM feature wired with structured outputs and a costed token-economy model.

Common pitfalls to pre-empt.

Never hardcode API keys; use environment configuration and the course proxy.
Pin the model version: silent model updates change behaviour under you.

Project integration (this week)

Wire the project's LLM feature through the course proxy with structured outputs and schema validation.
Produce the feature's token-economy model: cost per request and per month at target load, with the chosen tier justified.
Pin the model version in configuration; add the retry-with-backoff wrapper.

Curated references Project brief