Engineering of AI Systems · HIT

Week 6   Part III · DataOps

Data Quality, Contracts, Streaming & Feature Stores

Instructor lesson plan: lecture (2 h) and practice (2 h).

Learning objectives

Tools this week

Great ExpectationsKafkaFeast (concept)quarantine tables

🎓Lecture · 2 hours

0:00-0:1010 minRecap & objectives
  • Retrieval: the three medallion layers; why idempotency.
  • Today: keeping the data trustworthy, and keeping it fresh.
0:10-0:2515 minMotivation: the silent schema change
  • Story: an upstream team renames a column on a Tuesday; the model degrades for three weeks with zero errors thrown.
  • Quality failures are silent by default; loud is something you must engineer.
  • Some use cases also cannot wait for tonight's batch: the freshness half of today.
0:25-0:5025 minData quality, engineered
  • Validation as code (Great Expectations): schema, ranges, nullity, uniqueness, referential checks.
  • Where to validate: at the bronze-to-silver boundary, the gate you control.
  • Fail loudly versus quarantine: when each is right; the quarantine table pattern.
  • Quality SLAs: freshness, completeness, validity, with numbers and owners.
  • Profiling first: you cannot write expectations for data you have not looked at.
0:50-1:1020 minData contracts
  • The contract: schema, semantics, freshness, quality, and a change process the producer guarantees.
  • Enforcement at the boundary, automatically; a contract nobody checks is a wish.
  • Consumer-driven contracts: the consumer's tests run against the producer's changes.
  • Board work: draft the contract for the project's main source, live.
1:10-1:2010 minBreak
1:20-1:4020 minStreaming: Kafka in one lecture
  • Topics, partitions, producers, consumers, consumer groups; the log as the core abstraction.
  • Delivery semantics: at-least-once as the default reality; idempotent consumers (week 5's lesson returns).
  • When streaming beats batch: value decays faster than the batch period (sensor alarms, fraud); when batch wins: simplicity and cost.
  • The IoT use case end to end: device, topic, bronze landing, silver conformance.
1:40-1:5515 minFeature stores & train/serve skew
  • Offline features for training, online features for serving; one definition, two materialisations.
  • Point-in-time correctness: the subtle leak that inflates offline metrics.
  • Train/serve skew: a transformation that exists only in the training notebook; why it is silent and how the feature store removes the duplication.
1:55-2:005 minWrap-up & practice previewPractice gates the project's real data and lands a live stream in the lake.
Common misconception to confront.

Students often think: Data validation is a one-time cleaning step.
Set it straight: Quality is a continuous contract checked on every run. Upstream producers keep changing, so validation must guard the boundary forever, not once.

Check for understanding (pose during the concept blocks; let students answer before revealing).
What is train/serve skew and one cause?
The features seen at serving differ from those at training, e.g. a transformation applied only in the training notebook, or time leakage in offline features.
When does streaming genuinely beat batch?
When the value of the data decays faster than the batch period: fraud signals, sensor alarms, live personalisation. If a nightly aggregate is fine, batch is simpler and cheaper.
Key takeaways.

📚Reading & resources

💻Practice · 2 hours

In the practice session the instructor demonstrates the tooling live and teaches the hands-on topics that belong at the keyboard. There are no separate weekly labs: each session closes with the project-integration brief, the increment every team adds to its end-to-end system before next week.

0:00-0:1010 minSetup & recap
  • Open the project's bronze data from week 5.
  • Recap: boundaries, contracts, the log.
0:10-0:3525 minProfile, then gate
  • Profile the project's bronze data live; find the real dirt (nulls, duplicates, ranges).
  • Write Great Expectations suites at the bronze-to-silver boundary.
  • Break the data on purpose; watch the gate fail loudly and route rows to quarantine.
0:35-1:0025 minThe contract, in force
  • Write the one-page data contract for the project's main source.
  • Wire the contract checks into the pipeline so a violation fails the run.
  • Simulate the upstream rename from the lecture story; watch it get caught in seconds, not weeks.
1:00-1:1010 minBreak
1:10-1:3525 minA stream into the lake
  • Stand up single-broker Kafka; create the topic.
  • Produce a simulated sensor stream (IoT teams: your real feed); consume and land it in bronze.
  • Run the silver transform on top; the medallion now has a live inlet.
1:35-1:5015 minStudents drive
  • Each team gets its gate, contract, and (where applicable) stream running on its own data.
  • Instructor circulates on expectation tuning.
1:50-2:0010 minProject-integration briefThe 'Project integration' card: gates, contract, and the corpus or stream landed; the data side is now Presentation-2 ready.
Common pitfalls to pre-empt.

Project integration (this week)

Curated references Project brief

PreviousWeek 5: Data Lakes, Pipelines & VersioningNextWeek 7: Experiment Tracking, Model Registry & Serving