| 0:00-0:10 | 10 min | Recap & objectives- Retrieval: the three medallion layers; why idempotency.
- Today: keeping the data trustworthy, and keeping it fresh.
|
| 0:10-0:25 | 15 min | Motivation: the silent schema change- Story: an upstream team renames a column on a Tuesday; the model degrades for three weeks with zero errors thrown.
- Quality failures are silent by default; loud is something you must engineer.
- Some use cases also cannot wait for tonight's batch: the freshness half of today.
|
| 0:25-0:50 | 25 min | Data quality, engineered- Validation as code (Great Expectations): schema, ranges, nullity, uniqueness, referential checks.
- Where to validate: at the bronze-to-silver boundary, the gate you control.
- Fail loudly versus quarantine: when each is right; the quarantine table pattern.
- Quality SLAs: freshness, completeness, validity, with numbers and owners.
- Profiling first: you cannot write expectations for data you have not looked at.
|
| 0:50-1:10 | 20 min | Data contracts- The contract: schema, semantics, freshness, quality, and a change process the producer guarantees.
- Enforcement at the boundary, automatically; a contract nobody checks is a wish.
- Consumer-driven contracts: the consumer's tests run against the producer's changes.
- Board work: draft the contract for the project's main source, live.
|
| 1:10-1:20 | 10 min | Break |
| 1:20-1:40 | 20 min | Streaming: Kafka in one lecture- Topics, partitions, producers, consumers, consumer groups; the log as the core abstraction.
- Delivery semantics: at-least-once as the default reality; idempotent consumers (week 5's lesson returns).
- When streaming beats batch: value decays faster than the batch period (sensor alarms, fraud); when batch wins: simplicity and cost.
- The IoT use case end to end: device, topic, bronze landing, silver conformance.
|
| 1:40-1:55 | 15 min | Feature stores & train/serve skew- Offline features for training, online features for serving; one definition, two materialisations.
- Point-in-time correctness: the subtle leak that inflates offline metrics.
- Train/serve skew: a transformation that exists only in the training notebook; why it is silent and how the feature store removes the duplication.
|
| 1:55-2:00 | 5 min | Wrap-up & practice previewPractice gates the project's real data and lands a live stream in the lake. |
Common misconception to confront.
Students often think: Data validation is a one-time cleaning step.
Set it straight: Quality is a continuous contract checked on every run. Upstream producers keep changing, so validation must guard the boundary forever, not once.
In the practice session the instructor demonstrates the tooling live and teaches the hands-on topics that belong at the keyboard. There are no separate weekly labs: each session closes with the project-integration brief, the increment every team adds to its end-to-end system before next week.
| 0:00-0:10 | 10 min | Setup & recap- Open the project's bronze data from week 5.
- Recap: boundaries, contracts, the log.
|
| 0:10-0:35 | 25 min | Profile, then gate- Profile the project's bronze data live; find the real dirt (nulls, duplicates, ranges).
- Write Great Expectations suites at the bronze-to-silver boundary.
- Break the data on purpose; watch the gate fail loudly and route rows to quarantine.
|
| 0:35-1:00 | 25 min | The contract, in force- Write the one-page data contract for the project's main source.
- Wire the contract checks into the pipeline so a violation fails the run.
- Simulate the upstream rename from the lecture story; watch it get caught in seconds, not weeks.
|
| 1:00-1:10 | 10 min | Break |
| 1:10-1:35 | 25 min | A stream into the lake- Stand up single-broker Kafka; create the topic.
- Produce a simulated sensor stream (IoT teams: your real feed); consume and land it in bronze.
- Run the silver transform on top; the medallion now has a live inlet.
|
| 1:35-1:50 | 15 min | Students drive- Each team gets its gate, contract, and (where applicable) stream running on its own data.
- Instructor circulates on expectation tuning.
|
| 1:50-2:00 | 10 min | Project-integration briefThe 'Project integration' card: gates, contract, and the corpus or stream landed; the data side is now Presentation-2 ready. |