Engineering of AI Systems · HIT

Week 5   Part III · DataOps   🎤 Student Project Presentation 1 · Specification

Data Lakes, Pipelines & Versioning

Instructor lesson plan: lecture (2 h) and practice (2 h).

Learning objectives

Tools this week

Airflow / DagsterDVCParquet / Deltaobject-storage lake

🎓Lecture · 2 hours

0:00-0:1010 minRecap & objectives
  • Retrieval: canary versus blue-green; what RED measures.
  • Today: the data side begins; presentations this afternoon.
0:10-0:2515 minMotivation: garbage in, production out
  • Most ML incidents trace to data, not the model: a schema change upstream, a silent duplication, a timezone shift.
  • Data needs an architecture, a refinery, and version control, exactly like code.
  • The project bucket from week 2 becomes a real data lake today.
0:25-0:5025 minFrom warehouse to lake to lakehouse
  • The warehouse: structured, governed, expensive; the lake: cheap object storage, schema-on-read, easily a swamp.
  • The lakehouse: lake storage plus table semantics (Delta, Iceberg): ACID writes, schema enforcement, time travel.
  • Why object storage from week 2 is the substrate for all of it.
  • One-slide tour of the engines above the formats: Spark for heavy transforms, dbt for SQL-shaped ones.
0:50-1:1020 minThe medallion architecture
  • Bronze: raw, immutable, as-ingested; the audit trail of what actually arrived.
  • Silver: cleaned, de-duplicated, conformed; the trustworthy middle.
  • Gold: aggregated, business- and feature-ready; what models and dashboards consume.
  • Ownership and quality expectations per layer; why 'clean it later' becomes 'trust it never'.
  • Board work: the IoT use case as bronze sensor events, silver de-duplicated readings, gold hourly features per device.
1:10-1:2010 minBreak
1:20-1:4020 minPipelines: orchestration done right
  • DAGs, scheduling, and dependencies in Airflow or Dagster.
  • Idempotency: re-running a task must converge, not double-write.
  • Retries, backfills, and parameterised run dates; never hardcode 'today'.
  • Batch versus streaming, previewed; week 6 goes deeper.
1:40-1:5515 minData versioning & lineage
  • DVC and lakeFS: immutable snapshots, pointers in Git, data stays in the lake.
  • Reproducing a result means pinning code SHA and data version together; week 7 builds on exactly this.
  • Lineage: tracing a gold number back through silver and bronze to the source.
1:55-2:005 minWrap-up & Student-Project-Presentation logisticsFinal reminders for Student Project Presentation 1: running order, timing, and what the rubric rewards.
Common misconception to confront.

Students often think: A data lake is a dump; structure can be added later for free.
Set it straight: Without explicit zones and contracts the lake becomes a swamp nobody trusts. The medallion architecture makes refinement stages, quality expectations, and ownership explicit from the first byte.

Check for understanding (pose during the concept blocks; let students answer before revealing).
What lives in each medallion layer?
Bronze: raw, immutable, as-ingested events. Silver: cleaned, de-duplicated, conformed records. Gold: aggregated, business- or feature-ready tables that models and dashboards consume.
Why must an orchestrated task be idempotent?
Retries and backfills re-execute tasks. A non-idempotent task double-writes or corrupts state when it runs twice.
Key takeaways.
Common pitfalls to pre-empt.

📚Reading & resources

🎤Student Project Presentation · 2 hours

The full two-hour practice slot is given over to student project presentations (Student Project Presentation 1 · Specification). There is no instructor-prepared material: teams present and defend their work to the class, with peer and instructor questions after each talk. Each team has 12 to 15 minutes plus questions, and submits a short written report and a tagged release of the repository.

What each team presents.

See the running-project brief for the full milestone description and the grading weight.

Project integration (this week)

Curated references Project brief

PreviousWeek 4: Orchestration, Deployment Patterns & ObservabilityNextWeek 6: Data Quality, Contracts, Streaming & Feature Stores