AI 320 · HIT · Advanced Course · Year 3 · 13 Weeks
הנדסת מערכות בינה מלאכותית
DevOps, DataOps, MLOps, LLMOps, AgentOps, one running project, end to end
Department submission documents in the HIT form, on the official letterhead (Word, downloadable):
Syllabus (English) Syllabus (Hebrew) Rationale Catalogue summary
This is the advanced course on building and running AI systems in production, taken after a first machine-learning course. Modern AI systems fail far more often in operations than in modelling: a model that scores well offline still has to be packaged, served, version-controlled, fed with trustworthy data, monitored for drift, secured, costed, and governed once real users depend on it. The course teaches the engineering discipline that surrounds the model: the practices, tooling, and architectures that take a prototype into reliable, observable, continuously-improving production service. The emphasis is on building and operating, not on watching.
No prior cloud or LLM experience is assumed. Students arrive with basic machine learning and some software engineering; the course builds the cloud foundations (compute, storage, networking, deployment models) in week 2 and the LLM and AI-API foundations (tokens, prompts, structured outputs, the token economy) in week 9, before operating on either.
Rationale. Every AI specialization, vision, language, agents, eventually meets the same operational stack. This course provides that shared base for deploying and running AI in the real world. It is project-based and designed for the way students will actually work, with an AI coding assistant at hand, while keeping the learning genuine through a Build, Operate, and Review model.
Each week is 4 contact hours: a 2-hour lecture (concepts and architecture) and a 2-hour practice session that carries its own material, the instructor demonstrates the tooling live and teaches hands-on topics that belong at the keyboard rather than on slides (cloud consoles, Kafka, gateways, managed AI platforms, MCP). There are no separate weekly labs: each practice session closes with a project-integration brief, the increment every team adds to its end-to-end system that week. The thirteen weeks form six parts: Foundations and the Cloud (weeks 1 to 2), DevOps (3 to 4), DataOps (5 to 6), MLOps (7 to 8), LLMOps and AgentOps (9 to 12), and Security and Governance (13). The running project threads through all of it and is presented to the class three times, in the Student Project Presentations: a specification (week 5), an interim review (week 8), and a final production demo (week 13).
By the end of the course, students will be able to:
The course assumes a prior machine-learning course and the background below. Each row links a short refresher and a set of self-check questions, so readiness can be confirmed before Week 1.
| Subject | Background topics | Material |
|---|---|---|
| 🔧Programming & version control | Python, Git and pull requests, unit testing, code review, reading a stack trace | ReviewSelf-check |
| ⚙️Operating systems & networking | Processes, the Linux shell, filesystems, HTTP and REST, TCP/IP, an intuition for containers | ReviewSelf-check |
| 🧮Machine learning | Training and evaluation, train/test split, loss versus metric, overfitting, a trained model as an artifact | ReviewSelf-check |
Each week lists two sets of topics: what the lecture covers on the board, and what the practice session teaches at the keyboard. In the highlighted weeks the practice slot is a Student Project Presentation: fully student-run, no instructor material.
| Wk | Topic | Materials |
|---|---|---|
| Part I · Foundations & the Cloud | ||
| 1 | Production Engineering & the Ops Landscape Lecture Why systems fail in operations; SLOs and error budgets; the SRE mindset; the five layers; the course use cases (IoT, chatbot, document processing). Practice Git workflow and repo hygiene; containerise a first service; form project teams. Tools Git · GitHub · Docker | Lesson planPractice |
| 2 | Cloud Computing Fundamentals Lecture Compute, storage, and networking primitives; IaaS/PaaS/SaaS and serverless; regions and availability zones; shared responsibility; the cost model. Practice Budget alerts first; launch a VM and the project bucket; the same app in three deployment models; price your use case. Tools cloud free tier · VM · object storage (S3) · serverless · cost calculator | Lesson planPractice |
| Part II · DevOps | ||
| 3 | CI/CD, Testing & REST Services Lecture CI/CD pipelines; the testing pyramid; trunk-based development; infrastructure as code; REST API design and versioning. Practice The project's FastAPI skeleton with tests; a GitHub Actions pipeline (lint, test, build, push); required checks gate the merge. Tools FastAPI · pytest · GitHub Actions · container registry · Terraform (intro) | Lesson planPractice |
| 4 | Orchestration, Deployment Patterns & Observability Lecture Kubernetes primitives; blue-green and canary rollouts; GitOps; logs, metrics, traces; RED and tail latency. Practice Deploy the project to a local cluster; self-healing and scaling; a canary rollout; the project RED dashboard under load. Tools Kubernetes (kind) · kubectl · Prometheus · Grafana · Argo CD (demo) | Lesson planPractice |
| Part III · DataOps | ||
| 5 | Data Lakes, Pipelines & Versioning Lecture Data lake, lakehouse, and the medallion architecture (bronze, silver, gold); DAG orchestration, idempotency, backfills; data versioning and lineage. 🎤 Student Project Presentation 1 · Specification Tools Airflow / Dagster · DVC · Parquet / Delta · the object-storage lake | Lesson planPresentation brief |
| 6 | Data Quality, Contracts, Streaming & Feature Stores Lecture Validation and quality SLAs; data contracts; Kafka and streaming versus batch; feature stores and train/serve skew. Practice Profile and gate the project's real data; write and enforce its data contract; stand up Kafka and land a stream in the lake. Tools Great Expectations · Kafka · Feast (concept) | Lesson planPractice |
| Part IV · MLOps | ||
| 7 | Experiment Tracking, Model Registry & Serving Lecture Experiment tracking and reproducibility; the model registry as catalogue; model cards; serving patterns and safe rollout (shadow, canary, A/B). Practice Instrument the project's training with MLflow; register, promote, roll back; serve behind the project API; canary model v2 next to v1. Tools MLflow (tracking + registry) · BentoML / FastAPI · model cards | Lesson planPractice |
| 8 | Monitoring, Model Drift & Governance Lecture Data drift versus concept drift; detectors and proxy monitoring; retraining triggers; audit trails and model governance. 🎤 Student Project Presentation 2 · Interim Tools Evidently · PSI / KS tests · registry stage gates | Lesson planPresentation brief |
| Part V · LLMOps & AgentOps | ||
| 9 | LLM Foundations: AI APIs, Tokens & the Token Economy Lecture LLMs from an engineering standpoint; tokens and context windows; prompts and structured outputs; pricing, cost levers, rate limits; managed AI platforms (Bedrock-class). Practice First API calls; count tokens, measure latency and cost; structured extraction from invoices; flagship versus mini tiers; a managed-platform console tour. Tools OpenAI & Anthropic SDKs · tokenizer · JSON Schema · Bedrock console | Lesson planPractice |
| 10 | RAG & Serving LLMs: Vector Databases & Gateways Lecture Embeddings, vector databases, chunking, grounded prompts; prompts as versioned code; hosted versus self-hosted serving (vLLM); the gateway pattern. Practice Build the project's RAG core on its own corpus; tune chunking; LiteLLM gateway with fallback and budgets; prompt version swaps. Tools FAISS / Qdrant · embedding models · LiteLLM · vLLM (discussed) | Lesson planPractice |
| 11 | LLM Evaluation, Guardrails & Observability Lecture Eval sets as regression suites; faithfulness and relevance; LLM-as-judge biases and calibration; LLM tracing (Langfuse); guardrails and prompt injection. Practice Build the project's eval harness and wire it into CI; trace every call through the gateway; attack with injection, add the guardrail, measure the cache. Tools Ragas · Langfuse · NeMo Guardrails · response cache | Lesson planPractice |
| 12 | Agents & AgentOps: Tools, MCP & Managed Agents Lecture The agent loop and function calling; memory and termination; MCP as the tool standard; step-level tracing and evaluation; bounds; managed agent services. Practice Build a two-tool agent and trace it; localise a failure in the spans; add caps, budgets, and an error classifier; wrap a project tool as an MCP server. Tools function calling · MCP SDK · step tracing · managed-agent consoles | Lesson planPractice |
| Part VI · Security & Governance | ||
| 13 | Security, Governance & Synthesis Lecture Supply-chain security and SBOMs; the OWASP Top 10 for LLM applications; privacy and audit trails; synthesis of the five layers. 🎤 Student Project Presentation 3 · Final (with oral defense) Tools secrets manager · pip-audit / SBOM · OWASP LLM checklist | Lesson planPresentation brief |
Using an AI assistant is highly encouraged in this course; it reflects how production engineering is really done. Two conditions keep the learning genuine: students keep full ownership of, and responsibility for, everything they submit, and must be able to explain and defend any part of it. The Operate, Review, and oral-defense steps verify understanding rather than authorship; where an assistant was used, it should be disclosed.
Every weekly project increment follows a three-part model:
Stand up the week's increment, a pipeline stage, a service, a monitor, an eval; an AI assistant may be used freely.
Deploy it, instrument it, then break it on purpose. Predict what the telemetry will show, run the experiment, and compare.
Explain the design and its trade-offs, where it would fail, and what you would change; be ready to defend any line at the presentations.
Grading is project-based, with weight on the parts an AI assistant cannot do for the student: operating a system under load, interpreting telemetry, and defending design decisions. The running project is the single deliverable, built up in weekly increments and graded at the three Student Project Presentations. There are no written exams and no separate lab submissions.
| Component | What it covers | Weight |
|---|---|---|
| Project · Specification | Student Project Presentation 1: problem, SLOs, architecture, data and DevOps plan (week 5). | 20% |
| Project · Interim | Student Project Presentation 2: working pipeline, registry, deployment, live CI/CD (week 8). | 30% |
| Project · Final | Student Project Presentation 3: end-to-end production demo with a short oral defense (week 13). | 50% |
Teams carry a single AI-enabled service through the entire operational stack across the semester, integrating most of the covered material into one end-to-end system: cloud footprint, CI/CD, medallion data lake, model registry and serving, gateway-fronted LLM feature, evals and guardrails, an agentic capability, and the audit trail through all of it. Each week's practice session ends with a project-integration brief, the increment due before the next week; the three Student Project Presentations are the graded checkpoints. Teams choose one of the course use cases, IoT telemetry (sensor streams, anomaly alerts, predictive maintenance), a document-QA chatbot (RAG over a real corpus), or document processing (structured extraction at scale), or propose another domain in the same spirit for approval. Whatever the domain, teams must demonstrate operational maturity, not a working model alone. Each presentation is 12 to 15 minutes plus questions, with a short written report and a tagged release of the repository.
Each idea below exercises every layer of the stack; teams may take one as-is or propose a variant of comparable scope. The three primary use cases (IoT, chatbot, document processing) are the safest choices; the rest are approved variants.
A curated reading list spanning the five layers. Individual lesson plans link the chapters and resources for that week; the full list is on the references page.
The canonical text on running production systems: SLOs, error budgets, toil, on-call, and incident response.
The foundation for DataOps: storage, replication, batch and stream processing, and reliability of data systems.
End-to-end MLOps and LLMOps: data, training, deployment, monitoring, RAG, evaluation, and inference at scale.
The toolchain mirrors the weekly schedule; everything runs on laptops and cloud free tiers, no GPU required.