Generative AI: From Variational Autoencoders to World Models

Level: Graduate  ·  Duration: 13 weeks, one 3-hour session per week  ·  Credits: 3

Instructor: Dr. Alexander (Sasha) Apartsin  ·  Course text: Building Vision AI: From Pixels to Generative Models (open access) · ebook at Amazon

Course Description
This course is an advanced, research-oriented treatment of deep generative modeling, organized around four pillars: variational autoencoders, energy-based and score-based models, diffusion models, and world models. Building directly on a prior graduate course in deep learning, it develops the probabilistic foundations (latent-variable models, the evidence lower bound, score matching) and follows them into the modern generative stack: denoising diffusion and its variational and score-based derivations, the score-SDE framework and sampler design, classifier-free guidance, latent diffusion and diffusion transformers, flow matching, and few-step generation via distillation. The final third of the course turns to world models: latent dynamics models that learn behavior in imagination, large-scale generative world simulators for video, driving, and interactive environments, and predictive-embedding architectures such as JEPA. The centerpiece of the course is a semester-long research project: each team formulates an original research question, designs and runs experiments, and reports results in three milestone presentations (proposal, interim, final) and a documented GitHub repository. Each student leaves the course with a demonstrable, novel, technically deep, research-oriented project added to their portfolio.

Prerequisites

Students are assumed to have completed an advanced graduate course in deep learning. In particular, the following are treated as known and are not retaught: backpropagation and optimization (SGD, Adam, learning-rate schedules), regularization, CNNs, RNNs, and Transformers, training and debugging deep networks at scale, and comfortable fluency in Python and PyTorch. Students missing the deep-learning prerequisite should first work through Part III of the course text [1] (Chapters 18 to 22, Neural Networks through Vision Transformers) and Appendix A independently.

Learning Outcomes

On completing the course, students will be able to:

Course Format

The course consists of lectures and student presentations. Ten sessions are lectures on the week's topic, with the listed readings to be read before class. The remaining three sessions (Weeks 5, 8, and 13) are dedicated entirely to student presentations of the research projects: proposal, interim, and final. In all three, teams present and receive in-class feedback from the instructor and peers.

Research Project

The project is the core deliverable of the course and is explicitly research-oriented: the goal is a novel, defensible empirical or methodological contribution. Students work in teams of two. Each team is required to: (i) formulate a novel research problem connected to the course material, positioned against related work and not already answered in the literature; (ii) source suitable datasets or generate synthetic data where no adequate dataset exists, with the data collection or generation methodology documented and justified; and (iii) design and run controlled experiments that answer the problem with evidence. Significant novelty is a hard requirement and is assessed at the proposal stage; projects re-implementing an existing system or reproducing a published result without a new question will not be approved.

Milestones, all presented in class:

Grading

ComponentWeightDue
Proposal presentation10%Week 5
Interim presentation20%Week 8
Final presentation20%Week 13
Project repository: code, text, and documentation (GitHub)50%One week after Week 13

Weekly Schedule

Each week lists the specific chapter sections of the course text [1] to read before class. Presentation weeks are highlighted.

WeekTopic and readings
1 Foundations of Deep Generative Modeling
What it means to model a data distribution: the generative-model landscape (autoregressive models, normalizing flows, GANs, VAEs, energy-based, score-based, and diffusion models, world models) and the trade-offs among likelihood, sample quality, and inference speed; maximum likelihood and divergences; why evaluating generative models is hard (likelihood, FID, precision-recall). Course overview and project kickoff.
Textbook [1]: Ch. 30, Foundations of Generative Modeling. Sections: 30.1 Generative vs Discriminative: What Does It Mean to Model p(x)?; 30.2 A Map of Generative Families: VAE, GAN, Flow, Autoregressive & Diffusion; 30.3 Latent Variables & the Idea of a Latent Space; 30.5 Sampling, Likelihood & the Quality-Diversity-Speed Trilemma; 30.6 Evaluating Generators: A First Look.
2 Variational Autoencoders I: Latent-Variable Models & the ELBO
Latent-variable models and intractable posteriors; variational inference and the evidence lower bound, derived two ways; the reparameterization trick and amortized inference; Gaussian encoders and decoders; implementation patterns and the rate-distortion view of the ELBO.
Textbook [1]: Ch. 31, Autoencoders & Variational Autoencoders. Sections: 31.1 Autoencoders: Compression as Representation; 31.2 Denoising & Sparse Autoencoders; 31.3 The VAE: ELBO, Reparameterization & Amortized Inference.
3 Variational Autoencoders II: Hierarchical, Discrete & Disentangled Latents
Posterior collapse and its mitigations; β-VAE and disentangled representations; hierarchical VAEs (NVAE, very deep VAEs); VQ-VAE and discrete codebook latents; VAEs as infrastructure: the autoencoders inside latent diffusion and the tokenizers inside world models.
Textbook [1]: Ch. 31, Autoencoders & Variational Autoencoders. Sections: 31.4 Disentanglement, beta-VAE & Posterior Collapse; 31.5 Hierarchical VAEs: From Ladder Networks to NVAE; 31.6 Discrete Latents: VQ-VAE & Learned Codebooks. See also Ch. 33, §33.7 (Latent Diffusion) and Ch. 36, §36.5 (World Models) for where these latents are reused.
4 Energy-Based Models & Score Matching
Energy-based models and the partition-function problem; the score function as the object to learn; score matching and denoising score matching; sampling with Langevin dynamics; noise-conditional score networks and annealed Langevin sampling: generative modeling without normalized likelihoods. Proposal clinic: experimental-design checklist for the project proposals.
Textbook [1]: Ch. 30, Foundations of Generative Modeling, Section 30.4 Energy-Based Models, Score Matching & Langevin Dynamics; Ch. 33, Diffusion Models, Section 33.3 The Score-Based View: VE/VP SDEs & the Probability-Flow ODE.
5 Student Presentations I: Project Proposals
Each team presents its research question, related work, method, and experimental design, and receives in-class feedback from the instructor and peers.
6 Denoising Diffusion Probabilistic Models
The forward noising process and the learned reverse process; DDPM as variational inference and its ELBO; the equivalence with denoising score matching; noise schedules and parameterizations (ε-, x₀-, and v-prediction); DDIM and deterministic sampling; U-Net denoiser architecture; diffusion models as deep hierarchical VAEs.
Textbook [1]: Ch. 33, Diffusion Models. Sections: 33.1 Destroying & Rebuilding: The Forward & Reverse Processes; 33.2 DDPM: Noise Schedules, Parameterizations & the Variational View; 33.4 Fast Sampling: DDIM, Solvers & Step Distillation.
7 Score-Based SDEs, Samplers & Guidance
The continuous-time view that unifies score-based and diffusion models: variance-exploding and variance-preserving SDEs, the reverse-time SDE, and the probability-flow ODE with exact likelihoods; sampler design (predictor-corrector, higher-order ODE solvers, the EDM design space); conditional generation via classifier guidance and classifier-free guidance; diffusion priors for inverse problems.
Textbook [1]: Ch. 33, Diffusion Models. Sections: 33.3 The Score-Based View: VE/VP SDEs & the Probability-Flow ODE; 33.4 Fast Sampling: DDIM, Solvers & Step Distillation; 33.6 Guidance: Classifier & Classifier-Free.
8 Student Presentations II: Interim Progress
Each team presents first experimental results, diagnosis of what is and is not working, deviations from the proposal, and the plan for the final stretch, and receives in-class feedback from the instructor and peers.
9 Diffusion at Scale: Latent Diffusion, Transformers & Flow Matching
Latent diffusion and text-to-image systems; conditioning via cross-attention; diffusion transformers (DiT) and what scales; flow matching and rectified flow as the modern training recipe; few-step generation via distillation and consistency models; evaluating large-scale image and video generators.
Textbook [1]: Ch. 33, Diffusion Models, Sections 33.5 Flow Matching, Rectified Flow & Consistency Models and 33.7 Latent Diffusion: Compress First, Then Diffuse; Ch. 34, Text-to-Image Systems, Sections 34.1 Connecting Text & Pixels: CLIP & Text Encoders, 34.2 Inside Stable Diffusion: VAE, U-Net, DiT & Conditioning, 34.3 The Model Landscape: DALL-E, Imagen, Midjourney & FLUX, 34.4 Autoregressive & Token-Based Image Generation; Ch. 35, Controllable Generation & Image Editing, Sections 35.1 Spatial Control: ControlNet & Conditioning Adapters and 35.2 Personalization: LoRA, DreamBooth & Textual Inversion.
10 World Models I: Latent Dynamics & Learning in Imagination
What a world model is and why agents need one; the VAE+RNN world model of Ha and Schmidhuber; latent dynamics models and the recurrent state-space model; planning with learned models (PlaNet); learning behaviors inside the dream: the Dreamer line through DreamerV3.
Textbook [1]: Ch. 36, Video, 3D Generation & World Models, Section 36.5 World Models: Latent Dynamics, RSSM & Learning in Imagination.
11 World Models II: Generative World Simulators
Scaling world models with generative video: action-conditioned video prediction; driving world models (GAIA-1); foundation world models and interactive environments (Genie, Genie 2, UniSim, Cosmos); diffusion world models for agents (DIAMOND); the "video generation as world simulation" question (Sora) and what generative simulators can and cannot do.
Textbook [1]: Ch. 36, Video, 3D Generation & World Models, Sections 36.1 Video Diffusion: Architectures & Temporal Consistency, 36.2 Text-to-Video Systems: Sora-Class Models & the Open Ecosystem, 36.6 Generative World Simulators: From GAIA-1 to Interactive Environments; Ch. 26, Video Understanding, Sections 26.1 From Frames to Clips: The Temporal Dimension and 26.3 Video Transformers.
12 World Models III: Predictive Representations, JEPA & Research Frontiers
Generative versus predictive-embedding world models; joint-embedding predictive architectures: I-JEPA, V-JEPA, and V-JEPA 2 for understanding, prediction, and planning; decoder-free latent world models for control (TD-MPC2); evaluating world models: physical consistency, controllability, long-horizon coherence; open problems: memory, 3D and physics grounding, and world models for embodied agents and reasoning. Course synthesis and final-presentation clinic.
Textbook [1]: Ch. 36, Video, 3D Generation & World Models, Sections 36.7 Predictive World Models: JEPA & Decoder-Free Latents and 36.8 Evaluating World Models: Physical Consistency, Controllability & Coherence; Ch. 25, Self-Supervised Learning & Vision Foundation Models, Sections 25.1 Pretext Tasks: Learning Without Labels and 25.3 Self-Distillation & Masked Image Modeling: DINO & MAE; Ch. 27, Depth, 3D Vision & Neural Scene Representations, Sections 27.4 NeRF: Neural Radiance Fields and 27.5 3D Gaussian Splatting; Ch. 37, Evaluation, Safety & Generative Data Engines, Sections 37.1 Measuring Image Quality: FID, KID, Precision-Recall & CLIPScore and 37.2 Human Evaluation & Preference Studies.
13 Student Presentations III: Final Project Presentations
Conference-style final talks with in-class feedback: contribution, method, experiments, results, and limitations. Project repositories (code, text, and documentation) due one week later.

Policies

Use of AI tools

Using LLMs and coding agents in your project work is encouraged and is itself a skill the course develops. Two rules apply. First, significant novelty: whatever tools are used, the submitted work must constitute a significant novel contribution by the team, in the problem formulation, the experimental design, and the findings; work whose substance could be produced by a single prompt to an off-the-shelf model does not meet the bar. Second, accountability: you are fully responsible for the correctness of every claim, number, and citation you submit, regardless of which tool produced it. Hallucinated references or unverified AI-generated results are treated as academic integrity violations.

Collaboration and integrity

Discussion across teams is encouraged; code, experiments, and writing must be the team's own. All experimental results reported in milestones and the repository documentation must be backed by runnable artifacts in the team's repository.


References

[1] A. Apartsin and Y. Aperstein, Building Vision AI: From Pixels to Generative Models, 2nd ed., 2026. Open access online at https://visionbook.apartsin.com; also available as an ebook at Amazon: https://www.amazon.com/dp/B0H5BT8Y75. Part IV (Generative Vision Models, Chapters 30 to 38) is the primary text for the course; Chapters 25, 26, and 27 support the self-supervised and world-models material.

Supplementary books and resources

[2] S. J. D. Prince, Understanding Deep Learning. MIT Press, 2023. Open access at https://udlbook.github.io/udlbook/. Chapters 17 (VAEs) and 18 (diffusion models).

[3] K. P. Murphy, Probabilistic Machine Learning: Advanced Topics. MIT Press, 2023. Open access at https://probml.github.io/pml-book/book2.html. Part IV (Generation) covers VAEs, energy-based models, and diffusion.

[4] J. M. Tomczak, Deep Generative Modeling, 2nd ed. Springer, 2024. Author page: https://jmtomczak.github.io.

[5] Stanford CS236, Deep Generative Models: lecture notes and slides at https://deepgenerativemodels.github.io/.

[6] Y. Song, Generative Modeling by Estimating Gradients of the Data Distribution (tutorial): https://yang-song.net/blog/2021/score/.

[7] L. Weng, What Are Diffusion Models? (tutorial): https://lilianweng.github.io/posts/2021-07-11-diffusion-models/.

[8] Google DeepMind, Genie 2: A Large-Scale Foundation World Model (technical blog): https://deepmind.google/discover/blog/genie-2-a-large-scale-foundation-world-model/.