Back to Blog

Energy-Based World Models vs. Transformer-Based Generation: Two Competing Visions for Machine Intelligence

artifocialMarch 25, 202618 min read

Trends of Energy-based World Models

Energy-Based World Models vs. Transformer-Based Generation: Two Competing Visions for Machine Intelligence

W13 Trend Tutorial · Advanced (ML Practitioner) · March 2026

Research Area: World Models

Companion Notebooks — released progressively, links activated as each goes live:

#NotebookFocusCompute
0000_lewm_toy_world_model.ipynbToy JEPA world model from scratch — encoder, predictor, SIGReg regularizationCPU only
0101_jepa_latent_dynamics_planning.ipynbLatent dynamics and planning — CEM, MPC, speed benchmarksCPU only

1. Why This Matters Now

Two days ago (March 23, 2026), a team led by Yann LeCun released LeWorldModel (LeWM) — the first JEPA that trains stably end-to-end from raw pixels, using only two loss terms and ~15M parameters on a single GPU. It plans up to 48× faster than foundation-model-based world models while staying competitive on diverse control tasks.

The day after (March 24), OpenAI killed Sora — shutting down its generative video platform entirely and unwinding a $1 billion deal with Disney, citing unsustainable compute costs relative to other business priorities.

These two events, 24 hours apart, crystallize the deepest architectural split in AI today. One camp — led by LeCun and backed by AMI Labs' $1.03 billion — argues that predicting abstract representations is the right way to model the physical world. The other — led by Fei-Fei Li's World Labs and backed by $1 billion including $200M from Autodesk — generates visual worlds directly.

With Sora dead and LeWM alive, the evidence is shifting. This tutorial unpacks both approaches, introduces LeWorldModel as a breakthrough proof point, and analyzes where the field is heading.


2. The Core Debate: Prediction vs. Generation

The fundamental question: how should an AI system model the physical world?

DimensionEnergy-Based (JEPA / AMI)Generative (Transformers / Diffusion)
Core operationPredict abstract representationsGenerate raw outputs (pixels, tokens)
What gets predictedLatent embeddings (high-level structure)Every detail of the output (pixel-level)
Handling uncertaintyEnergy landscape over compatible statesProbability distribution over outputs
Training signalNon-contrastive self-supervised (VICReg, Barlow Twins, SIGReg)Next-token prediction / denoising
Irrelevant detailsDiscarded by encoder — only structure mattersMust be predicted — every pixel counts
Compute efficiencyLeWM: ~15M params, single GPU, hoursSora: massive compute, so costly OpenAI shut it down
Primary outputRepresentations for reasoning and planningRendered content (images, video, text)

LeCun's core argument is simple: predicting every pixel of a future video frame is both wasteful and brittle. Most pixel-level details are irrelevant to understanding what's happening in a scene. A system that reasons about the world should predict at the level of meaning, not pixels. Sora's shutdown lends this argument real weight — even OpenAI couldn't justify the compute cost of pixel-level world simulation.


3. The Energy-Based Approach: JEPA and AMI

3.1 LeCun's 2022 Blueprint

The intellectual foundation is LeCun's 2022 position paper, A Path Towards Autonomous Machine Intelligence. It proposes a cognitive architecture with six modules: perception, world model, cost, actor, short-term memory, and configurator. The world model — the centerpiece — uses JEPA to learn predictive representations of the environment.

3.2 How JEPA Works

JEPA operates through joint embeddings rather than reconstruction:

  1. Two encoding branches: An encoder maps input xx to representation sxs_x; a separate encoder maps target yy to sys_y. The encoders need not be identical.
  2. Prediction in latent space: A predictor module estimates s^y\hat{s}_y from sxs_x, optionally conditioned on a latent variable zz.
  3. Energy as prediction error: The energy function E(x,y)=sys^y2E(x, y) = \|s_y - \hat{s}_y\|^2 measures compatibility. Low energy means xx and yy are consistent; high energy means they're not.
  4. No pixel-level reconstruction: The system never tries to generate raw pixels. It learns that "a ball thrown upward will come back down" without needing to render every frame.

The latent variable zz is critical — it captures the information we cannot predict (stochastic elements, unobserved factors). By minimizing the information content of zz during training, the model learns to encode only what's predictable, discarding irrelevant noise.

3.3 The Collapse Problem (and Why JEPA Has Been Fragile)

The central challenge with JEPA has been representation collapse: the encoder learns to map everything to the same constant vector, making prediction trivially perfect but useless. Previous JEPA implementations avoided this through fragile engineering hacks — stop-gradients, exponential moving averages (EMA), multi-term losses with 6+ hyperparameters, or pre-trained frozen encoders (e.g., DINO features). These hacks worked but made JEPA impractical and difficult to reproduce.

This is what makes LeWorldModel so significant — it eliminates all of these crutches.

3.4 LeWorldModel: The End-to-End JEPA Breakthrough

LeWorldModel (LeWM), released March 23, 2026, by Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, and Randall Balestriero (Mila, NYU, Samsung SAIL, Brown University), is the first JEPA that trains stably end-to-end from raw pixels with no heuristics.

Architecture:

  • Encoder: ViT-Tiny (~5M parameters) — maps each frame observation into a compact, low-dimensional latent representation
  • Predictor: Transformer (~10M parameters) — models environment dynamics in latent space by predicting the next frame's embedding given the current embedding and an action
  • Total: ~15M parameters — trainable on a single GPU in a few hours

The Two-Term Loss:

LeWM uses only two loss terms — down from the six required by the only existing end-to-end alternative:

  1. MSE prediction loss: Standard mean-squared error between predicted and actual next-frame embeddings. This drives the model to learn accurate dynamics.
  2. SIGReg (Sketched Isotropic Gaussian Regularizer): The anti-collapse mechanism. SIGReg enforces that latent embeddings follow an isotropic Gaussian distribution.

How SIGReg prevents collapse:

SIGReg leverages the Cramér-Wold theorem: a multivariate distribution matches a target (isotropic Gaussian) if and only if all its one-dimensional projections match that target. In practice:

  1. Project the batch of latent embeddings onto MM random 1D directions
  2. Apply the Epps-Pulley test statistic to each 1D projection — measuring how far it deviates from a Gaussian
  3. Average the test statistics as the regularization loss

This is elegant because it's scalable (no pairwise covariance matrix needed), stable (statistically grounded via the Cramér-Wold theorem), and simple (one hyperparameter: the regularization weight).

Results:

EnvironmentLeWM PerformanceKey Comparison
Push-T (block manipulation)96% success rateBeats DINO-WM (which uses pretrained features + proprioceptive inputs)
Reacher (2-joint arm)Outperforms DINO-WMCompetitive with models using extra input modalities
Two-Room (2D navigation)Strong baselineMatches specialized methods
OGBench-Cube (3D pick-and-place)CompetitiveStrong on 3D control tasks

Planning speed: <1 second, up to 48× faster than foundation-model-based world models.

Why this matters for the field:

LeWM proves that JEPA can work from raw pixels without crutches. This removes the biggest technical objection to the energy-based approach: that it required pre-trained features or fragile training procedures. With LeWM, you can train a world model from scratch, from pixels, on a single GPU, in hours, and get competitive results. The code is open-source at github.com/lucas-maes/le-wm.

3.5 From I-JEPA to V-JEPA 2: The Scaling Story

Before LeWM simplified the training, Meta's team scaled JEPA through several iterations:

I-JEPA (2023, Meta AI) applied the architecture to images. Given a partially masked image, I-JEPA predicts the representations of missing regions — not the pixels themselves.

V-JEPA (2024, Meta AI) extended this to video. By masking space-time patches and predicting their representations, V-JEPA learns temporal dynamics without ever reconstructing video frames.

V-JEPA 2 (2025, arXiv:2506.09985) scaled to 1.2B parameters trained on 1M+ hours of video with progressive resolution training:

  • 77.3 top-1 accuracy on Something-Something v2 (motion understanding)
  • State-of-the-art 39.7 recall@5 on Epic-Kitchens-100 (action anticipation)
  • 84.0 on PerceptionTest when aligned with an LLM
  • Zero-shot robotic planning: V-JEPA 2-AC deploys on Franka arms for pick-and-place using only 62 hours of unlabeled robot video — no task-specific training or reward

VL-JEPA (arXiv:2512.10942) extends the framework to vision-language, achieving stronger performance than standard VLM training with 50% fewer trainable parameters.

3.6 What AMI Labs Is Building

AMI Labs (March 2026, $1.03B seed at $3.5B pre-money valuation) is led by LeCun (co-founder), with Saining Xie as Chief Science Officer (creator of the Diffusion Transformer architecture behind Sora — an ironic pedigree given Sora's demise), Pascale Fung as Chief Research & Innovation Officer, and Michael Rabbat as VP of World Models.

AMI's systems will train on video, audio, and sensor data — not just text. The goal: world models that understand physical causality (object permanence, gravity, collisions, material properties). LeCun has described this as a long-term scientific project.

The technical bet: JEPA-style architectures, scaled with the engineering insights from LeWM's stable training, will produce representations that enable planning and reasoning in ways generative models cannot.


4. The Generative Approach: World Labs and the Post-Sora Landscape

4.1 The State of Generative World Models After Sora

OpenAI's decision to shut down Sora on March 24 was not a failure of the technology per se — Sora 2 (September 2025) could generate 60-second physically-consistent video. The failure was economic: running the video model consumed so much compute that it starved other teams, and the revenue opportunity (creative tools) couldn't justify the cost versus coding tools and enterprise customers.

This confirms a version of LeCun's critique: pixel-level world simulation is computationally expensive in a way that abstract representation prediction is not. LeWM does its planning in <1 second on a single GPU; Sora required massive GPU clusters to generate a single minute of video.

But generative world models aren't dead — they're evolving.

4.2 World Labs and Marble: The Generative Counterargument

World Labs, founded by Fei-Fei Li, represents the strongest remaining case for generative world models.

Marble (generally available since November 2025) generates persistent, editable 3D worlds from multimodal inputs — text, images, video, or coarse 3D layouts:

  • 3D Gaussian splat generation: Scenes are represented as millions of semitransparent 3D Gaussians, enabling photorealistic rendering from any viewpoint
  • Chisel editor: Users draw rough spatial layouts, Marble fills in visual detail — human intent at the structural level, AI generation at the detail level
  • Multi-format export: Gaussian splats, triangle meshes (physics simulation), or video
  • NVIDIA Isaac integration: Generated worlds can be imported into robotics simulation for agent training

4.3 AMI vs. World Labs: The $2 Billion Divergence

These two companies, each backed by ~$1 billion, represent fundamentally different bets on how AI should understand the physical world. We first covered this rivalry in our W11 blog/video post; here we deepen the comparison:

DimensionAMI Labs (LeCun)World Labs (Fei-Fei Li)
Founded byYann LeCun (Turing Award, Meta FAIR)Fei-Fei Li (ImageNet, Stanford HAI)
Funding$1.03B seed (Mar 2026), $3.5B valuation$1B total incl. $200M Autodesk (Feb 2026), $5B valuation
Core architectureJEPA — joint embedding predictive architectureGenerative — diffusion + 3D Gaussian splatting
What it producesAbstract representations for planning/reasoningVisual 3D worlds you can see and navigate
Training dataVideo, audio, sensor streams, lidarText, images, video, 3D layouts
Go-to-marketResearch-first: build the right architecture, then find applicationsProduct-first: shipped Marble (Nov 2025), Autodesk integration
Target applicationsIndustrial, robotics, healthcare — where hallucinating physics killsCreative tools, gaming, film, architecture, robotics sim
Key advantageCompute efficiency, data efficiency, planning speedVisual quality, immediate commercial utility
Key weaknessNo visual output — can reason but can't renderNo physical reasoning — can render but may not understand
Compute profileLeWM: 15M params, 1 GPU, hoursMarble: large-scale generation infrastructure
Open researchV-JEPA 2, VL-JEPA, LeWM (all open-source)No full architectural paper; commercial product

The deepest difference: AMI is building understanding (can this robot plan a safe path?), while World Labs is building synthesis (can this tool generate the room the robot will practice in?). These may not be competing — they may be complementary layers of the same stack.

4.4 Video Foundation Models: Implicit World Models Under Pressure

With Sora dead, the remaining video world model contenders are Google's Veo 3.1 (January 2026, 4K, reference-image conditioning) and open-source projects like HunyuanVideo WorldPlay (Tencent, with RL post-training code on GitHub).

The CVPR 2025 tutorial From Video Generation to World Model mapped the frontier from passive generation toward interactive simulation. With Sora gone, the question of whether video generation can become true world modeling has lost its most prominent champion.


5. The Broader Landscape: Related Breakthroughs

5.1 R2-Dreamer: Decoder-Free World Models

Released the same week as LeWM, R2-Dreamer (March 18, 2026) proposes a decoder-free MBRL framework using a Barlow Twins-inspired redundancy-reduction objective to prevent collapse without data augmentation. On DeepMind Control Suite and Meta-World, R2-Dreamer matches DreamerV3 and TD-MPC2 while training 1.59× faster than DreamerV3.

The convergence is striking: both LeWM and R2-Dreamer independently arrived at the conclusion that world models don't need decoders or pretrained features — just the right regularization objective.

5.2 Causal-JEPA: Object-Level World Models

Causal-JEPA (February 2026) extends the JEPA framework from image patches to object-centric representations using object-level masking that induces causal inductive biases via latent interventions. Key result: ~20% absolute improvement in counterfactual reasoning, and planning with only 1% of the latent features required by patch-based world models.

5.3 Self-Improving World Models (ASIM)

From the ICLR 2026 RSI Workshop (our W12 coverage): ASIM uses cycle-consistency between forward and inverse models for architecture-agnostic self-improvement with 50%+ less data.

5.4 The RSI Connection: From Self-Improving Models to Self-Improving World Understanding

If you've been following our W10–W12 coverage, the world models story should feel structurally familiar — because the core failure modes are the same.

In our RSI weeks, we showed how self-play and self-training loops (STaR, ReST, Contextual Drag) can degrade when the model finds trivial shortcuts: a self-play proposer that generates only easy problems, a self-trainer that reward-hacks its own verifier, or a self-refining agent whose corrections compound errors rather than fixing them. The central RSI challenge is keeping the self-improvement loop honest — ensuring the model actually learns rather than exploiting the training signal.

JEPA world models face the exact same challenge, wearing different clothes. Representation collapse — where the encoder maps every input to a constant vector — is the world model equivalent of reward hacking. The model "solves" the prediction objective by making every prediction trivially correct, learning nothing useful in the process. Previous JEPA implementations required fragile hacks (stop-gradients, EMA, multi-term losses) to prevent this, just as previous self-play systems required careful curriculum design and rejection sampling to stay productive.

LeWM's SIGReg is to world model collapse what verification-based filtering is to self-training collapse: a principled regularizer that keeps the learning loop honest without human intervention. The parallel isn't just conceptual — it's mathematical. SIGReg enforces distributional structure on the latent space (Gaussianity via Cramér-Wold). Verification-based self-training enforces correctness structure on generated solutions. Both prevent degenerate convergence by constraining what the model is allowed to learn.

ASIM closes the circle: it applies self-improvement principles directly to world models via forward-inverse cycle-consistency. This is the intersection point of our entire W10–W15 arc — the question isn't just "can AI improve itself?" (RSI) or "can AI understand the world?" (world models), but can AI improve its own understanding of the world, autonomously?

For our notebooks this week, this connection matters practically: NB 00 implements SIGReg regularization in pure NumPy and shows what happens when you remove it (collapse), drawing a direct parallel to how verification filtering prevents reward hacking in W11's STaR notebook — same principle, different domain.


6. Where Each Approach Excels (and Fails)

CapabilityEnergy-Based (JEPA)Generative (Transformer/Diffusion)
Physical reasoningStrong — learns causal structure in latent spaceWeak — approximates appearance of physics
Planning speedLeWM: <1 sec, 48× faster than foundation modelsSlow — generation is the bottleneck
Visual generation qualityNot designed for generation — produces representations, not imagesExcellent — state-of-the-art photorealism (but expensive)
Data efficiencyHigh — V-JEPA 2 does zero-shot robotics from 62hr unlabeled videoLow — requires massive datasets for each domain
Compute efficiencyLeWM: 15M params, 1 GPU, hours; V-JEPA 2: 1.2B, still tractableSora: so costly OpenAI shut it down
Robotics applicationsDirect — latent predictions feed into controllersIndirect — Marble generates sim environments; no direct control
Creative applicationsLimited — cannot render visual contentExcellent — gaming, film, design, architecture
Training stabilitySolved — LeWM trains end-to-end with 1 hyperparameterStable but requires massive scale
Scalability evidenceLeWM (15M) → V-JEPA 2 (1.2B) — proven at both scalesGPT-4V, Veo at hundreds of billions — proven at scale

7. The Hybrid Thesis and the Engineering Angle

For Practitioners: What Can You Build Today?

The LeWM release is particularly exciting for researchers and engineers with limited compute — exactly our situation. With 15M parameters and single-GPU training, this is within reach for anyone with a consumer-grade GPU. The code depends on stable-worldmodel for environment management, planning, and evaluation, and stable-pretraining for training infrastructure.

The key insight for small teams: you don't need foundation-scale compute to do meaningful world model research. LeWM beats DINO-WM (which uses pretrained DINOv2 features) on Push-T with only raw pixel input and a fraction of the parameters. This validates our approach from previous weeks — finding the engineering sweet spots where small-scale work can produce frontier-competitive results.

The Convergence Signal

Several developments suggest convergence rather than winner-take-all:

  • LeWM + R2-Dreamer independently show that decoder-free, regularization-based training is the path forward for efficient world models
  • V-JEPA 2's LLM alignment shows JEPA representations can connect to language models for multimodal reasoning
  • Marble's simulation integration (NVIDIA Isaac) means even generative world models serve planning use cases
  • Causal-JEPA adds object-level reasoning to the JEPA framework, bridging toward richer world understanding

The most likely trajectory: energy-based encoders for perception and planning, generative decoders for rendering and content creation, connected through a shared latent space. AMI and World Labs may end up as complementary layers rather than competitors.


8. What to Watch

  1. LeWM scaling experiments. At 15M parameters, LeWM is competitive. What happens at 100M? 1B? The scaling behavior of end-to-end JEPA world models is uncharted territory — and with stable training now solved, the experiments are feasible.

  2. AMI's first technical results. With $1B, Saining Xie, and the LeWM training recipe, expect AMI's first publications within 6–12 months. Will they build on LeWM's SIGReg approach or develop new stabilization methods?

  3. Post-Sora generative landscape. With OpenAI out, Google's Veo and open-source projects (HunyuanVideo, LTX, Helios) become the generative world model frontier. Will any of them cross the interactive threshold?

  4. World Labs' robotics pipeline. Marble + Isaac Sim is the generative camp's strongest argument for practical world models. Sim-to-real transfer results will be definitive.

  5. Notebook 00 this week. We reproduce LeWM's core training loop at minimal scale — a pure NumPy MLP encoder + predictor with SIGReg regularization, trained from raw pixels on BallWorld. No PyTorch, no autograd — every gradient is hand-derived. Trains on any laptop CPU in under 2 minutes.


References

Key Papers

Companies and Products

Code

Surveys and Resources



Stay connected:

Comments