Back to Blog

Spatial Architectures: From Capsule Networks to Equivariant Neural Networks

artifocialApril 12, 202611 min read

A tour of the architectures that bake 3D geometry into the network itself—from Hinton's capsules and the geometric deep learning framework to modern E(n)-equivariant graph networks powering physical AI.

Spatial Architectures: From Capsule Networks to Equivariant Neural Networks

W15 Basic Tutorial 2 · Intermediate · April 2026

Research Area: Geometric Deep Learning, Physical AI

Companion Notebook

#NotebookFocusCompute
0101_equivariant_vs_standard.ipynbEquivariant vs. standard features for 3D — rotation generalization on point cloudsCPU only

The Gap: Why Standard Architectures Struggle with Geometry

We've built powerful tools over the past decade: MLPs that approximate any function, CNNs that exploit translation symmetry, and Transformers that scale to billions of parameters. Yet all three fail at a task that animals solve effortlessly — understanding how objects behave under rotation.

Consider a simple example: train an MLP on photographs of upright cats. Show it a cat rotated 45 degrees. It fails. The learned representations have no notion of geometry. A CNN does better — it bakes in translation equivariance through convolution — but it still treats a rotated cat as a fundamentally different object from an upright one. A Vision Transformer? It inherits this weakness entirely. These architectures have no built-in understanding that rotating the input should rotate the features predictably.

This matters because physical AI — robotics, world models, autonomous systems — lives in a geometric world. A grasping policy that works for object A should work for object A rotated. A world model that predicts physics must predict the same physics in any reference frame. Scaling these systems to handle arbitrary orientations through data alone requires exponential amounts of training. We need something better.

This tutorial surveys architectures that natively understand spatial structure. We'll move from Geoffrey Hinton's vision of capsules, through the mathematical framework of geometric deep learning, to modern equivariant networks now quietly powering production systems.


Part 1: Capsule Networks — Hinton's Spatial Intuition

The Core Idea

In a traditional neural network, a neuron is a scalar. It fires or doesn't. In Hinton's 2017 paper Dynamic Routing Between Capsules, a capsule is a group of neurons whose activity vector encodes both what an entity is and how it's oriented.

Formally: a capsule's magnitude represents the probability that an entity exists (e.g., a nose), while its direction encodes the pose — position, rotation, scale, thickness, et cetera. A network of capsules can then negotiate which higher-level capsule should receive their output, encoding part-whole relationships naturally.

The innovation was a dynamic routing algorithm: instead of fixed weights between layers, capsules use an iterative routing-by-agreement mechanism. A lower-level capsule (say, a "nose capsule") looks at all possible higher-level capsules (eye, mouth, full face) and sends its output preferentially to capsules whose predictions it agrees with. This is genuinely different from backpropagation — it's a message-passing negotiation.

Why It Promised So Much

The theory is beautiful. On MNIST, a capsule network recognized overlapping digits far better than a standard CNN, suggesting it was learning genuine spatial structure. The pose vector, in principle, should generalize across rotations and scales. A robot could extract the pose of a grasped object directly from the capsule network's latent space.

Why It Didn't Take Over

Three problems:

  1. Training instability: Routing algorithms are finicky. Getting convergence required careful tuning. Backprop-based training, by contrast, "just works" on almost everything.

  2. Scaling challenges: Capsule networks proved hard to scale beyond toy datasets. The routing overhead grows quickly, and the computational benefits never materialized.

  3. Lack of theoretical grounding: There was no principled framework explaining when and why capsules should work. Were they just a clever regularizer? A different parameterization? The uncertainty limited adoption.

The 2025 Revival: EquiCaps

A decade later, researchers at Aberdeen revived capsules by marrying them with equivariance theory. EquiCaps learns pose-aware self-supervised representations without explicit predictors, baking equivariance directly into the capsule mechanism. Results on rotation prediction (R² = 0.78 on 3DIEBench) and combined transformations suggest capsules + equivariance may finally unlock the promise Hinton saw.

The lesson: Hinton's intuition about spatial structure was sound. The missing ingredient was a rigorous mathematical language.


Part 2: Geometric Deep Learning — The Mathematical Blueprint

In 2021, Michael Bronstein and colleagues published Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges, a manifesto that changed how we think about architecture design.

The Core Theorem (Informal)

Every successful deep learning architecture is an instance of equivariance under some symmetry group.

  • CNNs: respect the translation group. Shift the input → shift the output.
  • Graph Neural Networks: respect permutation of nodes. Reorder vertices → reorder output.
  • RNNs: respect the group of index permutations in sequences.
  • Capsule networks: should respect pose transformations (rotation, translation, scaling).
  • Transformers: respect... permutation? Not really. The base architecture ignores geometry entirely — positional encodings (sinusoidal, RoPE, 3D PE) are the "duct tape" we bolt on to recover spatial awareness, but this is learned geometry, not built-in geometry.

The framework suggests that the reason Transformers work well on images (ViTs) is that they learn geometric invariance from massive amounts of data, not because they have it built in. This requires exponentially more training to match a CNN's sample efficiency.

Groups and Equivariance

A symmetry group G acts on inputs and outputs. A function f is:

  • Invariant under G if f(T(x)) = f(x) for all T in G. Example: the distance between two points doesn't change under rotation.

  • Equivariant under G if f(T(x)) = T'(f(x)). Example: rotating a 3D tensor of features rotates the features inside it predictably.

The distinction is critical. Invariance is useful for classification (we want to classify a rotated cat as "cat"). Equivariance is useful for representation learning (we want a rotated object's features to transform predictably, so we can compose layers).

Which Groups Matter?

For physical AI:

  • E(3): The Euclidean group. Rotations, translations, and reflections in 3D. Governs classical mechanics.
  • SE(3): Special Euclidean group. Rotations and translations only (no reflections). More natural for robotics — and for drug discovery, where chirality matters: a "left-handed" molecule can be a medicine while its mirror-image reflection is a toxin.
  • SO(3): Rotation group only. Relevant when translation is handled separately.

A network that respects SE(3)-equivariance automatically works regardless of the object's orientation or position in space — no retraining needed.


Part 3: Equivariant Neural Networks — Implementation

E(n)-Equivariant GNNs (EGNN)

The first practical equivariant architecture for arbitrary dimensions came from Satorras, Hoogeboom, and Welling (2021). Their E(n)-Equivariant Graph Neural Network (EGNN) processes graphs (point clouds, molecules) while respecting Euclidean symmetry.

The trick: make messages depend only on invariant quantities (relative distances, angles) while allowing node features to be equivariant (vectors that rotate with the object).

Pseudocode:

For each node i:
  distance_ij = ||x_i - x_j||  (invariant)
  message = MLP(h_i, h_j, distance_ij)  (computed from invariants)

  displacement = (x_i - x_j) / distance_ij  (unit direction)
  feature_update += MLP(message) * displacement  (equivariant: vector)

  x_i += feature_update  (coordinates transform predictably)
  h_i += MLPscalar(message)  (hidden states update)

The beauty: no explicit handling of rotation matrices or group representation theory. By building from invariants and multiplying by equivariant directions, equivariance emerges automatically.

Applications:

  • Molecular dynamics: predicting atomic forces without retraining for rotated molecules.
  • Protein structure prediction: AlphaFold uses SE(3)-equivariance ideas in its structure module.
  • 3D scene understanding: cameras can be in any orientation; the model adapts.

Vector Neurons

Deng et al. (2021) extended the idea to dense 3D point clouds. Vector Neurons replace scalar activations with 3D vectors, automatically encoding rotation as vector rotation.

A standard neuron is y=σ(wx)y = \sigma(w \cdot x). A vector neuron is y=σ(Wx)\mathbf{y} = \sigma(W \mathbf{x}) where x,yR3\mathbf{x}, \mathbf{y} \in \mathbb{R}^{3}, WW is a 3×33 \times 3 matrix, and σ\sigma is applied per-component. Under rotation RSO(3)R \in SO(3), the output transforms as RyR \mathbf{y} — true SO(3)SO(3)-equivariance without special-casing.

This unified framework subsumes specialized methods and allows building equivariant networks from standard building blocks: linear layers, ReLU, pooling, batch norm — all automatically equivariant when built from vectors.

SE(3)-Transformers

Why not add attention to equivariance? Fuchs et al. (2020) introduced SE(3)-Transformers, which couple invariant attention weights with equivariant value embeddings:

  • Attention logits depend only on relative distances and angles (invariant).
  • Values are equivariant vectors.
  • The final output respects SE(3)-equivariance while leveraging attention's expressiveness.

This design has become a building block in 3D perception systems and molecular modeling.


Part 4: Why This Matters for Physical AI

We've built three major systems in W14–W15 that need spatial understanding:

World Models (LeWorldModel, W13)

LeWorldModel predicts world evolution in a learned latent space. But the encoder converting pixels to latents is a standard CNN — translation equivariant but rotationally blind. If the camera rotates, the latent representation churns, breaking prediction. An equivariant encoder would let the latent space itself be geometric, where trajectories remain valid under rotation.

3D Worlds (Marble, W14)

Marble generates coherent 3D scenes. The generative model needs to understand that a rotated object is still the same object — just reoriented. A diffusion model trained with equivariance priors would generate physically consistent scenes with far fewer samples.

Robotic Manipulation

A grasping policy trained on objects in a canonical orientation should work on arbitrarily oriented objects. An equivariant network lets the policy learn once and generalize infinitely. Non-equivariant networks require retraining or massive data augmentation.

The core insight: equivariance is not a luxury; it's a sample efficiency multiplier. In a data-hungry era, geometric priors matter.


Part 5: The Landscape Today

The field is fragmenting into specialized systems:

ArchitectureDomainBasis
EGNNMolecular graphsE(n)-equivariance
Vector Neurons3D point cloudsSO(3)-equivariance
SE(3)-TransformersProtein folding, 3D detectionSE(3)-equivariance + attention
EquiCapsPose estimation, visual reasoningCapsules + equivariance
Equivariant Diffusion Models3D generation (molecules, shapes)Score functions respect symmetry

The trend is clear: production systems increasingly incorporate geometric priors. Not because equivariance is theoretically pure, but because it cuts data requirements and improves robustness.

Recent work explores equivariant diffusion models for 3D molecule generation, equivariant normalizing flows for sampling, and equivariant graph isomorphism networks (EGI) for higher-order graph problems.


Part 6: Building Intuition with Notebooks

Notebook 01 demonstrates the core principle: train a standard MLP on upright 3D point clouds, then test on rotations. Accuracy tanks. Train the same architecture with invariant features (pairwise distances only) and watch generalization jump to near-perfect. It's a pure-NumPy, CPU-only walkthrough — no PyTorch, no GPU, just the math in action.


Key Takeaways

  1. Standard architectures lack geometric priors. Transformers, CNNs, and MLPs all learn geometry from data, wasting capacity.

  2. Equivariance is a sample efficiency multiplier. An equivariant network generalizes across rotations with no retraining. Non-equivariant networks require exponential data.

  3. The math is elegant, the code is simple. You don't need group representation theory to use equivariant networks — invariant inputs and equivariant outputs compose naturally.

  4. Capsules were ahead of their time. Hinton's intuition about pose was correct; EquiCaps shows the path forward.

  5. Physical AI demands geometry. World models, robotics, and 3D generation all benefit from respecting spatial structure. This is not a nice-to-have; it's essential.


References


Related W15 Articles



Stay connected:

Comments