Latent Geometry, Blind Spots: Stress-Test JEPA World Models

Background

JEPA (Joint-Embedding Predictive Architecture) is a way for an AI to learn how the world works by predicting compressed internal representations of what happens next, rather than raw pixels. Meta's V-JEPA 2 already uses this for zero-shot robot manipulation, and it's quickly becoming a serious alternative to traditional world models. I built on top of this with two recent ideas: value-guided latent geometry, which reshapes the model's internal "distance" so it reflects real cost-to-reach a goal instead of just visual similarity, and a density-matrix latent layer inspired by UWM-JEPA, designed to preserve uncertainty when the model has to predict blind (e.g., a blocked camera).

What I Built

A baseline planner and both extensions on the same open-source PLDM backbone, then evaluated every version identically: normal conditions and goal-occluded conditions, n=40 trials each.

Part 1: Value-guided geometry My first version underperformed the baseline (7.5% success vs. baseline's 35% under normal conditions). I diagnosed why, training the value-shaped loss from scratch breaks the geometric agreement between the encoder and predictor that planning depends on, then built two fixes to test that diagnosis directly. Joint training with a warmup schedule brought normal-condition success up to 20% (8/40), and fine-tuning from a converged baseline checkpoint pushed it further to 25% (10/40), more than tripling my first result using the identical loss and hyperparameters, with only the starting point changed. Under occlusion, baseline held at 7.5% (3/40) while every value-guided version plateaued around 5% (2/40), regardless of which fix I applied. Each version I built moved the normal-condition number in the right direction, a strong signal this approach is a real contender, with more research & compute. What needs tuning next: prediction loss kept drifting upward the longer the value loss stayed active, meaning the two objectives still need a better-balanced joint schedule, which is a tuning and compute problem.

Part 2: Density-matrix uncertainty layer I also built a density-matrix latent layer on top of the backbone, aimed specifically at occlusion robustness, since none of my value-guided fixes ever moved that number. Given remaining time, I built this as a projection on a frozen backbone rather than a fully joint architecture, and it landed at 2.5% (1/40) normal and 0% (0/40) occluded. The training itself converged cleanly and stayed healthy (loss dropping from 0.675 to 0.445, no collapse) but what's missing is joint retraining of the predictor and planner around this new latent structure, the same fix that worked for Part 1. I see this as the next clear build step.

Why this is a strong contender for robotics: A real, working pipeline with two novel ideas layered on a planning backbone, found a specific and fixable bottleneck, and proved across two independent builds that fixing it produces consistent, repeatable gains. With more compute and tuning time, longer joint training, a better-balanced value-loss schedule, and a fully joint build of the density-matrix layer, I believe this closes the remaining gap to baseline and becomes a genuinely deployable approach for planning under uncertainty, which is exactly the kind of robustness real robotics teams need before trusting a model in the field.

What's next

Longer joint training and a tuned value-loss schedule to close the remaining performance gap, a fully joint (not frozen-backbone) build of the density-matrix layer, more MPPI planning samples to match full-scale settings, and pixel-space goal-image augmentation to directly target occlusion robustness.