Inspiration
Plato warned us about mistaking shadows for reality. Today, we align AI on shadows of shadows — compressing bodies into pixels, pixels into words. RLHF and Constitutional AI operate entirely in text, the lossiest representational layer. We wanted to make the information loss visible.

The question driving Daylight: does the medium through which an agent learns determine the quality of what it learns? We hypothesized that an agent trained on direct physical experience would dramatically outperform one trained on video, which would outperform one trained on language — even when all three receive "equivalent" information about the same task.

What it does Daylight puts the same humanoid walking task through three representational pipelines and races them side-by-side in a single MuJoCo physics scene:

  • Embodied (Daylight): proprioceptive state $\rightarrow$ policy $\rightarrow$ action. The agent directly feels its joint angles, velocities, contact forces — a 362-dimensional observation at every timestep.
  • Video (Dusk): camera frame $\rightarrow$ CNN state estimator $\rightarrow$ predicted state $\rightarrow$ same policy $\rightarrow$ action. Same policy, but now the state is reconstructed from pixels. Information is lost in the render-then-estimate loop.
  • Text (Shadow): camera frame $\rightarrow$ BLIP caption $\rightarrow$ sentence embedding $\rightarrow$ MLP $\rightarrow$ action. The agent only knows what language can describe. Captions like "right knee flexed ~40°, hip extending" sound precise but lack the continuous, high-frequency signal needed for real-time balance. The result is a live stampede: 15 humanoids (5 per type) race from a dark cave zone into warm daylight. The dark embodied agents stride confidently to the finish. The medium-toned video agents stumble a few meters before collapsing. The pale text agents barely leave the starting line. The visual is immediate and visceral — you see the degradation.

How we built it Physics & RL. We trained humanoid locomotion policies using Soft Actor-Critic (SAC) on Modal A100 GPUs. Getting a humanoid to walk at all required substantial sim engineering: the default Humanoid-v5 has sphere feet (condim=1) that slide — we built a custom XML with flat box feet (condim=3, 20x12cm), higher joint damping, and a 362-dim observation space including center-of-inertia, body velocities, actuator forces, and foot contact sensors. The gait reward balances forward velocity, survival, and foot-ground contact:

$$r = w_{\text{healthy}} \cdot \mathbb{1}[\text{alive}] + w_{\text{vel}} \cdot v_x - w_{\text{ctrl}} \cdot |a|^2 + w_{\text{contact}} \cdot (\mathbb{1}[f_L > 0] + \mathbb{1}[f_R > 0])$$

Video pipeline. We collected 50K (frame, action) pairs from the trained embodied policy, then trained a CNN (4-frame stack, 64x64 input, 512-dim hidden) via behavioral cloning. The CNN sees pixels and must reconstruct enough state to produce useful actions. Its val MSE of 0.075 sounds low, but that ~7% error compounds over hundreds of timesteps into catastrophic gait failure.

Text pipeline. We captioned the same frames using BLIP-2, embedded them with all-MiniLM-L6-v2 (384-dim), and trained an MLP behavioral clone. The captions are impressively detailed — angles, phases, force directions — but language fundamentally cannot encode the continuous, high-bandwidth proprioceptive stream at the rate needed for balance. The text agent produces semi-plausible joint activations that immediately diverge.

Multi-agent race scene. We built a programmatic XML generator that stamps out N humanoids per type with unique prefixes (e00_, v00_, t00_), arranges them in lanes, and bakes in a cave-to-daylight lighting gradient (cool dim spotlights at the start, warm bright lights at the finish). The race script uses name-based actuator lookup and batched policy inference.

Challenges we ran into

  • Getting a humanoid to walk at all. PPO failed completely (episode length 164 after 6.8M steps). SAC hit 5,500 reward at 580K. But even SAC needed the right observation space — 51-dim (qpos+qvel) couldn't balance; the full 362-dim with body inertia and contact forces was critical.
  • Video agent domain shift. Training the CNN on single-humanoid renders, then deploying in a multi-agent race scene, caused immediate failure. We had to maintain a separate single-humanoid MuJoCo scene just to render "clean" frames for the video agent at inference time.
  • Text agent is too bad. Early versions fell instantly. We tried zero-shot (Claude Vision caption → Claude text → action mapping) and trained (BLIP → embedding → MLP). Both fail, but the trained version at least lurches forward ~1m before collapsing, which makes the comparison more legible.
  • Multi-agent physics. 15 humanoids in one MuJoCo scene with separate actuator namespaces, contact pairs, tendons, and sensors required careful XML generation. A single typo in a prefix means silent wrong-joint actuation.

What we learned

The core insight held up: representational fidelity directly determines behavioral competence. The embodied agent walks 10+ meters. The video agent, using the exact same policy but with CNN-estimated state, manages ~4m before the compounding pixel-to-state error topples it. The text agent, with its language bottleneck, barely moves.

This isn't a failure of the text or video models — it's a property of the representations themselves. Language is a lossy compression of vision, which is a lossy compression of physics. Each layer discards information that seemed redundant but was load-bearing. When we align AI systems using only text (RLHF, Constitutional AI), we are optimizing in the space of shadows. Daylight makes that visible.

Built With

Share this project:

Updates