Aesthesis

Inspiration

Every UX team in the world is testing the wrong thing. Heatmaps tell you where users hovered. Surveys tell you what they say. Session replays tell you what they clicked. None of them tell you how their brain reacted, the same brain you're shipping for.

In March 2026, two things became possible at once:

Meta open-sourced TRIBE v2: the first foundation model that predicts whole-brain fMRI responses to naturalistic video. 451 hours of fMRI from 720+ humans. Won Algonauts 2025. ~70k cortical vertices on the fsaverage5 mesh. A 70× resolution jump over prior work. Until now, "AI predicts the brain" was a research toy. Now it's a deployable model.
Vision-language models can narrate UX. Gemini, Claude with vision, GPT-4V — they can watch a screen recording and reason about what a user is seeing and likely feeling.

Stack all three and you get something that didn't exist 12 months ago: an automatable, second-by-second neural read of any product experience. We thought, what if you could ship a UX decision the same way you ship a perf budget, by actually measuring the thing you care about?

What it does

Upload a demo and Aesthesis tells you, second by second, how a human brain would respond to it. The output is a fully synced workspace:

An interactive brain timeline: click any point on the chart and the video freezes on that exact frame so you can see the UI moment that triggered the response.
- A 3D cortical mesh that paints the live neural state in real time as you scrub.
- Timestamped insights flagging the most decisive moments (e.g. "friction spike at 12.5s during checkout summary").
- Personalized AI fix suggestions for each insight, powered by a Backboard agent that has access to the user's full run history, so the longer you use it, the more it surfaces cross-run patterns (recurring weaknesses across designs) that no single analysis could catch.

How we built it

Three services connected by HTTP: a Next.js 16 frontend (Framer Motion, Recharts, react-three-fiber), a FastAPI orchestrator, and a TRIBE GPU service running on Modal A100s.

A user uploads an MP4 → the backend validates it with ffprobe and strips the audio track → forwards it to the TRIBE service, where Meta's V-JEPA 2 ViT-g encoder + TRIBE v2 task heads produce 400 cortical-parcel activations per TR (every 1.5s), collapsed into the 8 UX-tuned ROI signals. The orchestrator pipes the ROI timeseries into Gemini 2.0 Flash to synthesize timestamped insights and an overall assessment.

The synced UI ties it together: clicking any point on the brain chart simultaneously seeks-and-freezes the video, advances the 3D cortical mesh (a real fsaverage5 anatomical surface, colored per-vertex via a custom react-three-fiber shader) to that TR, and highlights the matching insight.

The AI agent layer, the "Get personalized fix" button on each insight, plus the chat panel, runs on Backboard with 4 custom tools (list_past_runs, compare_runs, get_run_insights, get_run_trends). Threads are persisted per-run in Postgres so chat history survives restarts.

Stack: Next.js 16, FastAPI, Modal, Backboard, Auth0, Postgres on Neon, Prisma.

Challenges we ran into

Making a research model fast enough to demo. Our first end-to-end run took 148 seconds to analyze a 14-second clip — a state-of-the-art GPU running 11× slower than the source video. Profiling traced 70% of the wall time to a single sequential loop deep inside the upstream model code, which we couldn't safely patch directly without taking on hidden coupling to a fast-moving research package. So we won back time at the boundaries first, pre-baking the 5 GB model checkpoint into our GPU image and stripping audio at the API edge to skip a 2-minute transcription path entirely — then carefully relaxed the rule for inference-only optimizations: batched GPU forwards, TF32 precision (~3× throughput for ~0.1% drift), and a single sequential video decode replacing ~1,700 random seeks. Each is one env var away from revert. Final: ~42s warm, down from 148.

Accomplishments that we're proud of

Took state of the art research and turned it into a real, usable product
Built a real anatomical brain, not a stock 3D blob. The cortical mesh in Aesthesis is the actual fsaverage5 pial surface (left + right hemispheres), colored per-vertex from TRIBE's 400-parcel output via a custom react-three-fiber shader.
Honest neural-data UX. Small touches we obsessed over: chart x-axis bounded by the actual duration (not the backend's TR-padded estimate), insights filtered to drop padding-artifact moments, click-to-seek that pauses on the exact frame so you can see what triggered the response.

What we learned

Honest UX matters more than impressive numbers. When you're presenting brain data, accuracy of presentation matters as much as accuracy of the model. Bounding the chart x-axis at the actual duration (not the backend's TR-padded estimate), filtering out insights anchored past the video end, freezing the video on the exact frame the user clicked, small touches that make the product trustworthy.
The value of an AI agent is operational, not capability-based. A single LLM call could compare past runs if you stuff them all into the prompt, but you'd hit context limits, burn tokens, and not scale past 50 runs. Selective tool calling via Backboard solved this elegantly. We exposed 4 tools; the agent decides what to fetch per query.

What's next for Aesthesis

Capture & Assess (autonomous demos). Create an agent that takes a URL, navigates the experience for you, and records the session, so users don't have to record their own MP4.
Persona-conditioned simulation. TRIBE v2 was trained on 720+ humans. We want to fine-tune the readout heads on demographic subsets (gamers, designers, older users so a designer can ask: "how does THIS persona react to my UI?"
Long-term vision: every UI decision, from a CTA copy tweak to a full redesign, should be testable for neural impact before it ships, the same way we currently test for performance and accessibility. Heatmaps and surveys are lossy proxies for what the brain actually does. Aesthesis is the first version of replacing those proxies with the signal.