Metis
Inspiration
We've all felt it: thirty minutes vanish into a Reels rabbit hole and you can't recall a single thing you watched. The feed isn't entertainment anymore — it's a stimulus pattern engineered against your reward circuitry, and you have no instrument to measure what it's doing to you. Step counts changed how people move. Sleep scores changed how people rest. There's no equivalent for the most cognitively expensive thing most of us do every day, which is scroll.
So we asked a stranger question: what if we could see your brain on the feed? Not metaphorically — actually estimate the cortical response, region by region, second by second. Metis is that instrument.
## What it does
Metis watches your short-form feed and predicts the neural response it's eliciting in you, then turns that into session-level feedback you can act on.
- Upload mode. Drop an Instagram or TikTok clip onto the lander. Metis runs it through a brain-encoding model, parses the predicted cortical activity into seven functional regions (reward, salience, self-reflection, visceral, face, social, control), and renders a 3D brain visualization plus a "time reclaimed" dashboard.
- Live mode (Chrome extension). A Manifest V3 extension captures the active Reels tab in 10-second rolling windows via chrome.tabCapture + an offscreen MediaRecorder, ships each window to the backend, and surfaces results in a side panel. After 20 minutes of scrolling you might see: "14 of the last 20 minutes triggered high-activation reward circuitry with control regions disengaged — the dopamine-bait signature."
- Pattern matching. On top of raw scores, Metis flags named neural patterns the user can recognize: Dopamine bait, Comparison spiral, Clickbait pattern, Autopilot scroll, Anxiety lean. Each comes with a plausibility cap so we don't over-claim — neuroscience reverse-inference is fuzzy and we're honest about it.
How we built it
Machine learning core. The brain we're predicting onto belongs to TribeV2 (Meta FAIR's open brain-encoding model), which maps video frames to predicted BOLD signal across 20,484 cortical vertices on the fsaverage5 mesh — one prediction per second of video. TribeV2 internally uses a vision-language stack with a gated Llama-3.2 backbone, so the cold path includes a HuggingFace-gated checkpoint download cached on a persistent volume. Downstream of TribeV2 we use the Destrieux surface atlas (via Nilearn) to mask the 20k vertices into seven functional ROIs, then compute a composite addictiveness score: (reward + 0.5 × salience) / (control + ε).
Vultr-powered backend. The whole inference path runs on Vultr.
- TribeV2 runs on a Vultr Cloud GPU instance — the heavy A-class GPU is what makes per-second cortical prediction tractable. The model checkpoint lives on attached block storage so cold starts don't re-pull the 1 GB weights from HuggingFace.
- The FastAPI bridge runs on a Vultr compute instance, accepting multipart video uploads from both the React webapp and the Chrome extension, dispatching to the GPU node, then running the parsing pipeline locally for fast iteration.
- Session persistence is backed by Vultr-hosted storage, holding parsed results, brain color buffers, and per-session metadata so the dashboard can rehydrate prior sessions.
Putting GPU, API, and persistence under one provider meant intra-region latency between the bridge and the model node stayed in single-digit milliseconds, which mattered when we were debugging the rolling-buffer pipeline live.
Frontend. React + Vite for the upload demo and dashboard, with a Three.js / react-three-fiber brain renderer that consumes a per-vertex color buffer the backend writes after each inference. The Chrome extension is built with @crxjs/vite-plugin, runs MV3 with a side panel UI, and reuses the same /process API contract — no special endpoints for the live path.
Challenges we ran into
- MediaRecorder is not what you think. Calling start(timeslice) on a MediaRecorder does not produce independently playable mp4 fragments — only the full stop-emitted blob is a valid container. We had to stop and restart the recorder every 10 seconds and accept a single window in flight at a time, trading throughput for in-order, valid mp4s the GPU node could decode.
- Tab capture in MV3. Service workers can't host MediaRecorder. We landed on a three-process split — service worker for orchestration, an offscreen document for capture and upload, side panel for UI — wired through chrome.runtime.sendMessage. Getting the getMediaStreamId handoff between worker and offscreen right took a full afternoon.
- Atlas labels lying to you. The original parsing spec referenced ROI names (G_fusiform, G_temporal_sup) that don't exist verbatim in Destrieux — they're sub-divided or differently spelled (G_oc-temp_lat-fusifor, G_temp_sup-*). Silent empty masks meant our first scores were nonsense before we caught it.
- Per-timestep vs session-mean. Our first dashboard showed NaNm of undefinedm sampled because the parser only returned a session-level score, not per-second classification. We added a per-timestep recomputation of the same composite score, threshold-counted seconds above the "high" cut, and surfaced it as time-in-state.
- GPU cold starts. First boot of the TribeV2 image on Vultr took 10–20 minutes (heavy ML deps, gated weight pull). We solved it by baking dependencies into a custom image and parking the checkpoint on persistent block storage, dropping warm starts to seconds.
Accomplishments that we're proud of
- End-to-end neural pipeline in a hackathon window. Video bytes go in, predicted cortical activity comes out, and the user sees their brain light up — on real Instagram clips, in a real browser extension.
- Honest neuroscience. We didn't fake confidence. Every pattern label carries a plausibility cap so users can tell Dopamine bait (medium-high) from Anxiety lean (low). Reverse inference is the original sin of neuroimaging UX, and we tried not to commit it.
- One backend, two surfaces. The webapp upload demo and the live Chrome extension hit the same /process endpoint with the same contract. Adding live capture didn't require rewriting the pipeline.
- A live brain. The 3D cortex view re-colors itself per session. Watching it shift between videos is the first time most testers actually believed the prediction was about them.
What we learned
- Tab capture is a maze, but a navigable one. MV3 forced a clean component split we wouldn't have chosen otherwise, and we ended up with cleaner state boundaries because of it.
- Co-locating GPU + API matters more than picking the fastest GPU. Once we moved both onto Vultr in the same region, our end-to-end latency dropped harder than upgrading the GPU tier did.
- Functional-region storytelling beats raw scores. "Reward up, control down" is a story. "Score: 2.3" is a number. The pattern layer was the difference between a demo and a product.
- Brain-encoding models are usable today. TribeV2 and its peers have crossed a threshold where they're not just research artifacts — they're a primitive you can wire into a consumer app and get something meaningful out.
What's next for Metis
- Baseline normalization. Run TribeV2 on neutral nature footage to anchor the addictiveness score against a true zero, so labels become absolute rather than per-session percentile.
- TikTok adapter + multi-tab. The extension currently targets Instagram Reels in a single tab; TikTok and multi-tab session aggregation are next.
- On-device or edge inference. Even with a warm Vultr GPU, the round-trip is the throughput ceiling. We're investigating distilled student models that could run closer to the user, with the Vultr GPU reserved for high-fidelity reprocessing.
- Family / parental mode. Aggregated, privacy-preserving session summaries for parents — not raw watch logs. The same pattern detector, a different recipient.
- Open the dataset. Long-term, an opt-in corpus of (short-form clip → predicted cortical response) pairs is genuinely useful for researchers studying attention capture. We'd like to be the people who release it responsibly.
Metis started as a question — what is the feed actually doing to me? — and became an instrument that begins to answer it. We think every scroll deserves a number.
Log in or sign up for Devpost to join the conversation.