Cleo (CLear Environment Overview)

Real-time environmental intelligence for autistic individuals and their caregivers.

Inspiration

Autism affects 1 in 36 people globally. Most of them navigate daily life without specialist support, alongside caregivers who are making critical decisions about environments based entirely on experience and gut instinct.

When it goes wrong, the consequences go far beyond a difficult afternoon. Repeated sensory overload events erode confidence, shrink the person's world, and fuel the caregiver burnout that is endemic in this community. Autistic individuals are disproportionately excluded from public life — supermarkets, schools, medical appointments, social events, workplaces — not because they cannot participate, but because nobody can tell them in advance whether a specific environment will overwhelm their nervous system.

This is not a niche problem. It is a daily reality for millions of families, and there has never been a tool to help them navigate it.

We came across TRIBE v2, a tri-modal foundation model published by Meta FAIR in March 2026, which predicts whole-brain fMRI responses from audiovisual input, validated across 720 subjects and over 1,000 hours of brain imaging data. We asked one question:

What if we used the most sophisticated brain response model ever built to solve one of the most overlooked accessibility problems in the world?

Never done before. Until today.

What it does

Cleo is a real-time environmental intelligence platform for caregivers of autistic individuals.

You point your phone camera at any environment. Cleo analyses the live audiovisual stream and predicts the sensory load that environment will place on the brain, before you walk in.

The output is simple and immediate. A score from 1 to 10 with colour coding, the specific sensory factors driving that score, personalised recommendations for that individual, and a projection of whether the environment will get harder or easier the longer you stay.

The neuroscience behind the score focuses on three brain regions central to autistic sensory experience: the amygdala, hippocampus and thalamus. These are the brain regions most linked to sensory overload in autistic youth.

How we built it

Cleo is split across three boxes that talk over HTTP.

The Swift iOS app is what the user actually touches — point the phone, record ~30 seconds of audio and video, hit upload. That clip POSTs to a small FastAPI service (src/cleo/api.py) which queues the job, holds it in memory, and exposes /jobs/{id} for the app to poll.

The FastAPI layer is the orchestrator. It does not do any ML itself — it just brokers between the phone and the GPU cluster (Cornell's ellis-compute-02, ~24 GB VRAM), where Meta's TRIBE v2 actually runs. tribe_client.py is the thin HTTP wrapper that uploads media and text, polls the remote /v1/runs/{id} endpoint, and pulls back the predicted fMRI tensors. If the GPU is unreachable mid-demo, the API silently falls back to a mock forecast so we don't die on stage.

The pipeline (pipeline.py) is a straight chain:

ffmpeg demuxes the clip into audio and video frames
Claude Vision writes a one-paragraph objective caption of the scene (captioning.py)
That caption plus the raw media go to TRIBE v2 on the GPU, which returns predicted BOLD activation on the fsaverage5 cortical surface (20,484 vertices) plus 8,802 subcortical voxels

We then aggregate two ways. A Destrieux ROI grouping (auditory / visual / limbic-adjacent) produces the caregiver-facing sensory load score:

$$\text{Load Score} = f(\alpha \cdot A + \beta \cdot H + \gamma \cdot T)$$

Where $A$ is predicted amygdala activation, $H$ is hippocampus, and $T$ is thalamus, with personalised weights:

$$\alpha + \beta + \gamma = 1, \quad \alpha, \beta, \gamma \geq 0$$

A Glasser 360-parcel plus 8 subcortical breakdown produces the fuller neuroscience report.

Claude then writes the final caregiver brief — score, dominant region, and before/during/distress tips — grounded in the z-scored TRIBE numbers so it cannot hallucinate the underlying signal.

Total round-trip: ~5 minutes per clip, dominated by GPU inference.

Stack: Swift / SwiftUI · FastAPI · TRIBE v2 (Meta FAIR) · Claude Vision · Claude · ffmpeg · Destrieux ROI · Glasser 360 · Harvard-Oxford subcortical atlas · Cornell GPU cluster

Challenges we ran into

Finding the right problem. We spent around 40 minutes in ideation before landing on Cleo. Several ideas came up more than once and we kept circling back without a clear direction. The breakthrough came when we stopped asking "what can this technology do" and started asking "who has no tool right now and desperately needs one." That reframe unlocked everything.

Understanding the neurobiology of ASD. TRIBE v2 was trained and validated on neurotypical adult brains. Applying it meaningfully to non-verbal autistic children meant we could not simply take the model's outputs at face value. We had to research the specific neurobiology of ASD in young people — how thalamic sensory gating is reduced in autistic brains, why the amygdala activation threshold is lower, and how hippocampal context processing differs. This shaped which brain regions we focus on, how we set threshold values, and why unpredictability is weighted so heavily in the load score. It also shaped our honesty about the validation gap, which we address directly in the product.

Compute constraints. We were working within a 6,000 credit Nvidia GPU allocation. Running TRIBE v2's full tri-modal inference pipeline in anything approaching real time required careful decisions about batch size, window length, and which model layers to extract from. We made deliberate trade-offs between inference quality and latency to keep the product usable within a constrained budget.

Mapping raw tensor output to brain regions. TRIBE v2's raw output is a tensor representing predicted BOLD activation across 20,484 cortical vertices and 8,802 subcortical voxels. To extract named, meaningful brain region activity values we had to apply brain atlases — the Destrieux parcellation and Harvard-Oxford subcortical atlas — to map the tensor onto the anatomical regions that drive Cleo's load score. Getting that mapping pipeline working correctly and efficiently was one of the most technically demanding parts of the build.

Connecting the backend to the iOS frontend. Bridging the Python inference pipeline to a live iPhone camera feed introduced integration challenges around real-time data streaming, latency management, and keeping the mobile UI responsive while inference runs server-side. AVFoundation session configuration and device provisioning added further friction that we worked through under time pressure.

Accomplishments that we're proud of

We built a working real-time inference pipeline from live iPhone camera and microphone input through TRIBE v2 to a meaningful, personalised brain activation prediction in a single hackathon weekend.

Cleo is the first application of TRIBE v2 to real-world accessibility and mental health. The model was published six weeks ago. We saw what it could do and pointed it at a problem that affects millions of non-verbal autistic children and their caregivers every single day.

We are proud of the output language. Every prompt in Cleo was written with intention. Non-verbal autistic children cannot tell their caregivers when an environment is becoming too much. Cleo does it for them — and it does it before the moment of crisis, not after. When the score reaches 10, Cleo tells the caregiver: "This is not a failure. This is information." That line took longer to write than most of the code. The environment is always the problem. Never the child.

We are proud that we treated ethics and child data privacy as design constraints from hour one, not as boxes to tick at the end. Cleo scans environments. It never records, stores, or processes images of children.

What we learned

The distance between a model output and a genuinely useful product is enormous, and closing it is the real design challenge. TRIBE v2 produces numbers. A caregiver standing outside a venue with a non-verbal child who is already becoming distressed needs one clear sentence. That translation layer is where most of the product actually lives.

We learned that unpredictability drives autistic sensory overload more consistently than raw intensity. A familiar loud environment is often tolerable. An unfamiliar quiet one with a single unexpected sound can tip a child into overload. This finding from the neuroscience literature shaped the whole architecture — it is why the hippocampus sits alongside the amygdala and thalamus in Cleo's model, and why environmental predictability carries such heavy weight in the load score.

We also learned that the communication barrier is the core problem, not a secondary one. The reason caregivers of non-verbal autistic children are making environment decisions on pure instinct is not a lack of research or resources. It is that the person most affected cannot tell them what they need to know. Every design decision in Cleo flows from that insight. The product exists to give caregivers the information their child cannot give them.

Treating ethical constraints as design constraints produced better decisions rather than harder ones. Every time we asked whether something could cause harm, we found a cleaner solution. The no-data-storage decision came from a child privacy concern and simplified the architecture. The dignity-first language principle came from an ethical commitment and produced more useful outputs. Responsible design and good design pointed in the same direction every time.

What's next for Cleo

The immediate priority is clinical validation. TRIBE v2 was validated on neurotypical adults. We need a research partnership with an occupational therapy or neuroscience department to validate Cleo's predictions specifically in autistic child populations. We are clear that Cleo is a navigation tool and not a clinical instrument, and building the evidence base to close that gap responsibly is the work directly ahead of us.

In the near term, the Sensory Profile learning loop becomes the core product differentiator. With outcome feedback from real outings, the profile becomes genuinely personal — getting measurably more accurate over the first 10 to 15 uses as it learns which environments work for this specific child and which do not. No other tool in this space learns from the individual over time in this way.

In the medium term, Cleo's environmental scoring integrates into the platforms caregivers already use; Google Maps, booking platforms, school management systems. Every venue gets a Cleo score, and caregivers see it when they search. Venues gain a direct commercial incentive to improve their sensory environments to attract a large and underserved customer base. Accessibility improves because the market rewards it, not because it is mandated.

The longer term goal is a standard. No credible, data-grounded sensory environment standard exists anywhere in the world today. As Cleo's anonymised, consented dataset grows it becomes the foundation for the first evidence-based sensory accessibility certification, a verified signal that a space genuinely works for people with different sensory needs.

On scalability: the current pipeline processes one clip per job. Moving to a job queue with horizontal GPU worker scaling means throughput grows linearly with demand. In production, inference moves to a managed GPU cluster with autoscaling tied to job queue depth. The same pipeline that serves autistic children requires only a recalibrated Sensory Profile and adjusted atlas weightings to serve people with early-stage dementia, or other conditions involving altered sensory processing.