Inspiration
We created Clarity out of frustration with current accessibility tools. In high school, one of us was introduced to Read&Write after being tested for dyslexia. Although it was meant to assist, it only worked on half the websites, read off-topic text, and was so cumbersome that it was actually more convenient to simply give up and read things by hand, despite listening being a more effective way to learn.
Both of us later struggled with office hours at CMU. As neurodivergent students, we had more questions than our classmates, yet both limited time and societal pressure limited us from seeking out all that we needed. Far too often, we departed without being fully heard.
Clarity was made to do that. We wanted a judgment-free, always-available, voice-based learning environment that learns alongside the user, promotes accountability, and enables neurodivergent students to learn in ways existing tools simply can't.
What It Does
Clarity is having a personal tutor who can look at what's on your screen, understand what you're doing, and answer your questions in real-time:
- Contextual Understanding: Takes screenshots of your desktop, dissects them, and explains on-screen content in plain language.
- Interactive Learning: Lets you stop, ask follow-up questions, and direct the discussion as if you had a live tutor.
- Accountability Tools: Reminds you, takes review notes, and indexes them for you.
- Adaptive Experience: Uses reinforcement learning to modify text-to-speech speed, difficulty, and explanation tone based on how "lost" the user seems.
- Low Latency: Delivers responses, screen analysis, and voice output in 3–5 seconds.
It's 24/7 office hours, but without fear or judgment.
How We Built It
- Desktop Overlay: Built in Electron with a lightweight, persistent overlay UI.
- Voice Agent: Run through LiveKit Cloud, powered by GPT real-time for natural, conversational responses.
Speech Pipeline:
- Input: GPT-4.0 Transcribe for speech-to-text.
- Output: ElevenLabs for natural, high-speed TTS.
- Screen Understanding: A dedicated desktop capture feature in Electron delivers screenshots to GPT-4.0, which vectorizes and comprehends on-screen content.
- System Control: Added an MCP server to dynamically create AppleScript so that Clarity can open apps, interact with files, and control the Mac system.
Reinforcement Learning Agent:
- Implemented in PyTorch + NumPy.
- Learns session parameters (e.g., reading speed, verbosity) using an actor-critic n-step algorithm.
Data signals include text length, TTS pacing, and user "lostness" scores.
Challenges We Faced
- RL Training: Our episode statistics were too narrow to sensibly adjust parameters in real-time. Powerful on paper, but limited in reality by the screenshot and user interaction rhythm.
- Image Capture Bottlenecks: Initially relied on GPT-4.0 directly, but this made the pipeline lag. Bypassed it by setting up a custom Electron capture service and piping GPT-4.0 with GPT real-time.
- Latency Management: Ensuring sub-5-second turnaround for screen analysis + conversational response required close coordination between multiple services.
Accomplishments We're Proud Of
- Dynamic AppleScript Control: Seamlessly executing user commands by dynamically creating scripts.
- Low-Latency Voice Loop: Designing a voice agent that is interruptible, reactive, and conversational yet continues to run background tasks.
- Multimodal Orchestration: Chaining GPT real-time with GPT-4.0 successfully to scan screenshots and offer human-like explanations in 3–5 seconds.
What We Learned
- How to combine Electron overlays + MCP servers to control a computer.
- How to make voice-optimized AI agents that naturally sound, with real-time interruption and re-picking up.
- How to wed reinforcement learning with LLM + TTS pipelines even where the dataset forces limited full deployment.
What's Next for Clarity
- Campus Disability Resource Offices: Implement Clarity as a tool for neurodivergent students to use, reducing reliance on clunky, outdated software.
- Workplace Onboarding: Expand to employee training, where new hires often need a quiet, secure room to ask "obvious" questions.
- Behavior-Adaptive Tutoring: Expand RL models to adapt not just TTS speed, but explanation tone and accountability prods.
- Seamless Multi-Platform Support: Expand from Mac to Windows/Linux, and even to AR glasses and mobile overlays.
- Our dream: A world in which every student doesn't feel stupid for asking too many questions — because help is always available, flexible, and shame-free.
Built With
- applescript
- electron
- elevenlabs
- javascript
- livekit
- node.js
- numpy
- openai-gpt-4/whisper/vision
- python
- pytorch
- silero
- websockets

Log in or sign up for Devpost to join the conversation.