context/history/why we built this

imagine not being able to use your hands. don't laugh.. over 13 million Americans face this problem due to conditions such as cerebral palsy and ALS. how are you going to text your friends, buy stuff on amazon, or search up a mrbeast video to watch?

what do they use? eye-gaze cameras that cost $15,000 or that awful built-in head tracking technology that is built into MacOS. or, asking someone else to do it for them.

Voice agents today are really stupid and slow. Alexa, Google, siri are all really slow. They all wait for the user to stop talking and process the request for 2-3 seconds just to do one thing. They're also limited to an incredibly small amount of tasks. We asked Siri to tell us a joke related to programming, and instead of answering, it simply redirected the question to ChatGPT: "Would you like to use ChatGPT for this?". We asked Alexa what our latest email was about; it couldn't.

Meanwhile, the new wave of ai browser agents (Claude for Chrome, Perplexity Comet, OpenAI Atlas) use super inefficient and costly techniques to navigate the web:

  • Taking screenshots and outputting specific (x, y) coordinates to click on
  • Needing to re-learn websites' structure
  • And much more, all contributing to a terrible user expierence

Props to Anthropic and OpenAI for trying to allow an AI agent to browse the web; at least they gave it a shot... just kidding.

Anyways. We asked: what if your browser just listened and acted hands free, instantly, before you even finish your sentence?

What it does

Runic is a voice-powered browser agent that controls your browser without the need for screenshots or mouse simulation. No, this isn't:

  • "Open New Tab. Activate Keyboard. H-T-T-P-S-COLON-SLASH-SLASH-G-O-O-G-L-E-DOT-COM". Not hardcoded commands to memorize.
  • Reverse engineering specific websites such as Google Calendar and linking it to an agent via tool calls.

What we built this weekend is a browser that lets you literally say what YOU want: "go to amazon, search for noise cancelling headphones, sort by reviews, open the top one...actually, can you go back and click the second one? great, now buy it for me."

Runic reads the actual live page and reasons how to do what you asked. Ask it to do something nobody ever anticipated? It still works.

Our favorite part of Runic

Picture this- You say to Runic, "Can you open Google and search up cute dogs, and go to the first video?"

The moment the words "Can you open Google" exit your mouth, Runic navigates to https://google.com in a new tab. The moment you finish saying "and search up cute dogs", Runic has found the search input box and is typing "cute dogs" in it.

Runic doesn't wait for you to finish talking. It acts live on your browser, literally while you talk. Imagine that.

How we built it

Challenges we ran into

Saving context across use speech to text actions was a difficult first hurdle to jump across. Since we don't wait for the user to complete speaking before starting up actions the model lost context after each action. For example, we would ask: "Open Facebook Marketplace and find me a used cars". The model would first open Facebook Marketplace after the first bit of the command and then switch to google to search for used cars. We decided to use Palantir AIP as the solution for context.

  1. The agent has no memory. It's born, it acts, it dies. Every sentence = fresh
    agent.
  2. AIP is the only brain. Everything the agent knows comes from one query to AIP at startup. After that, zero lookups allowed — speed is a rule, not an
    optimization.
  3. Learning happens between your sentences. After the agent acts, background jobs extract info (people, documents, companies) and write them to AIP. By the time
    you speak again, the next agent gets a richer package.

What we learned

Separating planning (LLM decomposes commands into tool calls) from execution (deterministic action registry) keeps latency low and behavior predictable. You cannot get sub-sentence action dispatch by optimizing a full-transcript pipeline. It requires a fundamentally different architecture with a separate intent classifier running on streaming partial transcripts.

What's next for Runic

Finishing touches on web and the apple environment. Using a better speech-to-text model to get more accurate transcriptions Optimize speech to action latency Improve agent decision logic by configuring the model deeper Releasing to the public with a beta

Built With

Share this project:

Updates