Aggie

A voice companion in your browser and on your phone. You talk to it, and it runs your tasks as agents you can watch and steer.

Inspiration

Every voice assistant makes you wait your turn. You ask, it answers, you ask again. I wanted the other regime: you talk while the agent works, and you run several at once. Thinking Machines Lab and Hey Clicky point at it.

If you take it further, the model stops waiting for turns. It runs a query every 200 milliseconds, summarizes what has happened so far, and keeps a connection open for each session. I suspect that if the interface is improved, a single user can initiate queries in such a way that serving their requests is chiefly compute-bound. That crossover is the ridge point on the roofline. One decode reads the whole model once per token and barely uses the compute, so it stays memory-bound. Hold many of that user's queries in flight and each loaded weight gets reused across them, which raises the arithmetic intensity and pushes the work past the ridge to the compute-bound side. The batch that normally takes a crowd of separate users, a continuous interface gets from one person who never stops. That is where I am headed. It is not what Aggie does yet.

What it does today

There is a single orb on the phone and in the browser. You hold it to talk and release to send. Each request becomes a separate thread that runs as an agent in its own background tab, so you can start one and move on to the next. The threads stay listed so you can keep track of them, and if you change your mind, the most recent instruction is the one that counts.

You can switch the model an agent is using partway through a task without losing its work. Everything runs on your own machine. The phone or the browser reviews any action the model proposes before it runs, and the API keys stay on the gateway. The voice loop is still turn-based: you speak, it replies, you speak again.

How I built it

Three surfaces, one gateway. The Android app and the Chrome extension are thin clients: voice, screen context, approvals, local actions. Neither holds a provider key. The gateway is a Node service that routes turns and stores every session, transcript, and run in Postgres. Voice streams as PCM16 over a WebSocket through Gemini Live, with Chirp 3 as a second transcription path.

What's next

The continuous loop. Running a query every 200 milliseconds and keeping a connection open per session fills the inference capacity within seconds, so it is only worth doing if the sessions can share their KV cache. That means running my own model, speech-to-text, and text-to-speech instead of renting them. Once that works, serving a single user becomes chiefly compute-bound.

Built With

Share this project:

Updates