Inspiration

In this fast-moving AI space, engineers leverage agentic tools like Gemini CLI or Claude Code in their terminals every day. The terminal is powerful because it is text-based, can execute tools, and has access to local files. The IDE complements it as the visual interface to read and verify code.

Terminal → control the work. IDE → visualize the work.

But what about everyone else? For non-engineers — marketers, analysts, managers, recruiters — the "output" is visual: spreadsheets, documents, slides, emails, web pages. They have no equivalent of the terminal + IDE combo.

AGENTERM changes that. AGENTERM is the terminal. The browser is the IDE. Together, they give every knowledge worker the same 10x productivity boost that engineers get from agentic tools.


What it does

AGENTERM is a desktop AI agent that turns your entire computer into an intelligent, voice-controlled workspace. Instead of manually switching between apps, tabs, and tools — you tell AGENTERM what to do across multiple parallel conversations, and it executes.

All chats ran as independent parallel agent sessions, each with dedicated browser tabs — true multi-tasking.


How I built it

Architecture

AGENTERM is built as an Electron desktop application with a React frontend. The AI backbone is Google ADK running an LlmAgent powered by Gemini 3.1 Flash.

The agent orchestrates 26 tools across three domains:

Browser Control — 12 tools

Via the agent-browser package + Chrome DevTools Protocol (CDP). The agent can navigate URLs, snapshot interactive elements (assigned refs like @e1, @e2 for precise targeting), click, fill forms, select dropdowns, run JavaScript, scroll, press keys, take screenshots, and wait for conditions. Each chat tab gets a dedicated browser tab for session isolation via AsyncLocalStorage.

Computer Control — 12 tools

Via AppleScript and Python + Quartz framework. The agent can open applications, detect the frontmost app, list running apps, click at screen coordinates, type text, press keyboard shortcuts, take screenshots with Gemini 2.5 Flash vision analysis (returns descriptions + coordinates of UI elements), inspect accessibility elements, run arbitrary AppleScript, and control system volume.

Google Workspace — 2 meta-tools

Via @googleworkspace/cli with OAuth2. One tool executes any Google API operation across Gmail, Calendar, Drive, Sheets, Docs, Slides, Tasks, Forms, Meet, and Contacts. The other provides API schema lookups so the agent knows all available methods. Write operations auto-open results in the browser.

Voice Input

Push-to-talk via Ctrl+[1-9] — hold to record, release to transcribe via Gemini 2.5 Flash and auto-send to the target chat tab. A fullscreen overlay with animated visual feedback shows recording state.

Persistence

All conversations are persisted to Google Cloud Firestore using the REST API with the user's OAuth token — zero additional dependencies. Chat history is restored on app startup.


Technologies Used

Technology Role Version
Google ADK Agent framework with tool orchestration 0.5.0
Gemini 3.1 Flash Primary LLM — reasoning and tool selection latest
Gemini 2.5 Flash Vision (screenshot analysis) + audio transcription latest
Google Cloud Firestore Conversation persistence (REST API)
Google OAuth2 Authentication + encrypted token storage
Electron Desktop runtime 38.8.0
React + TailwindCSS v4 Frontend UI 18.2 / 4.2.1
agent-browser Browser automation via CDP 0.20.7
@googleworkspace/cli Google Workspace API interface 0.16.0
Vite + vite-plugin-electron Build system 5.1.6

Data Sources

  • User's Google Account — Gmail, Calendar, Drive, Sheets, Docs, Slides, Tasks, Forms, Meet, Contacts (via OAuth2)
  • User's Chromium browser — Any page content, interactive elements, DOM (via CDP)
  • User's macOS desktop — Screen pixels, running applications, UI elements, system state (via AppleScript + Quartz)
  • Google Cloud Firestore — Conversation storage and retrieval

Challenges and Learnings

  1. AsyncLocalStorage for multi-tab context routing — The biggest architectural challenge was ensuring tool calls route to the correct browser tab when multiple chat sessions run in parallel. Node.js AsyncLocalStorage propagates the chat tab ID through the entire ADK agent execution stack without thread-unsafe globals.
  2. Zero-dependency Firestore client — Instead of importing the 500KB+ Firebase SDK, using raw fetch() against the Firestore REST API with the existing OAuth token kept the bundle lean and avoided Electron compatibility issues.
  3. GWS CLI as a universal meta-tool — Rather than implementing individual wrappers for each Google API, delegating to @googleworkspace/cli gives the agent access to all Google Workspace APIs through a single tool — infinitely extensible without code changes.
  4. Snapshot refs vs coordinate clicking — For web automation, the agent-browser snapshot approach (assigning semantic refs to interactive elements) proved far more reliable than coordinate-based clicking. For desktop UI, however, screenshot + Gemini Vision analysis with coordinate clicking was the better approach since there's no DOM equivalent.
  5. Hybrid model architecture — Using Gemini 3.1 Flash (lightweight, fast) for the main agent reasoning loop and Gemini 2.5 Flash (vision-capable) for specialized tasks (screenshots, voice) optimizes for both speed and capability without compromise.

Built With

Share this project:

Updates