Inspiration
In this fast-moving AI space, engineers leverage agentic tools like Gemini CLI or Claude Code in their terminals every day. The terminal is powerful because it is text-based, can execute tools, and has access to local files. The IDE complements it as the visual interface to read and verify code.
Terminal → control the work. IDE → visualize the work.
But what about everyone else? For non-engineers — marketers, analysts, managers, recruiters — the "output" is visual: spreadsheets, documents, slides, emails, web pages. They have no equivalent of the terminal + IDE combo.
AGENTERM changes that. AGENTERM is the terminal. The browser is the IDE. Together, they give every knowledge worker the same 10x productivity boost that engineers get from agentic tools.
What it does
AGENTERM is a desktop AI agent that turns your entire computer into an intelligent, voice-controlled workspace. Instead of manually switching between apps, tabs, and tools — you tell AGENTERM what to do across multiple parallel conversations, and it executes.
All chats ran as independent parallel agent sessions, each with dedicated browser tabs — true multi-tasking.
How I built it
Architecture
AGENTERM is built as an Electron desktop application with a React frontend. The AI backbone is Google ADK running an LlmAgent powered by Gemini 3.1 Flash.
The agent orchestrates 26 tools across three domains:
Browser Control — 12 tools
Via the agent-browser package + Chrome DevTools Protocol (CDP). The agent can navigate URLs, snapshot interactive elements (assigned refs like @e1, @e2 for precise targeting), click, fill forms, select dropdowns, run JavaScript, scroll, press keys, take screenshots, and wait for conditions. Each chat tab gets a dedicated browser tab for session isolation via AsyncLocalStorage.
Computer Control — 12 tools
Via AppleScript and Python + Quartz framework. The agent can open applications, detect the frontmost app, list running apps, click at screen coordinates, type text, press keyboard shortcuts, take screenshots with Gemini 2.5 Flash vision analysis (returns descriptions + coordinates of UI elements), inspect accessibility elements, run arbitrary AppleScript, and control system volume.
Google Workspace — 2 meta-tools
Via @googleworkspace/cli with OAuth2. One tool executes any Google API operation across Gmail, Calendar, Drive, Sheets, Docs, Slides, Tasks, Forms, Meet, and Contacts. The other provides API schema lookups so the agent knows all available methods. Write operations auto-open results in the browser.
Voice Input
Push-to-talk via Ctrl+[1-9] — hold to record, release to transcribe via Gemini 2.5 Flash and auto-send to the target chat tab. A fullscreen overlay with animated visual feedback shows recording state.
Persistence
All conversations are persisted to Google Cloud Firestore using the REST API with the user's OAuth token — zero additional dependencies. Chat history is restored on app startup.
Technologies Used
| Technology | Role | Version |
|---|---|---|
| Google ADK | Agent framework with tool orchestration | 0.5.0 |
| Gemini 3.1 Flash | Primary LLM — reasoning and tool selection | latest |
| Gemini 2.5 Flash | Vision (screenshot analysis) + audio transcription | latest |
| Google Cloud Firestore | Conversation persistence (REST API) | — |
| Google OAuth2 | Authentication + encrypted token storage | — |
| Electron | Desktop runtime | 38.8.0 |
| React + TailwindCSS v4 | Frontend UI | 18.2 / 4.2.1 |
| agent-browser | Browser automation via CDP | 0.20.7 |
| @googleworkspace/cli | Google Workspace API interface | 0.16.0 |
| Vite + vite-plugin-electron | Build system | 5.1.6 |
Data Sources
- User's Google Account — Gmail, Calendar, Drive, Sheets, Docs, Slides, Tasks, Forms, Meet, Contacts (via OAuth2)
- User's Chromium browser — Any page content, interactive elements, DOM (via CDP)
- User's macOS desktop — Screen pixels, running applications, UI elements, system state (via AppleScript + Quartz)
- Google Cloud Firestore — Conversation storage and retrieval
Challenges and Learnings
- AsyncLocalStorage for multi-tab context routing — The biggest architectural challenge was ensuring tool calls route to the correct browser tab when multiple chat sessions run in parallel. Node.js
AsyncLocalStoragepropagates the chat tab ID through the entire ADK agent execution stack without thread-unsafe globals. - Zero-dependency Firestore client — Instead of importing the 500KB+ Firebase SDK, using raw
fetch()against the Firestore REST API with the existing OAuth token kept the bundle lean and avoided Electron compatibility issues. - GWS CLI as a universal meta-tool — Rather than implementing individual wrappers for each Google API, delegating to
@googleworkspace/cligives the agent access to all Google Workspace APIs through a single tool — infinitely extensible without code changes. - Snapshot refs vs coordinate clicking — For web automation, the agent-browser snapshot approach (assigning semantic refs to interactive elements) proved far more reliable than coordinate-based clicking. For desktop UI, however, screenshot + Gemini Vision analysis with coordinate clicking was the better approach since there's no DOM equivalent.
- Hybrid model architecture — Using Gemini 3.1 Flash (lightweight, fast) for the main agent reasoning loop and Gemini 2.5 Flash (vision-capable) for specialized tasks (screenshots, voice) optimizes for both speed and capability without compromise.
Built With
- adk
- agent-browser
- electron
- firestore
- googleworkspace/cli
- react
- tailwind
- typescript
- vite
Log in or sign up for Devpost to join the conversation.