Voice command starts autonomous execution
Gemini extended thinking plans 8 milestones
Imagen 3 generates custom hero images in real-time
Trust Controls pause for approval on sensitive actions
Real-time thought signatures stream as Wispy works
Complete website built from voice command - deployed to Vercel
CLI showing Gemini integration, 27+ tools, and Marathon Mode

Wispy - Autonomous AI Agent Platform

Inspiration

We kept hitting the same wall. Every time we needed an AI to do something real, not just answer a question, but actually build something, research something, execute a multi-step task, we had to babysit it the entire time. Copy-paste outputs. Re-prompt when it lost context. Start over when it failed halfway through.

We thought: what if an AI agent could take a goal like "build a REST API with auth and tests" and just go do it? Plan the steps, execute them one by one, recover when something fails, and keep going for hours or days until it's done. No hand-holding.

That's what we built with Wispy.

What it does

Wispy is an autonomous AI agent platform built entirely on the Gemini API. You give it a task, it plans the whole thing into milestones, then executes each milestone autonomously using 90+ built-in tools. We call this Marathon Mode, because the agent doesn't stop until the job is done.

Here's what actually happens when you run a marathon:

Gemini 2.5 Pro receives the goal and uses extended thinking (up to 24,576 tokens of reasoning budget) to decompose it into a detailed milestone plan
The agent executes each milestone by chaining tool calls via Gemini native function calling: file creation, shell commands, web requests, browser automation, image generation via Imagen 3, whatever the task needs
After each milestone, Gemini verifies its own work against the requirements and decides if the milestone is actually complete
If something fails, the agent analyzes the error using Gemini's reasoning and retries with a different approach, up to 3 recovery attempts per milestone
Progress is checkpointed to disk, so you can pause and resume at any point
Real-time updates are pushed to Telegram or WhatsApp so you can monitor from your phone

It's not a chatbot. It's an agent that does the work.

Architecture

Wispy is a full platform, not a script. Here's how it's structured:

Wispy Platform Architecture

Channels (left) handle user interaction: CLI, Telegram, WhatsApp, REST API, and WebSocket all feed into the same core engine. The Core Engine (center) is where Gemini lives: the agent loop, Marathon executor, and all 90+ tools. Services (right) provide the supporting infrastructure: vector memory, scheduled tasks, browser automation, voice TTS, and the skills engine.

Security runs through every layer. Device authentication uses Ed25519 key pairs. The Action Guard enforces permission levels before any destructive tool executes. The Trust Controller manages session isolation so multi-user environments stay clean.

The public REST API exposes 16+ endpoints with API key management, rate limiting, and SSE streaming, so you can integrate Wispy into any application.

How Gemini Powers Everything

This is not a project that uses Gemini as a wrapper. Every part of the system is built directly on the Gemini API through the @google/genai SDK. Here is exactly how each Gemini capability is used:

How Gemini Powers Wispy

Gemini API Feature	How Wispy Uses It
Gemini 2.5 Pro	The primary model powering all agent reasoning, planning, and execution. Every conversation and marathon task runs through Gemini 2.5 Pro.
Extended Thinking (Thinking Budgets)	Marathon planning uses configurable thinking budgets: `low` (128 tokens), `medium` (1,024), `high` (8,192), and `ultra` (24,576). The agent reasons deeply before acting on complex goals.
Native Function Calling	90+ tools registered as Gemini function declarations with full JSON schemas. The model decides which tools to call, in what order, with what parameters. Multi-tool chaining per turn. This is the core execution loop.
Streaming (generateContentStream)	Real-time token streaming for interactive chat across CLI, Telegram, WhatsApp, and SSE endpoints. Users see the agent thinking and responding in real-time.
text-embedding-004	Vector memory system. Every conversation turn is embedded using `text-embedding-004` and stored in SQLite for semantic search and long-term recall across sessions.
Imagen 3 (imagen-3.0-generate-002)	Image generation as a native tool. The agent can create images as part of any task: diagrams, mockups, visual assets, all generated through the Gemini API.
Gemini 2.5 Flash	Fast model for lightweight operations: token estimation, routing decisions, content classification, and quick tasks where Pro would be overkill. Used for intelligent model routing.
Structured Output (JSON Mode)	Tool result parsing and structured data extraction use Gemini's JSON output mode to ensure clean, parseable responses for the agent loop.
System Instructions	Every agent session uses Gemini system instructions to define the agent's personality, capabilities, tool usage rules, and safety boundaries. The skills engine dynamically injects skill-specific system instructions.
Multi-turn Conversations	Full conversation history management with Gemini's multi-turn chat API. Context compaction summarizes older turns while preserving recent context for long-running marathons.
Safety Settings	Configurable Gemini safety settings per session. The agent respects content safety while still being able to execute code and system commands within the trust boundary.
Thought Signatures	Reasoning continuity across sessions. When the agent resumes a multi-day marathon, it picks up its chain of thought from where it left off.

Every API call goes through the @google/genai SDK. We use GoogleGenAI for standard generation, generateContentStream for streaming, the embeddings API for memory, and the image generation API for Imagen 3. There is no middleware model. Gemini is the brain.

Marathon Mode - Deep Dive

This is the core innovation. Marathon Mode turns Gemini from a conversational model into an autonomous executor.

Marathon Mode Flow

Step 1 - Goal Submission: The user provides a natural language goal. Could be simple ("set up a Node.js project") or complex ("build a full-stack dashboard with auth, database, API, and deploy to Vercel").

Step 2 - Deep Planning: This is where the thinking budget matters. Gemini 2.5 Pro receives the goal with an extended thinking budget of up to 24,576 tokens. It doesn't just list steps, it reasons through dependencies, potential failure points, and verification criteria for each milestone. The quality of this plan directly determines whether the marathon succeeds or fails.

Step 3 - Milestone Execution: The agent chains tool calls to complete each milestone. A single milestone might involve creating files, running shell commands, making HTTP requests, automating a browser, and generating images with Imagen 3, all in sequence, all decided by Gemini's function calling.

Step 4 - Self-Verification: After execution, Gemini evaluates its own work against the milestone requirements. This isn't a rubber stamp. The model is prompted to find gaps, missing edge cases, and incomplete implementations.

Step 5 - Pass or Recovery: If verification passes, the agent checkpoints and moves to the next milestone. If it fails, the agent enters self-recovery: it analyzes the error, reasons about alternative approaches, and retries up to 3 times with different strategies.

Step 6 - Checkpoint: Progress is saved to disk after every milestone. The marathon can be paused, resumed, or inspected at any point. If the process crashes, it picks up from the last checkpoint.

Step 7 - Completion: When all milestones are done, the agent notifies you via Telegram or WhatsApp with a summary of what was accomplished.

90+ Built-in Tools

Every tool is registered as a Gemini function declaration with a full JSON schema. The model sees all tools and decides which ones to use based on the task.

Tools by Category

The tools span 12 categories:

File & Code (14 tools): Read, write, edit, search, glob, create directories, manage projects
Shell & System (8 tools): Execute commands, manage processes, environment variables, system info
Web & HTTP (10 tools): Fetch URLs, parse HTML, make API calls, download files, web scraping
Browser Automation (9 tools): Navigate, click, type, screenshot, extract data via Playwright CDP
Memory & Search (6 tools): Store, recall, search conversations with semantic similarity via text-embedding-004
Documents (8 tools): Generate PDFs, DOCX, parse documents, create charts and reports
Image Generation (4 tools): Create images via Imagen 3, process and manipulate images
Voice & TTS (5 tools): Text-to-speech via Google Cloud, voice transcription, audio processing
Scheduling & Cron (6 tools): Set reminders, schedule recurring tasks, manage cron jobs
Communication (7 tools): Send messages via Telegram, WhatsApp, manage notifications
Wallet & Payments (5 tools): Blockchain interactions, wallet management
MCP Extensions (12 tools): Model Context Protocol servers for extending capabilities

All 90+ tools support multi-tool chaining. Gemini can call multiple tools per turn, passing outputs between them. The tool executor handles parameter validation, error capture, and result formatting before sending back to Gemini.

How we built it

Development environment: We used Google Antigravity as our primary AI-assisted coding environment for building and iterating on the platform. Antigravity's tight integration with Gemini made it natural to develop an agent that runs on Gemini. We also prototyped and tested our Gemini API calls directly in Google AI Studio before integrating them into the codebase.

TypeScript end to end. The entire platform is ~15,000 lines of TypeScript with ES modules, running on Node.js 20+.

The core agent loop:

Gemini receives a message + 90+ tool declarations as function schemas
The model responds with text or tool calls (via native function calling)
If tool calls: execute, capture results, send back to Gemini
Loop until the model decides it's done
For Marathon Mode: this loop runs per-milestone, with verification and recovery wrapping each cycle

Key technical decisions:

@google/genai SDK for all Gemini API calls: chat, streaming, function calling, embeddings, image generation
Google AI Studio for API key management, prompt prototyping, and testing Gemini configurations before deploying to production
SQLite + text-embedding-004 for the memory system: every conversation turn is embedded and stored for semantic recall
Grammy for Telegram: full inline keyboards, marathon controls, voice transcription
Baileys for WhatsApp: QR code auth, message handling, progress notifications
Express for the public REST API: 16+ endpoints with API key auth, rate limiting, SSE streaming
Playwright for browser automation: CDP connection, full page interaction
Commander.js for the CLI: 19 commands, interactive REPL with live markdown rendering

Context window management was our hardest engineering problem. Long-running marathons generate massive amounts of tool output. We built a compaction system that summarizes older turns (using Gemini 2.5 Flash for fast summarization) while preserving recent context, so the agent doesn't lose track of what it's doing mid-marathon.

Schema cleaning was another challenge. Gemini's function calling is strict about JSON schemas. We built a cleaning layer that strips unsupported fields (like additionalProperties, nested $ref) and normalizes all 90+ tool declarations so Gemini processes them correctly every time.

Challenges we ran into

Self-verification honesty. Getting the agent to honestly assess whether a milestone is complete, rather than just saying "done" and moving on, took significant iteration. We use Gemini's system instructions to prompt the model specifically to check output against requirements, surface gaps, and flag incomplete work. The verification prompt is as important as the execution prompt.

Rate limit recovery. Gemini API rate limits hit hard during marathon execution when the agent is making rapid tool calls. We implemented exponential backoff with jitter that retries transparently, so the agent doesn't lose progress when throttled.

Multi-channel session isolation. The same agent instance serves CLI, Telegram, WhatsApp, and API users simultaneously. Session isolation ensures one user's context doesn't leak into another's, while the underlying memory and tool infrastructure is shared.

Token estimation accuracy. Predicting token usage before sending requests to Gemini is critical for staying within context limits. We use Gemini 2.5 Flash for fast token estimation, but the gap between estimated and actual tokens required calibration.

Accomplishments we're proud of

Marathon Mode actually works. Not as a demo. It completes real multi-step projects autonomously. We've run marathons that span 20+ milestones over hours of execution.
90+ tools with native function calling. Every tool is a first-class Gemini function declaration. The model understands them and chains them correctly.
Multi-channel from day one. The same agent works identically across CLI, Telegram, WhatsApp, and REST API. You can start a marathon from Telegram and check progress from the CLI.
Published and installable. npm install -g wispy-ai works right now. Setup takes under 2 minutes with a free Gemini API key from Google AI Studio.

What we learned

The biggest insight: the thinking budget matters more than the model size. Giving Gemini 2.5 Pro a 24K token thinking budget for planning produces dramatically better milestone decompositions than a small budget. The quality of the plan directly determines whether the marathon succeeds or fails.

We also learned that verification is more important than execution. An agent that executes fast but doesn't check its work produces garbage. An agent that verifies every step, even if slower, actually finishes the job.

And: function calling schemas need to be clean. Gemini is strict about what it accepts. One malformed schema in 90 tools breaks the entire tool set. Our schema cleaning layer was unglamorous but essential.

What's next for Wispy

Parallel milestone execution for independent task branches
Team collaboration so multiple users can monitor and steer the same marathon
Cost analytics dashboard to track token usage and API costs per marathon
Plugin system for custom tool sets beyond the built-in 90+
Agent-to-Agent (A2A) protocol for Wispy agents delegating tasks to each other
Gemini context caching to reduce costs on long-running marathons with repeated tool schemas

Built With

Google Gemini 2.5 Pro, Google Gemini 2.5 Flash, Imagen 3, text-embedding-004, @google/genai SDK, Google AI Studio, Google Antigravity, TypeScript, Node.js, Express, SQLite, Grammy, Baileys, Playwright, Commander.js, WebSocket, SSE, Ed25519, MCP