Inspiration

We kept watching browser automation agents make the same mistakes over and over. Click the wrong button. Get stuck in a loop. Burn through 40 LLM calls to do what a human does in 5 clicks. Every single run started from scratch, learning nothing from the last.

That frustrated us. Humans don't work like that. When you checkout on Amazon, you don't re-learn the flow every time. You remember: login, cart, checkout, pay. Your second time is faster than your first.

We asked: what if a browser agent could do the same? What if it could remember what worked, avoid what didn't, and actually get better over time — with proof?

That's LoopLess. A browser agent that learns from every run, caches successful action sequences, and uses Gemini as both the brain and the judge to continuously self-improve.

What it does

LoopLess is a self-improving browser automation agent with a full-stack demo UI. It runs multi-step web tasks (e-commerce checkout, calendar management, email composition, hotel booking) and gets measurably better with each attempt.

The core loop:

  1. Cold Run — The agent tackles a task with zero prior knowledge. Gemini 3.0 Flash plans every step from scratch. Every action, observation, and decision is logged.

  2. Learn — Successful action sequences are cached as "macros" in Redis, keyed by domain + intent + URL path + action history hash. The agent remembers what worked and where.

  3. Warm Run — On the same task, the agent checks Redis first. If a cached macro exists for the current context, it skips the LLM entirely and executes the proven action. Gemini validates each cached macro before use to prevent stale actions.

  4. Self-Improve — After every run, an LLM-as-a-Judge system (powered by Gemini) evaluates performance across 5 dimensions: task success, efficiency, loop avoidance, cache utilization, and action correctness. Failure patterns are analyzed, and the agent's system prompt is dynamically rewritten with learned rules for the next attempt.

  5. Auto-Improve Mode — A fully autonomous loop: run → evaluate → learn from failure → retry with improved prompts → repeat until success or max attempts. The agent literally rewrites its own instructions based on what went wrong.

The result: Warm runs consistently use 30-100% fewer LLM calls, complete in fewer steps, and avoid loops that plagued cold runs. Every improvement is traceable through W&B Weave.

18 benchmark tasks across 5 domains: SauceDemo (e-commerce), GoCalendar, GoMail, MarriSuite (hotel booking), and NetworkIn (professional networking).

How we built it

Gemini 3.0 Flash Preview — Google's newest model released just weeks ago — is the backbone of the entire system. It powers every intelligent decision:

  • Action Planning: Gemini receives the current page state (URL, actionable elements, form labels) and action history, then outputs the single best next action. We use the Google Generative AI SDK with systemInstruction for structured prompting.

  • LLM-as-a-Judge: After every run, Gemini evaluates the agent's performance with structured verdicts (PASS/FAIL, 0-1 scores, confidence levels). It reviews action sequences for correctness, detects wrong ordering, and generates specific improvement suggestions.

  • Real-time Action Validation: Before using a cached macro, Gemini validates it against the current page context. This prevents stale or incorrect cached actions from derailing a run.

  • Self-Improving Prompts: Gemini analyzes failure patterns and rewrites the agent's system prompt, injecting learned rules like "After adding items, MUST click cart icon before checkout" based on actual past failures.

  • Stagehand Integration: Gemini 3.0 Flash Preview powers the Stagehand browser automation framework for DOM observation and action execution through BrowserBase cloud browsers.

Full tech stack:

  • Backend: Node.js + Express + TypeScript, running Gemini 3.0 Flash Preview via @google/generative-ai SDK
  • Browser Automation: Stagehand (DOM-first, Gemini-powered) + BrowserBase (cloud browser sessions with live view and recordings)
  • Observability: W&B Weave for full trace logging — every runTask, planStep, executeAction, validateProgress, and learnMacro operation is wrapped as a Weave op
  • Memory: Redis Cloud for macro caching (30-day TTL), run metadata (7-day TTL), step events (1-day TTL), and evaluation feedback storage
  • Frontend: Next.js 14 with Tailwind CSS — live SSE streaming of agent steps, embedded BrowserBase live view, run comparison, and auto-improve dashboard
  • Evaluation: 5 Weave scorers (task success, efficiency, loop detection, cache utilization, LLM judge) + full Weave Evaluation framework integration

Architecture highlights:

  • Sequence-aware macro caching: Macros are keyed by domain:intent:url_path:action_history_hash, not just page signature. This prevents the "right page, wrong context" bug.
  • Macro confidence scoring: Macros track success/fail counts and are only reused if success rate exceeds 70%.
  • Feedback loop: Weave API is queried directly for past failure analysis, which feeds back into prompt generation. Redis serves as fallback when Weave API is unavailable.
  • Loop detection + breaker: Monitors page signature history and action repetition. When stuck, triggers page reload and re-observation with Gemini.

Challenges we ran into

Gemini as both planner and judge created a feedback loop paradox. If Gemini 3.0 Flash Preview plans a bad action and then judges its own action, will it recognize the mistake? We solved this by using different prompting strategies — structured task-focused prompts for planning, and analytical evaluation prompts for judging — so the same model operates in genuinely different "modes."

Macro cache poisoning was a real problem. Early versions cached macros by page signature alone. The agent would cache "click Add to Cart" on the inventory page, then replay it when revisiting the same page after already adding items — getting stuck in a loop. We redesigned the entire caching system to be sequence-aware, incorporating the last 3 actions into the cache key.

Stagehand + Gemini model naming was tricky. Stagehand expects model names in a specific format (google/gemini-3.0-flash-preview), while the Google AI SDK uses bare names. We had to build a provider-aware model name resolver.

SSE streaming across the full stack. Getting real-time step events from the agent runner → Express SSE → Next.js client required careful handling of connection lifecycle, especially for runs that take 30-60 seconds.

Deploying a monorepo with long-running browser sessions. Browser automation tasks can run for minutes. We couldn't use serverless. We deployed to a DigitalOcean droplet with PM2 process management, with the backend running persistent Node.js processes.

Accomplishments that we're proud of

  • Measurable self-improvement: Warm runs consistently show 30-100% fewer LLM calls than cold runs on the same task. This isn't a claim — it's visible in the metrics table and traceable through Weave.

  • The auto-improve loop actually works: The agent can take a failing task, analyze why it failed, rewrite its own prompt with learned rules, and succeed on the next attempt. Watching it go from FAIL → FAIL → PASS with decreasing step counts is genuinely satisfying.

  • 18 benchmark tasks across 5 real web applications: GoCalendar, GoMail, MarriSuite, NetworkIn, and SauceDemo. This isn't a toy demo — these are realistic multi-step workflows.

  • Full observability: Every single decision the agent makes is logged to W&B Weave. You can trace exactly why the agent chose a specific action, whether it used a cached macro, and how the LLM judge scored it.

  • Live browser view: You can watch the agent work in real-time through BrowserBase's live session streaming, embedded directly in the UI.

  • Production deployment: The entire system runs on a DigitalOcean droplet with a one-command deploy script. It's not just a local demo.

What we learned

  • Gemini 3.0 Flash Preview is remarkably good at structured evaluation. When given a clear rubric (VERDICT/SCORE/CONFIDENCE/REASONING format), it produces consistent, useful judgments that actually improve the agent's behavior on subsequent runs.

  • The hardest part of self-improvement isn't the learning — it's knowing what to forget. Bad macros that worked once but fail in different contexts are worse than no cache at all. Success rate tracking and confidence thresholds were essential.

  • DOM-first beats vision-first for structured web tasks. Using Stagehand's observe() to get actionable elements as structured data, then having Gemini reason over text labels, is faster and more reliable than screenshot-based approaches.

  • Weave's tracing is invaluable for debugging agent behavior. When the agent fails, being able to walk through the exact trace of state → plan → execute → validate for each step made debugging 10x faster.

  • The gap between "works once" and "works reliably" is where self-improvement matters most. A browser agent that succeeds 60% of the time with no learning is less useful than one that starts at 40% but climbs to 80% through cached macros and improved prompts.

What's next for LoopLess

  • Gemini 3.0 Pro integration for complex multi-page reasoning tasks where Flash's context window isn't sufficient

  • Cross-task transfer learning — macros learned on SauceDemo checkout could inform other e-commerce sites with similar flows

  • Visual grounding fallback — when DOM observation fails, fall back to Gemini's vision capabilities for screenshot-based action planning

  • Multi-agent collaboration — separate Gemini instances for planning, execution, and evaluation running in parallel

  • Community macro sharing — a shared Redis-backed macro library where agents deployed by different users contribute learned patterns, creating a collective intelligence layer

  • Adaptive strategy routing — automatically selecting between DOM-first, vision-first, and macro-first strategies based on learned domain performance, all orchestrated by Gemini 3.0

Built With

  • browserbase
  • express.js
  • gemini-3.0-flash-preview
  • google-generative-ai-sdk
  • next.js-14
  • node.js
  • pm2
  • redis-cloud
  • stagehand
  • tailwind-css
  • typescript
  • w&b-weave
  • zod
Share this project:

Updates