Inspiration

Applying for jobs is often a soul-crushing, repetitive, and time-consuming process. We've all experienced the fatigue of re-typing the same resume details into hundreds of different web forms. We wanted to build an industrial-grade, autonomous assistant that takes over this burden. Rather than creating a brittle web scraper that breaks every time a website updates its UI, we were inspired by the new wave of multimodal AI. We envisioned an agent that actually sees the screen like a human does, guided by your voice, taking over the granular clicks and keystrokes while leaving you in the driver's seat.

What it does

ResumeApply is an autonomous job search agent that controls a real browser to find and apply for jobs on your behalf.

  • Resume Parsing: You upload your resume, and the system extracts your profile details with high accuracy.
  • Voice-Controlled Browsing: Using a bidirectional voice interface powered by WebRTC, you can talk to the agent in real-time. Tell it, "Find me remote Senior Frontend roles," and it gets to work.
  • Autonomous Execution: The agent navigates job boards (like LinkedIn or Indeed), visualizes the page, clicks "Apply," and intelligently fills out complex form fields using your extracted profile data.
  • Human-in-the-Loop: If the agent encounters a CAPTCHA, an MFA prompt, or a highly specific, ambiguous question it doesn't know how to answer, it pauses its execution and pings you via a real-time WebSocket connection to intervene in the dashboard.
  • Real-time Tracking: A synchronized dashboard tracks every application, its status, and the agent's live progress.

How we built it

We built ResumeApply with a modern, real-time cloud-native stack:

  • Frontend: Next.js 16 App Router, styled with TailwindCSS and animated using GSAP. State management is handled by Zustand.
  • Backend: A highly concurrent FastAPI server orchestrating the agent's behavior.
  • The Agent Engine: We used the Google Generative AI ADK and Playwright (Stealth mode). Our biggest innovation here is dropping traditional DOM parsing. Instead, we use Gemini 2.0 Flash's multimodal vision capabilities to analyze raw screenshots of the browser. The model identifies input fields and buttons visually and returns coordinates for Playwright to interact with, mimicking human spatial understanding.
  • Real-time Sync & Voice: We implemented a custom WebRTC (aiortc) bridge for the bidirectional voice interactions and resilient websockets for streaming the agent's browser state down to the client dashboard.
  • Database & Auth: Google Cloud Platform (GCP) powers our core infrastructure. We leverage Cloud SQL for highly available PostgreSQL storage, Google Cloud Storage (GCS) for securely hosting resume assets, and Google Identity Platform for seamless user authentication.

image

Challenges we ran into

  • Bypassing Bot Detection: Modern job boards deploy aggressive anti-bot protections. We had to carefully configure Playwright Stealth, manage user-agent rotations, and inject human-like delays to prevent the agent's sessions from being blocked.
  • UI Variability & DOM Complexity: Initially, passing massive, deeply-nested HTML trees to the LLM was slow and prone to hallucination. We pivoted entirely to a vision-native approach, which was challenging to calibrate but ultimately much more resilient to UI changes.
  • State Synchronization: Keeping the hidden Playwright browser state perfectly synced with the Next.js frontend—especially when handling disconnects or connection throttling—required building a custom queueing layer over our WebSockets.

Accomplishments that we're proud of

  • Successfully building an agent that navigates visually. Having Gemini accurately map screen pixels to actionable bounding boxes for Playwright to click is a profound shift from brittle CSS selectors.
  • Achieving a seamless, low-latency voice interaction experience so the user feels like they are pairing with a dedicated human assistant.
  • Creating a robust "human-in-the-loop" handoff that prevents the agent from getting stuck or failing silently on edge cases like CAPTCHAs.
  • Successfully orchestrating complex real-time services across Google Cloud's robust ecosystem.

What we learned

  • Multimodal vision models (like Gemini 2.0 Flash) are incredibly powerful and often faster and more accurate for UI navigation than attempting to sanitize and parse raw HTML DOM structures.
  • Orchestrating headless browsers, real-time WebSockets, WebRTC streams, and LLM calls inside an asynchronous Python backend is complex and requires strict memory and dependency management.
  • Designing for failure (human-in-the-loop) yields a far better user experience than chasing elusive 100% autonomy.

What's next for ResumeApply

  • Complex Portals: Expanding support from "Easy Apply" job boards to complex, multi-page portals like Workday and Greenhouse.
  • Dynamic Cover Letters: Using Gemini to craft highly specific, targeted cover letters dynamically based on the parsed constraints of the individual job description.
  • Interview Scheduling: Connecting the agent directly to Google Calendar APIs so it can autonomously negotiate and book initial screening calls when recruiters reach out.
Share this project:

Updates