Inspiration

Tax filing is still one of the most frustrating paperwork workflows for ordinary users. Most products force people through long questionnaires and static forms, even though the real challenge is understanding a document, mapping numbers correctly, and asking questions naturally.

We wanted to build a next-generation AI agent that feels less like tax software and more like a live assistant. Instead of typing through a long wizard, the user can talk naturally, upload a W-2, let the agent observe the form they are working on, and get help in context.

What it does

LiveTax Agent is a voice-first multimodal tax copilot built with Gemini Live.

The agent can:

  • listen to the user in real time
  • respond with native audio
  • inspect uploaded W-2 images or PDFs
  • observe the live tax-form workspace through shared tab context
  • help the user understand whether visible edits look correct
  • support interruption and barge-in during spoken responses

The demo focuses on a simple but believable workflow: helping a user complete IRS Form 1040 with live multimodal guidance.

How we built it

We built the frontend in Next.js with a minimal split-screen interface:

  • left side: live voice agent, subtle chat/file input, and tab-sharing controls
  • right side: IRS Form 1040 workspace

For realtime interaction, we used Gemini Live through the Google GenAI SDK on Vertex AI.

Our architecture uses:

  • Next.js for the frontend
  • FastAPI for a WebSocket relay backend
  • Vertex AI Gemini Live for multimodal reasoning and native audio
  • Cloud Run for deployment

The browser captures:

  • microphone audio
  • uploaded W-2 images or PDFs
  • live app-tab visual context

The backend relay owns the Gemini Live session and forwards audio, text, image, and video input to Gemini Live, then streams spoken responses back to the browser.

Challenges we ran into

The biggest challenge was making the experience feel truly live instead of like a chatbot with voice layered on top.

A few examples:

  • Browser-direct realtime media was less stable than we wanted, so we moved to a backend-owned Gemini Live session.
  • Static document context was not enough. The agent became much better once it could see the live workspace instead of only the original PDF.
  • Barge-in required transport-level handling. We had to explicitly interrupt playback and notify the live session when the user spoke over the model.
  • Prompting mattered a lot. We had to refine the system prompt so the agent spoke more calmly, handled W-2 extraction more carefully, and continued the conversation naturally instead of restarting after uploads.

Accomplishments that we're proud of

We are proud that the final experience feels like a real multimodal agent rather than a text chatbot.

In particular:

  • the agent can listen, speak, inspect documents, and reason over live screen context
  • the app runs on Google Cloud using Cloud Run and Vertex AI
  • the W-2 upload flow, voice interaction, and form workspace all come together in one coherent demo
  • we kept the interface minimal and focused while still showing a technically rich workflow
  • we built a working end-to-end system under hackathon time constraints

What we learned

We learned that multimodal context is much more powerful than simply adding voice to a text workflow.

The most important takeaway was that an agent becomes dramatically more useful when it can combine:

  • speech
  • document understanding
  • live visual workspace awareness
  • action-oriented guidance

We also learned that reliability matters more than breadth in a hackathon demo. Focusing on one strong workflow produced a clearer and more compelling experience than trying to build a full tax product.

What's next for LiveTax Agent

The next step is to move from guidance-only assistance toward direct form actions.

We would like to add:

  • live field filling inside the real PDF instead of only guiding around it
  • structured tool calling for tax-field updates
  • stronger document extraction confirmation flows
  • better pointer awareness in shared workspace mode
  • broader support for additional tax forms and filing scenarios

Longer term, the product could become a true paperwork copilot for many complex government or financial workflows, not just tax filing.

Built With

Share this project:

Updates