Multimodal AI agent on Google Cloud using Gemini

Creative Storyteller Agent
screenshot 2

Inspiration Modern digital workflows generate massive amounts of information across dashboards, conversations, documents, and visual interfaces. Security analysts, students, and professionals often switch between tools to interpret alerts, create reports, and take action. Traditional AI assistants rely mainly on text input, limiting real-world usability. We wanted to build an AI agent that interacts more naturally — one that can see, listen, understand context, and respond using multiple forms of media. Inspired by real-world operational environments such as security operations centers and complex digital workflows, we designed a multimodal agent capable of real-time reasoning and interaction using Google’s Gemini models. The goal was to move beyond chatbots and demonstrate how AI agents can become active collaborators rather than passive responders. What it does The Multimodal AI Agent is a next-generation assistant built on Google Cloud that combines vision, voice, and reasoning capabilities. The agent can: Analyze screenshots or visual dashboards using Gemini’s multimodal understanding Accept voice or text instructions from users Generate structured explanations and actionable recommendations Convert responses into natural speech Understand UI screens and suggest or execute actions Create multimedia content including stories, visuals, and narrated outputs Key capabilities include: Live Interaction — users can speak naturally and receive spoken responses Visual Understanding — interprets images, interfaces, and diagrams UI Navigation — understands on-screen elements for workflow automation Creative Storytelling — produces interleaved text, images, and narration The system transforms AI from a question-answering tool into an interactive multimodal agent. How we built it The project was developed using Google’s AI and cloud ecosystem. Core Technologies Gemini 1.5 Pro for multimodal reasoning Google GenAI SDK & Agent Development Kit (ADK) for agent orchestration Google Cloud Run for hosting Cloud Storage for media handling Speech-to-Text & Text-to-Speech APIs for live voice interaction Architecture User provides input via voice, text, or image. Inputs are processed and routed through an agent orchestrator. Gemini analyzes multimodal context. The agent selects tools (analysis, storytelling, automation). Outputs are generated as text, audio, or multimedia responses. Results are stored and served using Google Cloud services. Components Built Multimodal reasoning engine Live voice interaction pipeline Screen understanding module Story generation pipeline Cloud deployment infrastructure Challenges we ran into Multimodal orchestration: Combining voice, image, and reasoning workflows required careful input routing and context management. Latency in real-time interaction: Streaming audio while maintaining responsiveness demanded optimization. Prompt design: Ensuring Gemini produced structured, actionable outputs instead of generic responses required iterative prompt engineering. UI interpretation ambiguity: Screenshots lack DOM information, so visual reasoning had to compensate using contextual prompts. Media synchronization: Aligning generated images, narration, and storytelling outputs into a coherent experience was technically complex.

Built With

2.0
ai
api
assets
cloud
firebase
flash
gateway
gemini
generated
google
multimodal/interleaved
native
next.js
output
storage
storing
vertex
with

Submitted to

Gemini Live Agent Challenge

Created by

The core insight I implemented: Most teams would build this naively — call Gemini for text, call a separate image API, stitch them together. I architected it around Gemini 2.0 Flash's response_modalities=["TEXT", "IMAGE"] config, which produces a single interleaved stream where text blocks and image bytes arrive together in one pass. That's the mandatory tech requirement, and it's a non-obvious API pattern.What I built:

gcp_client.py — the Vertex AI integration layer with authenticated streaming, Cloud Logging telemetry, Secret Manager integration, and GCS image persistence
main.py — FastAPI server with SSE streaming, a /gcp-proof endpoint that returns the exact Vertex AI URL being called, and response headers that expose the GCP project and model
Four distinct system prompts (storybook, marketing, educational, social) each tuned to elicit interleaved output — not just text with images appended at the end
A complete React frontend with real-time block rendering, a demo mode that works without an API key, and an editorial parchment aesthetic

Mahuri Bhambure
I am senior cyber security and vapt engineer and working as freelancer. i have experience for 11+ year for IT industry.
Developed a multimodal deep learning model integrating image and text features for classification tasks.

Implemented cross-modal attention mechanisms to improve prediction accuracy.

Built a data pipeline for multimodal datasets including text, image, and audio preprocessing.
Evaluated model performance using multimodal benchmarks.

Sneha Taori

Updates

Sneha Taori posted an update — Mar 06, 2026 08:18 AM EST

I and mahuri bhambure has worked for this very hard. i would request we want to develop this tool

Log in or sign up for Devpost to join the conversation.

Mahuri Bhambure started this project — Feb 27, 2026 02:09 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.