Ayana

The Problem
Our Vision
Immersive effect
Choose your persona
Architecture Diagram

Inspiration ✨

We've all been there: 47 open browser tabs, three spreadsheets, and a group chat full of conflicting opinions, just to plan a weekend trip. Travel planning today is fragmented, overwhelming, and strangely joyless before the journey has even begun.

We wanted to flip that completely. What if planning a trip felt like the trip itself? What if an AI guide could take you by the hand, narrate your destination, fly you over landmarks in cinematic 3D, and let you explore with your voice and hands, not a search bar?

That's what Ayana became: not just a travel tool, but a travel experience.

What it does

Ayana is a multimodal AI travel guide that turns trip planning into an immersive, cinematic journey.

You begin by choosing a persona such as Adventurer, Romantic, or Wanderer. Ayana's prep engine then generates three curated itineraries tailored to your personality using Gemini 2.5 Flash. A cinematic loading screen sets the mood with orchestrated text phases and ambient audio before dropping you into a live Google Maps 3D globe.

From there, Ayana speaks. A real-time voice agent narrates your journey, explains landmarks, and responds to your questions live through the Gemini Live API. You can steer the experience with your voice, hand gestures powered by MediaPipe, or simple taps. At any point, you can explore nearby food, shopping, or activities, and even step inside places through Street View.

When the session ends, Ayana generates a Spotify Wrapped-style recap of your journey, including landmarks visited, food and activity picks, and a traveller DNA profile, powered by Grok (xAI). And as a physical AR extension, our Snap "Teleport Me" lens series places you inside 360-degree Street View panoramas of your destinations, so you can literally see yourself inside places like the Eiffel Tower, Mount Fuji, or Cancun Beach. 🌍

How we built it

Frontend — We built the experience in Next.js 16 with TypeScript. The map layer uses the Google Maps JavaScript API 3D (alpha) with programmatic flyCameraTo and flyCameraAround to create cinematic arrivals at each landmark. We also built a custom SmartRange system that queries the Google Maps ElevationService to prevent camera distortion on elevated terrain, so the same camera logic works from sea-level beaches to mountain peaks without hardcoded location-specific tuning.

Gesture engine — Gesture interaction runs entirely client-side using MediaPipe Hand Landmarker in the browser. This allows users to explore the map more naturally, without relying only on conventional UI controls.

Voice agent — Ayana's live guide is powered by the Gemini Live API through the Google Agent Development Kit (ADK), using gemini-2.5-flash-native-audio-preview for real-time bidirectional audio streaming. Raw PCM audio (16kHz, 16-bit) is streamed over WebSocket using custom AudioWorklet processors. The agent is grounded in the generated itinerary and live session state from startup, and controls the experience through typed tool calls such as choose_itinerary, move_to_landmark, show_nearby, and open_place_street_view, each of which triggers real camera movement, overlays, sidebar changes, or Street View transitions.

Prep pipeline — On persona selection, Ayana makes a structured Gemini 2.5 Flash call that generates three persona-matched itineraries with city context, landmark stops, and narrative grounding. This runs in parallel with the cinematic loading experience so the wait feels intentional rather than dead.

Recap engine — At the end of a session, the conversation transcript and location events are sent to Grok (grok-3-mini), which extracts structured recap data such as stops, food, activities, traveller DNA traits, and exploration stats. We then enrich those stops with Google Places imagery and render the result as an animated recap flow.

Snap AR (Snap Lens Studio Track) — Built a custom Street View cubemap pipeline that resolves the nearest pano via metadata radius search and outputs horizontal‑cross cubemaps for each city station. Added an equirectangular converter so the same assets map cleanly onto a single 360° sphere in Lens Studio. The scene uses three sphere “stations” per city with DeviceTracking rotation (gyro look‑around), tap zones to walk between stations, and a fade‑to‑black crossfade to hide swaps.

Challenges we ran into

Real-time agent tool orchestration was the hardest problem by far. The Gemini Live API is a continuous streaming system, not a neat request/response loop. That means voice input, voice output, tool calls, frontend transitions, screenshots, and follow-up narration all have to coexist inside one live session without drifting out of sync.

To solve that, we built an ACK-gated orchestration protocol over WebSocket: the agent emits a tool call, the frontend performs the visible action, then sends back an ACK along with a screenshot of the map state. Only after that does the backend advance the session and let the agent continue narrating. Getting this loop reliable without race conditions, duplicate transitions, or broken narration took a lot of iteration.

Camera distortion on mountains was another major challenge. In Google Maps 3D, the camera range parameter is effectively a flat-ground distance. At high elevation, like Mount Fuji at 3776m, the same range that looks perfect at sea level can zoom straight into a rock face. We solved this dynamically with:

$$\text{adjustedRange} = \sqrt{\text{baseRange}^2 + \text{altitude}^2} \times 1.1$$

where altitude comes from a live ElevationService query. That gave us a general solution that works anywhere on Earth with zero location-specific hacks.

MediaPipe in Next.js also took care. Running browser-side WASM and camera pipelines inside a React app meant we had to carefully control initialization, lifecycle timing, and cleanup to avoid double-loading and unstable behavior.

Recap page performance became its own challenge. During transitions, a MutationObserver scanning the DOM at very high frequency combined with style reinjection caused visible jank. We fixed it by removing the observer-based approach and isolating the progress bar logic into a memoized component, which made the recap feel much smoother and more intentional.

Street View Static API ≠ JS Street View: different product enablement, key restrictions, and quota rules. We had to explicitly handle REQUEST_DENIED and key‑restriction pitfalls for cubemap/equirect generation.

Accomplishments that we're proud of 🚀

Full end-to-end multimodal loop — voice in, map moves, Ayana narrates the result, all in real time.
Visual grounding via screenshots — the agent effectively "sees" the current map state after a tool action, making its spatial narration much more accurate.
Dynamic altitude-adjusted camera — a single runtime formula handles both sea-level destinations and high-elevation landmarks without hardcoding.
Cinematic polish — the loading screen, map fly-ins, overlay timing, and voice pacing all work together to make the experience feel like a product rather than a prototype.
Web-to-AR bridge — connecting the main web journey to a physical Snap AR teleportation moment creates an experience that neither medium could achieve alone.
A recap that tells a story — instead of a bland session summary, Ayana turns raw conversation and movement into a narrative artifact with personality.

What we learned

The Gemini Live API is extremely powerful, but building on top of it requires strong runtime contracts, not just good prompts.
Multimodal grounding really matters. Once the agent has visual confirmation of what the user is seeing, the quality of narration changes dramatically.
Immersion is fragile. One awkward transition, clipped audio line, or janky UI moment can break the entire effect.
AR and web are stronger together when the transition between them is meaningful, not bolted on.
Realtime systems need explicit orchestration layers. Typed tools, ACKs, session state, and frontend action contracts saved us from a huge amount of debugging later.

What's next for Ayana

Multi-city journeys — let Ayana guide users across multiple cities or countries in one continuous session.
Persistent travel memory — remember what users loved, skipped, or asked for, and adapt future itineraries accordingly.
Collaborative exploration — allow friends to join the same live 3D journey together.
Booking integrations — turn a recap into real flights, hotels, and bookable experiences.
Deeper Snap integration — trigger the teleportation moment at the peak of the live session, not just afterward.
More personas and cities — expand beyond the current set with more culturally grounded itinerary generation across the world.

Built With

fastapi
gemini-2.5-flash-(itinerary-generation)
gemini-live-api-/-gemini-2.5-flash-native-audio-preview-(real-time-voice-agent)
google-agent-development-kit-(adk)
google-maps-elevationservice
google-maps-javascript-api-3d-(alpha)
google-places-api-(new)
grok-grok-3-mini-by-xai-(recap-extraction)
lens-studio
mediapipe-hand-landmarker
next.js-15
python
snap
street-view-static-api
tflite
typescript

Submitted to

YHack Spring 2026
- Winner [MLH] Best Use of Gemini API

Created by

UI/UX flow, gesture recognition integration, api handling

Prithvi Seshadri
Snap chat lens development, recap page full stack

Utkarsh Lal
Masters student at University of Pennsylvania
Vamsi Naghichetty Kishore Kumar
Rajath R