Echoes of Time

Inspiration

The inspiration for Echoes of Time came from the Gemini Workshop during the hackathon, which motivated us to explore the creative potential of Google's multimodal AI. We reflected on a universal human experience: capturing life's precious moments through video. Whether it's a thrilling concert, a family beach vacation, or any meaningful experience, we all document these memories through our cameras.

This led us to an interesting observation—people often say life is like a TV show. But what's missing from our personal "shows"? A soundtrack! While we have the visuals captured, we rarely have music that perfectly complements the mood and atmosphere of our recorded memories. Echoes of Time bridges this gap by generating personalized songs that match the essence of your captured moments.

What It Does

Echoes of Time transforms your everyday videos into personalized musical experiences. Users simply upload a video from their day, and our application analyzes the content to generate a custom song that captures the mood, atmosphere, and emotions of that moment. The result is a unique soundtrack that brings your memories to life.

How We Built It

Our solution leverages Google Cloud Platform throughout the entire pipeline:

Multimodal Analysis: We utilize Gemini's advanced multimodal capabilities to separately analyze both the audio and visual components of uploaded videos, generating detailed descriptions of the emotional undertones and contextual elements.

Intelligent Processing: A secondary Gemini instance acts as an AI judge, comparing the audio and video descriptions for consistency. When descriptions align, it creates a unified summary; when they diverge, it prioritizes the video analysis to ensure accuracy.

Song Generation: A third Gemini instance transforms our refined content summary into detailed song specifications, which are then processed through Lyria's API to generate the actual audio track returned as base64-encoded MP3 files.

Backend Infrastructure: We built a robust Node.js and Express.js server to handle media uploads and processing, ensuring smooth user experience while managing computational overhead.

Challenges We Overcame

API Limitations: Working with Lyria's cutting-edge but nascent API presented unique challenges. The system is sensitive to prompt complexity and length, requiring careful optimization of our requests. Additionally, frequent credential refreshes were necessary to maintain stable connections.

LLM Reliability: Large language models can sometimes produce inconsistent or hallucinated responses. We addressed this by implementing our multi-stage AI judging system to validate and refine outputs before song generation.

Performance Optimization: The end-to-end process currently takes approximately 1.5 minutes per video. While functional, this processing time highlighted the need for future optimization strategies.

Accomplishments We're Proud Of

Successfully creating a working end-to-end pipeline that takes raw video input and produces genuinely fitting musical accompaniments. The system demonstrates remarkable accuracy in capturing the emotional essence of uploaded content and translating it into appropriate musical styles and moods.

What We Learned

This project deepened our understanding of multimodal AI applications and the complexities of chaining multiple AI systems together. We gained valuable experience with Google Cloud Platform's AI services and learned important lessons about API management, error handling, and user experience design in AI-powered applications.

What's Next for Echoes of Time

Enhanced User Experience: Redesign the interface with improved error handling, loading states, and interactive feedback to make the application more user-friendly and robust.

Performance Optimization: Explore options for reducing processing time, potentially through custom-trained open-source text-to-music models or optimized prompt engineering strategies.

Expanded Format Support: Extend compatibility beyond MP4 and MP3 to support a wider range of media formats, making the application accessible to more users regardless of their recording device.

Mobile Application: Develop a dedicated mobile app to make video-to-song generation more convenient and accessible for on-the-go content creation.

Spotify Integration: Implement user authentication with Spotify to analyze listening history and preferences, enabling more personalized song generation that aligns with individual musical tastes.

Advanced Customization: Add user controls for musical style preferences, tempo adjustments, and genre specifications to give users more creative control over their generated soundtracks.

Convert to typescript: Convert backend to typescript to handle errors better and improve type security.

TypeScript Migration: Convert the backend from JavaScript to TypeScript for improved error handling, type safety, and development experience. This will provide compile-time error detection and better IDE support, especially important for our complex media processing workflows and Google Cloud API integrations.

Share this project:

Updates