Project ID
01368914242A91D3
Inspiration
Images freeze moments in time — but they don’t speak.
Scene-Speak was inspired by the idea of giving images a voice, transforming static visuals
into cinematic audio experiences filled with emotion, dialogue, atmosphere, and background
sound effects that bring scenes to life.
How We Built It
Users upload an image or enter a text prompt through an interactive Streamlit interface.
A generative AI model analyzes the input and creates a cinematic script with structured
narration and character dialogue. The script is converted into expressive audio using
AI-powered text-to-speech, while background ambient sound effects (SFX) are layered and
mixed using ffmpeg to create an immersive listening experience.
What We Learned
We learned how to design multi-stage generative AI pipelines that combine vision, language, and audio. The project emphasized prompt engineering, audio processing, and how background sound effects play a crucial role in shaping the emotional impact of AI-generated content.
Challenges
Maintaining narrative coherence while balancing narration, dialogue, and background ambient SFX was the main challenge. We addressed this through iterative prompt tuning, audio caching, and careful sound-level adjustments to ensure clarity without losing atmosphere.
Drive Link:
https://drive.google.com/file/d/1VYgbkTlbcjwA0sIrjz6g7e-byDQwCpQX/view?usp=sharing
Log in or sign up for Devpost to join the conversation.