Project ID

01368914242A91D3

Inspiration

Images freeze moments in time — but they don’t speak.
Scene-Speak was inspired by the idea of giving images a voice, transforming static visuals into cinematic audio experiences filled with emotion, dialogue, atmosphere, and background sound effects that bring scenes to life.

How We Built It

Users upload an image or enter a text prompt through an interactive Streamlit interface.
A generative AI model analyzes the input and creates a cinematic script with structured narration and character dialogue. The script is converted into expressive audio using AI-powered text-to-speech, while background ambient sound effects (SFX) are layered and mixed using ffmpeg to create an immersive listening experience.

What We Learned

We learned how to design multi-stage generative AI pipelines that combine vision, language, and audio. The project emphasized prompt engineering, audio processing, and how background sound effects play a crucial role in shaping the emotional impact of AI-generated content.

Challenges

Maintaining narrative coherence while balancing narration, dialogue, and background ambient SFX was the main challenge. We addressed this through iterative prompt tuning, audio caching, and careful sound-level adjustments to ensure clarity without losing atmosphere.

Drive Link:

https://drive.google.com/file/d/1VYgbkTlbcjwA0sIrjz6g7e-byDQwCpQX/view?usp=sharing

Built With

Share this project:

Updates