Scene-Speak(Project ID: 01368914242A91D3)

Project ID

01368914242A91D3

Inspiration

Images freeze moments in time — but they don’t speak.
Scene-Speak was inspired by the idea of giving images a voice, transforming static visuals into cinematic audio experiences filled with emotion, dialogue, atmosphere, and background sound effects that bring scenes to life.

How We Built It

Users upload an image or enter a text prompt through an interactive Streamlit interface.
A generative AI model analyzes the input and creates a cinematic script with structured narration and character dialogue. The script is converted into expressive audio using AI-powered text-to-speech, while background ambient sound effects (SFX) are layered and mixed using ffmpeg to create an immersive listening experience.

What We Learned

We learned how to design multi-stage generative AI pipelines that combine vision, language, and audio. The project emphasized prompt engineering, audio processing, and how background sound effects play a crucial role in shaping the emotional impact of AI-generated content.

Challenges

Maintaining narrative coherence while balancing narration, dialogue, and background ambient SFX was the main challenge. We addressed this through iterative prompt tuning, audio caching, and careful sound-level adjustments to ensure clarity without losing atmosphere.

Drive Link:

https://drive.google.com/file/d/1VYgbkTlbcjwA0sIrjz6g7e-byDQwCpQX/view?usp=sharing

Built With

elevenlabas
ffmpeg
gemini
github
python
streamlit

Updates

Udit Chowdary Udit Chowdary Jasti started this project — Jan 25, 2026 01:04 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.