Inspiration
The internet was supposed to democratize knowledge, but a massive barrier remains: Language. While over 60% of the web’s content is in English, only 16% of the world speaks it.
We realized that millions of hours of high-quality educational tutorials, lectures, and entertainment are locked away from billions of people simply because they don't understand the audio. Conversely, incredible local creators struggle to go global because professional dubbing is prohibitively expensive ($100+/minute) and slow.
We built Nativity.AI to bridge this "Knowledge Gap." We wanted to create a tool that doesn't just translate words, but nativizes content—adapting idioms, humor, and tone so that a viewer in Mumbai feels like a video from New York was made just for them.
What it does
Nativity.AI is an automated "One-Shot" Video Localization Studio.
Upload: A user uploads a video file (e.g., an English coding tutorial).
Process: The AI autonomously transcribes the audio, performs "Cultural Transcreation" (adapting metaphors and cultural references), and generates a synchronized script.
Synthesize: It generates natural-sounding speech in the target language (e.g., Hindi).
Mix: It intelligently mixes the new audio with the original video. Crucially, it uses Smart Audio Ducking to lower the original background music/noise only when the speaker talks, preserving the production value.
Result: The user gets a fully dubbed, studio-quality video in minutes. They can even verify the quality using our "Magic Compare" player, toggling between the original and dubbed versions in real-time.
How we built it
We built a robust, scalable architecture powered by Google Gemini 3:
The Brain (AI): We used Gemini 3 Flash for its massive context window and reasoning speed. It acts as the "Director," analyzing the entire video script to resolve ambiguities and ensure cultural accuracy. We lev eraged Gemini's Native JSON Mode to enforce strict timestamp outputs, which was critical for synchronization.
The Muscle (Processing): FFmpeg handles the heavy lifting—extracting audio tracks, performing side-chain compression (ducking), and rendering the final MP4.
The Voice: We integrated Edge-TTS for low-latency, neural-quality speech synthesis.
The Backend: Built with FastAPI (Python) and containerized with Docker to ensure all system dependencies (like FFmpeg) run smoothly on our Render deployment.
The Frontend: A sleek Next.js 14 dashboard with Tailwind CSS and Framer Motion for a "Cyberpunk" aesthetic.
The Cloud: AWS S3 for secure video storage (using Presigned URLs) and DynamoDB for tracking job history.
Challenges we ran into
The "Lip-Sync Drift": Early on, the AI's timestamps were slightly off, causing the audio to desync after a few minutes (the "Godzilla movie effect"). We solved this by forcing Gemini 3 to output strict mathematical arrays in JSON Mode and refining our FFmpeg calculations to match those timestamps down to the millisecond.
Background Noise: Simply overlaying the new audio made the video sound messy. We had to implement Audio Ducking (side-chaining) in FFmpeg, which dynamically lowers the volume of the original track only when the translated voice is speaking.
Large File Handling: Uploading large video files directly to our server crashed the backend. We implemented S3 Presigned URLs, allowing the frontend to upload files directly to AWS, bypassing our server bottleneck entirely.
Accomplishments that we're proud of
True Cultural Adaptation: Seeing the AI successfully translate "It's a piece of cake" into the Hindi equivalent "Baayein haath ka khel" (Left hand's game) instead of a literal translation about cake.
The "Magic Switch": Building the custom video player that switches languages instantly without buffering—it’s a huge "wow" moment for users.
Stability: Successfully deploying a complex FFmpeg pipeline to the cloud using Docker, moving beyond a "localhost only" demo.
What we learned
Structure is King: LLMs are powerful, but for engineering tasks, they need constraints. Using JSON Mode transformed Gemini from a creative writer into a precise data engineer.
Audio Engineering is Hard: We gained a massive appreciation for sound design. Merely generating speech isn't enough; mixing it properly is what makes a video watchable.
The Power of "Flash": We learned that for many tasks, Gemini 3 Flash is not just cheaper/faster, but actually better than larger models because its low latency allows for real-time iterative workflows.
What's next for Nativity.AI
Voice Cloning: Integrating models to clone the original speaker's voice so the dubbed version sounds exactly like them (e.g., MrBeast speaking Hindi in his own voice).
Visual Lip-Sync: Using GANs (Generative Adversarial Networks) to modify the speaker's lip movements to match the new language.
Direct YouTube Export: Automating the upload process so creators can publish to their "Dubbed" channels with a single click.
Built With
- amazon-cloudfront-cdn
- amazon-dynamodb
- amazon-web-services
- clerk
- edge-tts
- fastapi
- ffmpeg
- gemini
- github
- next.js
- render
- typescript
- vercel
Log in or sign up for Devpost to join the conversation.