Inspiration

The primary inspiration for this project is the internet personality DougDoug, whose content utilizes very similar API tools as those offered at this Hackathon. His content chains together generative AI, in the form of ChatGPT and AI voices from ElevenLabs, to create fun personalities and chaotic stories.

Once I saw that both a generative AI API and ElevenLabs were available for me to work with, I wanted to create a project utilizing both of these in a similar manner to DougDoug. With this inspiration, I set out to bring these AI personalities closer to the real world. I realized that by utilizing an external camera, I could feed images to an AI narrator that could describe my life in real-time. By giving the AI eyes, I am able to transform my ordinary life into an extraordinary tale.

What it does

Personal Narrator is a real-time, semi-autonomous commentary engine that turns your life into a BBC documentary. It captures a live video feed from an external camera and screenshots a frame every 10 seconds. These images are piped into Google Gemini 2.0 Flash, which acts as a scriptwriter, analyzing the visual context and creating an exaggerated, dramatic narrative about the subject. The script often laments the absence of my dear friend, Gavin Tranquillino, who unfortunately could not make it.

This script is then streamed into ElevenLabs, where a custom-designed British narrator voice brings the text to life with theatrical emotion. The system uses threading to manage the video feed and audio generation simultaneously. This ensures a seamless experience between the AI narrator describing your life and the video feed recording it.

How I built it

This project is built entirely in Python, which tied together the many different aspects of this project:

Vision and Capture: OpenCV was used to manage the webcam stream and handle image buffering. This creates a robust loop that captures high-quality frames without freezing the video feed.

Google Gemini AI: I implemented the google-genai SDK, utilizing the Gemini 2.0 Flash model for its speed and ability to handle multimodal prompts (text + image) simultaneously. I spent significant time adjusting the system instructions to remove the robotic tendencies of the outputs and fully bring to life the chaotic British narrator.

ElevenLabs: Using the ElevenLabs API voice generator, I was able to give the British narrator a voice. Utilizing the turbo_v2 model for speed and reduced latency, I used pygame to handle the headless audio playback. This allows the narration to play in the background without interrupting the video feed.

Challenges I ran into

The biggest challenge I ran into was concurrency and blocking. Initially, the camera would freeze whenever the AI was generating the output or the voice was playing. The workaround was implementing Python threading to enable both the camera feed and AI prompts to run simultaneously.

Another struggle was audio/file locking on Windows. Pygame would refuse to release the MP3 file after playing it, causing only the first observation by the AI to be read out. A more robust file loading/unloading system was implemented to ensure the file system remained stable. Tuning the 'Busy' flag was tricky as well, as I had to ensure a new picture was not taken by the system while the narrator was talking, necessitating the creation of a custom state machine to manage the flow of time.

Accomplishments that I'm proud of

I'm proud of managing the different systems present within this code. There are three distinct functions happening simultaneously (Vision, Generation, Audio), and ensuring they all worked harmoniously was difficult, but rewarding once my British narrator finally came to life.

Another accomplishment I'm proud of is being able to make the narrator consistently funny, in both what he says and how he says it. Making a generative AI sound genuinely human and maintain a sense of humor is difficult, and creating a suitable voice for the AI is even harder. Creating a narrator who mourns his dear friend while maintaining a genuine personality was a unique challenge.

What I learned

I learned how to implement the Google Gemini and ElevenLabs APIs into Python projects. Both these APIs will prove extremely valuable in future endeavors, enabling me to create more AI personalities capable of assisting with tasks.

Another thing I learned is that multimodal AI is all about timing. The individual APIs are fast, but ensuring the audio, video, and text generation all work harmoniously requires careful architecture in the code.

Finally, I learned how to create prompts that give AIs personality. It is very easy to make an AI speak and sound intelligent, but it is more difficult to give it the human inflections that express emotion. Tweaking the 'Stability' and 'Style' sliders in ElevenLabs, combined with careful instructions in the prompt, changed the narrator's personality from robot to storyteller.

What's next for Personal Narrator

Up next for Personal Narrator would be the ability to retain memory between prompts. This would allow the AI to build a continuous story, maintaining story beats and characters it imagines over time. Right now, the program gets a fresh start for each screenshot, preventing a long-form narrative from being built.

Another improvement will be experimenting with streaming the output from Gemini to ElevenLabs token-by-token, instead of waiting for the full prompt to finish before beginning the voiceover, to further reduce latency.

Built With

Share this project:

Updates