My venture into AI filmmaking was inspired by the rapidly advancing capabilities of generative AI—the desire to prove that a cohesive, visually stunning cinematic narrative could be created with just a creative vision and a handful of cutting-edge tools. Specifically, the exceptional realism and cinematography showcased by new text-to-video models like Veo, coupled with the character consistency promised by image models like Nano Banana, provided the perfect testing ground for a single-creator production pipeline. How the Project Was Built: This project heavily relied on a multi-AI workflow, leveraging each tool for its specific strength to overcome the typical challenges of AI-generated consistency. 1. Scripting & Visual Strategy (ChatGPT / Gemini)Action: Used Gemini (or ChatGPT) to brainstorm and structure the initial short film idea.Output: Generated a detailed, scene-by-scene script. Critically, I used the model to output a JSON Video Prompt structure, which explicitly details the visuals, camera movement, and character descriptions for each clip. This structured format helps ensure maximum consistency when feeding the prompts to the video generator. 2. Consistent Character Generation (Nano Banana)Action: Used Nano Banana (Google's Gemini 2.5 Flash Image model) for all key character and setting visuals.Output: Created a few high-resolution, consistent reference images of the main character, specific props, or core locations. Nano Banana’s strength in iterative editing and maintaining character consistency across different prompts was vital here, ensuring the protagonist looked the same in every scene. 3. Video Clip Generation (Veo)Action: Fed the JSON video prompts (from Step 1) and the reference images (from Step 2) into Veo (Google's advanced video generation model).Output: Generated the main cinematic clips. Veo excelled at translating the detailed prompts into short, high-quality video sequences with natural motion, realistic lighting, and strong camera dynamics, often supporting built-in audio or lip-sync. 4. Dialogue and Sound (ElevenLabs / Gemini)Action: Used a separate Text-to-Speech tool (like ElevenLabs) to create high-quality, emotionally nuanced voiceovers from the script's dialogue. For sound design ideas and music selection, I consulted Gemini for royalty-free options and mood-setting sound effects. Output: Professional-grade voice tracks and a list of suitable background music/sound effects. 5. Assembly and Refinement (Video Editor)Action: Imported all the generated video clips, voiceovers, and sound effects into a traditional video editing program (e.g., Adobe Premiere Pro or DaVinci Resolve).Output: Cut, trimmed, color-graded, and synced all the elements. This final, human-directed step was where the raw AI-generated assets were truly shaped into a cohesive, flowing narrative. What I Learned and Challenges Faced What I Learned Prompt Engineering is King: The level of detail in the prompt directly impacts the quality and consistency of the output. Using structured prompting (like JSON) is far superior to simple descriptive text for multi-scene projects. The Ecosystem Advantage: Tools designed to work together, like Nano Banana and Veo, provide a significant advantage in maintaining style, character, and scene coherence from image to video.AI is a Collaborator, Not an Operator: The final, human-guided editing stage is non-negotiable. AI provides incredible assets, but the creator's vision is still required for pacing, emotional flow, and sound design. Major Challenges Character and Object Consistency : Despite Nano Banana's strength, maintaining a character's exact appearance across a long film sequence is still a major hurdle. Small, unpredictable variations (a slight change in a shirt logo, a different hat angle) require constant re-generation. Video Length Limits: Veo clips are limited (e.g., 8-12 seconds), necessitating a tedious process of stitching together many short, sequential clips while attempting to maintain seamless motion between them. Uncanny Valley and Fine Detail: While realism is high, subtle issues in complex actions, hand/finger movements, or emotional expression can still betray the AI origin, requiring the clips to be kept very short or used as B-roll.

Built With

  • elevenlabs
  • googleflow
  • llm
Share this project:

Updates