The Inspiration
As video content production continues to scale, creators face a critical bottleneck: the traditional tools used to optimize for the YouTube algorithm rely solely on shallow text tags, flat titles, and human guesswork. They cannot actually see or hear what is happening inside the footage. This disconnect leads to unexpected viewer drop-offs, poor Click-Through Rates (CTR), and wasted production hours.
We built Youtube_Trendy_Analyst_V1 to solve this problem by leveraging native multimodal AI. Our tool scans the raw visual and audio tracks of a video file to automatically map out performance data, score platform virality, and pinpoint exact timestamps where viewers are at risk of dropping off due to cognitive overload or pacing issues.
How We Built It
The platform is built around a lightweight, local-first architecture designed to process heavy video payloads smoothly without performance lag:
- The Backend: A robust, asynchronous
FastAPIserver that handles multi-gigabit file uploads using optimized temporary filesystem buffers. - The Intelligence Core: Powered by the official
google-genaiSDK, we pass the raw video file directly into thegemini-2.5-flashmodel. By using strictPydanticobject schemas and keeping model configuration temperatures low ($temperature = 0.1$), we enforce deterministic, structured JSON delivery. - Trend Synthesis: The backend concurrent workers tap into the
Serper.devAPI to pull real-time organic search trends, giving the AI a live platform benchmark to evaluate the video against. - The Frontend: A responsive, clean HTML5/CSS3 dashboard that coordinates the media file data streaming via native JavaScript
FetchandFormDatalayers.
Challenges Faced
Our biggest hurdle was preventing data execution race conditions and eliminating model hallucinations. Initially, the model would guess the video topic incorrectly or hit a timeout block if the prompt executed before the multi-gigabit video packets had stabilized on the server.
We solved this by re-engineering the pipeline with an asynchronous check-state loop (while processing) using 5-second interval back-offs. Additionally, we implemented rigid token isolation boundaries ([START OF USER CONTENT CONTEXT]) inside our prompt structure to strictly shield the multimodal core from prompt injection attempts and conflicting search trend data noise.
What We Learned
This project deepened our understanding of building non-blocking asynchronous file pipelines and managing structural telemetry outputs using multimodal LLMs. We learned that isolating user input text blocks from core model instructions is vital for maintaining data schema integrity, ensuring a smooth, crash-free interface deployment for production environments.
Log in or sign up for Devpost to join the conversation.