ArenaVision

Logo

Inspiration

We were inspired by high school sports teams, where players and coaches often struggle to capture and share highlight footage due to limited editing resources. We wanted to create a tool that could automatically generate professional-looking highlight reels from any game video—helping athletes showcase their talent and schools celebrate their teams without needing a video editor.

What it does

ArenaVision is an autonomous AI video analyst and editor built with Google’s Agent Development Kit (ADK) and Gemini Vision. It accepts a sports game video, identifies the most exciting and meaningful moments (like goals, fast breaks, or clutch plays), and automatically edits them into a highlight reel with captions and optional AI-generated commentary. The system combines video and audio signal analysis with Gemini Vision’s semantic reasoning to detect true scoring moments, not just motion spikes or crowd noise. It also allows the user to make custom edits and cuts to the highlight reel to their own liking. Furthermore, they can directly post their generated highlight reel to X (Twitter) through our app.

Key capabilities:

Automatic highlight detection and ranking that prioritizes real scoring/decisive moments.
Human-in-the-loop editing: refine cuts, reorder clips, and update captions via a friendly UI.
Optional intro bumper and branded logo overlay (bottom-right watermark) for team identity.
Direct social sharing to X with robust, chunked media upload and processing status polling.
Ingestion from local files or YouTube links; works across common sports footage formats.
Optional AI-generated commentary and captions powered by Gemini, plus basic TTS support.
One-click export to a clean, shareable final MP4.

How we built it

We built a Vision Agent using Gemini Vision and Google Video Intelligence APIs to analyze frames and detect key events such as shot attempts and score changes. We fused this with OpenCV motion energy analysis and audio energy detection to locate moments of high excitement. The Planner Agent ranks these moments using a confidence model that blends visual, audio, and semantic cues. Finally, the Editor Agent clips, arranges, and captions the highlights into a polished reel. Everything runs inside a Google ADK-based agent pipeline, allowing the system to autonomously execute the entire workflow from video upload to highlight reel.

Architecture at a glance:

Input/Handlers:
- YouTube and local upload handlers to fetch/ingest raw footage.
Agents:
- Vision Agent: frame/segment analysis with Gemini Vision + Video Intelligence.
- Planner Agent: multi-signal scoring and ranking of candidate highlights.
- Editor Agent: clip extraction, concatenation, crossfades, captions; supports user edits.
- Commentator Agent: optional commentary line generation.
- Chatbot Agent: natural-language refinement of edits.
UI and Orchestration:
- Streamlit app for multi-step flow (ingest → analyze → edit → brand → export/share).
- Logo upload and automatic logo overlay with simple background cleanup.
- Optional intro/title card and separate “Skip Logo” vs. “Skip Intro” paths to move faster.
Social Sharing:
- OAuth1 flow with requests-oauthlib and Twitter’s chunked media upload (INIT/APPEND/FINALIZE/STATUS) to handle large MP4s reliably.
- MIME detection and processing-polling to avoid “unrecognized media” and “too large” errors.
- Fallback to v2 /2/tweets creation when v1.1 access is limited.
Video Processing:
- MoviePy for clipping, compositing, and exports; ffmpeg under the hood.
- Compatibility layer for MoviePy 1.x (fallback from subclipped to subclip).
- Caching and deterministic output paths to prevent unnecessary re-renders.

Tech stack:

AI/Analysis: Google ADK, Gemini Vision, Google Video Intelligence, OpenCV, NumPy.
App/UI: Python, Streamlit.
Video: MoviePy, imageio-ffmpeg.
Media ingest: yt-dlp/pytube.
Social: requests-oauthlib, Twitter v1.1 chunked media API with v2 tweet fallback.
Tooling: Structured utilities for overlay, keyframes, and clip concatenation.

Challenges we ran into

Signal fusion and precision: Motion or audio spikes alone were too noisy, and pure semantic classification could miss context. Blending frame-level detections, motion/audio energy, and Gemini reasoning made results more robust.
Social media posting at scale: X media uploads failed with 413 (too large) and “unrecognized media” errors. We implemented the official chunked INIT/APPEND/FINALIZE/STATUS flow with MIME detection and processing polling, and added a resilient fallback to the v2 tweets API where needed.
Library/runtime compatibility: MoviePy API differences (subclipped vs subclip) caused runtime errors. We pinned versions and added compatibility fallbacks. We also pinned NumPy to avoid OpenCV/ffmpeg conflicts.
Long-form video performance: Full-length games stress CPU/IO and external API quotas. We optimized chunking, avoided re-encoding when possible, and cached intermediate results to keep the loop usable.
Ingestion reliability: YouTube links and network hiccups required retries and safe fallbacks to keep the pipeline stable.

Accomplishments that we're proud of

End-to-end autonomy: The system can “watch” a game, surface the right moments, and produce a polished reel with minimal human intervention.
Reliable social sharing: Chunked uploads, processing polling, and fallbacks turned a fragile step into a one-click share.
Polished UX touches: Brandable reels with a logo watermark and optional intro bumper; simple paths to skip branding steps when speed matters.
Human-in-the-loop editing: It’s easy to iterate—accept the auto reel, then trim or rearrange quickly before exporting or posting.

What we learned

We learned how to combine traditional computer vision with generative and reasoning models to bridge the “semantic gap” between raw signals and real understanding. We also gained experience working with Google’s ADK and designing agents that plan and execute multi-step workflows autonomously. Most importantly, we learned how small, focused teams can use cutting-edge AI tools to solve real community problems.

Additional takeaways:

API ergonomics matter: mastering chunked uploads, MIME correctness, and processing polling turned out to be crucial for a great user experience.
Version pinning saves weekends: keeping MoviePy, imageio-ffmpeg, and NumPy in a known-good range avoids painful runtime surprises.
A little branding goes a long way: an intro and watermark make reels feel professional and shareable.

What's next for ArenaVision

Next, we want to deploy ArenaVision for local high schools, making it easy for coaches or parents to upload a game and get back a shareable highlight reel within minutes. We also plan to improve our highlight ranking system using fine-tuned feedback loops and add automatic stat overlays and player tracking. Eventually, we envision ArenaVision as a plug-and-play solution for schools, youth leagues, and amateur sports—bringing AI-powered media production and social sharing to every level of the game.

Roadmap highlights:

Smart overlays: player/score bugs, possession arrows, and key stat callouts.
Better tracking: lightweight player tracking for tighter, more cinematic crops.
Live workflows: support for live streams and near-real-time highlight pushes.
Multi-platform: publish to Instagram, TikTok, YouTube Shorts with auto-aspect adaptations.
Deployment: containerize and run on managed compute; add queues, observability, and cost controls.
Accessibility and reach: multilingual captions/TTS; templates tailored by sport.

Privacy & Safety

Data handling: user uploads are processed locally/in-session; generated media is saved to project outputs for user control. Secrets and large artifacts are excluded from version control.
Controls: watermarking options, logo upload, and moderation hooks for commentary/captions.
Consent and compliance: intended for owned/authorized footage; easy to remove branding or disable commentary if policies require it.

Limitations

Quotas and latency: analysis quality and speed depend on model/API limits and video length.
Edge cases: unusual camera angles, low resolution, or partial footage can reduce detection accuracy.
Platform policies: social APIs can change; we include fallbacks, but access levels may vary per account.