Inspiration
The resale market is booming, but for power sellers on eBay and Depop, the bottleneck isn't finding items, it's listing them. We noticed that many resellers have "death piles" unlisted inventory sitting in corners because the process of photographing, researching, and writing descriptions for each individual item is tedious. We wanted to build a tool that felt less like data entry and more like an "everyday-user-friendly" operation, simplifying the challenge of inventory management.
How we built it
We built HawkEye as a web application to serve as a companion tool in a warehouse or thrift store.
Powered by Agentic AI: "We didn't just use Gemini; we built with Gemini". The Gemini CLI served as an autonomous agent in our dev environment, bridging the gap between concept and code. Instead of constantly context-switching to a browser for documentation or debugging, we kept our flow state unbroken by leveraging the CLI to:
- Scaffold entire file structures for our MVC architecture.
- Refactor complex Python logic for the video processing pipeline on the fly.
- Synthesize clear, concise documentation and README sections directly from our codebase. By treating the CLI as an intelligent pair programmer, we effectively doubled our engineering velocity.
Surviving the Crunch with Gemini CLI: In the high-pressure environment of a hackathon, speed is everything. The Gemini CLI was our secret weapon for beating the clock. It transformed our terminal into a powerhouse of productivity, allowing us to generate boilerplate code, debug cryptic FFmpeg errors, and push clean commits to GitHub faster than ever before. Whether we needed a quick explanation of a library or a generated script to automate deployment, the Gemini CLI was there to unblock us instantly, turning potential hours of troubleshooting into minutes of progress.
The Brain: We leveraged Google's Gemini 2.5 Flash model for its ability to process video and audio simultaneously. This allowed us to create a "stream-of-consciousness" workflow where a user can film a pile of clothes while narrating defects ("this one has a stain"), and the AI correlates the visual timestamp with the audio transcript.
The Engine: We used Flask to handle the heavy lifting. When a video is uploaded, the backend orchestrates a pipeline: it sends the media to Gemini, parses the JSON response to find timestamps, and then uses FFmpeg to surgically extract high-quality still images from those exact moments in the video.
The Interface: We designed a "Simplistic" UI using Tailwind CSS. We focused heavily on the mobile experience, implementing PWA (Progressive Web App) features so it functions like a native iOS app
Challenges we faced
Mobile Browser Limitations: iOS Safari is notoriously strict with file inputs and PWA behavior. We struggled with the "green box" visual feedback not triggering because iOS wouldn't report the correct MIME type for videos. We had to rewrite our frontend logic to be "permission-based" rather than "type-based" to ensure a smooth experience.
The "Twin Item" Problem: Initially, if a video contained similar items (like two black shirts), the AI would sometimes assign the same timestamp or image to both. We solved this by implementing a "Timestamp Shifting" algorithm in Python that forces a time buffer between detections and appending unique IDs to every generated image filename.
Latency: Analyzing video takes time. To prevent users from thinking the app crashed, we built a custom "Analyzing" loading overlay and forced browser repaints in JavaScript to ensure the UI updated before the heavy network upload froze the main thread.
What we learned
We learned that "Multi-modal" is more than just a buzzword; it's a workflow unlock. By combining video (visuals) with audio (context), we replaced a 10-step manual form with a single button press. We also gained deep experience in Dockerizing complex Python/FFmpeg environments for the cloud, moving from "it works on my machine" to "it works on any iPhone."
Log in or sign up for Devpost to join the conversation.