ConstructionCAPTCHA

Inspiration

Current vision models lack the capabilities to deal with ego centric data, especially if there is a lot of motion involved. While many inference time changes can be made to improve model performance, these will eventually plateau in benefits, leading to necessary post or even pre-training. To do this, we need good labelled data, but this can be expensive and time consuming to acquire. This is especially important for specific use cases of vision models, such as construction, where most tasks fall out-of-distribution. In construction, it can oftentimes be useful for managers and team leaders to understand more fine-grained levels on how the project is going by determining what the workers are working on. This can help predict if the project is on pace, especially important for an industry that relies on fixed contracts that total up to $2T each year.

Additionally, CAPTCHAs are increasingly becoming an old security measure as they are either too easy for bots to break or they are difficult for both humans and robots. According to Searles et. al. (2023), AI systems are actually better at solving CAPTCHAs than humans, leading to CAPTCHAs simplifying being an annoyance and serve no real purpose for security. In this project, we target these two issues: lack of training data for vision model post-training and simpler + secure CAPTCHAs. Our goal is to help improve both these systems by creating a CAPTCHA which allows for more annotated data to help improve these models.

What it does

ConstructionCAPTCHA leverages the egocentric data of construction sites in video format. It uses gemini-3.1-pro-preview to select the 5 second clips. Within each 5 second clip, claude-haiku determines the types of construction-related tools, objects, or people relevant in the frame to create multiple choice questions in the CAPTCHA. To work as a CAPTCHA, the system asks the users two questions: one easier pre-annotated data and a second question about the construction sites. The first question helps determine if a bot is being used to enter the website and the second question is for data label collection. Current models struggle with video tasks, making this an effective measure against bots. This data collected from users can then be used to further spatial understanding by post-training video models.

How we built it

We built it with: Next.js 16 + React 19 for the web app, tRPC for the API layer, Prisma + Postgres (Supabase) for data AWS S3 for storing and serving video clips The CAPTCHA generation pipeline uses gemini-3.1-pro-preview to identify key moments of the long video claude-haiku through bedrock to generate questions. Additionally, FFmpeg is used to cut clips and frames, and a mix of TypeScript and Python scripts to automate ingestion and processing.

Challenges we ran into

Originally when we were going to select the 30 second clips, we thought randomly selecting segments and showing the user would work; however, we quickly realized it’s hard to determine what the worker is doing from all clips, especially if the current tasks involves some amount of waiting (ex. in transit waiting at the elevator). To fix this, we decided to leverage gemini-3.1-pro-preview video understanding to prompt it for relevant timestamps where it thinks something is going on and only show those clips for the CAPTCHA.