Inspiration

Complementarity is a budding research area in AI safety that focuses on how to best architect an interaction between a human and an AI agent to optimize results—and keep the human capable, informed, and in control.

The landmark paper “Human-AI Complementarity: A Goal for Amplified Oversight” (Jain, et al. 2025) frames the issue as an HCI (human-computer interaction) problem, emphasizing collaboration to leverage the disjoint capabilities of each party to achieve greater performance than each party could separately on the task of AI fact-checking.

We were inspired by the practicality of the use-case, but noted that complementarity research (which is still relatively nascent) has thus far focused on “boxed-in scenarios” despite the subject’s fundamental alignment toward messy real-world scenarios. Our goal was to make infrastructure that automates the development of experiments that apply complementarity principles to a real-world problem that far-too-often has gotten messy for us: agentic coding.

Given the short timespan, we quickly understood that we could not architect a human-AI interaction system within Claude Code that we can claim is fully robust. To make the most of our limited scope, we wanted to provide the building blocks for making HCI experiments in this domain and target the extensibility and versatility of the system while trying to validate it with some preliminary results.

Risk management necessitates a complementary approach. The Claude Mythos risk analysis emphasizes that accelerating risk mitigation techniques will become increasingly crucial as models become intelligent and autonomous enough to overstep traditional safety rules. Fulcrum Inc.’s library ramure introduces a framework for tighter control of agent swarms. This subarea requires novel, innovative approaches, and human-AI complementarity is a promising layer of control.

What it does

Tether is an agentic research tool designed to analyze and test human-AI interaction dynamics within a coding environment. By allowing researchers to completely customize the parameters of an intermediary judge model, Tether provides a flexible sandbox for evaluating how humans collaborate with autonomous coding agents under different experimental conditions.

Researchers can define the specific starter code, set up the targeted development environment, and explicitly program how and when the judge model intervenes during a coding session. Tether handles the orchestration of this entire environment while automatically tracking and capturing the interaction data. By logging all decision pathways and behavioral metrics directly to a Snowflake database, Tether functions as a robust telemetry engine that provides researchers with the deep empirical data needed to study agentic workflows and AI safety.

How we built it

Frontend & UX: We utilized Lovable to rapidly scaffold and iterate on the user interface, allowing us to build an intuitive, responsive dashboard and sandbox environment where researchers can configure judge parameters and track agent behavior. Core Logic & Orchestration: We leaned heavily on Claude Code for the vast majority of our general development, backend logic, and system integration. This agentic workflow allowed us to move at hyper-speed, writing the evaluation loops that power our judge model's real-time risk analysis. Infrastructure & Hosting: The backend architecture is hosted on a DigitalOcean Droplet, providing the reliable, high-performance compute necessary to handle concurrent user sessions, orchestrate the judge model, and manage live experimentation pipelines. Sandboxed Environments: To ensure isolated and reproducible experimental runs, we engineered an automated pipeline that spins up a containerized VS Code environment for each session. This allows researchers to inject specific starter code and run coding sessions within a secure, controlled playground. To best enable this functionality, we deployed our backend using a DigitalOcean droplet and our frontend on Cloudflare Pages. Data Logging & Analytics: For our backend telemetry, we connected the application to a Snowflake data warehouse. Every blocked action, user intervention, and agent decision is streamed to Snowflake, providing an immutable data pipeline for academic and safety research.

Challenges we ran into

Framing and Scoping the Problem Early in the brainstorming phase, we hit a roadblock regarding our project's primary objective. Because our system encourages active human intervention, we initially drifted toward an educational angle—teaching the user why a specific piece of code was dangerous. While compelling, we realized this leaned more toward social impact than pure AI Safety. After analyzing current market contexts and reports like Claude Mythos, we pivoted. We recognized that in large-scale, high-stakes development environments, users usually possess the technical expertise; they just lack the visibility. We narrowed our scope to focus strictly on oversight, telemetry, and risk mitigation rather than baseline education.

Continually Realizing the Need for This Research Building a tool to monitor agentic coders using agentic coders was an exercise in irony. Throughout the hackathon, we leaned heavily on tools like Claude Code to accelerate our development. However, we were frequently bottlenecked by the exact problems we were trying to solve: context-window hallucinations, broken instructions, and silent failures. Watching our development agents occasionally attempt to overwrite critical files or introduce breaking changes didn't demoralize us, it fueled us. It provided a real-time, high-frustration validation that the infrastructure Tether provides is desperately needed.

Accomplishments that we're proud of

Full-Stack Tool Integration: We stitched together an ambitious tech stack, combining Vite, local LLM orchestration via Ollama, and enterprise-grade data warehousing with Snowflake, all within a grueling hackathon timeline. We learned a lot about using tools such as Docker and DigitalOcean to fit our use case, and overall, we all grew as software developers and engineers.

From Theory to Infrastructure: We moved past the abstract philosophy of AI complementarity and built a tangible, working platform that researchers can actually use to run concrete HCI experiments.

We built a functional loop where we could see the immediate utility of our judge-model architecture based on the real-world friction we experienced while coding the project itself.

What we learned

Our biggest takeaway was a profound validation of our own thesis: we desperately need tools like Tether. Developing a multi-layered, large-scale project using autonomous agents taught us how incredibly easy it is to lose track of what an agent is recommending under the hood.

More broadly, we learned just how intricate and fragile a modern full-stack application truly is. Balancing real-time state, model inference latencies, and persistent database logging forced us to appreciate the meticulous engineering required to build tight, reliable human-in-the-loop software.

What's next for Tether

The immediate next step for Tether is opening the platform to a wider pool of developers and researchers to gather empirical data. We want to stress-test our judge agents against a broader array of edge cases to analyze where the AI-to-human handoff succeeds and where it falters. Ultimately, we aim to refine our telemetry pipelines in Snowflake, transforming Tether into the definitive open-source sandbox for benchmarking real-world Human-AI complementarity. A feature we plan to add is the ability to configure the coding agent (not just the judge), allowing for even more potential research directions -- for example evaluating the effect of hallucinations and corrupted context on results.

https://github.com/Fhazara/UncommonHacks26

Built With

Share this project:

Updates