Project Story: Self-Correcting Hardware Agent (SCHA)

💡 Inspiration

The inspiration for Self-Correcting Hardware Agent (SCHA) stems from the "Right to Repair" movement. Modern electronics have become "black boxes" that are difficult for average users to fix, leading to massive e-waste. While generative AI can provide generic advice, it often fails at spatial reasoning and technical precision—crucial elements when guiding a human through a complex circuit board or a fragile screen assembly.

We realized that for AI to be a reliable repair partner, it must do more than just "imagine" a solution; it must visually verify its own logic against physical constraints.

🛠️ How We Built It

SCHA is built as an autonomous, multi-turn agent using the Google Gen AI SDK.

  1. Thinking & Planning: The agent first uses Gemini 2.5 Flash to analyze a user-uploaded photo. It generates a structured Repair Graph in JSON, identifying specific layers and technical components.
  2. Multimodal Generation: We utilize Imagen 4.0 to transform the logical plan into a high-fidelity, isometric exploded view.
  3. The Vision Audit Loop: This is our core innovation. As seen in our generate_with_correction method, the agent does not trust its first attempt. It performs a "Vision Audit" where Gemini 2.5 Flash inspects the generated image for duplicate labels, missing parts, or ordering errors. If an error is found, it provides specific corrective feedback (e.g., "Numbers 2 and 3 are duplicated") and triggers a re-generation.

🚀 Challenges We Faced

The primary challenge was ensuring Visual Logical Consistency. Generative models often "hallucinate" numbers or scramble sequences in technical diagrams.

We solved this by implementing a Closed-Loop Audit System. By forcing the Auditor (Gemini) to cross-reference the pixel-data of the generated image against the initial JSON schema, we created a bridge between symbolic logic and visual representation. The success condition is defined as:

Where represents the visual output and represents the logical plan. If this condition is not met, the agent persists through a defined max_retries cycle until a technically sound guide is produced.

🧠 What We Learned

We discovered that Reasoning is the bridge between pixels and reality. Multi-turn multimodal interactions—where the AI looks, thinks, acts, observes the result, and corrects—dramatically increase the reliability of AI agents in high-stakes physical domains.

🔮 Future Work: The "PaperBanana" Evolution

We aim to scale SCHA from a standalone tool into a professional-grade engineering platform by adopting key methodologies from the PaperBanana (arXiv:2601.23265, Jan 2026) framework:

1. Reference-Driven Multi-Agent Collaboration

Following PaperBanana’s architecture, we plan to decompose the repair task into specialized roles:

  • Retriever Agent: To identify high-quality reference examples from engineering databases, providing the generator with structural and stylistic guidance.
  • Specialized Stylist & Planner: To ensure that every repair guide adheres to professional engineering norms and standardized visual clarity.

2. Benchmarking with "Repair-Banana-Bench"

Inspired by the PaperBananaBench, we propose the "Repair-Banana-Bench" to objectively measure AI-generated technical documentation.

  • VLM-as-a-Judge: Using Gemini 3 Pro as an automated judge to score repair guides on Fidelity (technical accuracy), Conciseness (focus on essentials), Readability (layout clarity), and Aesthetics (engineering standards).
  • This benchmark will help transition SCHA from "plausible" images to "publication-quality" engineering blueprints.

🛠️ Built With

  • Core Model: Gemini 2.5 Flash (Analysis & Audit)
  • Image Generation: Imagen 4.0
  • Language: Python 3.10+
  • Libraries: google-genai, PIL, IPython

Built With

  • googleaistudio
Share this project:

Updates