FIBO Visual Agent

Iteration 1 : Agent not happy
Iteration 2 : Agent not happy
Iteration 3 : Agent happy and exits
Architecture Diagram

Inspiration

We were frustrated by the "Slot Machine" nature of standard generative AI. You type a prompt like "A cyberpunk alleyway," pull the lever, and hope for the best. If the lighting is wrong, you have to guess a new prompt and try again. It feels less like engineering and more like gambling.

We realized that Bria's FIBO engine offers something unique: deterministic control via JSON. This meant that attributes like lighting direction, camera angle, and texture weren't just vague words—they were code. This sparked the idea: If the controls are code, can't we build an AI Agent to write that code for us?

This inspired FIBO Visual Agent—a system that doesn't just "imagine" an image, but actively "directs" a photoshoot, critiquing and correcting itself until it gets the shot right.

What it does

FIBO Visual Agent is an autonomous, self-correcting design loop.

Drafts: It takes a simple user goal (e.g., "A Lamborghini in India") and generates an initial draft using Bria FIBO.
Sees: It uses Google Gemini Vision to look at the generated image, acting as an AI Art Director.
Critiques: It compares the image pixels against the user's original goal.
Fixes: Instead of just complaining, it rewrites the JSON parameters programmatically—adjusting lighting, changing textures, or fixing composition.
Refines: It sends the patched JSON back to FIBO to generate the corrected image.

It turns the creative process into a closed-loop engineering problem, ensuring the final output matches the user's intent without human intervention.

How we built it

We designed a "Cybernetic Architecture" with three core components:

The Muscle (Bria FIBO API): We used FIBO in a novel "Hybrid Mode." We start with text-to-image to generate a valid initial composition, then switch to json-to-image for surgical, pixel-perfect refinements.
The Brain (Gemini 2.0 Flash): We used Gemini's multimodal capabilities to analyze images. We engineered a strict System Prompt that forces Gemini to output valid JSON patches (e.g., {"lighting": {"direction": "rim_light"}}) rather than conversational text.
The Manager (Python): A custom Python agent orchestrates the loop. It handles the API handshakes, merges the JSON updates recursively, and manages the iteration state to prevent infinite loops.

Challenges we ran into

The "Template Trap": Early versions of our agent kept generating sneakers no matter what we asked for! We realized the agent was reusing an old JSON template that had a "Sneaker" object hardcoded inside. We solved this by implementing a "Hybrid Start" strategy—wiping the object list and using a text-based generation for the first iteration to get a clean slate.
The "Lazy JSON" Bug: Gemini would sometimes try to be efficient by returning only partial updates to the object list (e.g., updating the texture but deleting the location). This caused the FIBO API to crash with 422 Unprocessable Entity errors. We had to implement a strict verification layer in Python to ensure the JSON structure remained valid before sending it to Bria.
The Silent Safety Filter: Our prompt "A Lamborghini in India" kept failing silently. It turned out Gemini's safety settings were flagging innocuous concepts. We learned to customize the HarmBlockThreshold to allow for creative freedom while maintaining safety.

Accomplishments that we're proud of

True Autonomy: We built an agent that can take a bad image and fix it without human help. Watching the agent realize "The lighting is too flat" and autonomously adding "Neon Rain" was a magical moment.
Mastering the JSON-Native Workflow: We proved that Bria FIBO's JSON interface is a superpower for developers. We moved beyond simple prompt engineering into Parameter Engineering.
Resilience: We successfully handled complex API rate limits (429 errors) and "Silent Blocks" by implementing robust error handling and exponential backoff strategies.

What we learned

Code > Text: Being able to programmatically set "Rim Lighting" or "50mm Focal Length" via JSON gives you control that natural language prompting never will.
The Power of Feedback Loops: The magic wasn't in the generation, but in the critique. By giving the AI "eyes" (Vision capabilities), it could verify its own work, effectively closing the loop between intent and result.
Model "Personality": We learned that different models have different biases. Gemini 2.0 Flash is incredibly fast and precise for logic, making it perfect for this kind of iterative agent loop.

What's next for FIBO Visual Agent

Multi-Agent Debate: We plan to implement a "Council of Critics"—one agent focusing solely on Lighting, another on Composition, and a third on Color Theory—debating the changes before applying them.
The "Cookbook" Memory: We want to save successful JSON schemas to a local database. If a user asks for "Cyberpunk" again, the agent should remember the exact JSON settings that worked last time.
Real-Time UI: Moving from the terminal to a web interface (Streamlit) so users can watch the image transform in real-time as the agent "thinks" and "fixes" it.

Built With

3.10
ai
api
bria
fibo
flash
gemini
generative
google
model)
pil)
pillow
python
sdk
v2)

Updates

suyash845 Patil started this project — Dec 15, 2025 03:29 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.