Inspiration
Texo was born from the convergence of two deeply human needs: preservation and personalized education.
First, the loss of oral tradition. We realized that in our own families, grandparents hold incredible stories that are rarely written down. We wanted a way to capture the raw voice and emotion of an elder and instantly transmute it into a permanent, illustrated artifact that a grandchild would cherish.
Second, the need for specific social education. Parents and teachers of neurodivergent children often rely on "Social Stories", custom narratives used to teach specific behaviors (e.g., "Going to the Dentist" or "Sharing Toys"). Creating these visual aids manually is exhausting and expensive. We wanted to build a tool that could generate a safe, age-appropriate, and visually consistent social story in seconds, just by speaking the scenario.
What it does
Texo is an autonomous AI Publisher. It transforms raw voice recordings or simple text ideas into fully illustrated, professional-grade children's books in real-time.
- It Listens: Texo natively processes raw audio files, using the speaker's tone and hesitation to inform the narrative mood.
- It Weaves: An "Orchestrator Agent" analyzes the transcript to define a consistent "Story Bible", locking in character designs, art styles, and plot arcs.
- It Illustrates: It spins up parallel workers to generate consistent illustrations for every page.
- It Self-Repairs: Uniquely, Texo is self-aware. If an illustration request triggers a safety filter (common with innocent children's prompts like "messy clothes"), the agent pauses, rewrites its own prompt to be compliant while preserving the artistic intent, and retries automatically.
How we built it
Texo is a Next.js application powered by a FastAPI agentic backend. The intelligence layer is built entirely on Google Vertex AI.
- Gemini 3 Flash (The Ear): We use Flash’s massive context window and native audio understanding to process user voice recordings directly. Instead of fragile Speech-to-Text, we use a "Chain-of-Thought" prompt that asks Gemini to first transcribe and then analyze the audio for emotional context.
- Gemini 3 Pro (The Brain): The Orchestrator uses the Pro model for high-level creative direction. Its reasoning capabilities are essential for our "Safety Repair Loop", allowing the agent to understand why a prompt failed and fix it intelligently.
- Imagen 3 (The Hand): We utilize Imagen 3 for high-fidelity illustration, using a "Visual Signature" injection technique to ensure the main character looks consistent from Page 1 to Page 10.
- Resilience Engineering: We implemented a custom threading system with exponential backoff to manage API quotas and "Self-Healing" mechanisms for content safety blocks.
Texo vs Gemini Hackathon Requirements
Texo was built to push the boundaries of what an "AI Agent" can be, which is moving beyond simple chat interfaces into autonomous, resilient production work.
- Technical Execution (The "Agentic" Standard): We didn't just build a wrapper; we built a self-healing system. Texo demonstrates advanced application architecture by decoupling the "Ear" (Audio processing), the "Brain" (Orchestrator), and the "Hand" (Illustration). Our use of parallelization with exponential backoff and a recursive "Safety Repair Loop" proves that we prioritized reliability and complexity handling over simple API calls.
- Potential Impact (Democratizing Legacy & Education): Texo solves two massive, disparate problems with one solution. For families, it prevents the loss of oral history by making preservation effortless. For the neurodivergent community, it democratizes access to "Social Stories", educational tools that currently cost hundreds of dollars or hours of manual labor to create. The market is as broad as "anyone with a story to tell."
- Innovation (The "Self-Aware" Loop): Our most novel feature is the agent's ability to handle rejection. While most GenAI apps crash when they hit a safety filter, Texo treats it as a logic puzzle. It analyzes the rejection, rewrites its own prompt to be compliant without losing artistic intent, and retries. This "antifragile" workflow is a significant leap forward in autonomous content generation.
- Presentation: We embraced the "Glass Box" philosophy. Instead of hiding the AI's latency, we visualize its "thought process" in the UI, turning the wait time into an engaging part of the user experience.
Texo Gemini Integration Outlined
Texo utilizes a "Hybrid-Model Architecture" to balance speed, reasoning, and creativity, leveraging the full spectrum of the Gemini 3 ecosystem on Vertex AI.
1. The Ear: Gemini 2.5 Flash (Multimodal Audio) We bypass traditional Speech-to-Text services entirely, relying on Gemini Flash’s native audio understanding. By feeding raw audio files directly into the model's massive context window, Texo captures not just the words of a story, but the emotion, hesitation, and tone, nuances that are critical for accurate narrative adaptation but lost in standard transcription.
2. The Brain: Gemini 3 Pro (Reasoning & Orchestration) The core "Orchestrator Agent" relies on Gemini 3 Pro’s superior reasoning capabilities. It performs the heavy lifting: analyzing the narrative structure, maintaining a consistent "Story Bible" for character continuity, and crucially, powering our Self-Correction Loop. When an image generation request fails due to safety filters, Pro analyzes the blocked prompt and intelligently rewrites it to be compliant while preserving the visual style.
3. The Hand: Imagen 3 (Visual Generation) We use Imagen 3 for its state-of-the-art prompt adherence. This is vital for our "Visual Signature" technique, where we inject specific character tokens into every prompt to ensure the protagonist remains recognizable from Page 1 to Page 10.
Challenges we ran into
- The "Safety Filter" Loop: Innocent descriptions in children's stories (e.g., "a child in a swimsuit") often trigger aggressive safety filters. Initially, this crashed our app. We solved this by building a Recursive Agentic Loop: if an image is blocked, Texo catches the error, asks an LLM to "sanitize" the prompt, and retries.
- Audio Hallucinations: When audio was silent or unclear, the model would sometimes hallucinate wild stories. We fixed this by enforcing a strict "Transcript-First" schema, forcing the model to ground itself in the actual audio data before generating the narrative.
Accomplishments that we're proud of
- True Agentic Behavior: Watching the "Thinking" logs in the UI is magical. Users can see Texo say: "⚠️ Safety filter hit on Page 3. Tweaking prompt..." and then "✅ Prompt fixed. Generating...". It feels like a living collaborator.
- Preserving Voice: We successfully processed a rambling 2-minute recording of a childhood memory and watched it turn into a cohesive story that still retained the speaker's specific quirks and names.
What's next for Texo
- Voice Cloning (TTS): Using the input audio to clone the narrator’s voice, so the digital book "reads itself" in Grandma’s voice.
- Physical Export: Integration with print-on-demand services for hardcover books.
- Education Mode: Pre-built templates for teachers to generate lesson-specific stories (e.g., "The Water Cycle") in seconds.
Links
Github - https://github.com/oadeniran/Texo Website - https://texo-fe.azurewebsites.net/ Video - https://youtu.be/TJLUr35LW1k
Built With
- fastapi
- gemini
- imagegen3
- nextjs
Log in or sign up for Devpost to join the conversation.