Genesis: A Multimodal AI Assistant with Intelligent Path Planning

TiDB Cloud Account

protagdis@gmail.com

About the Project

🚀 What Inspired This Project

The inspiration for Genesis came from a frustrating experience many of us have encountered - needing to process complex multimodal content but getting stuck in a maze of different tools, file formats, and workflows. Picture this: you have an image with Korean text that needs to be translated to English, the original text removed, and the translated text placed back onto the image.

This seemingly simple task requires orchestrating multiple specialized tools: OCR for text extraction, translation services, image inpainting for text removal, and text overlay systems. Each tool has different input requirements, output formats, and integration challenges. The process becomes a tedious manual pipeline prone to errors and inefficiencies.

We realized that while AI has become incredibly powerful at individual tasks, there's a massive gap in intelligent tool orchestration. Current AI assistants follow rigid, predefined workflows - ask for image processing, get a fixed pipeline. Need document analysis? You're locked into another predetermined path. But real-world problems are messy, interconnected, and require dynamic thinking.

The breakthrough insight came when we asked: What if an AI could automatically discover the optimal sequence of tools to achieve any goal? Instead of hardcoded workflows, what if we treated tool selection as a path planning problem in a type-aware graph?

This vision led to Genesis - an AI that doesn't just execute tasks, but intelligently discovers how to execute them.

How Genesis Works: Data Flow & Integrations

Genesis operates through a sophisticated orchestration system that intelligently processes your files and learns from your workflows:

🔄 Core Data Flow

graph TD
    UserUpload --> PrecedentAgent
    PrecedentAgent --> ClassifierAgent
    ClassifierAgent --> PathGenerator
    PathGenerator --> RouterAgent
    RouterAgent --> ExecutionEngine
    ExecutionEngine --> FinalizerAgent
    FinalizerAgent --> PrecedentStorage

    PrecedentAgent -.-> TiDB
    PrecedentStorage -.-> TiDB
    ExecutionEngine -.-> FileSystem
    FinalizerAgent -.-> SQLite

Step-by-Step Process:

  1. Input Analysis → Upload files (images, audio, PDFs) via web or CLI
  2. Precedent Lookup → Search TiDB vector database for similar past workflows
  3. Smart Classification → AI analyzes content type and processing requirements
  4. Path Planning → Generate optimal tool combinations based on input/output types
  5. Intelligent Routing → Select best workflow path with learned preferences
  6. Execution → Run processing tools (OCR, translation, audio processing, etc.)
  7. Result Assembly → Format outputs and save to organized file structure
  8. Learning → Store successful workflow as precedent for future optimization

🎯 What We Learned

This project was an intensive exploration into several cutting-edge AI and software engineering domains:

Advanced AI Agent Architecture:

  • Implemented multi-agent systems with specialized reasoning capabilities (Classifier, Router, Finalizer)
  • Learned about structured output generation and chain-of-thought reasoning with language models
  • Discovered the complexities of agent coordination and state management across distributed reasoning systems

Type Theory and Path Planning:

  • Developed a novel type system for multimodal data compatibility (ImageFile, AudioFile, StructuredData)
  • Implemented sophisticated graph algorithms for dynamic path discovery between data types
  • Learned about constraint satisfaction problems and optimization in tool composition

Real-Time AI Streaming:

  • Built WebSocket-based streaming for live AI reasoning updates
  • Integrated real-time reasoning transparency, allowing users to see the AI "think" in real-time
  • Mastered the challenges of streaming structured data while maintaining execution performance

LangGraph and Workflow Orchestration:

  • Utilized LangGraph for complex agent workflow management
  • Implemented hybrid execution environments with process isolation for tool safety
  • Learned about state management in distributed AI systems and error recovery patterns

Production AI System Design:

  • Built robust error handling for unpredictable AI outputs and tool failures
  • Implemented automatic JSON repair for inconsistent LLM structured outputs
  • Designed scalable Docker deployment with both CPU and GPU acceleration support

Modern Full-Stack Development:

  • Created responsive Next.js frontend with real-time visualization of AI decision-making
  • Built FastAPI backend with WebSocket support for streaming AI interactions
  • Implemented interactive graph visualization showing tool paths and execution states

🔧 How We Built It

Architecture Overview: Genesis consists of several sophisticated, interconnected components:

  1. AI Agent Trio: Three specialized agents (Classifier, Router, Finalizer) that work together to understand user intent, plan tool sequences, and synthesize results
  2. Type-Aware Tool Registry: A dynamic system that discovers and indexes available tools based on their input/output type signatures
  3. Path Generation Engine: A novel algorithm that finds all possible tool sequences between data types using graph traversal with provenance tracking
  4. Hybrid Execution System: LangGraph workflows combined with process isolation for safe, transparent tool execution
  5. Real-Time Frontend: Interactive interface showing live AI reasoning and tool path visualization
  6. Streaming Infrastructure: WebSocket-based system for real-time updates during AI processing

Technology Stack:

  • AI/ML: OpenAI's gpt-oss models via Ollama, LangGraph for agent orchestration
  • Backend: Python with FastAPI, SQLAlchemy for persistence, WebSocket streaming
  • Path Planning: Custom graph algorithms with type compatibility checking
  • Vector Database: TiDB Cloud for precedent storage and workflow pattern learning
  • Frontend: Next.js with TypeScript, React for component architecture, real-time WebSocket integration
  • Database: TiDB Cloud for production with vector embeddings, SQLite for local development
  • Deployment: Docker Compose with CPU/GPU modes, health monitoring, horizontal scaling support
  • Development: Python 3.12+, Node.js 20+, comprehensive testing with pytest and Jest

Development Process:

  1. Research Phase: Studied agent architectures, tool composition patterns, and type system design
  2. Core Algorithm Development: Implemented the path finding algorithm with strict contribution filtering and canonical ordering
  3. Agent System Design: Created structured agent communication protocols with Pydantic models
  4. Tool Integration: Built the @pathtool decorator system for automatic tool discovery and registration
  5. Execution Engine: Developed hybrid LangGraph + process isolation for safe tool execution
  6. Frontend Development: Created real-time visualization of AI reasoning and tool path selection
  7. Streaming Implementation: Built WebSocket infrastructure for live reasoning updates
  8. Production Hardening: Implemented error recovery, health monitoring, and deployment automation
  9. Integration Testing: Extensive testing with various multimodal workflows and edge cases

TiDB Cloud Integration - Advanced Vector Database Architecture:

A critical component that sets Genesis apart is its sophisticated database architecture leveraging TiDB Cloud's distributed NewSQL capabilities with native vector search. This wasn't just a storage choice - it was a fundamental design decision that enables Genesis to learn and improve from every workflow.

Why TiDB Cloud for Vector Search:

Genesis required a database that could handle both structured workflow data and high-dimensional vector embeddings for semantic search. Traditional databases struggle with vector operations, while pure vector databases lack the transactional guarantees needed for workflow state management. TiDB Cloud's unique architecture bridges this gap by providing:

  • Native Vector Support: TiDB Cloud's VECTOR(384) columns store embeddings natively, eliminating serialization overhead
  • Distributed Scale: Horizontal scaling ensures performance as the precedent database grows
  • ACID Transactions: Critical for maintaining consistency between workflow metadata and vector embeddings
  • MySQL Compatibility: Familiar SQL interface with specialized vector functions like VEC_COSINE_DISTANCE

Precedent-Based Learning System:

The system implements a novel approach to AI workflow optimization through precedent storage and retrieval:

Connection Management and Security:

The TiDB integration uses a robust connection management pattern:

  • Environment-Based Configuration: Secure credential management through environment variables
  • Connection Pooling: Efficient resource utilization with MySQLdb connection management
  • SSL Security: Optional SSL/TLS encryption for production deployments
  • Health Monitoring: Automatic connection validation and recovery

Performance Optimizations:

The vector search implementation includes several performance optimizations:

  • Threshold Filtering: Configurable similarity thresholds prevent irrelevant matches
  • Index Utilization: TiDB's vector indexes accelerate similarity searches
  • Batch Operations: Efficient bulk precedent storage for training scenarios
  • Connection Reuse: Global connection pooling reduces latency

This TiDB integration represents a significant advancement in AI workflow systems, enabling Genesis to learn from experience and continuously improve its tool selection and orchestration capabilities.

🏆 Challenges Faced

AI and Agent Coordination Challenges:

LLM Output Consistency: The biggest technical challenge was handling inconsistent structured outputs from OpenAI's gpt-oss models. Sometimes the model would return perfect JSON, other times it would include reasoning text or have formatting issues. We solved this with a hybrid approach that combines structured output APIs with fallback JSON parsing and repair mechanisms.

Multi-Agent State Management: Coordinating state between three different AI agents (Classifier, Router, Finalizer) while maintaining conversation context was complex. Each agent needed access to conversation history, previous decisions, and execution results, but we had to prevent state pollution and ensure clean handoffs.

Real-Time Reasoning Streaming: Implementing live reasoning updates while maintaining execution performance required careful architecture. We had to balance between showing users meaningful reasoning progress and not overwhelming them with too much technical detail.

Algorithm and Performance Challenges:

Path Discovery Complexity: As the number of available tools grew, the path discovery algorithm faced exponential complexity. Our initial naive approach couldn't handle more than 10-15 tools efficiently. We solved this with sophisticated pruning, memoization, and the "strict contribution" filtering that ensures every tool in a path is actually used.

Type Compatibility System: Creating a flexible type system that could handle semantic compatibility (not just exact type matches) was challenging. For example, VideoFile should be compatible with AudioFile (can extract audio), but TextFile should not be compatible with AudioFile without a transcription tool.

Process Isolation Performance: Running each tool in isolated processes for safety created performance overhead. We had to optimize serialization, implement smart caching, and develop efficient inter-process communication patterns.

Database and Vector Search Challenges:

TiDB Vector Integration Complexity: Integrating TiDB Cloud's vector capabilities required navigating several technical challenges. The biggest hurdle was understanding TiDB's specific VECTOR column format and ensuring compatibility with our embedding pipeline. We also had issue with utilizing and understanding the python library and the online SQL editor, so we ended up using the raw SQL through mySqlDB to connect TiDB.

Integration and User Experience Challenges:

Tool Integration Complexity: Each tool (OCR, translation, image processing, audio processing) had different requirements, dependencies, and failure modes. Creating a unified interface while preserving tool-specific optimizations required careful abstraction design.

Real-Time Visualization: Creating an intuitive interface that shows complex AI decision-making in real-time without overwhelming users was challenging. We iterated through multiple design approaches before finding the right balance of technical insight and usability.

Error Recovery and Transparency: When AI agents make mistakes or tools fail, users need clear explanations and recovery options. Building transparent error handling that maintains user trust while providing actionable feedback required extensive user testing.

🔄 Lessons Learned

This project taught us that building production-ready AI systems requires much more than just training models or calling APIs. The integration of multiple AI agents, complex algorithms, user experience, and reliability considerations is crucial. Key lessons learned:

AI System Design:

  • Agent Specialization: Instead of building one large agent, specialized agents with clear responsibilities are more reliable and maintainable
  • Structured Communication: Using Pydantic models for all inter-agent communication prevents errors and makes debugging much easier
  • Graceful Degradation: AI systems need multiple fallback strategies when components fail or produce unexpected outputs

Algorithm Engineering:

  • Start Simple, Optimize Later: Our initial path finding algorithm was naive but correct. We optimized for performance only after understanding the real-world usage patterns
  • Type Safety in Dynamic Systems: Even in Python, explicit type checking and validation prevents many runtime errors in complex systems
  • Provenance Tracking: Keeping track of why certain decisions were made is crucial for debugging and user trust

User Experience:

  • Transparency Builds Trust: Users are more forgiving of AI mistakes when they can see the reasoning process
  • Progressive Disclosure: Show simple results first, allow users to dig deeper into technical details if interested
  • Real-Time Feedback: Even small progress indicators during AI processing significantly improve perceived performance

Software Engineering:

  • Process Isolation Saves Sanity: Tool conflicts and resource contention are inevitable in complex systems; isolation prevents cascading failures
  • Comprehensive Testing: AI systems have many more edge cases than traditional software; extensive testing is essential
  • Monitoring and Observability: Production AI systems need detailed logging and metrics to understand behavior and performance
  • Database Schema Evolution: Vector-enabled databases require careful schema management and migration strategies as embedding models evolve

The experience reinforced that successful AI projects require a combination of cutting-edge research, practical engineering, and deep understanding of user needs. It's not just about having the most accurate model, but creating a system that people can trust and that works reliably in real-world scenarios. The TiDB Cloud integration specifically taught us that choosing the right database architecture early is crucial - it affects not just performance, but the entire system's ability to learn and improve over time.

🌟 Future Enhancements

Advanced AI Capabilities:

  • Self-Improving Path Discovery: Learn from usage patterns to optimize tool selection and suggest new tool combinations
  • Multi-Modal Context Understanding: Better integration of visual, audio, and text context for more intelligent tool selection
  • Collaborative Agent Learning: Agents that learn from each other's decisions and improve over time
  • Custom Tool Development SDK: Allow users to create and register their own tools with automatic integration

Enhanced User Experience:

  • Workflow Templates: Save and share successful tool sequences as reusable templates
  • Batch Processing: Handle multiple files or large datasets with intelligent parallelization
  • Voice Interface: Natural language interaction for hands-free multimodal processing
  • Mobile Applications: Native iOS/Android apps with offline processing capabilities

Enhanced Database and Vector Search Capabilities:

  • Multi-Modal Embedding Fusion: Combine text, image, and audio embeddings for richer precedent matching
  • Temporal Workflow Analysis: Track workflow evolution over time using TiDB's temporal features
  • Federated Vector Search: Distributed precedent sharing across multiple TiDB Cloud instances
  • Advanced Similarity Metrics: Implement custom distance functions beyond cosine similarity for domain-specific matching
  • Automated Embedding Model Updates: Seamless migration between embedding model versions with backward compatibility

Enterprise and Scaling Features:

  • Team Collaboration: Shared workspaces with role-based access and approval workflows
  • Advanced Analytics: Usage insights, performance optimization recommendations, and cost tracking
  • API Integration: Connect with existing enterprise tools and workflows
  • Federated Tool Discovery: Share tools across organizations while maintaining security and privacy
  • Multi-Tenant Vector Isolation: Secure precedent storage with tenant-specific vector spaces in TiDB

Research and Innovation:

  • Causal Reasoning: Understand not just what tools to use, but why certain combinations work better
  • Multi-Agent Negotiation: Agents that can debate and negotiate the best approach for complex tasks
  • Adaptive Type System: Types that evolve based on usage patterns and new tool capabilities
  • Quantum-Ready Architecture: Prepare for future quantum computing integration in AI processing

Genesis represents a significant step forward in AI assistant technology, moving from rigid task execution to intelligent problem-solving. The project demonstrates that with careful engineering, thoughtful design, and user-focused development, AI can become a true collaborative partner in creative and analytical work.

Built With

  • audio
  • computer
  • docker
  • fastapi
  • gptoss
  • langchain
  • next.js/react-with-typescript-and-tailwindcss
  • ollama
  • openai
  • opencv
  • processing
  • python
  • sqlalchemy-database
  • tidb
Share this project:

Updates