Inspiration
Enterprise executives and operating officers are drowning in cognitive overload. Valuable, mission-critical data does not arrive in neat packages it is fragmented across unstructured PDF reports, raw financial spreadsheets, and dashboard screenshots. The standard industry solution is to throw disjointed AI chatbots at the problem, forcing leaders to copy-paste data and chat back-and-forth to extract value. I built ControlTower AI to pioneer a shift from conversational chat wrappers to deterministic, institutional-grade execution engines. We wanted an autonomous "single pane of glass" that natively swallows multi-format corporate data in parallel and distills it instantly into an actionable, high-contrast executive matrix.
What it does
ControlTower AI is a fully asynchronous, multi-agent decision engine that evaluates raw enterprise data streams and synthesizes them into an elite, minimalist command dashboard. Users drop mixed-media assets (PDF documentation, image telemetry, financial data) into a clean, zero-bloat interface. The system concurrently fires specialized AI sub-agents to parse the inputs. Instead of returning generic summaries, the backend runs cross-agent synthesis to calculate a unified System Health Score, isolate Critical Exceptions, and produce a prioritized Execution Matrix concrete, sequential steps an executive can take immediately.
How we built it
The application is engineered using a decoupled, highly optimized full-stack architecture: The Pipeline Backend: Built on FastAPI (Python) utilizing non-blocking asynchronous routing. The engine avoids sequential processing bottlenecks by leveraging asyncio.gather to execute sub-agent reasoning in parallel, ensuring processing scaling is flat regardless of file count. The Multimodal AI Core: Powered by Google Gemini 2.5 Flash. The architecture discards legacy text-extraction wrappers and brittle OCR pipelines. Instead, it streams raw binary document streams and image components directly to Gemini’s native multimodal engine. Deterministic Governance: Every agent is bound to strict Pydantic validation schemas. This forces probabilistic LLM outputs into strictly typed JSON entities before they ever hit the wire, eliminating UI parsing failures. The Executive Frontend: Built using React and Vite, stripped of standard AI design clichés (no emojis, heavy borders, or bright gradients). It implements an ultra-dark monochromatic interface focused entirely on data density, clean typography, and microscopic interaction states.
Challenges we ran into
Building a highly parallelized multimodal system exposed significant engineering constraints that forced us to build a more robust product: Binary Stream Corruption: Initially, the backend attempted to parse inbound PDFs by decoding them into standard utf-8 strings. This destroyed the compiled binary architecture of the files, passing unreadable layout noise to the models. We re-engineered the routing tier to preserve raw byte streams, passing them directly to Gemini along with explicit mime-types for zero-loss native parsing. State-Level Payload Stripping: During frontend integration, our drag-and-drop React architecture stripped native JavaScript File blobs down to simple metadata dictionaries during state updates, causing FastAPI to drop a 422 Unprocessable Content validation failure. We completely rewrote the input handlers to retain the raw binary elements deep within the application state. Schema and Scope Mismatches: Managing rapid configuration changes across parallel files led to scoping errors where sub-agents misallocated validation schemas (e.g., text nodes calling data extraction shapes). We resolved this by centralizing our model schemas and implementing a strict exponential backoff wrapper function for all Gemini API invocations.
Accomplishments that we're proud of
Fault-Tolerant Degraded States: We successfully built an exceptional error-handling framework. If a single agent hits a rate limit or encounters an unparseable asset, the sub-pipeline isolates the error and passes a structured fallback payload. The entire system returns a 200 OK and renders gracefully, preventing a single failure from crashing the entire enterprise workflow. Zero-Loss Ingestion: Bypassing traditional intermediate parsing libraries allowed us to achieve highly accurate contextual extraction directly from document pixels and raw binaries. Production-Grade Aesthetic: Moving completely away from generic template layouts to deliver a clean, hyper-minimalist interface that feels like software crafted by senior human engineers.
What we learned
Native Multimodalism > OCR: Feeding raw binaries directly to Gemini 2.5 Flash yields vastly superior contextual comprehension compared to pre-processing files through traditional text extraction tools. SDE Rigor in AI Engineering: Building reliable software with AI requires treating LLMs as probabilistic functions that must be rigidly constrained by structural type enforcement (Pydantic) at the entry and exit boundaries of the network.
What's next for ControlTower AI
Automated Webhook Execution: Moving the Execution Matrix from passive insight to active automation. Clicking an action step will automatically spin up secure Slack incident rooms, draft precise Jira tickets, or alert specific department heads via API. Cross-Document Temporal Analysis: Enhancing the Synthesizer agent to track changes over time, allowing the engine to recognize trends across multiple processing sessions rather than analyzing uploads in isolation. Distributed Vector Storage: Integrating a low-latency vector database to enable semantic search across thousands of historical enterprise uploads alongside the real-time upload pipeline.
Built With
- fastapi
- google-gemini-2.5-flash
- react


Log in or sign up for Devpost to join the conversation.