About the Project: Multimodal Clinical Triage Agent (MCTA)
The MCTA is a high-impact application developed to solve the data synthesis crisis in emergency triage. Our project moves beyond the basic capabilities of AI chatbots by utilizing the highest-tier features of the Gemini API to perform verifiable, tool-grounded clinical reasoning across four distinct data modalities simultaneously.
๐ก Inspiration
Our primary motivation stemmed from the limitations of current medical AI, which often fails in high-stakes scenarios because it relies solely on text. This makes the system unable to reconcile conflicting patient dataโfor example, a low-risk verbal report contradicted by a critical lab value or an alarming X-ray image.
Our goal was to build a system that achieves cross-modal synthesis, mimicking an expert clinician's ability to seamlessly integrate:
Patient narrative (Text)
Objective imaging (Image)
Calculated metrics (Labs/Vitals)
โ What it Does
The MCTA functions as a Mission Control Dashboard for emergency room staff, delivering a comprehensive, verifiable diagnostic report in seconds.
Four-Modality Fusion: Accepts and synthesizes Text (notes), Image (X-ray), Tabular Data (live lab entries), and Time-Series Vitals (trends).
Autonomous Tool Orchestration: Recognizes when its internal reasoning requires external confirmation and autonomously calls two specialized Python tools for quantitative grounding.
Structured, Transparent Output: Generates a final, schema-enforced JSON report that includes the triage urgency (RED/YELLOW/GREEN) and a traceable summary of the evidence.
๐๏ธ How We Built It
We followed a rigid 8-hour execution map, focusing intensely on Gemini API integration:
- The Fusion Pipeline
We engineered a pipeline where Text and Image data are ingested raw, but Structured Data (Labs and Vitals) is first passed through a Data Abstraction Layer. This Python layer analyzes the data (e.g., checks if Lactate > 3.0) and converts it into concise, interpretive language features before submission. This ensures the LMM reasons over meaning, not numerical noise.
- Dual Function Calling
We built a complex Multi-Turn Conversation Loop (utils.py) to manage agentic behavior:
The agent is configured to use the calculate_sepsis_risk tool to ensure the final triage is based on an algorithmic risk score.
The agent is mandated to use the generate_vitals_visualization tool, which executes Matplotlib code on the host, creates a chart image, and returns the Base64 image to the LMM. The agent then performs self-referential multimodal reasoning by analyzing the chart it just created.
- UX and Transparency
We polished the interface into a three-column dashboard using Streamlit, featuring dynamic color-coded triage badges and a verbose Reasoning Trace that logs every tool call ([ACTION]) and every host response ([OBSERVATION]) for full auditability.
โ๏ธ Challenges We Ran Into
- API Conflict: JSON vs. Function Calling
The most significant technical hurdle was the 400 INVALID_ARGUMENT error, which proved the model cannot combine Function Calling and Structured JSON Output in the same request.
Solution: We implemented a Dynamic Configuration Strategy in utils.py, forcing the model to switch configurations mid-conversation: Tools ON, JSON OFF for the call turn, and then Tools OFF, JSON ON for the final synthesis turn. This was a challenging but necessary engineering fix to preserve both core features.
- Model Instability and Quota
We faced 429 RESOURCE_EXHAUSTED errors using high-tier models.
Lesson: This forced us to prioritize stability by switching to the high-throughput gemini-2.5-flash. While we targeted the higher reasoning of Pro, our ability to run the full complex architecture on a high-speed stack proves the robustness and efficiency of the MCTA pipeline.
๐ Accomplishments We're Proud Of
Full Four-Modality Synthesis: Successfully demonstrating synthesis across Text, Image, Tabular, and Time-Series data in a single, coherent workflow.
Self-Referential Agent: Achieving the complex loop where the agent autonomously generates a Matplotlib chart, receives the Base64 image, and integrates that visual data into its final diagnosis.
Robust Error Handling: The agent's logic is protected by Exponential Backoff and the Dynamic Configuration Strategy, guaranteeing high uptime even under API instability.
High-Quality UX: Translating complex technical output (tool logs, JSON data) into a clear, professional, color-coded dashboard.
๐ What We Learned
We gained deep knowledge regarding the constraints and best practices of advanced LMM integration:
LMMs Require Feature Engineering: Raw numerical data must be translated into contextual language features for effective reasoning.
Structured Output is Fragile: Combining Function Calling and Structured Output requires sophisticated dynamic configuration to avoid API conflicts.
Architectural Robustness Wins: A stable, complex architecture running on a high-throughput model like gemini-2.5-flash is more valuable for deployment than a fragile system running on a top-tier reasoning model.
โญ๏ธ What's Next for Multimodal Clinical Triage Agent
Full Data Integration: Implement dynamic frontend inputs for blood pressure and respiratory rate to eliminate current mock variables for the Sepsis Risk calculation.
Real-time Vitals Streaming: Integrate with a mock data stream generator to simulate real-time patient monitoring, allowing the agent to continuously reassess triage priority.
Expanded Medical Tools: Integrate additional tools for specialized scoring, such as Glasgow Coma Scale (GCS) or Wells' Criteria, further increasing the agent's quantitative grounding.
Log in or sign up for Devpost to join the conversation.