Multimodal Clinical Triage Agent

About the Project: Multimodal Clinical Triage Agent (MCTA)

The MCTA is a high-impact application developed to solve the data synthesis crisis in emergency triage. Our project moves beyond the basic capabilities of AI chatbots by utilizing the highest-tier features of the Gemini API to perform verifiable, tool-grounded clinical reasoning across four distinct data modalities simultaneously.

💡 Inspiration

Our primary motivation stemmed from the limitations of current medical AI, which often fails in high-stakes scenarios because it relies solely on text. This makes the system unable to reconcile conflicting patient data—for example, a low-risk verbal report contradicted by a critical lab value or an alarming X-ray image.

Our goal was to build a system that achieves cross-modal synthesis, mimicking an expert clinician's ability to seamlessly integrate:

Patient narrative (Text)

Objective imaging (Image)

Calculated metrics (Labs/Vitals)

✅ What it Does

The MCTA functions as a Mission Control Dashboard for emergency room staff, delivering a comprehensive, verifiable diagnostic report in seconds.

Four-Modality Fusion: Accepts and synthesizes Text (notes), Image (X-ray), Tabular Data (live lab entries), and Time-Series Vitals (trends).

Autonomous Tool Orchestration: Recognizes when its internal reasoning requires external confirmation and autonomously calls two specialized Python tools for quantitative grounding.

Structured, Transparent Output: Generates a final, schema-enforced JSON report that includes the triage urgency (RED/YELLOW/GREEN) and a traceable summary of the evidence.

🏗️ How We Built It

We followed a rigid 8-hour execution map, focusing intensely on Gemini API integration:

The Fusion Pipeline

We engineered a pipeline where Text and Image data are ingested raw, but Structured Data (Labs and Vitals) is first passed through a Data Abstraction Layer. This Python layer analyzes the data (e.g., checks if Lactate > 3.0) and converts it into concise, interpretive language features before submission. This ensures the LMM reasons over meaning, not numerical noise.

Dual Function Calling

We built a complex Multi-Turn Conversation Loop (utils.py) to manage agentic behavior:

The agent is configured to use the calculate_sepsis_risk tool to ensure the final triage is based on an algorithmic risk score.

The agent is mandated to use the generate_vitals_visualization tool, which executes Matplotlib code on the host, creates a chart image, and returns the Base64 image to the LMM. The agent then performs self-referential multimodal reasoning by analyzing the chart it just created.

UX and Transparency

We polished the interface into a three-column dashboard using Streamlit, featuring dynamic color-coded triage badges and a verbose Reasoning Trace that logs every tool call ([ACTION]) and every host response ([OBSERVATION]) for full auditability.

⚙️ Challenges We Ran Into

API Conflict: JSON vs. Function Calling

The most significant technical hurdle was the 400 INVALID_ARGUMENT error, which proved the model cannot combine Function Calling and Structured JSON Output in the same request.

Solution: We implemented a Dynamic Configuration Strategy in utils.py, forcing the model to switch configurations mid-conversation: Tools ON, JSON OFF for the call turn, and then Tools OFF, JSON ON for the final synthesis turn. This was a challenging but necessary engineering fix to preserve both core features.

Model Instability and Quota

We faced 429 RESOURCE_EXHAUSTED errors using high-tier models.

Lesson: This forced us to prioritize stability by switching to the high-throughput gemini-2.5-flash. While we targeted the higher reasoning of Pro, our ability to run the full complex architecture on a high-speed stack proves the robustness and efficiency of the MCTA pipeline.

🎉 Accomplishments We're Proud Of

Full Four-Modality Synthesis: Successfully demonstrating synthesis across Text, Image, Tabular, and Time-Series data in a single, coherent workflow.

Self-Referential Agent: Achieving the complex loop where the agent autonomously generates a Matplotlib chart, receives the Base64 image, and integrates that visual data into its final diagnosis.

Robust Error Handling: The agent's logic is protected by Exponential Backoff and the Dynamic Configuration Strategy, guaranteeing high uptime even under API instability.

High-Quality UX: Translating complex technical output (tool logs, JSON data) into a clear, professional, color-coded dashboard.

🎓 What We Learned

We gained deep knowledge regarding the constraints and best practices of advanced LMM integration:

LMMs Require Feature Engineering: Raw numerical data must be translated into contextual language features for effective reasoning.

Structured Output is Fragile: Combining Function Calling and Structured Output requires sophisticated dynamic configuration to avoid API conflicts.

Architectural Robustness Wins: A stable, complex architecture running on a high-throughput model like gemini-2.5-flash is more valuable for deployment than a fragile system running on a top-tier reasoning model.

⏭️ What's Next for Multimodal Clinical Triage Agent

Full Data Integration: Implement dynamic frontend inputs for blood pressure and respiratory rate to eliminate current mock variables for the Sepsis Risk calculation.

Real-time Vitals Streaming: Integrate with a mock data stream generator to simulate real-time patient monitoring, allowing the agent to continuously reassess triage priority.

Expanded Medical Tools: Integrate additional tools for specialized scoring, such as Glasgow Coma Scale (GCS) or Wells' Criteria, further increasing the agent's quantitative grounding.

Built With

3.11+
ai:
api
data
editing)
for
frontend
gemini
gemini-2.5-flash
generative
google-genai
language:
libraries:
matplotlib
model)
pandas
programming
python
sdk
streamlit
ui
visualization:

Updates

Bhaskar Anand started this project — Nov 29, 2025 09:44 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.