About the Project: Multimodal Clinical Triage Agent (MCTA)

The MCTA is a high-impact application developed to solve the data synthesis crisis in emergency triage. Our project moves beyond the basic capabilities of AI chatbots by utilizing the highest-tier features of the Gemini API to perform verifiable, tool-grounded clinical reasoning across four distinct data modalities simultaneously.

๐Ÿ’ก Inspiration

Our primary motivation stemmed from the limitations of current medical AI, which often fails in high-stakes scenarios because it relies solely on text. This makes the system unable to reconcile conflicting patient dataโ€”for example, a low-risk verbal report contradicted by a critical lab value or an alarming X-ray image.

Our goal was to build a system that achieves cross-modal synthesis, mimicking an expert clinician's ability to seamlessly integrate:

Patient narrative (Text)

Objective imaging (Image)

Calculated metrics (Labs/Vitals)

โœ… What it Does

The MCTA functions as a Mission Control Dashboard for emergency room staff, delivering a comprehensive, verifiable diagnostic report in seconds.

Four-Modality Fusion: Accepts and synthesizes Text (notes), Image (X-ray), Tabular Data (live lab entries), and Time-Series Vitals (trends).

Autonomous Tool Orchestration: Recognizes when its internal reasoning requires external confirmation and autonomously calls two specialized Python tools for quantitative grounding.

Structured, Transparent Output: Generates a final, schema-enforced JSON report that includes the triage urgency (RED/YELLOW/GREEN) and a traceable summary of the evidence.

๐Ÿ—๏ธ How We Built It

We followed a rigid 8-hour execution map, focusing intensely on Gemini API integration:

  1. The Fusion Pipeline

We engineered a pipeline where Text and Image data are ingested raw, but Structured Data (Labs and Vitals) is first passed through a Data Abstraction Layer. This Python layer analyzes the data (e.g., checks if Lactate > 3.0) and converts it into concise, interpretive language features before submission. This ensures the LMM reasons over meaning, not numerical noise.

  1. Dual Function Calling

We built a complex Multi-Turn Conversation Loop (utils.py) to manage agentic behavior:

The agent is configured to use the calculate_sepsis_risk tool to ensure the final triage is based on an algorithmic risk score.

The agent is mandated to use the generate_vitals_visualization tool, which executes Matplotlib code on the host, creates a chart image, and returns the Base64 image to the LMM. The agent then performs self-referential multimodal reasoning by analyzing the chart it just created.

  1. UX and Transparency

We polished the interface into a three-column dashboard using Streamlit, featuring dynamic color-coded triage badges and a verbose Reasoning Trace that logs every tool call ([ACTION]) and every host response ([OBSERVATION]) for full auditability.

โš™๏ธ Challenges We Ran Into

  1. API Conflict: JSON vs. Function Calling

The most significant technical hurdle was the 400 INVALID_ARGUMENT error, which proved the model cannot combine Function Calling and Structured JSON Output in the same request.

Solution: We implemented a Dynamic Configuration Strategy in utils.py, forcing the model to switch configurations mid-conversation: Tools ON, JSON OFF for the call turn, and then Tools OFF, JSON ON for the final synthesis turn. This was a challenging but necessary engineering fix to preserve both core features.

  1. Model Instability and Quota

We faced 429 RESOURCE_EXHAUSTED errors using high-tier models.

Lesson: This forced us to prioritize stability by switching to the high-throughput gemini-2.5-flash. While we targeted the higher reasoning of Pro, our ability to run the full complex architecture on a high-speed stack proves the robustness and efficiency of the MCTA pipeline.

๐ŸŽ‰ Accomplishments We're Proud Of

Full Four-Modality Synthesis: Successfully demonstrating synthesis across Text, Image, Tabular, and Time-Series data in a single, coherent workflow.

Self-Referential Agent: Achieving the complex loop where the agent autonomously generates a Matplotlib chart, receives the Base64 image, and integrates that visual data into its final diagnosis.

Robust Error Handling: The agent's logic is protected by Exponential Backoff and the Dynamic Configuration Strategy, guaranteeing high uptime even under API instability.

High-Quality UX: Translating complex technical output (tool logs, JSON data) into a clear, professional, color-coded dashboard.

๐ŸŽ“ What We Learned

We gained deep knowledge regarding the constraints and best practices of advanced LMM integration:

LMMs Require Feature Engineering: Raw numerical data must be translated into contextual language features for effective reasoning.

Structured Output is Fragile: Combining Function Calling and Structured Output requires sophisticated dynamic configuration to avoid API conflicts.

Architectural Robustness Wins: A stable, complex architecture running on a high-throughput model like gemini-2.5-flash is more valuable for deployment than a fragile system running on a top-tier reasoning model.

โญ๏ธ What's Next for Multimodal Clinical Triage Agent

Full Data Integration: Implement dynamic frontend inputs for blood pressure and respiratory rate to eliminate current mock variables for the Sepsis Risk calculation.

Real-time Vitals Streaming: Integrate with a mock data stream generator to simulate real-time patient monitoring, allowing the agent to continuously reassess triage priority.

Expanded Medical Tools: Integrate additional tools for specialized scoring, such as Glasgow Coma Scale (GCS) or Wells' Criteria, further increasing the agent's quantitative grounding.

Built With

  • 3.11+
  • ai:
  • api
  • data
  • editing)
  • for
  • frontend
  • gemini
  • gemini-2.5-flash
  • generative
  • google-genai
  • language:
  • libraries:
  • matplotlib
  • model)
  • pandas
  • programming
  • python
  • sdk
  • streamlit
  • ui
  • visualization:
Share this project:

Updates