Multimodal Diagnostic Copilot
Inspiration
Diagnostic errors in healthcare often occur because specialists work in isolation. Radiologists frequently analyze medical imaging without access to the patient's full clinical history, while primary care physicians read clinical notes without seeing the raw pixels of the scan. In up to 15% of cases, this disconnect leads to missed patterns and misdiagnoses.
We wanted to build a clinical decision support system that actually reflects how a medical specialist thinks. Instead of relying on isolated text searches or standalone image classification, the goal was to create an intelligent tool capable of simultaneous visual and semantic reasoning. Furthermore, we needed to address the primary hurdle of medical AI: hallucinations. By grounding every diagnostic suggestion in verifiable historical precedents, we ensure the system operates with forensic traceability and transparency.
What it does
The Multimodal Diagnostic Copilot is a Retrieval-Augmented Generation (RAG) system that connects new patient data to historical, verified medical outcomes.
When a clinician inputs a patient's chest X-ray and their current symptoms, the system converts both the image and the text into high-dimensional mathematical vectors. It then queries a specialized database to retrieve the top historical cases that match both the visual anomalies and the physical symptoms. Finally, an agentic AI copilot synthesizes these retrieved cases into a structured differential diagnosis, explicitly citing the specific historical records used to formulate its clinical claims.
How we built it
The architecture relies on high-speed vector retrieval and isolated data environments.
- Data Ingestion and Structuring: We utilized the OpenI Chest X-Ray dataset. We built a custom XML parser to extract the "Findings" and "Impression" sections from over 7,000 clinical reports, cleanly mapping them to their corresponding radiograph image files.
- Dual-Embedding Strategy: To handle the multimodal inputs, we deployed two distinct models. We used OpenAI's CLIP (ViT-B/32) to generate 512-dimensional vectors for the X-ray images, and HuggingFace Sentence-Transformers (all-MiniLM-L6-v2) to generate 384-dimensional vectors for the clinical text.
- Actian VectorAI DB Storage: We deployed Actian VectorAI DB locally via Docker. Actian was chosen for its sub-15ms query latency and Single Instruction, Multiple Data (SIMD) vectorized processing capabilities. We created two separate vector collections (
cxr_textandcxr_images) to prevent the visual and semantic data from interfering with each other. The retrieval logic relies on cosine similarity to find the nearest neighbors in the vector space: $$\text{similarity}(A, B) = \frac{A \cdot B}{|A| |B|}$$ - Agentic Copilot Integration: We integrated Google Gemini 1.5 Flash to handle the final reasoning step. The retrieved context from Actian is passed to Gemini, which acts as a conversational interface where doctors can interrogate the AI's logic (e.g., "Why did you rule out pneumonia?").
Challenges we ran into
Building a complex system within 36 hours exposed several integration challenges.
- Dependency Resolution: Setting up the environment required navigating severe version conflicts, specifically with outdated Python packages for OpenAI's CLIP model requiring legacy versions of PyTorch. We resolved this by forcing direct GitHub repository pulls and modifying the requirements file.
- Environment Conflicts: We encountered Protobuf and gRPC version incompatibilities when connecting the application to the Actian database. This required forcing pure-Python implementations in the environment variables to bypass the native C++ bindings.
- Citation Enforcement: Early iterations of the Gemini copilot would generalize its diagnostic reasoning. We had to implement strict prompt engineering to force the model to output forensic, hyperlinked citations mapped directly to the Actian XML payloads.
Accomplishments that we're proud of
We successfully built a multimodal architecture that achieves a 100% self-retrieval accuracy rate (a score of 1.0000) when testing known cases against the database.
Beyond the raw metrics, we are proud to have built a system that prioritizes "Traceable Medicine." Unlike standard conversational models that generate text without grounding, our system provides an audit trail. Every clinical assertion is backed by a specific, retrieveable historical record.
What we learned
Building this reinforced that true AI innovation requires a heavy focus on systems engineering and architecture, not just API calls. We learned how to manage high-dimensional arrays, deploy enterprise-grade vector databases inside isolated Docker containers, and coordinate cross-modal attention mechanisms. We also learned the importance of structural logic when applying technology to high-stakes fields like healthcare.
What's next for Multimodal Diagnostic Copilot
The immediate next step is expanding the vector database to handle localized diagnostic retrieval. By implementing a cross-modal attention layer, the system will be able to highlight the specific sector of an X-ray (such as the posterior costophrenic angle) that directly corresponds to a text query. We also plan to integrate a "Human-in-the-Loop" feedback mechanism, allowing clinicians to downvote irrelevant retrieved cases to actively fine-tune the Actian database's retrieval weights.
Log in or sign up for Devpost to join the conversation.