Inspiration

Medical imaging is incredibly powerful, but for most patients it is completely inaccessible. After receiving an MRI or CT scan, people are often handed files they cannot interpret, forcing them to rely entirely on specialists for even basic understanding. We were inspired by this gap between advanced medical technology and patient comprehension. Graphfol was built to make medical data understandable, interactive, and personal. We wanted to create a system where someone could explore their own scan in 3D, ask questions naturally, and receive clear, human-readable explanations in real time.

What it does

Graphfol is built as a full-stack pipeline that converts raw medical imaging data into interactive visual experiences in the browser. On the frontend, a lightweight interface written in HTML, CSS, and JavaScript handles rendering and interaction. NIfTI scans are visualized in 2D using NiiVue, allowing users to scroll through axial, sagittal, and coronal slices with segmentation overlays. In parallel, Three.js renders a 3D model generated from the same scan, where each anatomical structure is represented as an interactive mesh. A pivot-based grouping system ensures smooth rotation and scaling, enabling users to explore the model naturally from any angle.

The backend is split between a Node.js/Express server and a Python-based segmentation service. The Node server manages file uploads, routing, and API coordination, handling large NIfTI files efficiently through streaming and multipart processing. These scans are passed to a Python service running TotalSegmentator, which performs deep learning–based segmentation to identify and label over 100 anatomical structures. The output includes both labeled volumetric data for 2D overlays and GLB mesh files for 3D rendering. The system also extracts metadata such as structure names, volumes, and health indicators, which are returned to the frontend for interaction and analysis.

On top of this pipeline, Graphfol integrates real-time interaction through gesture control, voice input, and AI-driven reasoning. MediaPipe Hands tracks 3D hand landmarks from the webcam to enable touchless control of the model, including rotation, scaling, and selection. Voice queries are captured using the Web Speech API, processed on the backend, and routed to language models that generate contextual explanations based on the full scan data. A parallel intent classification system determines user actions and triggers corresponding visual responses in the AR environment. Responses are then converted into natural-sounding audio using ElevenLabs text-to-speech, creating a seamless multimodal interface where users can both see and hear insights about their scans.

How we built it

Graphfol was built as a full-stack system that transforms raw medical scans into interactive visual and conversational experiences in the browser. On the frontend, we used HTML, CSS, and JavaScript to create a lightweight interface powered by NiiVue for 2D scan exploration and Three.js for real-time 3D rendering. Users can move seamlessly between slice-based views and fully interactive anatomical models, with each segmented structure represented as a selectable mesh. The system is designed around a pivot-based scene graph, allowing smooth rotation and scaling across both standard and augmented reality views.

On the backend, we combined a Node.js/Express server with a Python-based segmentation pipeline. The Node server handles large NIfTI uploads, streaming, and API orchestration, while the Python service runs TotalSegmentator to extract over 100 anatomical structures from each scan. These outputs are converted into both volumetric overlays and optimized 3D mesh files for rendering. Alongside the geometry, the system computes structured metadata such as volumes, spatial relationships, and health indicators, which form the foundation for higher-level reasoning and interaction.

A core part of Graphfol is its integration of the K2 Think V2 model as a reasoning engine rather than a simple text generator. Instead of analyzing structures in isolation, we construct prompts that include the full set of detected anatomical data, enabling K2 to perform multi-step reasoning over the entire scan as a connected system. This allows the model to generate context-aware explanations, comparisons, and interpretations that reflect relationships between structures, not just individual labels. K2 is also used in the voice interaction pipeline, where user queries are interpreted against the full scan state to produce grounded, patient-facing answers. These responses are then converted into natural speech using ElevenLabs text-to-speech, completing a real-time loop where users can ask complex questions and receive both visual and spoken insights.

Challenges we ran into

System Architecture & Deployment
Built a multi-server pipeline: browser → Cloudflare tunnel → Node.js proxy → SSH tunnel → FastAPI on remote GPU cluster Managed persistent tunneling infrastructure to expose a private research cluster to the public internet through Cloudflare named tunnels Built auto-detection system that identifies scan type (brain, lung, leg) and routes to the appropriate 3D model and analysis pipeline

AR & 3D Visualization
Real-time MediaPipe hand tracking with gesture recognition: pinch-to-scale, open-palm rotation, and point-to-select organs Per-structure smart rotation that automatically orients the 3D model to face the highlighted organ toward the camera Contextual color coding per anatomical structure with smooth eased transitions

AI Agent Pipeline Guided walkthrough mode: agent generates a sequenced tour of structures, each step with speech + visualization actions + TTS audio played sequentially Conversation memory across turns for contextual follow-up questions ElevenLabs text-to-speech integration with Web Audio API for precise sequential playback during walkthroughs Holistic medical reasoning: AI receives all structure volumes and health statuses to provide contextual assessments about individual organs

Accomplishments that we're proud of

Graphfol successfully delivers a fully interactive medical imaging experience that combines 2D visualization, 3D modeling, augmented reality, and voice interaction into a single platform. One of our key accomplishments is building a system that can take complex scan data and transform it into a format that users can directly explore and interact with in real time. This creates a more engaging and intuitive experience for users, helping them better understand their own anatomy and making medical data feel less abstract and more tangible.

Another major accomplishment is the development of a fully connected reasoning layer powered by K2 Think V2. Instead of treating each anatomical structure separately, the system analyzes the entire scan as a unified context and generates explanations that reflect relationships between structures. This enables deeper, more meaningful insights that go beyond surface-level descriptions. For users and customers, this means receiving clearer, more relevant information that can support better understanding, improve communication with healthcare professionals, and reduce confusion around scan results.

We also built a seamless multimodal interaction system that combines gesture control, voice input, and real-time audio feedback. Users can explore scans hands-free, ask questions naturally, and hear responses generated and delivered through ElevenLabs text-to-speech. This creates a more accessible and flexible experience that adapts to different user needs and environments. For customers, this opens up new possibilities for remote consultations, patient education, and interactive demonstrations, making Graphfol not just a visualization tool but a platform for more effective and engaging healthcare communication.

What we learned

We learned that the hardest problems in building AI-driven products are often not the models themselves, but the systems around them. Moving large medical datasets through a pipeline, rendering them interactively, and keeping latency low is a significant engineering challenge.

We also learned that intuitive interaction matters as much as technical capability. A feature like gesture control or voice Q&A only becomes valuable when it feels natural and responsive. So we had to develop our own systems such as for hand controls that we found to be most user-friendly and easy-to-understand.

Most importantly, we saw how powerful it is to make complex data understandable. When users can directly explore and question their own medical information, it transforms them from passive recipients into active participants in their healthcare.

What's next for Graphfol

Graphfol is just the beginning of making medical imaging accessible and interactive. The next phase focuses on improving realism, accessibility, and clinical usefulness.

One major step is enabling real-time segmentation so users can upload raw scans and receive fully processed 3D models without relying on demo data. This would make the platform viable for real-world use rather than just demonstrations.

We also plan to support DICOM, the standard format used by hospitals, allowing patients to directly upload scans from clinical environments without needing conversion.

Another key direction is comparative analysis. Users will be able to upload multiple scans taken over time and visualize changes in anatomy, helping track disease progression or recovery.

On the interaction side, we want to expand beyond desktop AR into mobile and tablet experiences, making the platform more practical for bedside use and everyday accessibility.

We are also exploring collaborative modes where patients and clinicians can view and interact with the same scan simultaneously, enabling remote consultations in a shared 3D environment.

Finally, we aim to improve the intelligence layer by making explanations more personalized, context-aware, and clinically grounded, while still remaining easy to understand.

The long-term vision is to turn Graphfol into a universal interface for understanding the human body, where anyone can explore, question, and learn from their own medical data without barriers.

Public Frameworks, Libraries, and APIs Used

NiiVue for 2D NIfTI rendering. Three.js and built-in addons (OrbitControls, GLTFLoader, ARButton) for 3D and WebXR. MediaPipe Hands for hand landmark tracking and gesture input. Express / Multer / http-proxy-middleware / dotenv for server and upload/proxy infrastructure. K2 API, DeepSeek API, and ElevenLabs API for language, intent/transcription fallback, and text-to-speech. BiMediX2 endpoint support (if configured) for model inference compatibility. What Our Team Built in This Repository End-to-end web application flow from upload -> visualization -> report -> AR voice interaction. UI/UX logic for structure selection, highlighting, toggles, and report rendering. API gateway logic connecting frontend to segmentation and model services. Voice interaction orchestration pipeline: question capture -> intent classification -> answer generation -> TTS playback. AR overlay behavior and gesture-driven controls for scaling, rotating, and selecting structures. What We Did Not Build Here The underlying open-source/public frameworks listed above. Hosted external model providers and speech providers (K2/DeepSeek/ElevenLabs). The segmentation model/service itself (this repo expects a separate backend that provides segmentation endpoints).

Built With

Share this project:

Updates