Inspiration
Healthcare is drowning in paper. Despite the push toward interoperability standards like HL7 FHIR, the reality is that a massive share of clinically meaningful information still arrives as unstructured documents- discharge summaries, visit packets, scanned charts, and multi-patient bundles that can stretch past 200 pages. We've seen firsthand how clinicians spend hours manually reviewing charts, and how quality measurement programs like HEDIS depend on structured signals that are buried deep inside narrative notes and tables. When we read the hackathon challenge document, the line that stuck with us was about HEDIS Transitions of Care measures- how something as critical as whether a patient received discharge information or had their medications reconciled comes down to finding the right sentence on the right page of a PDF. That's a solvable problem, and we wanted to solve it the right way: not with a brittle script, but with a system that separates what to extract from how to extract it. The concept of a "skill file"- a YAML configuration that defines extraction rules independently from the pipeline code- became our north star. If a healthcare organization wants to add a new section header alias or change how conditions are classified as current vs. historical, they shouldn't need to redeploy their API. They should be able to edit a config file and re-run.
What it does
ClearCare is a config-driven REST API that ingests clinical healthcare PDFs and outputs structured, FHIR-ready JSON. Upload a document and our async pipeline extracts patient demographics, visit-by-visit conditions, medications, lab results, imaging findings, and transition-of-care signals into a validated schema that maps directly to HL7 FHIR R4 resources. All extraction rules live in a YAML "skill file" that you can edit live in the browser through our built-in editor with syntax validation, so healthcare teams can change what gets extracted without touching code. The frontend renders results in expandable patient and visit accordions with evidence citations and confidence scores, and one-click FHIR export generates a standards-compliant Bundle with Patient, Encounter, Condition, Observation, and MedicationStatement resources.
How we built it
We built a FastAPI backend with a 6-stage async pipeline: PyMuPDF extracts text blocks, tables, and images from each PDF page, then a deterministic image triage module classifies embedded images as X-rays, charts, or decorative content without any LLM calls. The LLM consolidation stage maps extracted content to our clinical schema using the skill file as configuration, and Pydantic validators with aggressive coercion handle all the output variability. The frontend is a Next.js app with four views: a drag-and-drop upload page, the Skill File Editor with line numbers and YAML validation, a job polling loader, and the results viewer with expandable accordions and raw JSON toggle. Supabase handles document storage and job state, and the FHIR exporter converts validated results into proper R4 Bundles with deterministic resource IDs and correct reference chains.
Challenges we ran into
LLM output variability was our biggest challenge because the same prompt produces structurally different JSON across runs, with arrays vs objects, null vs missing keys, and nested JSON serialized as quoted strings. We built multiple parsing fallbacks and a full normalization layer to handle every variant gracefully, returning a low-confidence fallback result instead of crashing when all parsing fails. Table extraction from clinical PDFs was deceptively hard because documents use complex layouts where single cells contain label-value pairs separated by newlines, and some tables span page breaks. Our first approach of sending images to the LLM for interpretation produced confident hallucinations, so we redesigned image handling to be entirely deterministic, which dramatically improved accuracy.
Accomplishments that we're proud of
We shipped a real Skill File Editor with line numbers, YAML syntax validation, and a save workflow that lets you change extraction rules in the browser and immediately re-run, which is something most teams hard-code. Our results viewer renders clinical data in expandable accordions with evidence citations showing page numbers and source snippets, making it actually usable for clinicians rather than just a JSON dump. The multi-stage pipeline (deterministic extraction, deterministic image triage, LLM consolidation, Pydantic validation) catches errors at every layer and produces high-accuracy results. We implemented every required clinical flag: multi-patient detection, transition-of-care signals aligned to HEDIS TRC windows, current vs historical condition separation, PCP change detection, and proper FHIR R4 export with correct resource references and code systems.
What we learned
We learned that LLMs are powerful but unreliable as standalone extractors, and the real value comes from using them as one stage in a multi-stage pipeline where deterministic code handles parsing, validation, and coercion. Schema design turned out to be the most underrated part of clinical data engineering because healthcare data arrives in every conceivable format: confidence as "high" or 0.8 or true, dates as "May 12, 1970" or "05/12/1970", evidence as strings or dicts or null. The config-driven architecture paid for itself immediately because iterating on extraction quality became editing YAML in the browser instead of changing Python code and rebuilding. We also learned that image handling in clinical documents requires deliberate restraint because throwing images at an LLM produces confident hallucinations, and the right approach is deterministic classification followed by explicit constraints on what the LLM can infer.
What's next for ClearCare
We want to add user authentication so that each user's custom skill files persist across sessions, meaning a care coordinator can log in and pick up exactly where they left off with their tailored extraction rules. We plan to build a timeline visualization for each patient that plots their visits, conditions, medications, and lab trends chronologically, giving clinicians an at-a-glance view of how a patient's health has progressed across encounters. We also want to significantly improve our image handling pipeline by integrating OCR for scanned pages and chart/graph images that our triage module already flags, and by adding DICOM metadata parsing for radiology images so we can extract study dates, modalities, and patient identifiers directly from embedded medical imaging rather than relying solely on nearby text context.
Built With
- composition-resources)-configuration:-yaml-(skill-file-format-for-config-driven-extraction-rules)-styling:-custom-css-in-js-with-dm-sans-+-outfit-typography
- condition
- diagnosticreport
- encounter
- fastapi
- git
- googlegenai
- image-extraction
- javascript-backend-framework:-fastapi-(async-rest-api-with-background-task-processing)-frontend-framework:-next.js-14-(app-router)
- job-state-management)-data-validation:-pydantic-v2-(clinical-schema-validation-with-custom-coercion-validators)-healthcare-standards:-hl7-fhir-r4-(bundle
- layout-analysis-database-&-storage:-supabase-(document-storage
- medicationstatement
- nextjs
- npm
- observation
- paddleocr
- patient
- pip
- pydantic
- pymupdf
- python
- react-18-ai/llm:-anthropic-claude-api-(clinical-document-consolidation-with-json-schema-enforcement)-pdf-processing:-pymupdf-(fitz)-?-text-extraction
- supabase
- table-detection
- tailwind
- teal-healthcare-color-system-dev-tools:-git
- typescript
- uvicorn
- yaml
Log in or sign up for Devpost to join the conversation.