Healthcare Multi-Modal Data Plumbers

Inspiration

Andrej Karpathy's idea of Software 2.0 proposes a new paradigm for software development that relies on AI models to replace critical layers of a software stack. In this paradigm, practitioners codify desired behaviors by designing datasets and then training commodity AI models. This approach could significantly benefit healthcare data processing, interoperability, and AI pipelines.

The current paradigm of "AI in healthcare," where developing, deploying, and maintaining a predictive model data pipeline for a single clinical task can cost upwards of $500,000, is unsustainable. Commercial solutions also fall short because vendors typically charge health systems either on a per model or per prediction basis or per data source basis. A better paradigm would focus on creating models that are cheaper to build, have reusable parts, can handle multiple data types, and are resilient to changes in the underlying data.

By lowering the time and energy required to create data pipelines, we can focus on ensuring that their use leads to fair allocation of resources with the potential to meaningfully improve clinical care and efficiency. This would create a new, supercharged framework for AI in healthcare.

What it does

Using Foundation Models, automate the data fields and value set mapping, validation/quality checks, and data transformation of multi-modal input data- both structured and unstructured and create semantic patient representation with minimal human supervision.

How we built it

We leverage the GPT-4 model to extract, map, validate and transform data to create patient representations and embeddings. We use sample structured and unstructured datasets as examples. We provide GPT-4 with a target data model to learn from, and allow it to learn the appropriate schema mapping.

We also built a UI for uploading data and for facilitating human review/feedback.

Challenges we ran into

Converting to the FHIR standard has a lot of complex logic, mapping is not 1:1 from source systems. Then, we need to map from source terminology to destination terminology and oftentimes change the Unit of Measure to common standard. A historical database of mapping examples may be needed to create a new small model or fine tune GPT.

GPT calls took long time with just few records we have, which is not suitable for heavy production workloads.

Limitations with context: In order to learn the input data fields and how they exist in our structured and unstructured input files, we need to provide GPT with sample records from all input files in one context. With a large number of files and/or large sample records, GPT sometimes falters as we reach the token limit. One solution is an iterative process where we feed sample data in multiple API calls.

Lastly, a separate fine tuned model is required to catch all data quality issues - both syntactic quality and semantic quality issues. This solution would dramatically improve the robustness of our pipeline.

Accomplishments that we're proud of

We thought deeply about the healthcare data mapping problem space, reviewed data standards and databases such as FHIR and MIMIC, and terminologies/knowledge graphs such as SNOMED and completed the first iteration of automated mapping with one structured source and one unstructured source (clinical notes) and combined them to create patient attributes/embeddings.

https://enigmatic-sands-54098.herokuapp.com/2/mapper/

What we learned

Healthcare data mapping is a complex space. Data Models, Mappings, and Ontologies are complex. A smaller model or GPT fine tuned model using past mapping and ontology datasets will be helpful here.

A cloud hosted (own tenant) / self hosted version of GPT is required for this type of work, public GPT version has lots of limitations including performance issues.