Unitrial

Inspiration

Drug discovery is one of the largest and most significant markets for better improving the advent of human health. For today's most challenging and deadly diseases like cancer and neurodegenerative disorders, the cost of developing a drug has ballooned to 2 billion and failure rates have reached a 97% failure rate. An extensive breakdown of the most prominent issues plaguing early-stage oncology trials has revealed that patient recruitment is the most critical pain point to running a successful trial. Cancers are fundamentally heterogeneous and notorious to develop drugs for. What's especially challenging is that today's patients often have to rely on clinical trials as a last resort for a cure. Trial recruitment models for today's most pressing clinical trials are incredibly limited; local clinics, hospital networks, and patient recruitment laws are incredibly variable. This makes finding and helping patients get matched to the right trial an incredibly challenging task due to location variation, lack of EHR interoperability, and few resources to best understand and connect with trials that make the most sense to enroll in.

Our solution aims to solve this bidirectional problem through Unitrial: a patient and CRO-positioned platform that uses a large corpus of live clinical trials updated daily from ClinicalTrials.Gov and patient EHRs and medical records (genomic screens, lab test results, payer codes and insurance claims) to best match patients with trials and trials with patients. By streamlining the process we aim improve affordability and expedite the time for FDA drug approval and more effectively bring patients closer to available trials that can cure their disorders faster.

What it does

Our website takes queries in the form of a patient EHR profile (pdf) or medical condition descriptions (dictionary input) and returns the top clinical trial matches using an efficient RAG system powered by Mistral-7B. We specifically generated a novel data schema that leverages FHIR and MESH IDs in logging medically relevant tags to prevent the model from hallucinating from just the unstructured data. We are able to obtain both technical specifications of trials (enrollment criteria, measure of success, metadata on trial length, operations, sponsors) and the unstructured data that is critical in making a lot of the logistical decisions around trial enrollment (inclusion description, title and goals, relative rank of sources, sponsor conflict of interest). We have 2 visualization features. On the patient-end, a patient can upload PDFs of their EHR and medical data, from which we can extract the pdf raw text and intelligently format it to our internal EHR schema to be interoperable with existing FHIR protocols. We also de-identify any patient information so we can effectively pass data back to our system for CROs to potentially search and find patients as well. Then, the patient can either enter a custom prompt or a set of well-established prompts to search the trial space with their EHR and medical data to now search the clinical trial space. These EHR embeddings are uploaded into our system (in the future with real data, via HIPAA compliance) for clinical trial CROs to conduct more refined, specific searches to search across patients by either dictionary key values or a RAG QAS. We believe matching AI selection with real-world information from MESH-IDs and other biological tags can provide the most accurate information about clinical trials and best help match patients across different diseases.

How we built it

A) We access existing de-identified patient data from EHRs, preprocess the unstructured data into a structured format (key biomarkers formatted + raw text as input) B) Pass formatted key biomarkers into MED-BERT for medical-specific data and BGE for unstructured data to create novel embeddings for our RAG pipeline. C) Collect list of various Clinical Trials overviews with relevant trial parameters, information, timelines, and other necessary variables + preprocess them using our modified REST API on top of existing ClinicalTrials.gov API. This API is designed to be incredibly flexible for ANY disease query. Simply search which trials you'd like to learn more about and let the RAG system do the rest. D) Set up a Vector Database (using ChromaDB) for Clinical Trial Corpus and initial EHR patient cluster E) Use VDB Filtering to Reduce Search Space and Select N Top Documents from the Database (top N is determined by a similarity search across the prompt, uploaded EHR (get embeddings by document ID in corpus), and the clinical trials embeddings. F) Using BERT Embeddings with Query, Calculate Document Similarity Scores between Data and its respective “chunks” (trials), fetch top N docs and top I chunks in each document for final response

Business Model

We believe the best part of our platform that distinguishes us from the existing medical clinical trial recruitment products is not only our use of a novel RAG system that has been proved to have incredibly accurate decision-making power for trial recruitment, but also how our application is matching multiple stakeholders to build a asymmetric moat in both patient data, active patients, multiple clinical trials, dedicated CRO partners, and specialists to mediate the AI model results. Existing software for CROs for clinical trial recruitment are limited to statistical models and existing trial parameters to recruit from a stratified set of patients.

Unitrial is positioned uniquely at the intersection of cutting-edge AI technology and clinical trial recruitment, offering a solution that leverages the power of real-time data analytics and patient-specific information. Our revenue model will capitalize on this by implementing a subscription service for CROs and healthcare providers who seek access to our refined patient-matching system. Additionally, we will explore value-based pricing models where fees are aligned with the outcomes of successful patient matches, which not only ensures a higher ROI for our clients but also aligns our business interests with the health outcomes of patients.

Moreover, our platform can offer an advertisement option for pharmaceutical companies to feature their trials more prominently, ensuring higher visibility among relevant candidates. This dual-revenue stream from subscriptions and targeted advertising provides a sustainable business model while continuously improving the platform's capabilities.

Challenges we ran into

Finding good data to train our model was quite challenging. Once we found some suitable EHR data, we still had to do a lot of data cleaning and modification with some Python scripts. Furthermore, when building our RAG pipeline, we had trouble integrating with Langchain and making the Mistral model rely exclusively on our database of clinical trials. It would frequently hallucinate which was a major problem given our use case.

Accomplishments that we're proud of

We're proud of refining the model and creating a product capable of helping potential patients match with life-saving drugs. We've extensively tested our model and the hallucination rate is far lower than it initially was.

What we learned

We learned a lot about the inequalities of finding clinical trials and how to integrate various LLM frameworks. We also learned the specifics of building a RAG pipeline like working with different embedding techniques and vector space modeling and querying, as well as accounting for scalability in the development process.