Klìnic

Klìnic helps clinicians get insights about clinical trials in their field of interest.

The following is a document explaining the project Klìnic, developed during the Hackathon "HackUPC 2024".

Biomedical research is hard, but we can help. Klìnic is an integrated platform to get insights on clinical trial trends in a scoped domain, helping design new experiments and analyze past failures.

The idea is to help clinicians and researchers easily get an overview of the landscape of clinical research trials in a certain field. They just have to input a general description of a disease, such as "A disease that affects young patients, generally male Caucasians". We get the diseases whose description is more similar to that statement by means of the embeddings of their descriptions. Then, we use a knowledge graph to represent the relationships between diseases to find the most similar diseases to the ones the user is interested in. This way, we can do data augmentation and find more clinical trials that are related to the diseases the user is interested in. We can then use a language model to summarize the clinical trials and extract numerical data from them.

Inspiration

Our team is composed of students from different backgrounds, that include computer science, mathematics, and biomedicine. We wanted to create a tool that could help clinicians and researchers get an overview of the landscape of clinical research trials in a certain field easily. We believe that this tool could help them design new experiments and analyze past failures.

What it does

This tool is an integrated platform to get insights on clinical trial trends in a certain domain (for example, diseases that affect young females). The user just has to input a general description of a disease, such as "A disease that affects young patients, generally females, showing symptoms of fatigue and muscle pain".

We get the diseases whose description is more similar to that statement by means of the embeddings of their descriptions. Then, we use a knowledge graph to represent the relationships between diseases to find the most similar diseases to the ones the user is interested in. This way, we can do data augmentation and find more clinical trials that are related to the diseases the user is interested in. We then use a language model to summarize the clinical trials and extract numerical data from them.

How we built it

We wrote the whole frontend (using Streamlit) and most parts of the backend in Python and some backend-part in Matlab. Our system first has to preprocess data that is fed into IRIS. First, there is our knowledge graph which holds information about the relationship between different diseases. For this, we downloaded the MedGen dataset and trained an embedding model. We took the same approach for clinical trials (source) to represent the relationships.

The heart of our logic comprises the following nine steps:

Embed the textual description that the user entered using the model
Using a similarity threshold, Get top-k diseases with the highest cosine similarity from the DB.
Get the similarities of the embeddings from those diseases (cosine similarity of the embeddings of the nodes of such diseases)
- we also represent this using a correlation heatmap on our frontend
Potentially filter out the diseases that are not similar enough (e.g., similarity < 0.8)
Augment the set of diseases: add new diseases that are similar to the ones that are already in the set until we reach a defined threshold
- we also show the selected diseases in a graph view, whose UI was built in Matlab.
Query the embeddings of the diseases related to each clinical trial (also in the DB) to get the most similar clinical trials to our set of diseases.
Use an LLM to get a summary of the clinical trials in plain text format
Use an LLM to extract statistical insights from the clinical trials (e.g. average minimum and maximum age of patients, average timeframe of the trials, most common gender in the trials etc.).
Show the results using a web app we built to the user. Salient features of the web interface include a graph of the diseases chosen, a summary of the clinical trials, statistical insights about the clinical trials, and a list of the details of the clinical trials considered.

Our setup relies on the demo provided by InterSystems. As can be seen from above, we used IRIS' vector search frequently when determining the similarities.

Challenges we ran into

Situation	Task	Action	Result
Generation of the shortlist of diseases that are most closely related to a certain set of diseases	We had to use the distance across the embeddings of the nodes of every disease in the DB, to get the similarity score of the diseases in it	We decided to use a translation-based algorithm (TransE), which even if it's a bit dated now, should work generally well with cosine similarity, as it moves the embeddings in the space to approach them if they are connected.	We were able to continue coding the project, as empirical results showed that this approach offered correct results.
Deploying the IRIS platform	We had to deploy a Docker environment with the IRIS platform, but the available Docker images were not working properly	Our first approach was to try to isolate the two services that compose the provided repository (the platform and a Jupiter notebook that can actually be used to interact with the platform). We were able to isolate the platform and deploy it in a Docker container, but we were not able to make it work properly until we got some help from the mentors on what was the correct way to access the platform.	Finally, we were able to deploy the platform and use it to interact with the data.
Unavailability of textual data to train the LLM	We needed a large amount of textual data to use with an LLM to do semantic search properly, but most of the databases that we found did not include human-readable text	We decided to use the textual data that we had available, which was the description of the diseases and the clinical trials. To refine the search for diseases, we resourced to the embeddings of the nodes of a graph of diseases fetched from MedGen.	We were able to get a good amount of textual data to use with the LLM, and we were able to use it to get a summary of the clinical trials.
Hallucinations in the generation of textual data	The LLM was generating text that was not present in the clinical trials	This is a common problem with LLMs, as they tend to hallucinate text that is not present in the prompt given to the system, especially when facing technical text. We decided to use a more complex prompt to try to avoid this problem. Likewise, we switched to GPT-4 Turbo, which is a more advanced model that should be able to generate more coherent text. Aside from that, we've taken a big leap forward in limiting hallucinations by using other AI techniques that are less vulnerable to this problem, such as the embeddings of the nodes of the diseases and searching text with Sentence Transformers.	We were able to get a more coherent summary of the clinical trials, and we were able to extract numerical data from them.

Key takeaways

LangChain "stuff chain summarizer" was used with GPT 4 Turbo for the best possible results for the text summarization. We could extend it to use a recursive character text splitter and reduce the token size for very large tokens, which is a future scope of the project.
LangChain "tagging classes" functionality was used to output specific statistics from our raw JSON data. We used this class to give us specific statistical insights into all the clinical trials similar to the disease description that was entered.
InterSystems IRIS vector search internally using cosine similarity was used to find the similarity between different diseases using their embeddings stored in the IRIS vector database.
InterSystems IRIS vector database was also used for querying the database using SQL and getting valuable insights from the embeddings.
We used openAI embeddings with a batch size of 64 and embedding length 128 to encode the text prompts into vectors due to hardware constraints. Learning and building on a bigger feature space is a future scope of this product.
We also explored FAISS, a lightweight vector database that supports semantic search functionality.
We offer a MATLAB app designed to visualize the significant connections among diseases. Simply input the disease's code name, and the app will generate a node graph illustrating all the direct relationships associated with the entered disease.

What we learned

Knowledge Graphs
Knowledge Graph Embedding Algorithms
SPARQL queries
RDF data representation
Language Models
Data Augmentation
Embedding models
Metrics for translation-based algorithms
FAISS
Processing large amounts of data using Python and storing it in a database. Also tracking it with git-lfs.
LangChain
MATLAB (Design Apps)

What's next for Klìnic

While the tool we present is a promising first approach to the exploration of the research landscape in clinical trials, Klìnic can be taken further. Specifically, we envision a complete studio with more data sources to offer clinicians deeper insights. An interesting first step would be combining our creation with the survey of scientific literature using automated methods - ideally offering new insights by leveraging the information available in the clinical trial reports.

Built With

api
docker
embeddings
faiss
git-lfs
iris
knowledge-graph
knowledge-graph-embeddings
langchain
large-language-model
matlab
networkx
openai
python
rdflib
retrieval-augmented-generation
sparql
sql
streamlit
vector

Submitted to

HackUPC 2024
- Winner [MLH] Best Use of MATLAB
- Winner Airpods Pro 2nd Gen
- Winner Intersystems Challenge - Best use of GenAI using InterSystems IRIS Vector Search

Created by

I built a Knowledge Graph of 43,000+ diseases and trained an embedding model on its nodes. Also generated embeddings of textual data about those diseases and developed the code to search them.

Aldan Creo
Matthias Seiler
Tanguyvans Vansnick
Arijit Samal