Biologists are rapidly acquiring enormous amounts of high-dimensional data, and they require assistance from computational biologists to effectively visualize and comprehend it. Empowering biologists with computational skills would unlock significant value as they are the domain experts in the biological context of their specific genomics data.
What it does
Our tool allows users to load genomics data and query it using natural language in a no-code interface. When given a data analysis prompt, our tool generates accurate analysis and visualization code, presenting the final figure within the chat interface.
How we built it
To develop our tool, we created multiple prompt-code pairs and used GPT4, along with the data schema, to generate code. We also implemented a simple error correction approach to address any code execution failures. The frontend was built using chainlit, and langchain was utilized to interact with the LLM. For deployment, we relied on DigitalOcean.
Challenges we ran into
During the engineering process, we encountered complexities due to specific bioinformatics python libraries with numerous dependencies. Additionally, we found that generating proper responses required detailed analytical and technical prompts for the LLM. To control convergence issues with agent-based approaches, we adopted a straightforward error correction scheme by looping the prompt with error information for GPT4.
Accomplishments that we're proud of
Despite the challenge of generating code within the strict domain-specific context of biology, we successfully developed a working model. Our approach marks the initial steps towards an automated AI-assisted data analysis and visualization tool, empowering biotech & pharma R&D scientists to interact with complex, high-dimensional data using natural language.
What we learned
We discovered that interactive LLM-powered biological data analysis is feasible for cases with limited complexity, but it has some limitations that need to be addressed to provide accurate responses for more complex prompts.
Our future plans include:
- Fine-tuning an LLM model on genomics data analysis libraries to tailor the code solutions to the specific domain.
- Testing the use of multiple agents to enable more sophisticated planning, execution, and error correction.
- Generating tests to identify and solve a more diverse set of logical and conceptual errors.
- Expanding our tool to support a wider range of analysis and visualization capabilities.