High Level Overview😎
Our AI tool, PeptGPT, helps researchers create new fusion protein sequences and 3D structures with specific functions by leveraging GPT-4 and Large Language Models (LLMs).
Inspiration 💡🧬
Researchers and scientists face difficulties in creating and designing novel protein sequences with desired functionality. The current process is time-consuming, requiring extensive expertise and resources to ensure proper folding, stability, and functionality. There is a critical need for an efficient and user-friendly solution that enables rapid generation of verified protein sequences with specific desired functions, while considering folding requirements, thereby addressing both the accessibility and functionality aspects of protein engineering.
What it does 🤔
By using PeptGPT, users can input specific or generic parameters to create protein sequences tailored to their requirements. The platform identifies relevant protein families and generates a representative seed sequence. These seed sequences, along with keyword-derived ones, are combined to produce fusion proteins, expanding the possibilities of protein design.
To ensure successful folding, PeptGPT integrates with ESM2, a powerful protein folding tool. The user-friendly interface of the platform offers a visually appealing display of protein folding, streamlining the protein engineering process, while the command line interface and available functions (which make use of Biopython types) will be quite useful for bioinformaticians.
How we built it 🧠🧠
- The code uses OpenAI's GPT-4 model to generate protein family identifiers (PFAM values) based on user prompts (keywords), helping scientists design novel protein sequences with desired functionality.
- Optimal protein seed sequences are determined from sets of ~20 seed sequences per PFAM value via the EBI InterPro API. We implemented both a ranking solution and a custom "consensus of seed" sequence generation algorithm and ultimately settled on novel sequence generation.
- We then combine the generated protein sequences with flexible protein linkers to create fusion proteins.
- We ensure the fusion protein starts with the methionine amino acid (M) to ensure proper translation and functionality.
- Via the ESM2 web "API" (which we reverse-engineered) we fold the resulting fusion protein for the experimenter to view.
Challenges we ran into 😤
- Inaccurate PFAM Values from GPT-4: Addressing the issue of GPT-4 occasionally generating PFAM values that did not correspond to valid protein sequences.
- Understanding GPT-4 Parameters: Navigating the GPT-4 playground and fine-tuning parameters for optimal protein sequence generation.
- Interpreting InterPro API: Parsing STOCKHOLM formatted responses from the PFAM database in the InterPro API.
- Reverse Engineering ESM Web API: Reverse engineering the ESM web API by inspecting browser debugger's POST requests to access required functionality.
- Challenges with Prompt Engineering: Overcoming difficulties in fine-tuning prompts to incorporate highly specific biological parameters.
- Converting Sequences to 3D Structure: Utilizing the ESM model to convert amino acid sequences into accurate 3D protein structures.
- Thermodynamic Stability Measurement: Measuring the thermodynamic stability of amino acid sequences using YASARA for accurate stability assessment.
- Navigating Available APIs: Selecting and integrating suitable APIs from the available options for efficient data extraction and processing.
Accomplishments that we're proud of ✨
- Using GPT4's "comprehension" of protein function, identifiers, abbreviations, etc. to identify appropriate protein families
- Making a functional protein seed sequence generator
- Creating a working GUI
- Successfully interfacing with web ESM2 to fold proteins
What we learned 🙌
- Biopython
- HTTP/S
- APIs
- Integrating openAI with our application
- Amino acid sequences
- Protein structures
What's next for PeptGPT🚀
Sequences generated using PeptGPT can be tested experimentally for thermodynamic stability and crystal structure! Additional user parameters can be added, a predictive AI model can be built to predict the stability, functionality and potential interactions of the generated sequence and an entire LLM for protein sequences can be built.
Built With
- biopython
- esm
- gpt4
- gui
- interpro
- python


Log in or sign up for Devpost to join the conversation.