Project Writeup: GenPlasmid
Inspiration
Plasmids are circular DNA molecules capable of replicating independently within a host cell and are essential tools in molecular biology. They play a critical role in gene cloning, protein expression, and reporter assays, making them indispensable for research, synthetic biology, biomanufacturing, and therapeutic development. However, traditional plasmid design is labor-intensive, requiring multiple rounds of experimental validation to achieve optimal gene expression, stability, safety, and host compatibility.
As access to computationally-designed proteins becomes democratized, functional protein expression and testing are emerging as significant bottlenecks. Moreover, specific research questions almost always require highly customized experimental systems for validation. Improving plasmid design can bridge the gap between computational protein design and physical testing, enabling faster and more effective wet-lab validation. GenPlasmid was inspired by the need to enhance plasmid engineering, adding a critical component to the design, build, and test stack in generative biology.
What it does
GenPlasmid automates the design of plasmid components. We validated a use case by applying in silico mutagenesis to generate novel promoters for enhanced expression of YFP.
How we built it
First, we compiled a new dataset, OpenPlasmid, consisting of ~150k engineered plasmids from Addgene. We then finetuned gLM2, a mixed-modality DNA and protein sequence model, and evaluated the learned plasmid representations using a new benchmark, showing improvements over several baselines. Finally, we applied in silico mutagenesis to design novel promoters for YFP expression and computationally evaluated these designs through a robust oracle model.
Challenges we ran into
Collecting and organizing the OpenPlasmid dataset was a significant challenge. Initially, we aimed to evaluate the design characteristics of promoters in the context of novel protein sequence variants. We designed thousands of likely functional YFP variants using ESM3 and used these variants for the conditional generation of novel promoters. However, we lacked a robust evaluation framework to report conclusive results, particularly within the time constraints of the hackathon. We plan to continue this work.
Accomplishments that we're proud of
- OpenPlasmid: a new, publicly available dataset of 150k annotated plasmid designs.
- A state-of-the-art finetuned, mixed-modality genomic language model for plasmid design, with new benchmark evaluations.
- Improved promoter design through in silico mutagenesis, and a new benchmark evaluation.
What we learned
Model and task benchmarks are at a premium for validating new methodologies in generative biodesign. Overall, the ML in biology research field is rapidly democratizing, as evidenced by the ability of a globally distributed team to collaborate. We are excited about how access to these tools will compound progress in the field.
What's next for GenPlasmid
Next, we aim to enhance GenPlasmid’s capabilities by incorporating more structured plasmid annotations and developing new frameworks to evaluate tasks of interest, such as delivery and host compatibility. Ultimately, we want to make GenPlasmid an accessible resource for labs and researchers, enabling plasmid engineering and facilitating data sharing on the success of computational designs.
Built With
- esm3
- glm2
- polaris
- python
- runpod.ai
Log in or sign up for Devpost to join the conversation.