Mesengenic AI What Inspired Us Mesengenic AI was born where two ideas collide: directed evolution and evolutionary statistics. Frances Arnold's lab at Caltech proved you can breed proteins for better function. The Marks Lab at Harvard showed that a protein's evolutionary history contains blueprints for its structural rules. Between them, they proved that evolution is both a process we can harness and a dataset we can read. But we noticed a gap. Labs working on stem cell reprogramming, like the Thomson Lab at the Morgridge Institute, are limited by the proteins nature already made. When you look at mesenchymal and blood stem cell differentiation, efficiency depends on natural enzymes. We asked a simple question: what if the ideal protein for clinical use doesn't exist in nature yet? That question became Mesengenic AI. Named for mesenchymal stem cells, meaning growth from the middle layer. We treat evolution not as a history book but as an incomplete experiment, and we built a model that predicts what the next page should say. How We Built It We built a Variational Autoencoder that treats evolutionary data as a mathematical prior, a probability field encoding what biology considers viable. Think of it like music. Play a thousand melodies from the same tradition to someone with a good ear and they start to feel the grammar. They wouldn't just memorise the songs. They would understand the language well enough to hum something new that still sounds right. That is what our VAE does with proteins. We trained on 10,129 real evolutionary sequences from the TEM-1 beta-lactamase family, a surface protein in antibiotic-resistant bacteria and the gold standard benchmark in protein ML. The model compresses each protein into a fitness landscape and balances two forces: reconstruction accuracy, how well it recreates known proteins, and exploration pressure, how aggressively it searches the unknown. The balance between these two is controlled by a single parameter called beta, the dial between exploiting what evolution found and exploring what it missed. Stack: PyTorch, NumPy, SciPy, Plotly, Streamlit, Three.js, Biopython Challenges We Faced Early on, the model went silent. It memorised instead of organising. A problem called posterior collapse. We fixed this with beta-annealing, slowly increasing exploration pressure until the model was forced to build a meaningful map before it could take shortcuts. This one fix changed everything. The evolutionary data was noisy. Thousands of nearly identical bacterial sequences threatened to drown out the rare variants we care about most. We implemented occupancy filtering and subsampling to protect signal over noise. We had no internet access during the build. So we wrote the entire VAE in pure NumPy with manual backpropagation. Every gradient by hand. A full PyTorch pipeline was built in parallel for production. And biology does not tolerate nonsense. Because we only train on proteins proven to exist in nature, the model cannot hallucinate impossible chemistry. No D-type amino acids, no stoichiometric impossibilities. Evolution already did the quality control. What We Accomplished In under 12 hours from raw genomic data we parsed and encoded 10,129 evolutionary sequences, trained to full convergence with loss dropping from 689 to 377 over 40 epochs, discovered 17 epistatic peak candidate regions where the model predicts viable proteins should exist but no known sequence sits there, mapped 38 high-variability positions onto the 3D crystal structure of TEM-1 and the model found the catalytic active site on its own without being told where to look, and built an interactive dashboard with a live fitness landscape, 3D structure viewer and training dynamics. The moment that mattered: our predicted residues clustered around the known active site. Nobody programmed that. The model learned it from evolution. What We Learned Evolution is the best dataset we are not using. Every protein that survived millions of years of selection is a validated experiment. The gaps between those proteins are not empty. They are full of possibility. Our model does not invent biology. It reads between the lines of 3.8 billion years of biological text and predicts the sentences that were never written but grammatically should work. Beta is not just a hyperparameter. It is a philosophical choice. Low beta and the model stays close to what nature already found. High beta and it ventures into the white space, hunting for proteins that are mathematically probable but have never existed. Discovery lives in the tension between those two modes. And the model does not need to be told what matters. Given only sequences with no labels, no structures, no annotations, it independently identifies the residues that control function. That is not memorisation. That is understanding. What's Next Mesengenic AI is a target-agnostic engine. Same workflow, different evolutionary prior, different disease. Antimicrobial resistance: predicting collateral sensitivity in superbugs. Prototyped and demonstrated today. Cancer: finding rescue mutations that restore broken p53 tumour suppression. Gene therapy: designing immune-evasive AAV capsids for drug delivery. Myelodysplastic syndrome and autoimmune disease: identifying where blood cell differentiation breaks down at the level of collective cellular organisation. Our next step is experimental validation. Spearman rank correlation against MaveDB laboratory fitness scores to prove the model's predictions match real biology, then wet lab partners to synthesise and test our 17 predicted epistatic peaks. The computation is done. Now we need biology to answer back.

Built With

Share this project:

Updates