Inspiration Our primary inspiration stems from Don Swanson's 1986 concept of "undiscovered public knowledge," which highlights that isolated islands of literature often contain latent, complementary findings that remain unconnected. We recognized that humanity leaves millions of world-changing breakthroughs undiscovered simply because researchers are siloed into hyper-specific disciplines. This inspired us to view discoveries not as serendipitous inventions, but as inevitable, unmapped edges in a topological graph of existing human knowledge.
What it does Episteme is a decentralized, multi-disciplinary consensus engine that algorithmically generates scientific hypotheses. It ingests global scientific literature, maps it into a high-dimensional semantic vector space, and calculates the probability of unmapped hypotheses existing between highly isolated disciplines. By doing this, it produces novel, verifiable hypotheses for critical domains like drug repurposing and sustainable materials design.
How we built it We built a scalable natural language processing pipeline using NLTK and custom transformers fine-tuned on scientific corpora (like SciVocab and S2ORC) to process raw literature into a Scientific Knowledge Graph. We applied Topological Data Analysis (TDA) and persistent homology to mathematically track "knowledge gaps" across multiple scales. For reasoning, we developed a hybrid Association Engine that combines the deterministic logic rules of the Apriori algorithm with the predictive simulations of Deep Neural Networks. This is orchestrated by a multi-agent system running on a resilient AWS infrastructure.
Challenges we ran into A major mathematical hurdle was the "hubness problem" in high-dimensional vector spaces, where a small fraction of generic nodes artificially dominate nearest-neighbor searches, hiding meaningful cross-disciplinary links. We mitigated this by applying mean-centering and embedding whitening to our vector space to ensure isotropic variance. We also had to overcome the computational expense of the Apriori algorithm on large datasets and handle the extreme class imbalances of specialized scientific terminology.
Accomplishments that we're proud of We are particularly proud of our core probability equation, which utilizes cosine similarity to actively penalize obvious connections and exponentially reward novel, unmapped cross-disciplinary leaps. We also successfully integrated persistent homology into a graph neural network framework, shifting from simple statistical co-occurrence to structural, shape-based link prediction. Additionally, designing an agentic system that balances rigorous domain-specific searches with exploratory cross-connections stands as a major achievement.
What we learned We learned that current Large Language Models (LLMs) are fundamentally insufficient for true hypothesis generation due to their susceptibility to hallucinations and sycophancy biases; they optimize for summarizing what already exists rather than computing verifiable new science. We also realized the critical importance of modeling scientific evolution temporally, which led us to represent complex ideas as dynamic hyperedges that evolve over time, rather than static pairwise links.
What's next for Episteme We plan to implement advanced zero-shot link prediction to hypothesize about entirely new, unseen entities—such as newly synthesized materials or proteins—without needing prior training examples. We also aim to scale our multi-agent framework for immediate, high-impact applications, such as finding sustainable, biocompatible substitutes for heavily regulated PFAS chemicals. Ultimately, we envision Episteme operating as a continuously computable world model that updates global scientific hypotheses in real time as new research is published.
Log in or sign up for Devpost to join the conversation.