PaleoGPT is a command-line interface (CLI) designed to bridge the gap between high-dimensional archaeogenetic data and historical narrative. By leveraging the Gemini 3 generative model and grounding it in the Allen Ancient DNA Resource (AADR), the tool transforms raw genetic metadata into immersive, scientifically-accurate biographies of ancient individuals.

In the field of archaeogenetics, findings are often presented in abstract formats:

Standardized sample IDs (e.g., I0124)

Haplogroup designations (e.g., R1b1a1a2)

Principal Component Analysis (PCA) coordinates

To a non-expert, these data points are disconnected from the human experience. PaleoGPT was inspired by the need to humanize this "big data" by providing a narrative layer that respects the underlying genetic truth while offering a window into the life of the individual.

The project architecture follows a modular Retrieval-Augmented Generation (RAG) pattern to ensure data integrity and minimize model hallucination. Raw annotations from the AADR were processed into a relational SQLite database. This allows for $O(\log n)$ lookup times for specific sample metadata. The retrieved attributes (e.g., age, sex, archaeological culture, and burial location) are then passed as context to the LLM.

For the comparison of ancient samples, the tool utilizes Global 25 (G25) coordinates. The proximity between two individuals Pand Q in a 25-dimensional ancestral space is calculated using the Euclidean distance formula:

The Gemini 3 API was selected for its extensive context window and its ability to reason over complex historical datasets. The system instructions are designed to enforce a "fact-first" generation policy, ensuring that the model prioritizes database-provided dates over its internal pre-training weights.

Data Normalization: Mapping diverse archaeological labels from various studies into a unified schema required significant cleaning using the Pandas library.

Hallucination Control: Initial iterations saw the model inventing specific physical traits not present in the data. This was mitigated by strictly limiting the model's creative scope to the environmental and cultural context provided by the database.

Built With

Share this project:

Updates