GeoGuessr Guessr: A Retrieval-Augmented Multi-Agent Pipeline for Country Prediction
I built GeoGuessr Guessr as a lightweight AI pipeline that tries to predict the most likely country from a GeoGuessr-style street image.
The idea behind the project was simple: instead of depending on a single model call, I wanted to combine three different strengths in one system:
- Visual similarity search using CLIP + FAISS
- Scene clue extraction using Gemini
- Final country reasoning using Claude
The result is a prototype that takes an input image, extracts geographic clues, retrieves visually similar locations from a dataset, and then asks a reasoning model to make the final country-level prediction.
What the Project Does
At a high level, the system works like this:
- A query image is passed into the pipeline.
- The image is converted into a CLIP embedding.
- That embedding is used to retrieve the nearest visual matches from a FAISS index.
- The same image is sent to Gemini, which extracts clues like:
- language on signs
- road markings
- vegetation
- architecture
- region-specific hints
- Those clues, along with the retrieved examples, are passed into Claude.
- Claude generates the most likely country guess and explains the reasoning.
This creates a more grounded prediction flow than a plain one-shot prompt.
Why I Built It This Way
GeoGuessr-style reasoning is interesting because it mixes several kinds of intelligence:
- visual pattern recognition
- geographic intuition
- language and signage interpretation
- contextual reasoning
A single model can do some of this, but I wanted to break the task into stages that feel more structured:
- one stage for seeing
- one stage for retrieving memory
- one stage for reasoning
That made the project more modular and easier to inspect.
Public GitHub Repo Scope
The public GitHub repository contains the full pipeline code, setup instructions, and project documentation, including a breakdown of how the system works internally.
Because of repository size constraints, the dataset and prebuilt vector index artifacts are not included in the public repo. Instead, the repo includes the scripts needed to:
- load a folder-based country dataset
- generate CLIP embeddings
- build a FAISS index
- run the full inference pipeline
- benchmark the system on a sample set
So the public repo reflects the actual architecture and executable logic, while the large data artifacts are expected to be created locally.
Tech Stack
This project uses:
- Python
- CLIP for image embeddings
- FAISS for nearest-neighbor retrieval
- Google Gemini for visual clue extraction
- Anthropic Claude for final reasoning
- PIL / Transformers / Torch for image and model handling
What I Learned
A few interesting lessons came out of building this:
- Retrieval adds useful grounding, especially when visual scenes are ambiguous.
- Splitting perception and reasoning into separate stages makes the system easier to debug.
- Repository management matters a lot when working with datasets and vector indexes; large files can become a real publishing problem.
- Even when a pipeline is conceptually “multi-agent,” the implementation details matter. In the current version, the flow is sequential and practical rather than overly abstract.
Current State of the Project
Right now, the public version is best described as a retrieval-augmented prototype for country prediction from street-view imagery.
It already demonstrates the full flow clearly:
- dataset preparation
- vector index building
- clue extraction
- reasoning-based prediction
There are still areas I want to improve, especially around:
- stronger evaluation
- more structured final outputs
- better reflection and self-correction
- easier reproducibility for anyone cloning the repo
Why I’m Excited About It
What I like most about this project is that it sits at the intersection of computer vision, LLM reasoning, and search. It is small enough to understand end to end, but still rich enough to show how different AI components can work together on a non-trivial task.
It also feels like a fun example of how modern AI systems often work better as pipelines, not just prompts.
Log in or sign up for Devpost to join the conversation.