GeoGuessr Guessr: A Retrieval-Augmented Multi-Agent Pipeline for Country Prediction

I built GeoGuessr Guessr as a lightweight AI pipeline that tries to predict the most likely country from a GeoGuessr-style street image.

The idea behind the project was simple: instead of depending on a single model call, I wanted to combine three different strengths in one system:

Visual similarity search using CLIP + FAISS
Scene clue extraction using Gemini
Final country reasoning using Claude

The result is a prototype that takes an input image, extracts geographic clues, retrieves visually similar locations from a dataset, and then asks a reasoning model to make the final country-level prediction.

What the Project Does

At a high level, the system works like this:

A query image is passed into the pipeline.
The image is converted into a CLIP embedding.
That embedding is used to retrieve the nearest visual matches from a FAISS index.
The same image is sent to Gemini, which extracts clues like:
- language on signs
- road markings
- vegetation
- architecture
- region-specific hints
Those clues, along with the retrieved examples, are passed into Claude.
Claude generates the most likely country guess and explains the reasoning.

This creates a more grounded prediction flow than a plain one-shot prompt.

Why I Built It This Way

GeoGuessr-style reasoning is interesting because it mixes several kinds of intelligence:

visual pattern recognition
geographic intuition
language and signage interpretation
contextual reasoning

A single model can do some of this, but I wanted to break the task into stages that feel more structured:

one stage for seeing
one stage for retrieving memory
one stage for reasoning

That made the project more modular and easier to inspect.

Public GitHub Repo Scope

The public GitHub repository contains the full pipeline code, setup instructions, and project documentation, including a breakdown of how the system works internally.

Because of repository size constraints, the dataset and prebuilt vector index artifacts are not included in the public repo. Instead, the repo includes the scripts needed to:

load a folder-based country dataset
generate CLIP embeddings
build a FAISS index
run the full inference pipeline
benchmark the system on a sample set

So the public repo reflects the actual architecture and executable logic, while the large data artifacts are expected to be created locally.

Tech Stack

This project uses:

Python
CLIP for image embeddings
FAISS for nearest-neighbor retrieval
Google Gemini for visual clue extraction
Anthropic Claude for final reasoning
PIL / Transformers / Torch for image and model handling

What I Learned

A few interesting lessons came out of building this:

Retrieval adds useful grounding, especially when visual scenes are ambiguous.
Splitting perception and reasoning into separate stages makes the system easier to debug.
Repository management matters a lot when working with datasets and vector indexes; large files can become a real publishing problem.
Even when a pipeline is conceptually “multi-agent,” the implementation details matter. In the current version, the flow is sequential and practical rather than overly abstract.

Current State of the Project

Right now, the public version is best described as a retrieval-augmented prototype for country prediction from street-view imagery.

It already demonstrates the full flow clearly:

dataset preparation
vector index building
clue extraction
reasoning-based prediction

There are still areas I want to improve, especially around:

stronger evaluation
more structured final outputs
better reflection and self-correction
easier reproducibility for anyone cloning the repo

Why I’m Excited About It

What I like most about this project is that it sits at the intersection of computer vision, LLM reasoning, and search. It is small enough to understand end to end, but still rich enough to show how different AI components can work together on a non-trivial task.

It also feels like a fun example of how modern AI systems often work better as pipelines, not just prompts.