GeoGuesser: Guessing Countries from Street View

Made by: Abhyudaya Sharma (ashar161), Klimentina Krstevska (kkrstevs), YouJung Koo (ykoo6)

Introduction

The project takes inspiration from GeoGuessr, an online game that uses Google Street View images to test the players' geography knowledge by having them guess the location of a given image on a world map. Building on this idea, we are developing a new CNN model that can identify the origin country of a given Google Street View image based on its scenery.

We explore the application of the ResNet-50, DenseNet-121, Inception-v3 and linear probe CLIP model architectures to test their ability to correctly classify the country of origin where the input image was taken and report and evaluate their performances for the top-1, top-2, top-3, and top-5 accuracies.

Data

We are using a pre-existing image dataset from Kaggle (GeoLocation - GeoGuessr Images (50K)) of ~50,000 GeoGuessr images from the GeoWorld challenge. The original dataset consists of ~150 different subfolders, each containing images from a different country. After eliminating countries with less than 200 images, we were left with a total of 43 countries in the dataset. We also eliminated ~30 images which deviated from the standard image size of 1536 x 662 pixels. The final dataset used has a size of 6.74GB and a total of 31,531 images.

Stratified sampling was utilized to ensure that the country images were evenly distributed between the training and test datasets. Using an 8:2 ratio for the split, the training set comprises 25,248 images while the test set has 6,283 images. For data preprocessing for the ResNet50, DenseNet-121 and Inception-v3 models we used two black boxes in the bottom right and top right to cover the map and the game scores respectively (see images below for details). For the linear probe CLIP model we used CLIP’s own preprocessing. To retain crucial features like the side of the road on which cars are driven, we have intentionally avoided some data augmentation methods in our preprocessing, such as flipping and translation. Furthermore, as our task involved identifying patterns in various landscapes, we did not use techniques such as adding noise to the original images because the images we used are very large with multiple features and it would be hard for the model to overfit on them. We performed one hot encoding on the 43 unique countries to create training and test labels.

Methodology

Using Google Cloud Platform, we trained two different models: ResNet-50 and DenseNet-121 for 10 epochs each with a batch size of 5. We acknowledge that the batch size is smaller compared to industry standards, however, given that our images are of size 1536 x 662 and we were allocated limited computing resources, this was the largest number of images we could fit on the GPU in a single batch. This also resulted in our model training for ~1.5 hours per epoch. Our training process utilized the Adam optimizer with a learning rate of 0.001 and cross entropy loss.

We also attempted to train the Inception-v3 model for 10 epochs multiple times, however, each time during training halfway through the second epoch, the model run stopped because it ran out of memory. For the Inception-v3 model we were able to produce testing accuracies for only training on a single epoch.

For the linear probe CLIP model we trained a logistic regression linear classifier on top of the pre-trained CLIP model’s features and used the SAGA solver for a max of 1000 iterations. This model gave us the worst performance due to the clip image preprocessing which cuts a 224 x 224 square image from the center which a lot of the images in the dataset is just a road. We also found out that increasing the maximum number of iterations only slightly increased the top-1 accuracy while causing a decreasing the rest of the accuracies.

Results

We individually trained the models ResNet-50 and DenseNet-12 for 10 epochs, each with a batch size of 5 on both the original images as well as the preprocessed images with black boxes. The total training time for these two models was 60 hours, with an average running time of 1.5 hours per epoch. The table and graph below show the model's performance on the test dataset for the top-1, top-2, top-3, and top-5 accuracies. For more details, including the training accuracy per epoch, please refer to the Appendix in the report document.

Model	Top 1 Accuracy	Top 2 Accuracy	Top 3 Accuracy	Top 5 Accuracy
ResNet-50	43.80%	55.98%	63.92%	74.45%
ResNet-50 with black box	36.81%	50.52%	59.11%	70.13%
DenseNet-121	49.56%	60.96%	68.01%	77.65%

[Top-1, top-2, top-3, and top-5 accuracies for the ResNet-50 and DenseNet-121 models on the testing dataset after 10 epochs of training.]

Model	Top 1 Accuracy	Top 2 Accuracy	Top 3 Accuracy	Top 5 Accuracy
Inception-v3	28.68%	42.05%	47.64%	56.10%

[Top-1, top-2, top-3, and top-5 accuracies for the Inception-v3 model on the testing dataset after 1 epoch of training.]

	Top 1 accuracy	Top 2 accuracy	Top 3 accuracy	Top 5 accuracy
100 iterations	5.63%	7.24%	8.66%	11.32%
1000 iterations	5.79%	7.11%	8.40%	10.97%

[Top-1, top-2, top-3, and top-5 accuracies for Linear Probe with CLIP using the SAGA solver.]

Challenges

The majority of challenges we had to overcome were related to the initial project setup, data preprocessing and pipeline building. We had to set up 4 different VM instances on 3 different accounts and copy over the datasets and models each time because due to the volume of our data we quickly ran out of the allocated GCP credits. The images in the dataset did not have labels in the names, but were instead organized in folders which contained the country names. After manually removing the folders with less than 200 images, we had to write a script that extracted the images from each country folder, assigned a label to them, and then put them in the training or testing directories with a 80/20 split. Because of the size of our datasets we had to write our own iterator for loading the images in each batch. We also found out during this step that ~30 images were of different sizes than the standard 1536 x 662 pixels size for the rest of them and discarded these images as well. For all of us, this was the first time using PyTorch, so we also faced some challenges while building the end-to-end pipeline.

By far, the biggest challenge we faced was the lack of adequate computing resources that would allow us to use bigger batch size and have our models run faster.

Reflections

We are very satisfied with the outcome of our final project. We were able to reach our base goal of better than random accuracy (2.3256%) as well as our target goal to reach better top-1 accuracy than the DeepGeo paper [1] by using a single image. We also believe that we are close to achieving our stretch goal of building an interpretable model that shows 50% accuracy for top-1. Currently we are at a 43% accuracy for the top-1 and during testing by output verification we found out that our model tends to group countries from the same continent or climate together which are the first steps towards interpreting our model.

Our first ResNet50 model gave a satisfactory top-1 accuracy of 38% . Because we started working on the project early, we had time to improve this accuracy as well as expand the scope of the project and experiment with other architectures and even try state of the art models such as linear probes with CLIP.

If we had more time we would make our code more modularized, reduce code redundancy and pass the model name, batch size, epoch size and learning rate as command line arguments instead of hardcoding them.

Ethical Implications

We want to acknowledge that training an AI model on images from Google Street View could raise some privacy concerns, as it captures images of people, homes and other private properties. Even though the faces of people are blurred, there are still ways to identify a person in a photo which can then be maliciously used. We think it’s important for us as engineers to consider the effects our models may have on the general public, whether intentional or not.

Division of Labour

Data retrieval: Youjung Koo

Preprocessing: Youjung Koo, Abhyudaya Sharma, Klimentina Krstevska

Model: Youjung Koo, Klimentina Krstevska, Abhyudaya Sharma

Check output: Youjung Koo

Devpost: Youjung Koo, Klimentina Krstevska, Abhyudaya Sharma

Final presentation: Youjung Koo, Klimentina Krstevska, Abhyudaya Sharma

Reference

Suresh, S., Chodosh, N., & Abello, M. (2018). DeepGeo: Photo localization with deep neural network. arXiv preprint arXiv:1810.03077.

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).

Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4700-4708).

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1-9).

Facebook. PyTorch: Tensors and dynamic neural networks in Python with strong GPU acceleration [Internet]. Facebook; 2016 [cited 2023 Apr 29]. Available from: https://pytorch.org/.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021, July). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PMLR.

We attached three links that direct you to our 1. Presentation slide 2. Report 3. Github repository.