Inspiration
We all rely on navigation apps like Google Maps to find where we need to go. However, when navigating dense urban environments, the navigation instructions are often insufficient to guide the driver, leading them to take their eyes off the road and glance at the map at their device.
Professionals such as delivery and taxi drivers and even racing drivers rely heavily on landmarks and points of interest in order to navigate their respective environments. I have personally observed that I make fewer navigational mistakes while driving when I have a friend in the passenger seat interpreting the map and navigation instructions and relating them to visual points of interest in the road. And it is to be expected: The instruction: "Turn right at the next stop sign" is easier to follow than "turn left in 50 meters."
For this project, I wanted to attempt to recreate this using Gemini and Google Maps. Can we provide more intuitive directions to the driver by relating the navigation directions with visible points of reference?
What it does
The way the app works is the following: Given an origin and destination points for a route, it fetches the navigation instructions from Google Maps using the Directions API. Next, for each critical point (i.e. when a navigation directions is given) it requests a Street View image from the Street View Static API. The street view images are paired with their respective navigation instructions and passed to Gemini 1.5 for processing. The prompt given is the following:
You are given a street view image and a set of navigation directions corresponding
to the scene in the image. Augment the navigation direction by incorporating static
points of reference from the image to the instructions that work as visual aids for
the driver. Try to favor elements that stand out due to their color. Avoid using
sign names unless they are clearly legible. Suitable points of reference include:
Landmarks with distinct architecture: Iconic or historic buildings or prominent towers, unique skyscrapers.
Major intersections with prominent signage: Points where multiple roads meet, marked by large street signs, traffic lights.
Large shopping centers or malls: Retail complexes with visible signage, expansive parking lots.
Gas stations or convenience stores: Well-known shops, easily recognizable due to signage, fuel pumps.
Hotel chains: Well-known brands with large signs, often near tourist spots.
Fast food chains or drive-thrus: Popular restaurants with bright signage, drive-thru lanes.
Highway exits or entrances: Ramps with large signs, exit numbers.
Car dealerships: Showrooms with banners, along major roads.
Bridges or overpasses: Structures spanning water, serving as landmarks.
Parks or squares: Green spaces or open areas that serve as central points within neighborhoods or districts.
Water bodies: Rivers, lakes, or oceans that provide natural landmarks and orientation points, especially in coastal cities.
Tall communication towers or antennas: Radio towers visible from afar, guiding orientation.
The navigation directions given are the following: {directions}
Keep the responses brief rather than conversational.
The model returns a new set of directions, which incorporate visual elements from the images. These are usually buildings that stand out due to their size or color, but they may also be statues, flags, signs, trees, parks, bodies of water, antennas, towers, and many other things.
The new augmented navigation directions can be accessed through a Flutter web app.
How I built it
- The backend is written in PyTorch and hosted on Google Cloud using Cloud Functions.
- The directions are acquired using the Directions API
- The street view imagery is acquired using the Static Street View API
- The front end is built with Flutter and deployed as a web app. I used FlutterFire to connect to the backend.
Challenges I ran into
There are many little challenges that made things difficult during the development of the project.
A critical challenge that I had to solve was selecting the correct heading for the street view images, which is crucial for providing coherent instructions. I overcame this difficulty by using the last two points of a navigation instruction's polyline in order to calculate the heading. This works well enough, but it is not bulletproof. I tried to understand what Google Maps does which seems to be putting the street view image for a turn on a route a few meters before the turn, but I could not replicate it effectively.
Another issue I ran into was with the Street View API. I use both Static and Embed APIs (for the frontend app) and when requesting a location by coordinates, I sometimes get slightly different panoramas. This turns to be an issue when showcasing the app, as it has a tendency to give me a panorama that is slightly forward in the route (mostly 1 panorama away towards the direction of movement).
Lastly, a very serious issue I ran into is the rate limits on Gemini. Currently the app works for very short routes of <7 steps which is enough as a proof of concept, but not good enough for a product obviously. In its current implementation, I prompt the model "naively", i.e. I don't check rate limits or do any sort of lazy invocation in order to not run into them. The fix is simple: call the model only when the instruction is needed.
I also run into a number of other small technical challenges with Google Cloud as it was only my second time using it and I am not really a full stack developer, but the online resources (and the integrated Gemini Bot <3) helped me resolve them quite easily. I still don't think that I have it set up 100% correctly, but it works (most of the time).
Accomplishments that we're proud of & What we learned
- It was fun working on this project. While the implementation is rough, I think it is evident that such a feature is very viable and can really improve the experience of a service like Google Maps.
- There is a significant, very noticeable improvement between Gemini 1.0 and Gemini 1.5, especially when it comes to text on images. Gemini 1.0 was not a viable option for my use case, while 1.5 works pretty well. It is sometimes overly ambitious and tries to read obscured text, but at least in English, it can process clear signs on the street. (On that note, I also tried it with Greek and it had a hard time, which is understandable, since the dataset is probably smaller).
What's next for Point of Reference Navigation
- Refactor the front end a bit, maybe do a polish on the widgets too.
- Implement lazy model invocation on the backend.

Log in or sign up for Devpost to join the conversation.