GuideDog

Splash Screen
Obstacle/Object Detection
Image Captioning

💡Inspiration

Technology has made all our lives easier. Wait, has it?

For most, it has, but there are some people whose lives have become more inconvenient than ever because of new technologies. With increasingly more service workers being replaced by automated kiosks, touch screens, and buttons than ever, many visually impaired people find it difficult to navigate even the most basic tasks in everyday life.

📱What it does

GuideDog is an AI/ML-based mobile app designed to assist the lives of the visually impaired, 100% voice-controlled. You may as well think of it as "speaking guide dog," as the name suggests. It has three key features based on the scene captured by your mobile phone, GuideDog: 1) Reads text upon command, 2) Describes the scene around you upon command and, 3) Warns you if there is an obstacle in front of you.

🛠️ How we built it

Technologies that we needed

Diagram circle logo According to the storyboard for our mobile app, we had to implement 5 main features:

Speech-to-text in order to recognize user's verbal command

In order to better interact with the visually-impaired, all of our commands are voice-controlled. To do this, we used Google Cloud's Speech-to-Text API to convert the user's voice input into a command that the app can understand an execute.

Text-to-speech in order to say out loud the output

We use Google Cloud's Text-to-Speech API in our application to guide the user through the app experience, interact with them as personable as it can, and respond back to the user based on the feature it has selected or used.

Image captioning to describe a scene

The camera feature in the application captures an image and then using Azure Vision API to come up with a sentence description of that scene. The user then is able to understand certain situations based on objects it encounters, but also provided context of what's going on.

Object detection to detect obstacles

The application is able to detect obstacles and ultimately warn the user of its surroundings in real-time. This is especially useful when the user is not familiar with a certain location and trying to navigate themselves through the street. This was done by incorporating Android's CameraX API, Google ML Kit's Object Detection and Tracking API, and TensorFlow Lite's MobileNet Image Classification Model.

OCR, text recognition to read text

The text recognition feature is able to detect texts, whether it be typed, printed, or handwritten, and then reads them to the user, when certain situations or locations do not provide adequate accommodations for the visually-impaired.

Backend API

Diagram circle logo Since we are using a lot of different APIs, we decided to create a Flask API gateway that processes and redirects the request from the mobile app to the different machine learning services we are using. We deployed the Flask application on Google App Engine to create a scalable and fast central API with different endpoints. We also have a self-trained image captioning model that we integrated into the Flask application.

Client-side Native Android Application

Diagram circle logo The application integrates the Flask API, which processes the image captioning and optical character recognition, with the other features of our product. It uses Speech-to-Text and Text-to-Speech to interact with the user and detects obstacles and points of interests through the camera.

Training image captioning model

Diagram circle logo We used tensorflow to build and train model for image captioning on MS-COCO 2014 based on the paper Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. The model uses standard convolutional network as an encoder to extract features from images (we use Inception V3) and feed the generated features into an attention-based decoder generate sentences. While the paper used LSTM model as a decoder, we use a simpler RNN instead. We used all 80,000 train data to train the model for 10 epochs, which took 5 hours on 3 GPUs.

🧗 Challenges we ran into

We first developed an attention base image captioning model (InceptionV3-RNN) using Tensorflow on 80,000 images with 4 GPUs for 10 hours. In order to deploy it, we connected the model using Flask API. While the performance was good, due to the ML model, we had about ~10 seconds of latency. We thought this would be problematics for end-users. Hence we instead decided to use the image captioning API provided by Azure Vision API.
The pretrained model that we use for object detection, MobileNet, was not trained specifically for our use case, so the classification and latency proves that we need to train our own model for detecting obstacles and people we meet on the street.

🏆Accomplishments that we're proud of

Deploy machine learning models with API and serve them to the mobile app.
Create a responsive mobile app that seamlessly incorporates object detection, image captioning, and OCR all in a single weekend

✍️What we learned

API response time is extremely important for app performance.
Larger machine learning models need more powerful backend servers.
First time deploying machine learning functionalities on a mobile app

🕶️What's next for GuideDog

Develop "Smart Sunglasses" that could be connected to our mobile app that captures eye-level video, and allow handsfree, seamless voice control using Bluetooth microphone and speaker.
Develop our own objection detection and image classification models that is catered towards our use cases and train them with objects that may hinder a visually-impared's walk to the grocery store.
Improve the app's latency and performance, so the user is always alert of their surroundings.

Built With

azure-vision-api
figma
flask
google-cloud-vision-api
google-text-to-speech-api
java
keras
ml-kit
python
tensorflow

Submitted to

HopHacks Fall 2021
- Winner Hacking - First Place Overall

Created by

I worked on all the functionalities of the mobile application. I implemented the object detection, tracking, and classification of obstacles and connected the client-side app to the backend APIs. I also developed the voice-activation and response mechanism.

Steven Gunarso
I worked on training image captioning model on MS-COCO. I was also the product manager for this project, defining features, making storyboards, mockups, designing UI/UX, managing scheduling, and pitching for final presentation!

Kyuhee Jo
Raghav Sharma
Jiaqi(Jacky) Wang