Alex K's girlfriend Allie is a writer and loves to read, but has had trouble with reading for the last few years because of an eye tracking disorder. She now tends towards listening to audiobooks when possible, but misses the experience of reading a physical book.
Millions of other people also struggle with reading, whether for medical reasons or because of dyslexia (15-43 million Americans) or not knowing how to read. They face significant limitations in life, both for reading books and things like street signs, but existing phone apps that read text out loud are cumbersome to use, and existing "reading glasses" are thousands of dollars!
Thankfully, modern technology makes developing "reading glasses" much cheaper and easier, thanks to advances in AI for the software side and 3D printing for rapid prototyping. We set out to prove through this hackathon that glasses that open the world of written text to those who have trouble entering it themselves can be cheap and accessible.
What it does
Our device attaches magnetically to a pair of glasses to allow users to wear it comfortably while reading, whether that's on a couch, at a desk or elsewhere. The software tracks what they are seeing and when written words appear in front of it, chooses the clearest frame and transcribes the text and then reads it out loud.
How we built it
Software (Alex K) - On the software side, we first needed to get image-to-text (OCR or optical character recognition) and text-to-speech (TTS) working. After trying a couple of libraries for each, we found Google's Cloud Vision API to have the best performance for OCR and their Google Cloud Text-to-Speech to also be the top pick for TTS.
The TTS performance was perfect for our purposes out of the box, but bizarrely, the OCR API seemed to predict characters with an excellent level of accuracy individually, but poor accuracy overall due to seemingly not including any knowledge of the English language in the process. (E.g. errors like "Intreduction" etc.) So the next step was implementing a simple unigram language model to filter down the Google library's predictions to the most likely words.
Stringing everything together was done in Python with a combination of Google API calls and various libraries including OpenCV for camera/image work, pydub for audio and PIL and matplotlib for image manipulation.
Hardware (Alex G): We tore apart an unsuspecting Logitech webcam, and had to do some minor surgery to focus the lens at an arms-length reading distance. We CAD-ed a custom housing for the camera with mounts for magnets to easily attach to the legs of glasses. This was 3D printed on a Form 2 printer, and a set of magnets glued in to the slots, with a corresponding set on some NerdNation glasses.
Challenges we ran into
The Google Cloud Vision API was very easy to use for individual images, but making synchronous batched calls proved to be challenging! Finding the best video frame to use for the OCR software was also not easy and writing that code took up a good fraction of the total time.
Perhaps most annoyingly, the Logitech webcam did not focus well at any distance! When we cracked it open we were able to carefully remove bits of glue holding the lens to the seller’s configuration, and dial it to the right distance for holding a book at arm’s length.
We also couldn’t find magnets until the last minute and made a guess on the magnet mount hole sizes and had an exciting Dremel session to fit them which resulted in the part cracking and being beautifully epoxied back together.
The Alexes would like to thank our girlfriends, Allie and Min Joo, for their patience and understanding while we went off to be each other's Valentine's at this hackathon.