What is Monocol?
Monocol is an IOS app that takes in live video and automatically outputs audio based on the depth of the user’s surroundings. The video feeds image frames into a Monodepth2, a deep learning algorithm. It then outputs a corresponding depth map which is interpreted through another algorithm to translate to audio, which is sent back to the user and played in real-time.
Monocol has 4 key advantages over other current methods. (1) It is accessible to everyone; all you need is a phone camera and headphones. (2) This is a scalable solution as it is affordable to both us and the users to maintain. (3) Monocol works in varying light conditions both indoors and outdoors. (4) We used advanced machine learning algorithms to develop Monocol. With the rapid pace of machine learning progress, our backend is easily upgradable to smarter and quicker algorithms.
Monocol can be useful to people who have normal eyesight as it can be used as a low-cost alternative to a night vision camera for people working in low vision environments.
After seeing hardware hacks attempting to help visually impaired people navigate their surroundings, we wanted to build an affordable and accessible hack with devices available to the average person.
How we built Monocol
For our IOS app, we used Swift to build the frontend and Python for the backend, as well as a Pytorch implementation of Digging into Self-Supervised Monocular Depth Prediction, Godard 2019.
Challenges we ran into
Networking across devices was our biggest challenge. This was necessary since the ML algorithm is too compute-heavy for the average phone to handle.
Latency time was our biggest priority. We made sure our image data is compressed enough and in the right format for the model so that we could run algorithms and send data back and forth in real time. We still can improve the number of times we ping to our server per second but we got it to a point where it is viable.
Getting the audio to work correctly and consistently was also a big challenge. There were multiple times when the audio would output after the image was processed, but through hours of work and various fixes, we pushed through.
Accomplishments that we're proud of
We built and deployed our first compute engine while learning Swift fundamentals. We also had a fun but difficult time navigating our team dynamic with members in 3 global time zones. We also pitched to a venture capital fund together for the first time.
What we learned
Some quirks: Google Cloud doesn't like new accounts spinning up GPU instances, because that's how you illegally mine cryptocurrency. AWS requires high upfront costs. CUDA is overpowered. It's hard to get precise depth information using deep learning.
More specific software engineering things: basics of the platform as a service, and the difficulties of Swift dependencies, and playing audio in swift.
What's next for Monocol?
We envision additional functionality using object detection and/or GPS. Not only will the user know when something is in front of them, but they can also know what it is. This will be invaluable for navigating in ordinary places such as walking across the street, differentiating shops at the mall, etc. Additionally, front camera functionality could be implemented so users can avoid crashing with rapidly approaching objects from behind.