Inspiration

We have a friend who is visually impaired, and he relies on our input to see the world. Even though we love to help him see the world, it undermines his independence and health. He does not feel confident in navigating the world alone. Before he graduated, we always dreamt of the future in which he won't be dependent on technology. Well, why wait for the future when it is now.

Our vision for Theia was born out of the desire to empower the visually impaired, to help them navigate the world with confidence and ease. We wanted to create a device that would not only provide visual assistance but also be compact, portable, and easy to use. We wanted to give them the freedom to explore the world on their own terms.

Theia is not only a breakthrough in accessibility technology but also a game-changer in the health industry. By empowering the visually impaired with independence and confidence, Theia promotes physical and mental health. Moreover, it eliminates the health risks associated with the dependency on technology for the visually impaired. With Theia, our friend and others like him can lead a healthier and happier life.

What it does

Theia provides visual assistance to the visually impaired. It is designed to provide answers to the user's questions about the world around them, in a way that is natural, intuitive, and easy to understand. Theia can do almost anything when it comes to providing visual assistance, making it a versatile and powerful tool.

With Theia, users can read prescription labels, allowing them to take their medication with ease. Theia can also tell users what is in front of them, allowing them to navigate their surroundings with confidence. Theia is also equipped with face recognition technology, allowing users to identify people around them.

Theia is also an excellent tool for reading labels, enabling users to understand important information on products they use. Theia's object detection technology also allows users to know whether it is safe to go or not. Theia can also help users identify the food in front of them, making it an excellent tool for navigating restaurants.

Theia is an intuitive and easy-to-use app, powered by advanced machine learning technology. Theia's AI agent chains all the models and takes the decision on which models would be appropriate for the user's question, based on the data it retrieves. Theia's natural language interface is designed to be easy to use, allowing users to ask questions in a way that is natural and intuitive.

Theia is a powerful tool for the visually impaired, providing them with the ability to navigate the world on their own terms. It is designed to be compact, portable, and easy to use, making it an excellent tool for anyone who needs visual assistance. With Theia, the visually impaired can enjoy a new level of independence, enabling them to live life to the fullest.

How we built it

Infrastructure

Hardware: We are using a Raspberry Pi 1gb with a camera sensor. Software: an API deployed on Vultr bare-metal cloud.

API:

Our software is powered by five machine learning pipelines, each specialized in a specific function:

Object detection: We use the Facebook/Detr-Resnet-101 model to detect objects in the user's surroundings.
Object description: For object description, we utilize the ViT GPT-2 model to provide accurate and detailed information about detected objects.
Object Depth: We use the intel/DPT-large model to estimate the depth of objects, giving users a better understanding of their surroundings.
Obstacle avoidance: By combining the detr-resnet-101 and DPT-large models, we create a 3D model of the user's surroundings, allowing us to estimate the distance of the three most significant objects in the image.
Face recognition: Our app uses OpenCV for accurate face recognition, allowing visually impaired users to identify people around them.
OCR: We use AWS Textract to provide accurate and efficient text reading, enabling users to read and understand text-based information around them.

However, we realised we do not need all the models to accurately answer the user's question. So, we took a page out of langchain's book and added our expertise. We created an AI agent that chains all the models and takes the decision on which models would be appropriate. The AI agent is based upon synthetic data we generated for our use-case.

AWS diagram

Based on the data, it retrieves from the models, it infers and responds in a human-readable way.

Challenges we ran into

We tried Threading to make the entire process faster but resulted in data loss for the images. An image of spaghetti became a hotdog by losing 12 bytes
Hardware Limitations such as the Raspberry Pi not having a sound card but we wanted to allow the user to give voice commands. We couldn't procure a compact USB mic on time either so we improvised and made it work with a different mic.
Issues with deploying the API

Accomplishments that we're proud of

Initially we were planning to get distance from an object using an Ultrasonic sensor but we were unable to get our hands on it. So, we decided to use the Camera Image itself to calculate the distance of the objects around by using Monocular Depth Estimation.
We created an AI agent that chains all the models and takes the decision on which models would be appropriate based on the data it retrieves, allowing for greater customization and efficiency in answering users' questions.
Created a device that can provide visual assistance to the visually impaired with multiple ML models in just 24 hours!

What we learned

Using multiple ML transformer models and then creating the AI agent which takes the decision on which models are appropriate specific to a user's input was a big challenge we overcame!
We learned how we perceive vision as a tool and how closely we could replicate it for people who are visually impaired.

What's next for cap

We're aiming to improve the features we currently have to be more accurate and faster. Improving the quality of the wearable is the next goal on our plate. We also plan to improve the natural language interface to make it even easier for users to ask questions and get accurate answers.