ThirdEye

Inspiration The inspiration comes from our team leader Hemangi Patel. She had an idea to build a solution with tech to enable visually impaired people to see. For this we had multiple options and tools available. We decided to go ahead with a) Mobile Phone camera because it is easily accessible, and b) esp-32 based camera, because we intend to turn it into a wearable device which speaks out what it sees. It could be worn as a necklace, or pinned on a cap/hat, it sees and describes to the user what it sees in real time.

What it does? Our solution is two pronged :

Web/phone app : In v1 we built the web app, to be opened on a phone via the browser. It takes in images from the camera of the phone, and sends it to the Gemini servers via it's API for image recognition, then we use Gemini API to put into text what the image is, and then we take that text input and convert it into voice with Elevenlabs API. The user just has to point their camera and wait for the description of what the phone sees.
Wearable Tech : Our solution can be used without a phone as well, by using our wearable esp-32 based camera device, which is the size of 2 pencil batteries, and very light weight, could be worn as a necklace, or clipped to a hat. The device performs the same functions, and uses some shared libraries to take the picture input and describe it in words.

How we built it?

Brainstorming : We sat down yesterday. Selected this idea.
Tech Stack : We decided that for V1 we will use a webapp, this is because we first started with live video streaming, but this functionality was limited with flutter and kotlin. So we came to python and built a flask app which uses the device camera. We also side by side started working on the Arduino development.
Once we were able to connect to the device camera, we first tried web sockets for video streaming, but the lag was concerning, so we shifted to image processing.
The current version clicks an image every 1 second from the device camera, sends it to gemini API via a flask app, and receives the text description, which is converted to voice by the ElevenLabs API.
To enable the same functionality via arduino/esp-32, we built an end-point on our flask app, to receive images from the esp-32 and then to send them to Gemini API, receive text description, and play it in voice via eleven labs.

Challenges we ran into:

Lag in video. In v1, we established a video stream to Gemini, and an audio stream of description, but the lag was high so we shifted to images.
Text to speech. Initially we used the inbuilt device text to speech function. The voice was not in flow and barely understandable. We used the ElevenLabs API, and the voice was much smoother and understandable.
Connecting to the wearable esp/arduino. We had some difficulties in connecting both parts of our app, due to CORS policies of our hosting servers, we resolved this by building a dedicated endpoint on our main app, to receive images from the esp, and then describe them. This worked well.

Accomplishments that we're proud of

That 4 unknown people got together and were able to coordinate despite challenges.
Building an end to end solution in 24 hours, which was in Hemangi's mind for around an year.
To maintain coordination over the night, into the next day, and fulfill all criteria of the competition.

What we learned?

Satinder : Learnt that team coordination is important and amazing.
Hemangi : Learnt that Replit increases development speed
Ares : Built with esp and camera for the first time
Aiden : Learnt the end to end app development cycle

What's next for ThirdEye? We are considering developing the wearable prototype further, reducing lag times, introducing more languages and 3-d printing wearable cases.