QueryCap: Ask any Question about What You See

Inspiration

We began with an interest in making an Augmented Reality system after realizing that combining the understanding of the world from both vision and large language models could help make a system that functions like Iron Man’s Jarvis assistant and helps people who have visual impairments understand their surroundings. Instead of using an expensive commercial headset, we decided it would be more fun to make the hardware ourselves out of a webcam, a Raspberry Pi, and a helmet.

What it does

QueryCap is a helmet-mounted AI assistant that can see what you see! All you have to do is wear the helmet, look at something you are interested in, and ask a question while holding the button. QueryCap will answer your question in the context of your surroundings and reply to you aloud.

For example, if you are visually impaired, you can enter an unknown space and ask QueryCap to “Describe what objects are in front of me.”

You can also use it to get more information about objects that you are interested in. For example, you can hold up a pair of keys and ask “What are these for?”.

These are just examples. You can ask anything!

How we built it

The system begins with a webcam mounted on top of the helmet that a user wears. At any time, the user can press a button to ask the system a question. At this point, an image and the user’s speech is sent wirelessly from the Raspberry Pi mounted on the helmet to our laptop, which sends the image and speech to an AWS server. The server then runs a multimodal vision-language model (BLIP2) with the captured image and the user’s question to generate an answer, which is sent back to the laptop and played through the speaker. The project required several different tools, such as embedded programming, data networking and routing, and running machine learning models.

Challenges we ran into

Installing things is always harder than we think, especially when combining multiple different systems like a Raspberry Pi, AWS server, and Macbook. 😭
We wanted to use some newer, more powerful image recognition models, but they were too big to run well on the resources we have.
The system sometimes gets confused and answers only part of the question we asked or answers incorrectly.

Accomplishments that we're proud of

We made a product that can literally answer any question about a person’s context! We did not expect to complete it as well as we did in less than 2 days, and are excited to share it with others. Overall, working as a team to merge fun back-end software with a hardware product was enjoyable. Also, building something complete from scratch that was our own idea and getting to determine tough design and implementation decisions throughout the whole process was rewarding.

What we learned

Plan for things to go wrong, because they will. Nothing is ever “that easy.”
Large language models can be surprisingly small and accessible!
Combining technologies together onto new platforms and in new scenarios can result in awesome new ideas that help people.

What's next for QueryCap: Ask any Question about What You See

QueryCap could be expanded into a more robust system with a particular aid for people who are visually impaired. Imagine a system that could always be there for someone to ask any question any time when they most need it.