Inspiration
Many Vision Language Action models guide through text which can often get confusing. Our approach cuts through that confusion by directly visualizing what the end result is supposed to look like. You can use it to do things like cooking, fixing stuff, assembling furnitures, sorting packages etc.
What it does
When you put on the headset, it sees what you see and predicts your intent. The prediction is conveyed in terms of specific verifiable steps and you are guided through these steps sequentially with textual and visual context. If you make a mistake, it would guide you to correct it and if you do a step correctly, it would move on to the next one.
How we built it
The Passthrough Camera API in Quest 3 was used to take discrete pictures that were then sent to the Gemini backend which had Gemini Robotics for step planning, and Gemini Nano-banana for visualizing steps. User steps were verified with Gemini Robotics and the Quest 3 scene was updated with future steps to follow.
Challenges we ran into
Gemini Nano Banana has incredible sensitivity to the choice of prompt for any given task. As we wanted to keep our prompt generally applicable, that meant we had to do a lot of trial and error to find the best prompt for relevant future step-visualization. Constantly building the apk for the headset was a big hassle for rapid prototyping but inevitable as we were using the passthrough camera access.
Accomplishments that we're proud of
Integrating passthrough camera with powerful vision language action models seem very effective, although still in its early stages. This gives the user a perspective on how in the not-so-distant future, smartglasses would serve as capable assistants in our daily tasks ranging from picking healthy items in grocery, picking a good deal from a sea of ads and guiding us through difficult manual tasks.
What we learned
Nano banana has a specific set of intended use. Once you step out of those usecases, it becomes pretty hard to manage. This is our first time working with Gemini Robotics and was quite surprised by how capable it was in terms of spatial reasoning over multiple steps. Unity and Meta SDKs have become very easy to work with and have quite a lot of features out of the box such as beautiful UI elements, TTS, dictation, raycast PlaceBox.
What's next for RealityGuide
Adding more 3D anchoring features rather than the 2.5d features. We would like to port it to smartglasses and are eagerly waiting for the Meta Wearables Device Access Toolkit to build functionalities such as web search informed, voice-dictated guidance with helpful visualization.
Log in or sign up for Devpost to join the conversation.