[C21] Gigafind

Inspiration

Using 2 controllers, captures the corners of the detection/segmentation box. As an output, receive the 2D mask ("sticker") and the description of the segmented object.

Our setup: Lenovo Yoga PRO 9i RTX4060 - Unity/MR rig Macbook Pro M1 MAX - CV processing Meta Quest 3

Our data flow as follows:

we capture data over the last N frames of the headset, determined by a constant in the code (we found 5 frames to process a good trade-off between computational latency and data collection).
In each frame we determine the 2D box projection of controllers positions and capture a frame from Passthrough camera.
We send the batch of frames and boxes to the FastAPI server, where we host inference of SAM2 segmentation model. This model utilizes temporal relationships between capture box and frames, so the more frames we send the more accurate the prediction gets.
SAM2 forms 3 possible masks off the frame, we pick the one with the highest confidence, crop the mask.
We send the mask to Groq API with a custom prompt to receive a short description of the mask.
We send the mask and description back to the device, where we present it UI canvas.

With the requirement of standalone APK application for passthrough API created difficulties with accessing camera feed from Quest. We had to resort to Unity Web-requests to transfer model input data to the local server over hotspot WLAN connection. We found that Windows hotspot functionality had troubles routing requests between the device and the server in real-time, while hotspot from iPhone had sufficient speed.
We had trouble casting 3D position of controllers to (x, y) position on the screen, mainly the y direction is consistently 35-45 pixels below the target point which might have to do with improper reference position used.
SAM2 model used has a tendency to apply masks over fair-grained details like wood pattern on the table or texture of the ceiling. We had to use blur to remove fine details from the picture.

On the technical side, we are proud of setting up real-time inference of the data from the headset on the server (a laptop) with sufficient room to expand the detection methods and data transferred. We are also proud of figuring out how to accurately bound 3D positioning of the controllers into 2D pixel position of the camera.
We are proud we found a use case for segmentation, which is a CV type that does not have a lot of usage example in realtime XR.

We learned a lot about how we can access hardware and engine data of the app running directly on Meta Quest 3. We learned about projections and challenges with CV on XR device.

Gigafind has a potential to be a useful augmentation application for professionals working or training for work on complex systems with a lot of exposed components where visual identification could be augmented by CV. It might also find applications in more accurate labelling for future multimodal CV and more.

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.