Inspiration
The project was inspired by a collection technologies that we wanted to learn during the hackathon:
- VLMs
- Raspberry PI
- RL We were also inspired by projects developed with meta's AR glasses.
What it does
Our project is a tool that you carry around with you to take pictures of the world, talk about the world and have the brain be able to detect the items around the world. We plan to let it interface with obsidian so you can drop in notes directly, and read from your vault when needed.
How we built it
We built the project with a data pipeline consisting of:
- User query summarization with LLM
- Object detection by Grounding-Dino
- Semantic segmentation with SAMv2
- Custom short-term memory as inspired by this paper
- LLM Model to output (Essentially chatgpt) The raspberry pi was setup with a camera and microphone and streamed its data with WebRTC to the server, connected both via ethernet
We optimize the ML models by using multi-threading for object detection, and llm responses. We also skipped frames per models to get better FPS
Challenges we ran into
Audio transcription during streaming is so horrible. It is so hard to work with audio formats over the network and reconstruct them afterwards. The FPS was slow at the start before we added in optimization. The Grounding-DINO has low accuracy since it is a small model.
Accomplishments that we're proud of + What we learned
We are proud that we got learn a bunch of new tech.
- How to setup a LAN for testing
- How to setup RPI hardware and system settings for camera and microphone
- How to create a data pipeline with VLMs
What's next for Mirror Mind
We gotta make it work with obsidian!
Built With
- cuda
- depth-anything
- fastapi
- grounding-dino
- huggingface
- opencv
- python
- raspberry-pi
- sam2
- yolo
Log in or sign up for Devpost to join the conversation.