Inspiration
All large language models, whether visual or text-based, hallucinate, undermining trust in AI systems. They make up false information and can't reliably keep track of things, which is especially problematic in a visual setting, where memory is key for tasks like object retrieval and path planning. As a team of CV/robotics researchers, we've seen the drawbacks of basic computer vision and current paradigms on robotic autonomy and path planning. Thus, whether just for a pair of smart glasses that remember where your keys are or for building the next generation of autonomous robotics, solving the problem of memory and hallucination is a key challenge with immense potential.
Even for a subtask like finding lost objects, the impact is tangible: Americans spend an estimated $2.7 billion annually replacing lost items, along with countless hours of frustration and inefficiency. To address this, we present VLMemory, a framework that equips Visual Language Models (VLMs) with spatial and temporal memory, enabling them to remember what they see and recall where things are.
What it does
Imagine you’re late for work, your heart pounding, and you can’t find your keys. Instead of frantically checking every surface in your apartment, you ask VLMemory, “Where are my keys?” It immediately responds: “On the kitchen counter, left of the refrigerator.” You grab them and go—thirty seconds instead of ten minutes.
VLMemory makes this possible by providing continuous, context-aware visual memory. Using a live camera stream, in our case from a companion website on your phone, VLMemory detects, tracks, and updates a database of objects over time, which can be instantly queried and updated. This database can be queried via natural language (“Where’s my wallet?”) or object-specific searches (“Find the blue mug”). Each object’s relative and global position, as well as key visual attributes, are recorded and updated over time and stored in a chronological history that can reveal trends and habits.
When a user searches for an object, VLMemory retrieves not only its last-known global location but also a local description of nearby items and current environmental clues, ensuring precise identification.
Privacy is built into the system from the ground up. All processing and inference occur locally, so no video or sensor data leaves the device. VLMemory provides powerful functionality without the privacy compromises of cloud-based AI systems.
How we built it
We built VLMemory around three core components: a local VLM pipeline, a spatial memory database, and an interactive web interface.
For video input to our local VLM, we developed a React-based web app that streams the phone’s camera and IMU sensor data through a secure, local webport. The incoming stream is then processed and fed to our Visual Language Model, which uses database knowledge to detect objects, describe them, and track spatial relationships.
To enhance spatial accuracy, we fuse dead-reckoning IMU data with visual detections, enabling consistent global and relative positioning with coordinates and descriptors for surroundings and nearby objects. Each detected object is then logged to a locally hosted SQLite database that maintains timestamped histories and descriptors for all tracked objects.
On the frontend, a custom API retrieves data from the database, enabling users to search, filter, and view object histories via natural-language or image-based queries. To support flexible, human-like search behavior (“Where did I leave my phone?”), We use embedding-based similarity matching with cosine similarity to interpret and rank results.
Challenges we ran into
The biggest challenges we ran into were from design and time complexity. Because we had to constantly query a VLM to get accurate data, finding the right model, choosing proper preprocessing, and preventing desyncs were essential to ensure the app worked smoothly. Besides that, our constraint of only local hosting also added extra hurdles, since we had to design our own custom databases and APIs to ensure privacy.
Accomplishments that we're proud of
We're proud of just how tech we managed to pack into a seemingly simple project. From using dead-reckoning to edge VLMs to embedding spaces, the robustness of our system and its transferability to other applications are something we're especially proud of.
We're also proud of our communication and planning, which allowed us to effectively divide tasks, hit our reach goals, and maintain a consistent project timeline with few hiccups.
What we learned
By tackling VLMs, we learned just how powerful they can be for a variety of tasks, but also their weaknesses, and where traditional methods could further enhance their capabilities. Along the way, building VLMemory, we also learned a lot more about API time complexity, local model hosting and optimization, script writing, and peer-to-peer connections.
What's next for VLMemory
The next steps for VLMemory are focused on hardware integration. Whether in a pair of smart glasses on board a robot, adapting the system to allow hands-free object tracking would be the next big step for VLMemory, which could then lead to more complex applications down the line.
Built With
- embedding
- fastapi
- gemini
- python
- react
- sql
- vlm
- websockets

Log in or sign up for Devpost to join the conversation.