Inspiration
Today’s baby monitors can track sleep cycles and basic vitals—but they fall short where it matters most: understanding what your child is actually doing in real time. Parents don’t just need data- they need context. Is the baby sleeping peacefully, trying to climb out, or in a potentially dangerous situation? There are countless possibile dangers to the baby; with the rapid advancement of capable Vision Language Models (VLMs), we saw an opportunity to move beyond passive monitoring and build something that can interpret and explain behavior as it happens. Furthermore, state-of-the-art smart monitors are prohibitively expensive. We wanted to democratize this technology by building an edge-pipeline capable of turning everyday hardware like older webcams and phones into highly advanced, cognitive safety nets.
What it does
- Live analysis: Provides a real-time overview of your child/toddler's activity in an intuitive dashboard
- Danger Zone: Allows users to define custom "keepout" zones that trigger when movement is detected within them.
- Status Alerts: Sends status messages to caregivers when abnormal activities are detected
- IoT Sensor Integrations: Integrates flame sensor, MQ2 gas sensor (CO2/H2/LPG/CH4), temperature sensor, and active buzzer
How we built it
We built an asynchronous, non-blocking pipeline where the lightweight IoT telemetry (Arduino) and the heavy cognitive visual inference (Moondream V2) run independently to prevent thread-locking.
- Arduino/RaspberryPI serially linked and communicating over TCP/UDP
- Moondream V2 VLM(1.86B) continuously runnning over video frames
- Tailwind CSS and ShadCN component libraries used in frontend
- Data ingestion through Pocketbase
Challenges we ran into
- We experienced instability in the RaspberryPI early on which slowed down our progress, causing us to pivot to a web camera/phone. We eventually did get the camera to work so that a raspberry pi setup is possible.
Accomplishments that we're proud of
- Getting the VLM to accurately identify danger zones: We achieved accurate zoning of danger zones such as electrical outlets, sharp objects, or fire sources.
- Categorizing alert severity: We defined Low, Medium, and High severities to give accurate information.
What we learned
- Stereovision is harder than it looks : We ran into difficulties with stereovision cameras such as depth occlusion, inconsistent lighting, multiple target subjects which makes accurate estimation difficult.
- Prompt engineering matters for VLM: We needed to direct the context of the VLM to find the things that actually matters in the image
- Real time systems introduce novel constraints: cameras buffer, loops get locked, and processing speed differences mean that coordinating multiple devices and protocols needs to be done in a non-blocking manner.
What's next for NeuralNest
- Creating a physical layer using a custom pcb board, camera, and microcontrollers
Log in or sign up for Devpost to join the conversation.