Walkie and Talkie

What it does

Walkie and Talkie samples a frame from the phone’s camera feed every three seconds, identifies the most important object in the frame, and uses tts to announce what it is, its distance from the phone (in feet), and its positioning in the camera frame (left, right, center).

How we built it

We used YOLO V8, trained on the COCO detection dataset to detect and localize objects within the image frame. Meanwhile, we used the iPhone's native LiDAR depth estimation to compute the distance between the phone and the object.

To determine the most important object in the frame, we developed a priority scoring system. Each object in a sampled frame is assigned a priority score based on what it is, how close it is to the camera, and where it is in the frame. Vehicles and people that are close to the camera and centered in the frame are given the highest scores. We then identify the max scoring object and use Deepgram's Text-To-Speech API to announce details about the highest scoring object.

Challenges we ran into

-Battery Usage: Using the camera for extended periods of time can quickly drain the phone’s battery life, making the app impractical to use throughout the day. We addressed this problem by disabling the camera’s auto-focus, and lowering the camera resolution and sampling rate. We experimented with different combinations of camera resolution settings and sampling rates to maintain accurate, regular announcements without wasting battery life.

-Compatibility: As we tested different learned models, we ran into compatibility issues between interpreter and module versions. We addressed these issues gradually by looking into documentation and forums.

Accomplishments that we're proud of

-Priority System Design: To prevent the program from being overwhelmed by too many objects on the screen, we had to design a way to prioritize certain objects over others. We’re mainly proud of how well it works in the application, prioritizing objects that our eyes would in tested scenarios.

-LiDAR: While object recognition requires light, LiDAR does not, so we are able to at least announce the presence of an object.

What we learned

Swift development, HTTP requests to app backend, integrating learned detection models.

What's next for Walkie and Talkie

-Tweak Scoring System: We didn’t have too much time to decide the weights for each piece of information about each object, so we want to explore ways to best keep users aware of their surroundings.

-Accessibility for Phones Without LiDAR: We also experimented with relative depth estimation models that use image data like Depth Anything V2, which could help us reach people who don’t have flagship phones.

-Specialized Dataset The COCO dataset that YOLO V8 is trained with includes many objects that are not necessary for Walkie and Talkie and is missing many key objects that people would likely want to be notified for. By creating our own dataset based on the app’s needs, we could make detection more useful.