Inspiration

I have a friend who lost his sight when he was 12. He uses screen reading software every day. The existing experience can be daunting and stressful at times. We believe that new technologies can and must help people live better lives.

What it does

The AI-powered screen reader can scroll social media feed, describe photos that people post, and do actions with them: like, share, comment etc.

How we built it

We used Mistral Pixtral to describe images and MacOS capabilities to capture screen and convert text to speech and back.

Challenges we ran into

Multimodal LLMs can't work with pixel coordinates on the image.

Accomplishments that we're proud of

A complete end-to-end demo showing how the blind user can listen to a photo description, like it, and scroll down to the next post to repeat the workflow.

What we learned

Zero-shot CV models work well, Apple Shortcuts have really advanced automation capabilities, visual comprehension has some room for improvement, navigate a web page with CV is much tougher challenge then we thought.

What's next for Readie

Better generalisation of capabilities, performance improvements, more actions to support, support showing experience on the web. Potentially a mobile version.

Built With

Share this project:

Updates