Inspiration
I have a friend who lost his sight when he was 12. He uses screen reading software every day. The existing experience can be daunting and stressful at times. We believe that new technologies can and must help people live better lives.
What it does
The AI-powered screen reader can scroll social media feed, describe photos that people post, and do actions with them: like, share, comment etc.
How we built it
We used Mistral Pixtral to describe images and MacOS capabilities to capture screen and convert text to speech and back.
Challenges we ran into
Multimodal LLMs can't work with pixel coordinates on the image.
Accomplishments that we're proud of
A complete end-to-end demo showing how the blind user can listen to a photo description, like it, and scroll down to the next post to repeat the workflow.
What we learned
Zero-shot CV models work well, Apple Shortcuts have really advanced automation capabilities, visual comprehension has some room for improvement, navigate a web page with CV is much tougher challenge then we thought.
What's next for Readie
Better generalisation of capabilities, performance improvements, more actions to support, support showing experience on the web. Potentially a mobile version.
Built With
- apple
- mistral
- nebius
- python
Log in or sign up for Devpost to join the conversation.