About VisionNotes

Inspiration

The inspiration for VisionNotes came from a pressing need to support blind and visually impaired individuals in a world that’s becoming more fast-paced and urbanized by the day. With cities growing more complex and daily life increasingly driven by visual cues, we wanted to create a tool that would empower people to better engage with their surroundings and feel more connected to the world around them.

What We Learned

This project was a major learning journey for us. When we started, we had no experience with Computer Vision, Artificial Intelligence, Large Language Models (LLMs), Machine Learning, or even full-stack development. Building VisionNotes forced us to learn these technologies from scratch, one step at a time. We tackled everything from setting up object detection models to building the backend server and crafting an intuitive front-end interface. The result? A complete system that we’re incredibly proud of, which we built entirely through dedication and self-learning.

How We Built VisionNotes

  1. Research and Learning: We began by immersing ourselves in tutorials, online courses, and open-source projects. This helped us understand the basics of computer vision, AI, and web development.
  2. Full Stack Development: Starting with a Flask backend, we added a React frontend, focusing on simplicity and accessibility. This way, VisionNotes would be easy for anyone to use.
  3. Computer Vision and Object Detection: We integrated object detection using pre-trained models, then fine-tuned and optimized them for real-time processing. Multithreading and various performance enhancements helped us get the speed we needed to make VisionNotes responsive.
  4. Audio Descriptions: A critical feature was delivering audio descriptions of the recognized objects. We implemented this using natural language processing, giving users a conversational and descriptive experience.

Challenges We Faced

Building VisionNotes was anything but smooth sailing, and we hit plenty of roadblocks along the way:

  1. Integrating Components: Connecting all the moving parts (Flask backend, React frontend, Computer Vision models, and audio description) was challenging. Each part needed to communicate seamlessly, which required extensive debugging and testing.
  2. Server Responsiveness: Ensuring the Flask server could handle real-time object detection was difficult, especially when processing images from the camera feed. To keep everything responsive, we experimented with multithreading and optimized our server’s request handling.
  3. Speed and Efficiency in Object Detection: Processing video feeds quickly enough to provide real-time feedback required us to optimize every step in our pipeline. We used multithreading, efficient algorithms, and frame-skipping techniques to maintain speed without compromising accuracy.
  4. User Accessibility: Since VisionNotes is designed to assist visually impaired users, we had to rethink typical UI/UX decisions. It was a constant balancing act between functionality and simplicity.

Final Thoughts

Looking back, VisionNotes has been an incredible journey of learning, problem-solving, and growth. We are thrilled with what we have created, and we hope it makes a real difference in the lives of those who use it. This project reminded us that determination and a willingness to learn can go a long way—and we’re excited to see where this experience takes us next!

Share this project:

Updates