About VisionNotes
Inspiration
The inspiration for VisionNotes came from a pressing need to support blind and visually impaired individuals in a world that’s becoming more fast-paced and urbanized by the day. With cities growing more complex and daily life increasingly driven by visual cues, we wanted to create a tool that would empower people to better engage with their surroundings and feel more connected to the world around them.
What We Learned
This project was a major learning journey for us. When we started, we had no experience with Computer Vision, Artificial Intelligence, Large Language Models (LLMs), Machine Learning, or even full-stack development. Building VisionNotes forced us to learn these technologies from scratch, one step at a time. We tackled everything from setting up object detection models to building the backend server and crafting an intuitive front-end interface. The result? A complete system that we’re incredibly proud of, which we built entirely through dedication and self-learning.
How We Built VisionNotes
- Research and Learning: We began by immersing ourselves in tutorials, online courses, and open-source projects. This helped us understand the basics of computer vision, AI, and web development.
- Full Stack Development: Starting with a Flask backend, we added a React frontend, focusing on simplicity and accessibility. This way, VisionNotes would be easy for anyone to use.
- Computer Vision and Object Detection: We integrated object detection using pre-trained models, then fine-tuned and optimized them for real-time processing. Multithreading and various performance enhancements helped us get the speed we needed to make VisionNotes responsive.
- Audio Descriptions: A critical feature was delivering audio descriptions of the recognized objects. We implemented this using natural language processing, giving users a conversational and descriptive experience.
Challenges We Faced
Building VisionNotes was anything but smooth sailing, and we hit plenty of roadblocks along the way:
- Integrating Components: Connecting all the moving parts (Flask backend, React frontend, Computer Vision models, and audio description) was challenging. Each part needed to communicate seamlessly, which required extensive debugging and testing.
- Server Responsiveness: Ensuring the Flask server could handle real-time object detection was difficult, especially when processing images from the camera feed. To keep everything responsive, we experimented with multithreading and optimized our server’s request handling.
- Speed and Efficiency in Object Detection: Processing video feeds quickly enough to provide real-time feedback required us to optimize every step in our pipeline. We used multithreading, efficient algorithms, and frame-skipping techniques to maintain speed without compromising accuracy.
- User Accessibility: Since VisionNotes is designed to assist visually impaired users, we had to rethink typical UI/UX decisions. It was a constant balancing act between functionality and simplicity.
Final Thoughts
Looking back, VisionNotes has been an incredible journey of learning, problem-solving, and growth. We are thrilled with what we have created, and we hope it makes a real difference in the lives of those who use it. This project reminded us that determination and a willingness to learn can go a long way—and we’re excited to see where this experience takes us next!
Log in or sign up for Devpost to join the conversation.