Inspiration
Many students struggle with homework not because they lack effort, but because they lack access to clear explanations. When students search online, they often find generic answers that do not match the exact problem they are facing.
VisionMentor was inspired by the idea of building an AI tutor that behaves more like a real teacher—one that can understand what the student is looking at, listen to their question, and explain concepts clearly.
My goal is to make high-quality tutoring more accessible to students anywhere in the world.
What it does
VisionMentor is a multimodal AI tutor that helps students understand homework through images, voice, and documents. Students can:
- Ask questions using voice or text
- Upload photos of homework
- Upload PDF assignments or notes VisionMentor prioritizes the student’s question, analyzes the visual or document context, and provides step-by-step explanations. To make concepts easier to understand, the system can also generate visual diagrams automatically that illustrate the explanation.
The tutor supports multiple languages, making it easier for students from different backgrounds to learn.
How I built it
The project combines several technologies to create a seamless tutoring experience.
The frontend was built with Next.js and TailwindCSS, allowing students to interact with the system through voice, image uploads, or PDFs.
The backend was developed using FastAPI, which manages API routes for homework analysis, PDF processing, and visual generation.
VisionMentor uses Google Gemini multimodal models to analyze questions alongside images or document content.
To support document understanding, the system extracts text and preview images from PDFs using PyMuPDF.
The platform can also generate SVG visual diagrams to reinforce explanations and make learning more intuitive.
Challenges I ran into
One challenge was ensuring the AI focused on the student’s question first, rather than over-interpreting the uploaded image or document.
Another challenge was handling API rate limits and response latency when generating both explanations and diagrams.
We also had to design the interface carefully so it could support voice input, camera capture, image uploads, and PDF documents without becoming confusing for users.
Accomplishments that I'm proud of
One accomplishment I’m proud of is successfully building a multimodal tutoring system that integrates voice, images, and documents into one learning experience.
Another highlight is the automatic visual explanation feature, which converts AI explanations into diagrams that help students understand concepts faster.
Most importantly, the project demonstrates how AI can be used to create a more personalized and accessible learning experience.
What I learned
This project taught me a lot about designing multimodal AI applications, especially how to combine text, images, and documents into a single workflow.
I also learned how important prompt design and response structure are when building AI systems for education.
Beyond the technical aspects, the project reinforced the importance of designing AI tools that are clear, accessible, and focused on helping people learn.
What's next for VisionMentor
In the future, VisionMentor could evolve into a more complete AI learning platform.
Potential next steps include:
More advanced visual explanations and interactive diagrams
Learning history and progress tracking
AI tutoring that adapts to grade level or subject
Expanded language support and accessibility features
Our long-term vision is to create an AI tutor that can help any student understand difficult concepts more clearly, anywhere in the world.
Log in or sign up for Devpost to join the conversation.