Inspiration
The inspiration for this project came from the sheer volume of "lost" information trapped in physical notebooks and whiteboard sessions. Despite living in a digital age, we still brainstorm by hand because it’s faster and more intuitive. However, those notes often end up as dead files in a camera roll. I wanted to build a tool that doesn't just "see" handwriting but understands it, turning static images and even video walkthroughs of handwritten diagrams into searchable, digital knowledge.
What it does
Gemini-lens is a multimodal intelligence tool that bridges the gap between physical handwriting and digital productivity. By leveraging the Gemini API, the app allows users to upload images or videos of handwritten notes whether they are on paper or whiteboards and instantly extracts the text. Unlike standard OCR, it understands context, enabling it to convert messy scribbles into structured, searchable digital formats.
How we built it
The project is built on a modern full-stack architecture designed for high performance: Frontend: Developed with Next.js, providing a fast, responsive user interface for media uploads and real-time text rendering. Backend: A Node.js and Express server handles the API routing, file processing, and secure communication with Google’s AI services. AI Integration: We utilized the Gemini 1.5 Flash model for its speed and multimodal capabilities, allowing the application to process visual data from both static images and dynamic video frames.
Challenges we ran into
The biggest hurdle was "Spatial Consistency" in video. When a user pans a camera over a long handwritten document, the AI can see the same word multiple times. I had to refine the backend logic to ensure the model synthesized the information rather than duplicating it. Additionally, handling low-light images and varying handwriting styles required extensive prompt engineering to maintain a high Accuracy rate.
Accomplishments that we're proud of
I am particularly proud of the video processing feature. Successfully extracting clean, formatted text from a moving video file is a significant step up from traditional image-based scanning. We also achieved a seamless integration between the Express backend and Next.js frontend, resulting in a latency-optimized experience that feels near-instant for the user.
What we learned
This project taught me the power of Multimodal Prompting. I discovered that providing Gemini with specific instructions on how to handle "noisy" visual data drastically improved the output. On the technical side, I deepened my knowledge of asynchronous stream handling in Node.js and learned how to manage large file buffers efficiently before sending them to the Gemini API.
What's next for Gemini-lens
The next phase for Gemini-lens involves moving beyond just extraction. I plan to implement "Contextual Actions," such as automatically adding extracted "To-Do" items to a calendar or summarizing handwritten lecture notes into study guides.
Built With
- cloudinary
- javascript
- nextjs
Log in or sign up for Devpost to join the conversation.