Inspiration

Have you ever stared at a 3-hour long, unstamped recorded lecture, knowing the exact formula or concept you need is hidden somewhere inside it? Students and professionals waste countless hours scrubbing through educational videos. Existing AI tools offer generic, abstract summaries and almost always rely on expensive, cloud-based API subscriptions. Worse, they only process audio—completely ignoring the equations, diagrams, and bullet points actually written on the screen.

I was inspired to build LectureLens: a fully local, multimodal search engine that pioneers the simultaneous use of Audio Speech Recognition (ASR) alongside Optical Character Recognition (OCR). It doesn't just listen to the presenter; it actively reads the screen, instantly turning raw multi-platform video into formatted, offline study materials.

What it does

LectureLens is an end-to-end multimodal pipeline that completely eliminates the need to manually scrub through long-form video content.

Omni-Platform Ingestion: It isn't just limited to YouTube. The custom pipeline seamlessly ingests media from Dailymotion, Twitch VODs, Instagram Reels, TikToks, and even live stream recordings.

Instant Semantic Search: Users can search for a specific concept (e.g., "Quantum Mechanics") and the app instantly returns the exact timestamps where the word was either spoken out loud or shown visually on a slide.

Automated Study Guides: With a single click, LectureLens fuses the spoken context and the visual slide text into a fully formatted PowerPoint (.pptx) or an Notion Markdown study guide, ready for local download.

How I built it

To align with DSOC's focus on practical software engineering and AI implementation, I architected a highly optimized, completely offline pipeline designed to extract maximum performance from consumer hardware:

Algorithmic Frame Extraction: Instead of naive, resource-heavy frame-by-frame sampling, I engineered an ultra-fast parallel FFmpeg pipeline focused on I-frame (intra-frame) extraction. By targeting specific video keyframes and utilizing 16-core parallel workers to chunk media, I bypassed traditional decoding bottlenecks, achieving extremely fast image extraction.

The AI Vision Engine: I implemented PaddleOCR for high-accuracy optical character recognition. I specifically tuned it to handle challenging visual data—successfully extracting complex mathematical formulas, skewed presentation slides, and low-resolution whiteboard text without relying on a massive GPU footprint. This runs alongside YOLOv8 for rapid object and context detection.

Vector Database (RAG): Extracted ASR transcripts and OCR text are vectorized using local SentenceTransformers (all-MiniLM-L6-v2) and indexed into ChromaDB for lightning-fast semantic querying.

Export Engine: Programmed a custom parser using python-pptx to dynamically generate formatted presentation slides directly from the vector database results.

Challenges I ran into

Building a heavy, multimodal AI pipeline locally as a solo developer over a single weekend brought me face-to-face with brutal hardware constraints and system-level bugs:

C++ Deadlocks & Memory Leaks: Feeding raw image arrays (numpy/cv2) into PaddleOCR in a fast loop caused catastrophic C++ deadlocks that froze my entire local server. I had to completely refactor the vision pipeline to process raw file paths safely, preventing the Python Global Interpreter Lock (GIL) from crashing.

OOM (Out of Memory) Errors: Processing hours of video locally initially crashed my system. I engineered a chunked, parallel-processing architecture that safely batched data into ChromaDB without overwhelming system RAM.

Environment Locking: Navigating complex environment conflicts and background ghost processes holding file locks (.venv permission denials) required writing custom OS-level cleanup scripts to safely kill zombie processes between runs.

Accomplishments that I'm proud of

Pioneering OCR + ASR Fusion : LectureLens is the first local application to successfully fuse semantic vector search across both audio and visual domains simultaneously. Mapping disjointed, multi-modal data into a single, cohesive ChromaDB vector space to generate a unified PowerPoint presentation and to export it straight to Notion database is my proudest technical achievement.

Zero Paid APIs: I built a powerful AI tool that is 100% free and runs entirely offline. I proved you don't need expensive enterprise SaaS subscriptions to achieve production-grade video intelligence.

What I learned

I learned the harsh, practical realities of local machine learning deployment. Managing memory states, utilizing I-frames to optimize compute load, handling silent failures in batch processing, and understanding the strict interface requirements of local vector databases taught me more about system architecture than any tutorial could. I also learned that a great user experience requires hiding massive backend complexity behind a simple "Search" and "Generate" button.

What's next for LectureLens

GPU Acceleration: Compiling PaddleOCR with strict CUDA support to bring processing times down from 1:10 to 50x real-time speed.

Local File Support: Allowing users to drag and drop local .mp4 files directly into the browser.

Local LLM Synthesis: Integrating a lightweight local LLM (like Llama 3 8B) to abstractively summarize the exact database chunks retrieved during a search, providing an intelligent "TL;DR" right inside the generated PowerPoint.

Built With

Share this project:

Updates