Inspiration

258 million people worldwide are visually impaired. Braille is their primary written communication system but most caregivers, teachers, and accessibility workers cannot read it. We wanted to build something that works on a real phone camera, not just a research demo.

What it does

BrailleVision takes a photo of physical embossed Braille and converts it to English text and speech. Three independent pipelines run in parallel on every image: Pipeline A uses Classical CV with DBSCAN clustering and a Grade 1/2 Braille lookup table. Pipeline B uses a pretrained YOLOv8 model trained on 1,324 real Braille images. Pipeline C uses GPT-4o Vision API. The ensemble layer combines all three outputs using weighted confidence voting with agreement bonuses. Final text is spoken aloud via TTS. It runs live at huggingface.co/spaces/sriksven/braillevision and also has an Android app in the GitHub repo.

How we built it

Python, OpenCV, Flask, YOLOv8, OpenAI GPT-4o Vision API, DBSCAN from scikit-learn, Docker, GitHub Actions CI, Hugging Face Spaces. The Flask backend streams NDJSON so the UI updates progressively. Pipeline A and B return instantly while C arrives 2 to 4 seconds later. The ensemble uses Levenshtein similarity to detect agreement across pipelines and boost confidence when they agree.

Challenges we ran into

Real world Braille images are much harder than synthetic ones. Dark surfaces, angled cameras, glare, and dense multi-row pages all reduce accuracy. Getting GPT-4o to read small embossed dots required forcing high resolution mode and writing a strict deterministic prompt. The Roboflow model struggles on surfaces it was not trained on. Getting all three pipelines to stream results progressively to the UI without blocking was a threading challenge.

Accomplishments that we're proud of

Three fully independent pipelines running in parallel with a live progressive UI. 29 tests passing with GitHub Actions CI green. Deployed on Hugging Face Spaces via Docker. The ensemble correctly overrides wrong answers from individual pipelines using weighted voting.

What we learned

No single approach wins on every image. Classical CV is fast and explainable but fragile on real world variation. YOLOv8 handles varied surfaces better but needs good training data. GPT-4o is the most accurate but conservative on difficult images. The ensemble is more reliable than any individual pipeline.

What's next for BrailleVision

Grade 3 Braille support, improved dot detection on dark surfaces, a finetuned YOLOv8 model trained on the combined Roboflow and Angelina ICCV 2021 dataset, mobile first UI, and multi language translation of the decoded text.

Built With

Share this project:

Updates