Inspiration We wanted to build something that combines AI and algorithms in one real pipeline. This challenge was interesting because it required both OCR with neural networks and text compression with a custom algorithm.
What it does Our project takes a noisy scanned document image, extracts the text using a CNN-based OCR microservice, and then compresses that text using a custom adaptive Huffman compression microservice. It also decompresses the text to prove the process is lossless.
How we built it We built two FastAPI microservices. For OCR, we used PyTorch to train a denoising CNN and a CRNN-based text recognizer. For compression, we implemented adaptive Huffman coding from scratch. We also added a web UI, Docker support, and evaluation scripts.
Challenges we ran into The biggest challenge was getting CNN-only OCR to work well on noisy scanned pages. We also had to handle missing verified transcripts for evaluation and make sure the compression stage stayed fully lossless.
Accomplishments that we're proud of We built a complete end-to-end system with trained models, custom compression, microservices, evaluation tools, and a browser demo. We are especially proud that the compression stage is fully custom and lossless.
What we learned We learned that OCR depends a lot on preprocessing, line detection, and good training data. We also learned how to connect machine learning and algorithms into one working system.
What's next for Stage Neural Compression Pipeline We want to improve CNN-only OCR accuracy further, test on more real noisy documents, and make the system more polished for real-world document digitization use cases.
Log in or sign up for Devpost to join the conversation.