We have learned what big companies are charging large amounts of money( starting from $60,000) for simple *Optical Character Recognition *(OCR) program (which is completely free and open source).From that information we decided to create out own program which does the exact same thing that big companies use, and become part of the OCR market.
What it does
This simple program relies mostly on open source code using PDF mining. We take PDF files, use PDF mining, and convert it to a .txt file. From there we use topic modeling to and sorting methods to search the files for specified words or ideas. In the case that the file is unreadable, like in the case of the file being a .jpg, when that happens we rely on tools like Teserak, Wand, and Image Magick convert something like a .jpg that has text in it, scan it, and bring out the text and putting that into a file in which we can run our topic modeling and sorting to find desired information. The way we use our tools is we grab a "PDF" file (quotes because this is the case where user saved the document as a PDF, but in reality the file behaves as a .jpg file), and insert that into the combination of Wand and Image Magick to convert this "PDF" file to an image file (jpg). We then proceed to use Teserak to convert this image file into a .txt file which we can run our mining and sorting tools.
How we built it
We used Python as our language of choice due to its simplicity when it comes to coding. From there we looked for open source code to find PDF mining, Image Magick, Wand to give us our desired result.
Challenges we ran into
We encountered many problems through out the building of this program, one of the main ones being that two out of 3 members had no experience in Python*, making this a complete uphill battle to get some type of code written down. Once the programming language became the last thing to worry about, running the actual code became the issue at hand. We encountered the issue where our program was unable to find the code needed to to run itself. This issues took the brain power of 4 individuals and 2 hours to find the issue. We solves the issue by redirecting the directory of the files in such a way that the code could find the needed files to run the code.
Accomplishments that we're proud of
"I am proud that I am a first year and I am doing this hack" -Tony
"Traveling 7 hours to get here and being part of Hack Merced." -Isauro
"I am proud of the amount of learning that I have done today!" -Edgar
What we learned
We learned *python *Regex *Image Procesing *PDF mining
What's next for Hacking The Law for Profit
To hack even bigger companies and expose their corruption.