How did we get here?

I love to program, it's something I do every day and it brings me joy! However, finding a project to do can be very challenging. Especially for a hackathon since there is a deadline on when you not only need to have a project idea but have a finished product! Once I realized that, I began brainstorming bad idea after bad idea and eventually almost gave up. I was looking for a problem to solve and not trying to create a problem to solve. So, I went looking for problems and I realized that public companies put out annual reports based on how well the company is doing. Then it hit me, there are thousands of annual reports online with thousands of problems in each of them. This is where the idea of Unveil.ai came into existence.

Now what does Unveil.ai do?

Unveil.ai uses deep learning to extract and summarize risk factors from the company's annual reviews. These documents are free to everyone who has internet access. Simply upload the pdf file to Unveil.ai and it will give you back a CSV that has the original risk factors spotted as well as a summarized version of those risk factors. This way instead of scanning through hundreds of pages of annual reports you can sit back and relax while Unveil.ai does the rest. The benefit this brings to businesses is that it provides an easy way to access what risks other companies have. Now that you have this knowledge of other companies' risks you can do your best to have your business avoid that risk.

How was Unveil.ai built?

Unveil.ai was built with Flask using Python. This handles all of the UI/UX aspects of the project such as handling inputs and giving outputs. The inputs to Unveil.ai consist of the annual report as a pdf and the output is a CSV file with the risk-classified paragraphs and their respective summarization. This project uses two deep-learning models. The first one is BERT. BERT is a bidirectional encoder representations from transformers. BERT was created at Google and excels at classification tasks which proved to be very useful for this project. See we needed to fine-tune BERT so it could classify the data we wanted it to. To do this I downloaded probably about 30-50 annual reports and stored each paragraph in a MySQL database. Along with that, if the page in which the paragraph was extracted has the term 'risk' in it then I classified that paragraph as 1 and if not I classified it as 0. After creating and cleaning the data I then fine-tuned the BERT models to classify paragraphs that are related to risk. The second deep learning model I use was called PEGASUS. This was also created by Google and is used for abstractive summarization. Luckily I did not need to fine-tune this. To bring all these deep learning shenanigans together I fine-tuned the BERT model to classify paragraphs as risky or non-risky. For those paragraphs classified as risky, I then sent them through the PEGASUS model to be summarized.

Challenges I ran into

The biggest challenges I ran into revolved around getting and cleaning the data as well as fine-tuning the BERT model. I've always heard data science and machine learning engineers talk about how messy data could be. I never fully realized how right they were until I tried to organize my data for the data pipeline. First off, I had to extract the text from a pdf using a Python library called PyPDF. This sounded easy, until I got the data. There were too many unknown symbols and random new line characters to count. Along with that words didn't always have 1 space in between them they have 5. Just random things you cannot plan for. Luckily there are many helpful tokenizing libraries to transform messy data into something that a deep learning model can understand. The second challenge was fine-tuning BERT. BERT is a humongous model. It contains 12 stacked transformers that total up to 110 million parameters. Now someone in a 2-bedroom apartment does not have enough money to train this model. I found this out the hard way when my computer said it did not have enough memory on my GPU to run it. So I then moved everything from my local environment to the cloud to train on a Google Colab GPU. This process took some time because I had to not only move all my code but upload all my data to Colab. I then tried the same thing with a very expensive T4 GPU from Google. The same thing happened. I was flabbergasted. Luckily through many hours of stressful Colab runtime restarts, I finally got it to run and saved the model. After finishing my project and asking myself how I can avoid errors like this next time, I realized, you can't avoid them. Messy data and huge models are just one small part of what it takes to create a successful AI project. So I'm learning not what I can do to avoid problems but rather learning to identify an error and find the solution.

What I learned

This has been a very eye-opening weekend for me. I was very anxious leading up to this weekend since I had no team and little faith in myself. I proved to myself that I could create something as awesome as this all by myself! I also learned so many technologies I wouldn't have learned otherwise such as MySQL and Flask. It just goes to show that you can never be finished learning in this field and you must be someone who learns every day.

Drawbacks

One of the major drawbacks of this project is how well the BERT classifier classifies the text. If BERT does not do a good job at summarizing the risk factors, no matter how well PEGASUS summarizes, it will not give valuable information. In testing the project I have even seen instances where BERT classified risky text as non-risky and vice versa. The biggest thing to do to combat this would get cleaner and more structured data. If I had more time, this would be the immediate next thing I do.

What's next for Unveil.ai

The next step for Unveil.ai would do expand this from the public company domain to the public and private domain. Since private companies do not post annual reports, to combat this Unveil.ai would create a Twitter scraper that collects all the tweets around a specific company and then performs sentiment analysis to get the public perception of them. While this can not give us as concrete facts as annual reports can, it can give us some idea of the risk that the company has.

Built With

Share this project:

Updates