Stonks Scraper

Inspiration

In an age where efficiency and automation are essential, even processes like reading and analyzing 10-K reports can benefit from advanced solutions. This script streamlines the traditionally time-consuming task of manually reviewing financial reports. By automating the process, it scrapes 10-K report PDFs from specified URLs, extracts relevant data, cleans and preprocesses the content, and then applies BERT-based NLP processing to analyze and derive insights from the report. This not only saves time but also ensures that key information is quickly and accurately assessed—allowing businesses and investors to make more informed decisions without the manual labor typically involved.

What it does

The Stonks Scraper automates the extraction and analysis of financial data from 10-K reports. It:

Downloads 10-K report PDFs from the provided URL.

Uses tools like BeautifulSoup and PyMuPDF to extract the full text of the report.

Preprocesses the extracted data, cleaning it for NLP analysis.

Utilizes BERT-based NLP processing to gain valuable insights from the report, identifying key sections, trends, and data points.

How I built it

Web Scraping: We used BeautifulSoup to scrape 10-K reports from their respective URLs.

Text Extraction: PyMuPDF (fitz) was used to extract text from the PDF files.

Data Preprocessing: We cleaned and organized the extracted data to make it suitable for NLP analysis.

NLP Processing: We employed BERT to perform advanced text analysis, including identifying important financial terms, assessing sentiment, and extracting key information for investors.

Challenges we ran into

Text Extraction Quality: Extracting clean, structured text from PDFs was challenging due to the varied formatting of 10-K reports.

NLP Model Tuning: Fine-tuning BERT to process financial language and identify relevant insights required significant experimentation.

Handling Large Data Volumes: 10-K reports can be quite lengthy, and processing them efficiently while maintaining accuracy was a hurdle.

Accomplishments that I am proud of

Developing a streamlined solution that can scale for multiple reports and extract actionable insights in real time.

Organizing and still participating in the hackathon.

What I learned

The intricacies of PDF scraping and how to handle different report formats.

How to preprocess and clean financial text data for better NLP model performance.

The power of BERT and transformer models in processing complex, domain-specific language like finance.

What's next for Stonks Scraper

Extended Data Coverage: Integrating other financial documents (e.g., 10-Q reports, earnings call transcripts) for a broader analysis.

Improved NLP Models: Fine-tuning the BERT model for even better extraction of financial insights.

User Interface: Building a more user-friendly interface for users to easily upload reports and receive insights.

Built With

python

Updates

Ivan Widjanarko started this project — Mar 22, 2025 05:00 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.