As hopeful entrepreneurs who are also interested in venture capital, we are constantly faced with the challenge of determining how a startup or a business we are interested in grows. Using historical data about companies that are now IPO, we saw that we could examine the growth of prior companies that achieved similar initial growth and use this data to get a rough idea of how well a new business idea or startup would thrive.
What it does
This tool uses cutting-edge machine learning techniques to map historical data in company earning reports, analyze sentiment from news articles, and gather information about the company's funding rounds. Using the k-means clustering algorithm, it groups together companies with similar quantitative measures, allowing VCs to predict the growth of startups that they are evaluating by looking at prior companies that achieved similar early-stage growth and are likely to see similar long-term growth.
How we built it
We used multiple web-scraping tools, written in Python, to grab data about thousands of startups and established companies. After filtering down this list of companies by year and industry and running sentiment analyses on hundreds of articles about these startups, we built a model using the k-means clustering algorithm, written with the Tensorflow and scikit-learn libraries. For any startup that we wish to evaluate, we can now use the results of our model to determine the startups that are most similar in terms of early-stage growth and press, factors which are potentially useful for a VC to determine a startup's long term growth. Finally, we built a web application to host and display the results of our algorithm.
Challenges we ran into
The major challenge we ran into was figuring out what data would be relevant four our k-means clustering algorithm. We wanted to use information that Vcs and investors would use to make investment decisions such as gross profit and margin, previous funding, and operation costs. We also needed to perform a lot of data cleaning in order to reduce the size of the Crunchbase dataset.
Accomplishments that we're proud of
Our team is most proud of successfully building a tool with the potential to significantly change the way in which VCs analyze, view, and invest in startups. Within a very short frame of time, our team was able to put together an end-to-end pipeline for web scraping, sentiment analysis, and clustering, achieving much of our initial ambitious goal. Beyond just being a cool project, however, this project came together with the addition of a well-accessible frontend, allowing us to visualize and display our results in an easily digestable format.
What we learned
This project required us to learn about and build multiple frameworks for web scraping, which, prior to this project, we had very limited experience with. We had to learn to deal with websites that had information in multiple different formats and that had noisy data. Additionally, prior to this project, we had limited experience with the k-means clustering algorithm and had to quickly learn how to implement it with multiple variables and hundreds of datapoints. However, most importantly, this project gave us the confidence to pursue projects in which there was a lot of uncertainty and we had to figure out our way as we went along.
What's next for Speculon
We hope to expand our dataset significantly. As of now, we were limited by companies that had gone IPO and whatever we could find publicly available on Crunchbase and Google Finance. However, with more access to private datasets, we can easily improve our model's accuracy and reliability and make it a very practical tool for real-life investors and VCs.