Inspiration
Both of us are huge film buffs and have toyed around with the idea of making a short film sometime in the near future. Combing this passion with our knowledge of machine learning, we thought this would be a great tool for other similar budding filmmakers, and even for film studios and investors/financiers.
What it does
Our project is a machine learning model that can predict a movie's budget based on the below input parameters:
- Number of Actors in the movie
- Genre(s) of the movie
- Language(s) of the movie
- Release Year of the movie
- Runtime (in minutes) of the movie
- Production company for the movie
How we built it
- The first step in our project was to scrape the data for 100,000 movies from imdb.com
- The next step was to extract and store all the information (such as cast information, technical details, budget, production information etc.) in each html file in a json object, resulting in 100k such json objects.
- Next, we combined all the json files into a single csv file that contained all the data in a tabular format.
- The next task was exploratory data analysis and data pre-processing. This included cleaning up the data, transformation operations, encoding categorical variables, reducing the dimensionality of the dataset etc.
- For the machine learning model, we chose to go with a linear regression model, with a 80/20 train/test split to test for model accuracy.
- The last step was to build a user interface using streamlit that could take the user input and perform inference using the model we built.
Challenges we ran into
One of the first challenges that we ran into was figuring out how to source the movie data, since we did not want to pay for API usage. Eventually we settled on scraping imdb webpage data to get all pertinent information. The next major challenge was working with all the features that we had, and to narrow them down to a lower dimensionality dataset. This was done via extensive analysis of the features and their contribution towards the dependent variable, movie budget. For certain categorical features we had to perform binning in order to reduce the number of categories.
Accomplishments that we're proud of
We're proud of the real-world application of our project, particularly for budding filmmakers, film studios, and investors/financiers. Our machine learning model can provide a clearer understanding of potential budgetary implications for upcoming movies. We also believe that our accomplishment is not just in the development of the app but in the collaboration itself. The collective effort and synergy between team members with diverse expertise made this achievement possible, reflecting the power of teamwork in innovation.
What we learned
We learned quite a lot of things in the development of the app. One of the primary aspects of a ml pipeline, the Exract-Transform-Load process, constituted a major part of the development time and led to a lot of insights into the different techniques at disposal. Another major learning was during the development of the machine learning model, particularly the feature engineering phase. Understanding how to effectively preprocess and engineer features from various sources (such as actors, genres, languages, etc.) was a significant learning curve.
What's next for Movie Budget Prediction
The next major step in improving the predictions of our model would be to incorporate textual features (such as actor names/directors) alongside numerical/categorical features.
Built With
- docker
- elasticsearch
- kibana
- pandas
- python
- ray
- scikit-learn
- streamlit
Log in or sign up for Devpost to join the conversation.