Inspiration

Movies have always been a fascinating blend of art and business. While some films become massive box office hits, others struggle to break even. This inspired me to explore whether machine learning could accurately predict a movie’s domestic gross revenue based on factors like IMDB ratings, runtime, director, genre, and audience votes.

What it does

How we built it

1️⃣ Data Collection: Used a dataset containing the top 1000 IMDB movies, including their runtime, genre, IMDB rating, votes, and domestic gross revenue.
2️⃣ Data Preprocessing: Cleaned and transformed the dataset, removed missing values, and converted columns like Gross and Runtime into numerical format.
3️⃣ Feature Engineering: Encoded categorical values using one-hot encoding and scaled numerical features.
4️⃣ Model Training: Trained multiple models (Linear Regression, Random Forest, XGBoost) and compared their performance.
5️⃣ Evaluation & Optimization: Measured model performance using R² score, MAE, and RMSE and fine-tuned hyperparameters for better accuracy.
6️⃣ Final Testing & Prediction: Used the trained model to predict the domestic gross of a new, unseen movie.

Challenges we ran into

🔸 Data Cleaning Issues: Some columns (like Gross) contained commas, and missing values had to be handled properly.
🔸 Feature Encoding: Ensuring that categorical variables like Genre and Certificate were correctly encoded without data leakage.
🔸 Model Overfitting: Some models performed well on training data but poorly on test data, requiring better feature selection and tuning.
🔸 Ensuring Realistic Predictions: The model sometimes predicted unrealistic movie grosses, which I addressed by improving feature selection and adjusting outliers.

Accomplishments that we're proud of

What we learned

Throughout this project, I gained hands-on experience with:
Data Cleaning & Preprocessing – Handling missing values, data type conversion, and feature engineering.
Feature Selection & Encoding – Transforming categorical features (like genre and certificate) into numerical values.
Model Training & Evaluation – Comparing Linear Regression, Random Forest, and XGBoost to find the best model.
Hyperparameter Tuning – Improving model performance using GridSearchCV.
Debugging & Problem-Solving – Overcoming errors in data processing and model predictions.

What's next for Box Office Revenue Prediction

Final Outcome

After rigorous testing, Random Forest and XGBoost performed best, providing the most accurate predictions for movie revenue. This project successfully demonstrates how machine learning can predict a film’s financial success using historical data! 🎉

Built With

Share this project:

Updates