Deploy-ML_Boston-House-Price

Background Project
The main purpose of the Machine Learning Process course is to prepare for applying end-to-end machine learning workflows to real-world tasks, starting from business problems to service deployments.
This case study is based on the famous Boston housing data. It contains the details of 506 houses in Boston city. Your task is to create a machine-learning model which can predict the average price of a house based on its characteristics. In the below case study I will discuss the step-by-step approach to creating a Machine Learning predictive model in such scenarios. You can use this flow as a template to solve any supervised ML Regression problem! The flow of the case study is as below:
- Reading the data in python
- Defining the problem statement
- Identifying the Target variable
- Looking at the distribution of the Target variable
- Basic Data Exploration
- Rejecting useless columns
- Visual Exploratory Data Analysis for data distribution (Histogram and Barcharts)
- Feature Selection based on data distribution
- Outlier treatment
- Missing Values treatment
- Visual correlation analysis
- Statistical correlation analysis (Feature Selection)
- Converting data to numeric for ML
- Sampling and K-fold cross-validation
- Trying multiple Regression algorithms
- Selecting the best Model
- Deploying the best model in production
Data Description
The business meaning of each column in the data is as below:
- CRIM - per capita crime rate by town
- ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS - proportion of non-retail business acres per town.
- CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
- NOX - nitric oxides concentration (parts per 10 million)
- RM - average number of rooms per dwelling
- AGE - proportion of owner-occupied units built prior to 1940
- DIS - weighted distances to five Boston employment centres
- RAD - index of accessibility to radial highways
- TAX - full-value property-tax rate per 10,000 dollars
- PTRATIO - pupil/teacher ratio by town
- B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT - % lower status of the population
Work Instructions
Step 1. Select Dataset
Step 2. Statement of Business Problems
Step 3. Implement Endo to End Machine Learning Workflow
Step 4. Perform and Summary Analysis
Outcome Project
* Block data preparation diagram
Lets check the description of the dataset: print(boston.DESCR):
.. _boston_dataset:
Boston house prices dataset
Data Set Characteristics:
:Number of Instances: 506
:Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.
:Attribute Information (in order):
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of black people by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's
:Missing Attribute Values: None
:Creator: Harrison, D. and Rubinfeld, D.L.
This is a copy of UCI ML housing dataset. https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics ...', Wiley, 1980. N.B. Various transformations are used in the table on pages 244-261 of the latter.
The Boston house-price data has been used in many machine learning papers that address regression problems.
.. topic:: References
- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
Software And Tools Requirements
Create a new environment
conda create -p venv python==3.7 -y
Framework to Describe These Projects
Goal: The goal of this project was to explore and implement linear regression and random forest machine learning techniques to predict Boston house prices accurately. The project aimed to demonstrate the application of these techniques for real-world tasks and highlight their impact on predicting real estate prices.
Impact: The project's impact lies in its contribution to the field of machine learning and real estate valuation. By utilizing linear regression and random forest techniques, the project provides insights into predicting housing prices based on various features. This can have practical implications for landowners, appraisers, policymakers, and potential buyers by aiding in accurate property valuation and decision-making.
Challenges: The project encountered challenges in various stages of the machine learning workflow, including data collection, preprocessing, feature engineering, modeling, and deployment. Ensuring the accuracy and generalizability of the predictive models while addressing potential overfitting and selecting appropriate features were some of the challenges faced during the project.
Interesting Findings: The project's findings include: • Successful implementation of linear regression and random forest techniques for predicting Boston house prices. • The importance of data preprocessing and feature engineering in improving model performance. • The practical implications of accurate housing price predictions for real estate stakeholders.
Conclusion/Future Works: The project successfully demonstrated the application of linear regression and random forest techniques in predicting real estate prices. The findings highlight the potential benefits of utilizing machine learning for property valuation and decision-making. Future work could involve exploring other advanced techniques, evaluating model robustness on different datasets, and integrating additional features for more accurate predictions. Overall, the project contributes to advancing the understanding and practical application of machine learning workflows in the context of real estate prediction.
Log in or sign up for Devpost to join the conversation.