deploy-ml_boston-house-price

Deploy-ML_Boston-House-Price

Background Project

The main purpose of the Machine Learning Process course is to prepare for applying end-to-end machine learning workflows to real-world tasks, starting from business problems to service deployments.

This case study is based on the famous Boston housing data. It contains the details of 506 houses in Boston city. Your task is to create a machine-learning model which can predict the average price of a house based on its characteristics. In the below case study I will discuss the step-by-step approach to creating a Machine Learning predictive model in such scenarios. You can use this flow as a template to solve any supervised ML Regression problem! The flow of the case study is as below:

Reading the data in python
Defining the problem statement
- Identifying the Target variable
- Looking at the distribution of the Target variable
- Basic Data Exploration
- Rejecting useless columns
- Visual Exploratory Data Analysis for data distribution (Histogram and Barcharts)
- Feature Selection based on data distribution
- Outlier treatment
- Missing Values treatment
- Visual correlation analysis
- Statistical correlation analysis (Feature Selection)
- Converting data to numeric for ML
- Sampling and K-fold cross-validation
- Trying multiple Regression algorithms
- Selecting the best Model
- Deploying the best model in production

Data Description

The business meaning of each column in the data is as below:

CRIM - per capita crime rate by town
ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS - proportion of non-retail business acres per town.
CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
NOX - nitric oxides concentration (parts per 10 million)
RM - average number of rooms per dwelling
AGE - proportion of owner-occupied units built prior to 1940
DIS - weighted distances to five Boston employment centres
RAD - index of accessibility to radial highways
TAX - full-value property-tax rate per 10,000 dollars
PTRATIO - pupil/teacher ratio by town
B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
LSTAT - % lower status of the population

Work Instructions

Step 1. Select Dataset

Step 2. Statement of Business Problems

Step 3. Implement Endo to End Machine Learning Workflow

Step 4. Perform and Summary Analysis

Outcome Project

* Block data preparation diagram

Lets check the description of the dataset: print(boston.DESCR):

.. _boston_dataset:

Boston house prices dataset

Data Set Characteristics:

:Number of Instances: 506 

:Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

:Attribute Information (in order):
    - CRIM     per capita crime rate by town
    - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
    - INDUS    proportion of non-retail business acres per town
    - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
    - NOX      nitric oxides concentration (parts per 10 million)
    - RM       average number of rooms per dwelling
    - AGE      proportion of owner-occupied units built prior to 1940
    - DIS      weighted distances to five Boston employment centres
    - RAD      index of accessibility to radial highways
    - TAX      full-value property-tax rate per $10,000
    - PTRATIO  pupil-teacher ratio by town
    - B        1000(Bk - 0.63)^2 where Bk is the proportion of black people by town
    - LSTAT    % lower status of the population
    - MEDV     Median value of owner-occupied homes in $1000's

:Missing Attribute Values: None

:Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset. https://archive.ics.uci.edu/ml/machine-learning-databases/housing/

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics ...', Wiley, 1980. N.B. Various transformations are used in the table on pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression problems.

.. topic:: References

Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

Software And Tools Requirements

Create a new environment

conda create -p venv python==3.7 -y

Framework to Describe These Projects

Goal: The goal of this project was to explore and implement linear regression and random forest machine learning techniques to predict Boston house prices accurately. The project aimed to demonstrate the application of these techniques for real-world tasks and highlight their impact on predicting real estate prices.
Impact: The project's impact lies in its contribution to the field of machine learning and real estate valuation. By utilizing linear regression and random forest techniques, the project provides insights into predicting housing prices based on various features. This can have practical implications for landowners, appraisers, policymakers, and potential buyers by aiding in accurate property valuation and decision-making.
Challenges: The project encountered challenges in various stages of the machine learning workflow, including data collection, preprocessing, feature engineering, modeling, and deployment. Ensuring the accuracy and generalizability of the predictive models while addressing potential overfitting and selecting appropriate features were some of the challenges faced during the project.
Interesting Findings: The project's findings include: • Successful implementation of linear regression and random forest techniques for predicting Boston house prices. • The importance of data preprocessing and feature engineering in improving model performance. • The practical implications of accurate housing price predictions for real estate stakeholders.

Conclusion/Future Works: The project successfully demonstrated the application of linear regression and random forest techniques in predicting real estate prices. The findings highlight the potential benefits of utilizing machine learning for property valuation and decision-making. Future work could involve exploring other advanced techniques, evaluating model robustness on different datasets, and integrating additional features for more accurate predictions. Overall, the project contributes to advancing the understanding and practical application of machine learning workflows in the context of real estate prediction.

Built With

dockerfile
html
jupyter-notebook
python

Updates

Zulkarnain Prastyo started this project — Jan 04, 2023 09:16 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.