Inspiration
Background
India is a country in South Asia. It is the second-most populous and seventh-largest country by area. Delhi, Mumbai, Chennai and Kolkata are four most important metro cities in India. Metro city are the urban cities that are highly populated. People move towards metro cities in search of better job and opportunities and a better life. These cities are popular and a place of interest for many due to their superior infrastructure like road, metro, safety, good quality education etc.
Problem
It is often difficult to decide to choose one of these cities for settlement. The deciding factor would be the superior and unique facilities these cities provide when compared to each other. This project aims to predict the best place to get settled in these metro cities.
Interest
People might be interested in knowing the analyze of different neighborhoods and the facilities and opportunities these neighborhoods can provide before settling or investing their money in any of these metro cities.
Data acquisition and Cleaning
Data Sources
For any data science project or analysis, data is most important. For this project, data can be found at government portal here. This dataset contains Indian postal codes along with their state name and coordinates. We have to download the CSV files and then load the data. This dataset, however, lacks data for latitudes and longitudes. We will use Google Geocoding APIs for filling data. We will also use Foursquare APIs to get the venues in each neighborhood.
Data Processing
There are several problems with the dataset. The dataset is huge and contains the data of all the states. However, we need data of the four metropolitan cities only. Also, there is a lot of missing data too.
Data Cleaning We will select only those rows which have name of those four cities in their taluk (administrative district). Also, there are same pin codes for different entries so we will keep the first entry only.
Filling missing data We don’t have coordinates values in our data. Therefore, we will use Google Geocoding APIs to get latitudes and longitudes value using pin codes. For better understanding we will add one more column as neighborhood in our dataset (as the office name is not that insightful). However, some errors might get crept in. We will manually remove those rows which contain coordinates outside of India or the rows for which we are unable to fetch coordinates values.
Feature Selection There are a lot of features in the dataset. However, we need only neighborhood data and its coordinates for analyzing. Therefore, we will use Neighborhood, Taluk, Pin code, latitude and longitude. By using Foursquare APIs, we will extract different venues in each of the cities.
Methodology
Analyzing data
We use one-hot encoding for the venues category and group the data by their neighborhood. For analyzing the data, we will extract top 10 most venues of the data.
Modelling data
We will use k-means clustering algorithm to cluster the venues in each city in five clusters.
Visualizing data
We use folium to visualize the map
Built With
- dataframe
- folium
- google-geocoding
- pandas
- python


Log in or sign up for Devpost to join the conversation.