Texas A&M Datathon 2020

Github: https://github.com/johnschwenck/tamu_datathon_2020

Team:

Summary

For our 2020 TAMU Datathon submission, we have elected to solve the City Search challenge. Our model's unique spin offers a complete front-to-end implementation of a K-Means algorithm, which suggest the Top 12 best locations for a given user's preferences. We managed to collect and thoroughly process data from over 10 very different sources using various APIs such as the US Census and COVID-19 data. We also utilized Google tools such as Google Forms and Google Sheets where users submit a form with their preferences to a series of questions, which then get fed into a Google Sheet. Using R, we extract the information from the Google Sheet using the googledrive package and feed it into our K-Means algorithm.

Outline of Our Approach:

  • Identify region of interest (U.S)
  • Decide on the granularity (Metropolitan Area/ CBSA code)
  • Identify variables of interest
  • Obtain relevant dataset for each variable (API, government, other websites)
  • Process each dataset to obtain relevant columns
  • Map each dataset from zip code/city/county level to Metropolitan area through CBSA ID
  • Join all the mapped datasets into one table containing CBSA ID, and all relevant columns
  • Analyze dataset using K-Means Clustering and output cities within the chosen cluster
  • Utilize Power BI to create visuals

Variable Descriptions:

Housing (2019-2020):

Source: https://www.zillow.com/research/data/
Raw Data: ZHVI_1.csv, ZHVI_2.csv, ZHVI_3.csv, ZHVI_4.csv, ZHVI_5.csv, ZHVI_condo.csv
Processed Data: housing.rda

Variable Descriptions:

  • CBSA: Core-based statistical area code
  • one_bed_avg: Average price in dollars of one bedroom house
  • two_bed_avg: Average price in dollars of two bedroom house
  • three_bed_avg: Average price in dollars of three bedroom house
  • four_bed_avg: Average price in dollars of four bedroom house
  • five_bed_avg: Average price in dollars of five bedroom house
  • condo_avg: Average price in dollars of a condo

Rental Prices (2021 Forecast):

Source: https://www.huduser.gov/portal/datasets/fmr.html#2021_data
Raw Data: fy2021-safmrs.xlsx
Processed Data: rent_metro.rda

Variable Descriptions:

  • CBSA: Core-based statistical area code
  • one_bed_rent: Fair Market Rent for one bedroom apartment
  • two_bed_rent: Average price in dollars of two bedroom apartment
  • three_bed_rent: Average price in dollars of three bedroom apartment
  • four_bed_rent: Average price in dollars of four bedroom apartment
  • five_bed_rent: Average price in dollars of five bedroom apartment

Air Quality Index (2019):

Source: https://aqs.epa.gov/aqsweb/airdata/download_files.html#Annual
Raw Data: annual_aqi_by_cbsa_2019.csv
Processed Data: air_quality.rda

Variable Descriptions:

  • City, State: City and State for each entry
  • CBSA: Core-based statistical area code
  • Median AQI: Median Air Quality Index

Public Transportation (2018):

Source: https://api.census.gov/data/2018/acs/acs5/variables.html
Raw Data:
Census API key needed
pub_trans = B08006_008
Processed Data: pub_trans.rda

Variable Description:

  • CBSA: Core-based statistical area code
  • NAME: City, State Abbr. Area Type
  • Avg_User: Average number of users

Daily Average of Temperatures (2011):

Source: https://wonder.cdc.gov/controller/datarequest/D60
Raw Data: max_avg_temps2011.txt and min_avg_temps2011.txt
Processed Data: avg_temps_msa.rda

Variable Description:

  • CBSA: Core-based statistical area code
  • mean(Avg.Min.Temp): mean average of minimum air temperature for each CBSA
  • mean(Avg.Max.Temp): mean average of maximum air temperature for each CBSA

Demographics

Source: https://www.bls.gov/oes/tables.html
Raw Data: B01003_001B01002_001, B19013_001
Processed Data: demographics.rda

Variable Descriptions:

  • CBSA: Core-based statistical area code
  • CBSA_name: Core-based statistical area name
  • tot_emp_hc: Total employees in healthcare
  • loc_quotient_hc: Location quotient for healthcare
  • tot_emp_rest: Total employees in restaurants and food service
  • loc_quotient_rest: Location quotient for restaurants and food service
  • pop: Population
  • pca_hc: Healthcare employees per capita
  • rest_hc: Restaurant and food service workers per capita

Healthcare and Restaurant Labor Statistics (2019):

Source: https://www.bls.gov/oes/tables.html
Raw Data: oesm19ma.zip; MSA_M2019_dl.xlsx
Processed Data: healthcare_restaurants_labor.rda

Variable Descriptions:

  • CBSA: Core-based statistical area code
  • CBSA_name: Core-based statistical area name
  • tot_emp_hc: Total employees in healthcare
  • loc_quotient_hc: Location quotient for healthcare
  • tot_emp_rest: Total employees in restaurants and food service
  • loc_quotient_rest: Location quotient for restaurants and food service
  • pop: Population
  • pca_hc: Healthcare employees per capita
  • rest_hc: Restaurant and food service workers per capita

COVID Cases and Deaths (2020):

Source: https://www.nytimes.com/interactive/2020/us/coronavirus-us-cases.html
Raw Data: us_crime_counties.csv
Processed Data: covid_14dayavg.rda

Variable Descriptions:

  • CBSA: Core-based statistical area code
  • mean(cases): 14-day average of cases
  • mean(deaths): 14-day average of deaths

Transportation (2016-2018):

Source: https://api.census.gov/data/2018/acs/acs5/variables.html Raw Data: B08101_025, B08134_061, B08101_009, B08134_011, B08134_021, B08301_004, B08134_031, B08101_033, B08101_041, B08134_111, B08101_049, B08134_001
Processed Data: transportation.rda

Variable Description:

  • CBSA: Core-based statistical area code
  • public_trans_qty: Average number of people who took public transportation
  • priv_trans_all_qty: Average number of people who took private transportation
  • priv_trans_carpool_qty: Average number of people who took private transportation and carpooled
  • walked_qty: Average number of people who walk
  • bike_cab_other_qty: Average number of bike, cab and other transportation usage
  • work_from_home_qty: Average number of people who work from home
  • avg_time_to_work_all: Average number of seconds for a person to get to work

Religion (2010):

Source: https://www.thearda.com/Archive/ChCounty.asp Raw Data: religion.csv
Processed Data: religion.rda

Variable Description:

  • CBSA: Core-based statistical area code
  • sum(TOTCNG): All denominations/groups - Total number of congregations
  • sum(TOTADH): All denominations/groups - Total number of adherents
  • sum(NONDCNG): Non-denominational - Total number of congregations
  • sum(NONDADH): Non-denominational- Total number of adherents
  • mean(TOTRATE): All denominations/groups - Rates of adherence per 1,000 population
  • mean(NONDRATE): Non-denominational - Rates of adherence per 1,000 population

Education

Source: https://libguides.princeton.edu/c.php?g=464855&p=3181354
Raw Data: B15003_017, B15003_018, B15003_022, B15003_023, B15003_024, B15003_025
Processed Data: education.rda

Variable Description:

  • CBSA: Core-based statistical area code
  • edu_hs_only: Total number of people who only have a high school diploma
  • edu_ged_only: Total number of people who only have a GED
  • edu_bs: Total number of people who have a Bachlor's degree
  • edu_ms: Total number of people who hae a masters
  • edu_prof: Total number of professional persons
  • edu_phd: Total number of people with a PhD

Built With

  • api
  • powerbi
  • r
Share this project:

Updates