Inspiration
We compiled a bunch of brainstorming ideas and this happened to be the one we collectively were interested in purusing.
What it does
It shows compiled open New York City data dating from 2013-2017. It relates Housing & Development property sales to open parking and camera violations and parking tickets to try to discover if a relationship exists between property values and the number of tickets issued in that area.
How we built it
We cleaned our data through R and Python.
Challenges we ran into
There were numerous zero sale prices in a neighborhood. We developed an equation to combat this by finding the mean of property sales to estimate those sale prices.
Accomplishments that we're proud of
We managed to clean out data effectively!
What we learned
This experience has offered the opportunity to experiment with different ways of approaching code. It was fun to be connected within the CUNY community to take on a project together with people of similar interests.
What's next for The Correlation of Ticket Violations & Property Sales in NYC
We went through various options:
Mean sales price in borough
Mean sales price within neighborhood
If time allowed: a. Would have wanted to determine mean pricing within a neighborhood going through slicing methods. As we iterate through each row, take average of pervious rows and replace zero with mean value. b. compare neighborhood with residential/commericial building type and replace with avg.
Went with: Determined to check normal distribution of sales pricing to either replace with mean or median depending on skewness
CODE (R):
2013 2014
Keeping necessary columns
keeps <- c("Summons.Number", "Plate.ID", "Registration.State", "Plate.Type", "Issue.Date", "Violation.Code", "Violation.County", "Violation.Precinct")
Aug_2013_June_2014 <- CUNY.Hackathon.Aug2013June2014
cleaning neighborhood names.
Aug_2013_June_2014 <- Aug_2013_June_2014 %>% mutate(Violation.County= case_when(Violation.County == "BX" ~ "BRONX", Violation.County == "K" ~ "KINGS", Violation.County == "KING" ~ "KINGS", Violation.County == "Q" ~ "QUEENS", Violation.County == "QUEEN" ~ "QUEENS", Violation.County == "RC" ~ "RICHMOND", Violation.County == "RICH" ~ "RICHMOND", Violation.County == "R" ~ "RICHMOND", Violation.County == "NEWY" ~ "MANHATTAN", Violation.County == "MAN" ~ "MANHATTAN", Violation.County == "NEW Y" ~ "MANHATTAN", Violation.County == "QU" ~ "QUEENS", Violation.County == "MH" ~ "MANHATTAN", Violation.County == "QNS" ~ "QUEENS", Violation.County == "BK" ~ "KINGS", Violation.County == "QN" ~ "QUEENS", Violation.County == "MN" ~ "MANHATTAN", Violation.County == "ST" ~ "STATEN_ISLAND", Violation.County == "NYC" ~ "MANHATTAN", Violation.County == "NY" ~ "MANHATTAN", TRUE ~ Violation.County))
Aug_2013_June_2014 = Aug_2013_June_2014[keeps]
changing name for violation.precinct column for later join
names(Aug_2013_June_2014)[8] <- "pct"
writing csv file
write.csv(Aug_2013_June_2014, file = "CUNY Hackathon Aug_2013_June_2014.csv")
2015 cleaning
Keeping necessary columns
keeps <- c("Summons.Number", "Plate.ID", "Registration.State", "Plate.Type", "Issue.Date", "Violation.Code", "Violation.County", "Violation.Precinct")
renaming file
fiscal_2015 <- CUNY.Hackathon.Year_2015
cleaning violation.county
fiscal_2015 <- fiscal_2015 %>% mutate(Violation.County= case_when(Violation.County == "BX" ~ "BRONX", Violation.County == "K" ~ "KINGS", Violation.County == "KING" ~ "KINGS", Violation.County == "Q" ~ "QUEENS", Violation.County == "QUEEN" ~ "QUEENS", Violation.County == "RC" ~ "RICHMOND", Violation.County == "RICH" ~ "RICHMOND", Violation.County == "R" ~ "RICHMOND", Violation.County == "NEWY" ~ "MANHATTAN", Violation.County == "MAN" ~ "MANHATTAN", Violation.County == "NEW Y" ~ "MANHATTAN", Violation.County == "QU" ~ "QUEENS", Violation.County == "MH" ~ "MANHATTAN", Violation.County == "QNS" ~ "QUEENS", Violation.County == "BK" ~ "KINGS", Violation.County == "QN" ~ "QUEENS", Violation.County == "MN" ~ "MANHATTAN", Violation.County == "ST" ~ "STATEN ISLAND", Violation.County == "NYC" ~ "MANHATTAN", Violation.County == "NY" ~ "MANHATTAN", TRUE ~ Violation.County))
CODE (PYTHON):
getting rid of unnecessary columns
fiscal_2015 = fiscal_2015[keeps]
changing name of violation. Precinct
names(fiscal_2015)[8] <- "pct"
writing csv
write.csv(fiscal_2015, file = "CUNY Hackathon Year_2015.csv")
def meanMedianImputation(df): # import pdb; pdb.set_trace();
# cleaning
# remove \n from columns
df.columns = df.columns.str.replace('\n','')
# remove whitespace from both sides
df["NEIGHBORHOOD"] = df_list["NEIGHBORHOOD"].str.strip()
# get unique neighborhood names
neighborhoods = df["NEIGHBORHOOD"].unique()
# loop over neighborhoods
for x in range(len(neighborhoods)):
# get neighborhood sale price
neighborhood_sales = df.loc[df["NEIGHBORHOOD"]==neighborhoods[x], "SALE PRICE"]
# get median for each neighborhood
median = neighborhood_sales.median()
# get mean for each neighborhood
mean = neighborhood_sales.mean()
# get skewness, if greater than 1, replace with median
if ((neighborhood_sales.skew() > 1) | (neighborhood_sales.skew() < -1)) :
# neighborhood_sales.replace(0.0, median, inplace=True)
df.loc[df["NEIGHBORHOOD"]==neighborhoods[x], "SALE PRICE"] = df.loc[df["NEIGHBORHOOD"]==neighborhoods[x], "SALE PRICE"].replace(0, median)
# else replace with mean
elif neighborhood_sales.skew()==0:
df.loc[df["NEIGHBORHOOD"]==neighborhoods[x], "SALE PRICE"] = df.loc[df["NEIGHBORHOOD"]==neighborhoods[x], "SALE PRICE"].replace(0, mean)
else:
df.loc[df["NEIGHBORHOOD"]==neighborhoods[x], "SALE PRICE"] = df.loc[df["NEIGHBORHOOD"]==neighborhoods[x], "SALE PRICE"].replace(0, mean)
**We will have a video prepared by 4:30 P.M. the video posted now at 2PM is a placeholder*
Log in or sign up for Devpost to join the conversation.