Investigating A Baseball Dataset

Source: 2014 Lahman's Baseball Database

Udacity Data Analyst Nanodegree - Project 2

I used "Salaries.csv" and "AwardsSharePlayers.csv" for my project and my question to this database is "What factors have an effect on team salaries?" There are total 5 columns and I am going to analyze yearID, teamID and igID to investigate how these columns are related to salaries.

First, I am going to import these two csv files and transform to Pandas DataFrame and do some data wrangling.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# import Salaries data
filename = 'Salaries.csv'
salariesDF = pd.read_csv(filename)

# import Players data
filename = 'AwardsSharePlayers.csv'
playersDF = pd.read_csv(filename)

salariesDF.head()

	yearID	teamID	lgID	playerID	salary
0	1985	ATL	NL	barkele01	870000
1	1985	ATL	NL	bedrost01	550000
2	1985	ATL	NL	benedbr01	545000
3	1985	ATL	NL	campri01	633333
4	1985	ATL	NL	ceronri01	625000

playersDF.head()

	awardID	yearID	lgID	playerID	pointsWon	pointsMax	votesFirst
0	Cy Young	1956	ML	fordwh01	1.0	16	1.0
1	Cy Young	1956	ML	maglisa01	4.0	16	4.0
2	Cy Young	1956	ML	newcodo01	10.0	16	10.0
3	Cy Young	1956	ML	spahnwa01	1.0	16	1.0
4	Cy Young	1957	ML	donovdi01	1.0	16	1.0

And I am going to see if there is any null value in the data. First salaries data.

pd.isnull(salariesDF).sum()

yearID      0
teamID      0
lgID        0
playerID    0
salary      0
dtype: int64

salariesDF.describe()

	yearID	salary
count	24758.000000	2.475800e+04
mean	1999.891752	1.932905e+06
std	8.324492	3.190334e+06
min	1985.000000	0.000000e+00
25%	1993.000000	2.600000e+05
50%	2000.000000	5.250000e+05
75%	2007.000000	2.199643e+06
max	2014.000000	3.300000e+07

Although there is no null value in the data, there are some zeros in salary column. I am going to find these rows with zero salary.

salariesDF[salariesDF["salary"] == 0]

	yearID	teamID	lgID	playerID	salary
6179	1993	NYA	AL	jamesdi01	0
12007	1999	PIT	NL	martija02	0

There are total two players with 0 salary. I am going to find more details about those two players.

salariesDF[salariesDF["playerID"] == "jamesdi01"]

	yearID	teamID	lgID	playerID	salary
1302	1987	ATL	NL	jamesdi01	120000
1929	1988	ATL	NL	jamesdi01	360000
2591	1989	ATL	NL	jamesdi01	400000
3536	1990	CLE	AL	jamesdi01	650000
5293	1992	NYA	AL	jamesdi01	387500
6179	1993	NYA	AL	jamesdi01	0
8019	1995	NYA	AL	jamesdi01	350000

salariesDF[salariesDF["playerID"] == "martija02"]

	yearID	teamID	lgID	playerID	salary
11011	1998	PIT	NL	martija02	170000
12007	1999	PIT	NL	martija02	0

Clearly for jamesdi01, the 0 value is wrong. I am going to assume the salary in 1993 as the average of 1992 and 1995. For martija02, I will ignore these two rows from the database.

# Taking care of jamesdi01
salariesDF["salary"].iloc[6179] = (387500 + 350000) / 2
salariesDF[salariesDF["playerID"] == "jamesdi01"]

	yearID	teamID	lgID	playerID	salary
1302	1987	ATL	NL	jamesdi01	120000
1929	1988	ATL	NL	jamesdi01	360000
2591	1989	ATL	NL	jamesdi01	400000
3536	1990	CLE	AL	jamesdi01	650000
5293	1992	NYA	AL	jamesdi01	387500
6179	1993	NYA	AL	jamesdi01	368750
8019	1995	NYA	AL	jamesdi01	350000

# Taking care of martija02
salariesDF = salariesDF[salariesDF["playerID"] != "martija02"]
salariesDF.describe()

	yearID	salary
count	24756.000000	2.475600e+04
mean	1999.891865	1.933069e+06
std	8.324817	3.190411e+06
min	1985.000000	1.090000e+04
25%	1993.000000	2.600000e+05
50%	2000.000000	5.250000e+05
75%	2007.000000	2.200000e+06
max	2014.000000	3.300000e+07

Now I am going to clean the AwardsSharePlayers data.

# Printout the total number of null values of each column
pd.isnull(playersDF).sum()

awardID         0
yearID          0
lgID            0
playerID        0
pointsWon       0
pointsMax       0
votesFirst    358
dtype: int64

playersDF.describe()

	yearID	pointsWon	pointsMax	votesFirst
count	6705.000000	6705.000000	6705.000000	6347.000000
mean	1971.344817	43.168829	266.107233	1.612573
std	27.172773	67.714919	128.980771	4.868226
min	1911.000000	0.000000	16.000000	0.000000
25%	1950.000000	4.000000	140.000000	0.000000
50%	1974.000000	12.000000	336.000000	0.000000
75%	1994.000000	52.000000	336.000000	0.000000
max	2014.000000	448.000000	448.000000	32.000000

Although there are 358 null values in "votesFirst" column, I don't plan to use this column. Since other columns don't have a null value, I can leave this data as it is.

Now salary is investigated using the single-variable (1d) exploration

# single-variable investigation - Salary
# plot boxplot and histogram
%matplotlib inline
plt.boxplot(salariesDF["salary"])
plt.ylabel("Salary")
plt.show()
plt.hist(salariesDF["salary"]/1000000, bins = 10)
plt.ylabel("Frequency")
plt.xlabel("Salary in millions")
plt.show()

png

In histogram, you can see that the most of the salaries are below 5 millions and very little number of people are more than 5 millions range. However, the box plot is very difficult to understand because there are too many points and most of them seem to be outliers. In order to decrease the number of points, the set of the mean salaries per year is used to plot the boxplot.

# Get a dataframe of the set of the mean salaries per year
salaryYearDF = salariesDF[["yearID","salary"]].groupby("yearID").mean()
salaryYearDF

	salary
yearID
1985	476299
1986	417147
1987	434729
1988	453171
1989	506323
1990	511973
1991	894961
1992	1047520
1993	977366
1994	1049588
1995	964979
1996	1027909
1997	1218687
1998	1281958
1999	1486794
2000	1992984
2001	2279841
2002	2392526
2003	2573472
2004	2491776
2005	2633830
2006	2834520
2007	2941435
2008	3136517
2009	3277646
2010	3278746
2011	3318838
2012	3458421
2013	3723344
2014	3980445

Now plot the boxplot using this dataframe.

# plotting box plot and histogram
%matplotlib inline
plt.boxplot(salaryYearDF.values / 1000)
plt.ylabel("Salary in thousands")
plt.show()

meanYearDF = float(salaryYearDF.mean())
medianYearDF = float(salaryYearDF.median())
print "Mean:", meanYearDF
print "Median:", medianYearDF

png

Mean: 1902124.83333
Median: 1739889.0

Mean is greater than Median as expected since the historam above shows the right skewed graph.

Now I am going to investigate the relation between the teamID and salaries. Find sum, max, min, mean of salaries by team ids.

# Database with two columns: teamID and salary
salariesDF_reduced = salariesDF[["teamID","salary"]]

salariesDF_summary_by_teamID = salariesDF_reduced.groupby("teamID").agg([np.sum,np.max,np.min,np.mean])
salariesDF_summary_by_teamID.columns = ["sum", "max", "min", "mean"]
salariesDF_summary_by_teamID.head()

	sum	max	min	mean
teamID
ANA	468091973	13166667	150000	1895109
ARI	1209975228	16000000	170000	2499948
ATL	1882946784	16061802	60000	2184393
BAL	1630200038	17000000	60000	1846206
BOS	2432700935	22500000	60000	2755040

def bargraph(column): # column = sum, max, min, mean
    frequency = salariesDF_summary_by_teamID[column].sort_values(axis=0, ascending = False).head(10)
    objects = frequency.index
    y_pos = np.arange(len(objects))
    plt.bar(y_pos, frequency, align='center')
    plt.xticks(y_pos, objects)
    plt.ylabel('Salary')
    plt.title("Top 10 {} of salaries by teamID".format(column))
    print "Top 10 {} of salaries by teamID".format(column)
    print frequency
    plt.show()

Change False to True to see the bargraphs.

""" Bar graph of the number of players by teamID """
if True:
    %matplotlib inline
    bargraph("sum")
    bargraph("max")
    bargraph("min")
    bargraph("mean")

Top 10 sum of salaries by teamID
teamID
NYA    3283488084
BOS    2432700935
LAN    2237766703
PHI    1982355800
NYN    1965737231
ATL    1882946784
CHN    1859833315
SLN    1775631332
DET    1771197687
CHA    1699865143
Name: sum, dtype: int64

png

Top 10 max of salaries by teamID
teamID
NYA    33000000
LAA    26187500
LAN    26000000
PHI    25000000
TEX    24000000
SEA    24000000
NYN    23145011
DET    23000000
MIN    23000000
BOS    22500000
Name: max, dtype: int64

png

Top 10 min of salaries by teamID
teamID
NYM    501560
SFG    500500
MIA    480000
WAS    316000
LAA    316000
ARI    170000
MIL    170000
TBA    170000
ANA    150000
COL    109000
Name: min, dtype: int64

png

Top 10 mean of salaries by teamID
teamID
SFG    5315191
LAA    4186119
NYA    3735481
BOS    2755040
LAN    2505897
ARI    2499948
MIA    2480984
WAS    2466758
NYN    2345748
NYM    2283624
Name: mean, dtype: int64

png

multiple-variable (2d) explorations

I believe that the player' pointsWon affects salaries. (i.e. palyers with more points have higher salaries) So I set the pointsWon as an independent variable and salary as a dependent variable.

First, I reindex the salary dataframe with a key value set (year, player ID) so that I can join the two dataframes.

# Set key value as (year, player ID) for salariesDF
keyValue = ("{}, {}".format(salariesDF["yearID"].iloc[i],salariesDF["playerID"].iloc[i]) for i in range(len(salariesDF)))

salariesDFkv = pd.DataFrame(salariesDF.values, columns = list(salariesDF.columns.values), index = list(keyValue))
salariesDFkv

	yearID	teamID	lgID	playerID	salary
1985, barkele01	1985	ATL	NL	barkele01	870000
1985, bedrost01	1985	ATL	NL	bedrost01	550000
1985, benedbr01	1985	ATL	NL	benedbr01	545000
1985, campri01	1985	ATL	NL	campri01	633333
1985, ceronri01	1985	ATL	NL	ceronri01	625000
1985, chambch01	1985	ATL	NL	chambch01	800000
1985, dedmoje01	1985	ATL	NL	dedmoje01	150000
1985, forstte01	1985	ATL	NL	forstte01	483333
1985, garbege01	1985	ATL	NL	garbege01	772000
1985, harpete01	1985	ATL	NL	harpete01	250000
1985, hornebo01	1985	ATL	NL	hornebo01	1500000
1985, hubbagl01	1985	ATL	NL	hubbagl01	455000
1985, mahleri01	1985	ATL	NL	mahleri01	407500
1985, mcmurcr01	1985	ATL	NL	mcmurcr01	275000
1985, mumphje01	1985	ATL	NL	mumphje01	775000
1985, murphda05	1985	ATL	NL	murphda05	1625000
1985, oberkke01	1985	ATL	NL	oberkke01	616667
1985, perezpa01	1985	ATL	NL	perezpa01	450000
1985, perryge01	1985	ATL	NL	perryge01	120000
1985, ramirra01	1985	ATL	NL	ramirra01	750000
1985, suttebr01	1985	ATL	NL	suttebr01	1354167
1985, washicl01	1985	ATL	NL	washicl01	800000
1985, boddimi01	1985	BAL	AL	boddimi01	625000
1985, dauerri01	1985	BAL	AL	dauerri01	480000
1985, davisst02	1985	BAL	AL	davisst02	437500
1985, dempsri01	1985	BAL	AL	dempsri01	512500
1985, dwyerji01	1985	BAL	AL	dwyerji01	375000
1985, flanami01	1985	BAL	AL	flanami01	641667
1985, fordda01	1985	BAL	AL	fordda01	450000
1985, grosswa01	1985	BAL	AL	grosswa01	483333
...	...	...	...	...	...
2014, blevije01	2014	WAS	NL	blevije01	1675000
2014, cedenxa01	2014	WAS	NL	cedenxa01	507800
2014, clippty01	2014	WAS	NL	clippty01	5875000
2014, desmoia01	2014	WAS	NL	desmoia01	6500000
2014, detwiro01	2014	WAS	NL	detwiro01	3000000
2014, espinda01	2014	WAS	NL	espinda01	540580
2014, fistedo01	2014	WAS	NL	fistedo01	7200000
2014, frandke01	2014	WAS	NL	frandke01	900000
2014, gonzagi01	2014	WAS	NL	gonzagi01	8500000
2014, hairssc01	2014	WAS	NL	hairssc01	2500000
2014, harpebr03	2014	WAS	NL	harpebr03	2150000
2014, jordata01	2014	WAS	NL	jordata01	504300
2014, koberje01	2014	WAS	NL	koberje01	501000
2014, larocad01	2014	WAS	NL	larocad01	12000000
2014, leonsa01	2014	WAS	NL	leonsa01	501000
2014, lobatjo01	2014	WAS	NL	lobatjo01	900000
2014, matthry01	2014	WAS	NL	matthry01	520000
2014, mclouna01	2014	WAS	NL	mclouna01	5000000
2014, moorety01	2014	WAS	NL	moorety01	507900
2014, ramoswi01	2014	WAS	NL	ramoswi01	2095000
2014, rendoan01	2014	WAS	NL	rendoan01	1800000
2014, roarkta01	2014	WAS	NL	roarkta01	506100
2014, soriara01	2014	WAS	NL	soriara01	11000000
2014, spande01	2014	WAS	NL	spande01	6500000
2014, stammcr01	2014	WAS	NL	stammcr01	1375000
2014, storedr01	2014	WAS	NL	storedr01	3450000
2014, strasst01	2014	WAS	NL	strasst01	3975000
2014, werthja01	2014	WAS	NL	werthja01	20000000
2014, zimmejo02	2014	WAS	NL	zimmejo02	7500000
2014, zimmery01	2014	WAS	NL	zimmery01	14000000

24756 rows × 5 columns

Similarly reindex the award dataframe with the key value set.

# Set key value as (year, player ID) for playersDF
keyValue = ("{}, {}".format(playersDF["yearID"].iloc[i],playersDF["playerID"].iloc[i]) for i in range(len(playersDF)))

playersDFkv = pd.DataFrame(playersDF.values, columns = list(playersDF.columns.values), index = list(keyValue))
playersDFkv.head()

	awardID	yearID	lgID	playerID	pointsWon	pointsMax	votesFirst
1956, fordwh01	Cy Young	1956	ML	fordwh01	1	16	1
1956, maglisa01	Cy Young	1956	ML	maglisa01	4	16	4
1956, newcodo01	Cy Young	1956	ML	newcodo01	10	16	10
1956, spahnwa01	Cy Young	1956	ML	spahnwa01	1	16	1
1957, donovdi01	Cy Young	1957	ML	donovdi01	1	16	1

Create new database that joins the salary dataframe with the player dataframe.

salaryPlayerDF = pd.merge(salariesDFkv, playersDFkv, how = "inner")
salaryPlayerDF.head()

	yearID	teamID	lgID	playerID	salary	awardID	pointsWon	pointsMax
0	1985	ATL	NL	murphda05	1625000	MVP	63	336
1	1985	BAL	AL	murraed02	1472819	MVP	130	392
2	1985	BAL	AL	ripkeca01	800000	MVP	9	392
3	1985	BOS	AL	boggswa01	1000000	MVP	159	392
4	1985	BOS	AL	gedmari01	477500	MVP	1	392

I now have a merged dataframe so I can find the correlation between pointsWon and salary. First I am going to standardize each column for pointsWon and salary.

# Standardizing
def standard(sr):
    return (sr - sr.mean())/sr.std(ddof=0)

salaryPlayerDF = salaryPlayerDF[["pointsWon", "salary"]]

# Scatter plot for awardsWon and salary
%matplotlib inline
plt.scatter(salaryPlayerDF["pointsWon"], salaryPlayerDF["salary"])
plt.show()

png

This part is for looking at the trend of the salary change whether it increases or decreases.

""" data by year """
salariesDF_by_year = salariesDF[["yearID","salary"]].groupby("yearID").agg([np.sum,np.max,np.min,np.mean])
salariesDF_by_year.columns = ["sum", "max", "min", "mean"]

if True:
    %matplotlib inline
    plt.title('Sum by year')
    plt.xlabel('Year')
    plt.ylabel('Salary')
    plt.plot(salariesDF_by_year.index, salariesDF_by_year["sum"])
    plt.show()
    plt.title('Maximum by year')
    plt.xlabel('Year')
    plt.ylabel('Salary')
    plt.plot(salariesDF_by_year.index, salariesDF_by_year["max"])
    plt.show()
    plt.title('Minimum by year')
    plt.xlabel('Year')
    plt.ylabel('Salary')
    plt.plot(salariesDF_by_year.index, salariesDF_by_year["min"])
    plt.show()
    plt.title('Mean by year')
    plt.xlabel('Year')
    plt.ylabel('Salary')
    plt.plot(salariesDF_by_year.index, salariesDF_by_year["mean"])
    plt.show()

png

All of these factors increase as year gets close to 2014. But when you look at the second graph there is a peak around at 2010 and I decided to analyze that.

""" Finding maximum year """
maxSalaryByYear = salariesDF_by_year["max"].argmax()
print "\nMaximum salary occurs in {}, $ {}\n".format(maxSalaryByYear, salariesDF_by_year["max"].loc[maxSalaryByYear])

Maximum salary occurs in 2009, $ 33000000

""" Find player with the maximum in 2009 """
maxDFyear = salariesDF[salariesDF["yearID"]==2009]
maxPlayer2009 = maxDFyear[maxDFyear["salary"] == salariesDF_by_year["max"].loc[maxSalaryByYear]]
maxPlayerID = maxPlayer2009["playerID"].values[0]
print maxPlayerID
print "Detailed information of the player"
print maxPlayer2009, "\n"

rodrial01
Detailed information of the player
       yearID teamID lgID   playerID    salary
20286    2009    NYA   AL  rodrial01  33000000

The maximum value occurs in 2009 and the team id and player id is shown above. He had the highest salary of all time.

""" maximum player's yearly graph """
maxPlayerDF = salariesDF[salariesDF["playerID"] == maxPlayerID]

if True:
    %matplotlib inline
    plt.plot(maxPlayerDF["yearID"], maxPlayerDF["salary"])
    plt.title("Year data of {}".format(maxPlayerID))
    plt.show()

png

He started from very low salary and reached to the highest salary range in 2009. Although the salary goes down after 2010, he is the highest paid player.

The correlation between year and salary is calculated. The correlation will be between -1 and 1. I expect the range would be positive since there is an increase in salary as year goes to 2014.

""" Relationship between year and salary """
salaryYearDF = salariesDF[["yearID","salary"]]
standardizedDF = salaryYearDF.apply(standard)

if True:
    %matplotlib inline
    plt.scatter(standardizedDF["yearID"], standardizedDF["salary"])
    plt.title("Salaries by year")
    plt.show()

png

The code above first take yearID and salary columns and standardize those two columns. And then, the mean of product of each x coordinate and y coordinate is calculated. One interesting thing to note is that although the highest paid salaries is getting increased, the minimum salaries are almost consistent.

""" Calculating correlation """
correlation = (standardizedDF["yearID"] * standardizedDF["salary"]).mean()
print "Correlation between salaries and year:",correlation

Correlation between salaries and year: 0.345339080873

As expected the correlation comes out as a positive number. There is a positive correlation between year and players' salaries.

""" Relationship between year and salary """
salaryYearDF = salariesDF[["yearID","teamID","salary"]].groupby(["yearID","teamID"]).mean()

x = salaryYearDF.index.values
x = np.array(map(lambda x: int(x[0]), x))

y = salaryYearDF.values.reshape(1, len(salaryYearDF.values))[0]

x = (x - x.mean()) / x.std()
y = (y - y.mean()) / y.std()

if True:
    %matplotlib inline
    plt.scatter(x,y)
    plt.title("Means of teams by year")
    plt.show()

png

This graph is showing only the means of teams' salaries not the entire salaries of players. As you can see that the number of points have been reduced compared to the previous graph. The mean of each team is calculated by using "groupby" in pandas. Another notable difference is that the minimum salaries of each year is increasing. We might be able to expect more positive correlation. Since there will be less negative values when we calculate the mean.

""" Calculating correlation """
correlation = (x * y).mean()
print "Correlation between mean salaries by team and year:", correlation

Correlation between mean salaries by team and year: 0.725273264404

As expected, the correlation is calculated to be a positive number and is greater than the previous correlation value. So there is an obvious trend of increase in salaries.

Lastly, I decide to find the highest paid basedball team in 2014 because when you look at the scatter plot above, the highest paid team in 2014 is the highest paid of all time.

""" Find the country with the maximum mean in 2014 """
salaryYearDF = pd.DataFrame((salaryYearDF.index.values), index =np.array(range(len(salaryYearDF))), columns = ["mean"])

salaryYearDFyear = salaryYearDF["mean"].map(lambda x: x[0])
salaryYearDFteam = salaryYearDF["mean"].map(lambda x: x[1])
salaryYearDFsalary = pd.DataFrame(y, index =np.array(range(len(salaryYearDF))), columns = ["mean"])

maxidxSalary2014 = salaryYearDFsalary[salaryYearDFyear == 2014].idxmax()
maxTeam2014 = salaryYearDFteam[maxidxSalary2014.values[0]]
print "Team with the maximum mean of salaries in 2014:", maxTeam2014

Team with the maximum mean of salaries in 2014: SFN

Built With

html
jupyter-notebook
python

investigate-a-dataset