Investigating A Baseball Dataset

Source: 2014 Lahman's Baseball Database

Udacity Data Analyst Nanodegree - Project 2

I used "Salaries.csv" and "AwardsSharePlayers.csv" for my project and my question to this database is "What factors have an effect on team salaries?" There are total 5 columns and I am going to analyze yearID, teamID and igID to investigate how these columns are related to salaries.

First, I am going to import these two csv files and transform to Pandas DataFrame and do some data wrangling.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# import Salaries data
filename = 'Salaries.csv'
salariesDF = pd.read_csv(filename)

# import Players data
filename = 'AwardsSharePlayers.csv'
playersDF = pd.read_csv(filename)

salariesDF.head()
yearID teamID lgID playerID salary
0 1985 ATL NL barkele01 870000
1 1985 ATL NL bedrost01 550000
2 1985 ATL NL benedbr01 545000
3 1985 ATL NL campri01 633333
4 1985 ATL NL ceronri01 625000
playersDF.head()
awardID yearID lgID playerID pointsWon pointsMax votesFirst
0 Cy Young 1956 ML fordwh01 1.0 16 1.0
1 Cy Young 1956 ML maglisa01 4.0 16 4.0
2 Cy Young 1956 ML newcodo01 10.0 16 10.0
3 Cy Young 1956 ML spahnwa01 1.0 16 1.0
4 Cy Young 1957 ML donovdi01 1.0 16 1.0

And I am going to see if there is any null value in the data. First salaries data.

pd.isnull(salariesDF).sum()
yearID      0
teamID      0
lgID        0
playerID    0
salary      0
dtype: int64
salariesDF.describe()
yearID salary
count 24758.000000 2.475800e+04
mean 1999.891752 1.932905e+06
std 8.324492 3.190334e+06
min 1985.000000 0.000000e+00
25% 1993.000000 2.600000e+05
50% 2000.000000 5.250000e+05
75% 2007.000000 2.199643e+06
max 2014.000000 3.300000e+07

Although there is no null value in the data, there are some zeros in salary column. I am going to find these rows with zero salary.

salariesDF[salariesDF["salary"] == 0]
yearID teamID lgID playerID salary
6179 1993 NYA AL jamesdi01 0
12007 1999 PIT NL martija02 0

There are total two players with 0 salary. I am going to find more details about those two players.

salariesDF[salariesDF["playerID"] == "jamesdi01"]
yearID teamID lgID playerID salary
1302 1987 ATL NL jamesdi01 120000
1929 1988 ATL NL jamesdi01 360000
2591 1989 ATL NL jamesdi01 400000
3536 1990 CLE AL jamesdi01 650000
5293 1992 NYA AL jamesdi01 387500
6179 1993 NYA AL jamesdi01 0
8019 1995 NYA AL jamesdi01 350000
salariesDF[salariesDF["playerID"] == "martija02"]
yearID teamID lgID playerID salary
11011 1998 PIT NL martija02 170000
12007 1999 PIT NL martija02 0

Clearly for jamesdi01, the 0 value is wrong. I am going to assume the salary in 1993 as the average of 1992 and 1995. For martija02, I will ignore these two rows from the database.

# Taking care of jamesdi01
salariesDF["salary"].iloc[6179] = (387500 + 350000) / 2
salariesDF[salariesDF["playerID"] == "jamesdi01"]
yearID teamID lgID playerID salary
1302 1987 ATL NL jamesdi01 120000
1929 1988 ATL NL jamesdi01 360000
2591 1989 ATL NL jamesdi01 400000
3536 1990 CLE AL jamesdi01 650000
5293 1992 NYA AL jamesdi01 387500
6179 1993 NYA AL jamesdi01 368750
8019 1995 NYA AL jamesdi01 350000
# Taking care of martija02
salariesDF = salariesDF[salariesDF["playerID"] != "martija02"]
salariesDF.describe()
yearID salary
count 24756.000000 2.475600e+04
mean 1999.891865 1.933069e+06
std 8.324817 3.190411e+06
min 1985.000000 1.090000e+04
25% 1993.000000 2.600000e+05
50% 2000.000000 5.250000e+05
75% 2007.000000 2.200000e+06
max 2014.000000 3.300000e+07

Now I am going to clean the AwardsSharePlayers data.

# Printout the total number of null values of each column
pd.isnull(playersDF).sum()
awardID         0
yearID          0
lgID            0
playerID        0
pointsWon       0
pointsMax       0
votesFirst    358
dtype: int64
playersDF.describe()
yearID pointsWon pointsMax votesFirst
count 6705.000000 6705.000000 6705.000000 6347.000000
mean 1971.344817 43.168829 266.107233 1.612573
std 27.172773 67.714919 128.980771 4.868226
min 1911.000000 0.000000 16.000000 0.000000
25% 1950.000000 4.000000 140.000000 0.000000
50% 1974.000000 12.000000 336.000000 0.000000
75% 1994.000000 52.000000 336.000000 0.000000
max 2014.000000 448.000000 448.000000 32.000000

Although there are 358 null values in "votesFirst" column, I don't plan to use this column. Since other columns don't have a null value, I can leave this data as it is.

Now salary is investigated using the single-variable (1d) exploration

# single-variable investigation - Salary
# plot boxplot and histogram
%matplotlib inline
plt.boxplot(salariesDF["salary"])
plt.ylabel("Salary")
plt.show()
plt.hist(salariesDF["salary"]/1000000, bins = 10)
plt.ylabel("Frequency")
plt.xlabel("Salary in millions")
plt.show()

png

png

In histogram, you can see that the most of the salaries are below 5 millions and very little number of people are more than 5 millions range. However, the box plot is very difficult to understand because there are too many points and most of them seem to be outliers. In order to decrease the number of points, the set of the mean salaries per year is used to plot the boxplot.

# Get a dataframe of the set of the mean salaries per year
salaryYearDF = salariesDF[["yearID","salary"]].groupby("yearID").mean()
salaryYearDF
salary
yearID
1985 476299
1986 417147
1987 434729
1988 453171
1989 506323
1990 511973
1991 894961
1992 1047520
1993 977366
1994 1049588
1995 964979
1996 1027909
1997 1218687
1998 1281958
1999 1486794
2000 1992984
2001 2279841
2002 2392526
2003 2573472
2004 2491776
2005 2633830
2006 2834520
2007 2941435
2008 3136517
2009 3277646
2010 3278746
2011 3318838
2012 3458421
2013 3723344
2014 3980445

Now plot the boxplot using this dataframe.

# plotting box plot and histogram
%matplotlib inline
plt.boxplot(salaryYearDF.values / 1000)
plt.ylabel("Salary in thousands")
plt.show()

meanYearDF = float(salaryYearDF.mean())
medianYearDF = float(salaryYearDF.median())
print "Mean:", meanYearDF
print "Median:", medianYearDF

png

Mean: 1902124.83333
Median: 1739889.0

Mean is greater than Median as expected since the historam above shows the right skewed graph.

Now I am going to investigate the relation between the teamID and salaries. Find sum, max, min, mean of salaries by team ids.

# Database with two columns: teamID and salary
salariesDF_reduced = salariesDF[["teamID","salary"]]

salariesDF_summary_by_teamID = salariesDF_reduced.groupby("teamID").agg([np.sum,np.max,np.min,np.mean])
salariesDF_summary_by_teamID.columns = ["sum", "max", "min", "mean"]
salariesDF_summary_by_teamID.head()
sum max min mean
teamID
ANA 468091973 13166667 150000 1895109
ARI 1209975228 16000000 170000 2499948
ATL 1882946784 16061802 60000 2184393
BAL 1630200038 17000000 60000 1846206
BOS 2432700935 22500000 60000 2755040
def bargraph(column): # column = sum, max, min, mean
    frequency = salariesDF_summary_by_teamID[column].sort_values(axis=0, ascending = False).head(10)
    objects = frequency.index
    y_pos = np.arange(len(objects))
    plt.bar(y_pos, frequency, align='center')
    plt.xticks(y_pos, objects)
    plt.ylabel('Salary')
    plt.title("Top 10 {} of salaries by teamID".format(column))
    print "Top 10 {} of salaries by teamID".format(column)
    print frequency
    plt.show()

Change False to True to see the bargraphs.

""" Bar graph of the number of players by teamID """
if True:
    %matplotlib inline
    bargraph("sum")
    bargraph("max")
    bargraph("min")
    bargraph("mean")

Top 10 sum of salaries by teamID
teamID
NYA    3283488084
BOS    2432700935
LAN    2237766703
PHI    1982355800
NYN    1965737231
ATL    1882946784
CHN    1859833315
SLN    1775631332
DET    1771197687
CHA    1699865143
Name: sum, dtype: int64

png

Top 10 max of salaries by teamID
teamID
NYA    33000000
LAA    26187500
LAN    26000000
PHI    25000000
TEX    24000000
SEA    24000000
NYN    23145011
DET    23000000
MIN    23000000
BOS    22500000
Name: max, dtype: int64

png

Top 10 min of salaries by teamID
teamID
NYM    501560
SFG    500500
MIA    480000
WAS    316000
LAA    316000
ARI    170000
MIL    170000
TBA    170000
ANA    150000
COL    109000
Name: min, dtype: int64

png

Top 10 mean of salaries by teamID
teamID
SFG    5315191
LAA    4186119
NYA    3735481
BOS    2755040
LAN    2505897
ARI    2499948
MIA    2480984
WAS    2466758
NYN    2345748
NYM    2283624
Name: mean, dtype: int64

png

multiple-variable (2d) explorations

I believe that the player' pointsWon affects salaries. (i.e. palyers with more points have higher salaries) So I set the pointsWon as an independent variable and salary as a dependent variable.

First, I reindex the salary dataframe with a key value set (year, player ID) so that I can join the two dataframes.

# Set key value as (year, player ID) for salariesDF
keyValue = ("{}, {}".format(salariesDF["yearID"].iloc[i],salariesDF["playerID"].iloc[i]) for i in range(len(salariesDF)))

salariesDFkv = pd.DataFrame(salariesDF.values, columns = list(salariesDF.columns.values), index = list(keyValue))
salariesDFkv
yearID teamID lgID playerID salary
1985, barkele01 1985 ATL NL barkele01 870000
1985, bedrost01 1985 ATL NL bedrost01 550000
1985, benedbr01 1985 ATL NL benedbr01 545000
1985, campri01 1985 ATL NL campri01 633333
1985, ceronri01 1985 ATL NL ceronri01 625000
1985, chambch01 1985 ATL NL chambch01 800000
1985, dedmoje01 1985 ATL NL dedmoje01 150000
1985, forstte01 1985 ATL NL forstte01 483333
1985, garbege01 1985 ATL NL garbege01 772000
1985, harpete01 1985 ATL NL harpete01 250000
1985, hornebo01 1985 ATL NL hornebo01 1500000
1985, hubbagl01 1985 ATL NL hubbagl01 455000
1985, mahleri01 1985 ATL NL mahleri01 407500
1985, mcmurcr01 1985 ATL NL mcmurcr01 275000
1985, mumphje01 1985 ATL NL mumphje01 775000
1985, murphda05 1985 ATL NL murphda05 1625000
1985, oberkke01 1985 ATL NL oberkke01 616667
1985, perezpa01 1985 ATL NL perezpa01 450000
1985, perryge01 1985 ATL NL perryge01 120000
1985, ramirra01 1985 ATL NL ramirra01 750000
1985, suttebr01 1985 ATL NL suttebr01 1354167
1985, washicl01 1985 ATL NL washicl01 800000
1985, boddimi01 1985 BAL AL boddimi01 625000
1985, dauerri01 1985 BAL AL dauerri01 480000
1985, davisst02 1985 BAL AL davisst02 437500
1985, dempsri01 1985 BAL AL dempsri01 512500
1985, dwyerji01 1985 BAL AL dwyerji01 375000
1985, flanami01 1985 BAL AL flanami01 641667
1985, fordda01 1985 BAL AL fordda01 450000
1985, grosswa01 1985 BAL AL grosswa01 483333
... ... ... ... ... ...
2014, blevije01 2014 WAS NL blevije01 1675000
2014, cedenxa01 2014 WAS NL cedenxa01 507800
2014, clippty01 2014 WAS NL clippty01 5875000
2014, desmoia01 2014 WAS NL desmoia01 6500000
2014, detwiro01 2014 WAS NL detwiro01 3000000
2014, espinda01 2014 WAS NL espinda01 540580
2014, fistedo01 2014 WAS NL fistedo01 7200000
2014, frandke01 2014 WAS NL frandke01 900000
2014, gonzagi01 2014 WAS NL gonzagi01 8500000
2014, hairssc01 2014 WAS NL hairssc01 2500000
2014, harpebr03 2014 WAS NL harpebr03 2150000
2014, jordata01 2014 WAS NL jordata01 504300
2014, koberje01 2014 WAS NL koberje01 501000
2014, larocad01 2014 WAS NL larocad01 12000000
2014, leonsa01 2014 WAS NL leonsa01 501000
2014, lobatjo01 2014 WAS NL lobatjo01 900000
2014, matthry01 2014 WAS NL matthry01 520000
2014, mclouna01 2014 WAS NL mclouna01 5000000
2014, moorety01 2014 WAS NL moorety01 507900
2014, ramoswi01 2014 WAS NL ramoswi01 2095000
2014, rendoan01 2014 WAS NL rendoan01 1800000
2014, roarkta01 2014 WAS NL roarkta01 506100
2014, soriara01 2014 WAS NL soriara01 11000000
2014, spande01 2014 WAS NL spande01 6500000
2014, stammcr01 2014 WAS NL stammcr01 1375000
2014, storedr01 2014 WAS NL storedr01 3450000
2014, strasst01 2014 WAS NL strasst01 3975000
2014, werthja01 2014 WAS NL werthja01 20000000
2014, zimmejo02 2014 WAS NL zimmejo02 7500000
2014, zimmery01 2014 WAS NL zimmery01 14000000

24756 rows × 5 columns

Similarly reindex the award dataframe with the key value set.

# Set key value as (year, player ID) for playersDF
keyValue = ("{}, {}".format(playersDF["yearID"].iloc[i],playersDF["playerID"].iloc[i]) for i in range(len(playersDF)))

playersDFkv = pd.DataFrame(playersDF.values, columns = list(playersDF.columns.values), index = list(keyValue))
playersDFkv.head()
awardID yearID lgID playerID pointsWon pointsMax votesFirst
1956, fordwh01 Cy Young 1956 ML fordwh01 1 16 1
1956, maglisa01 Cy Young 1956 ML maglisa01 4 16 4
1956, newcodo01 Cy Young 1956 ML newcodo01 10 16 10
1956, spahnwa01 Cy Young 1956 ML spahnwa01 1 16 1
1957, donovdi01 Cy Young 1957 ML donovdi01 1 16 1

Create new database that joins the salary dataframe with the player dataframe.

salaryPlayerDF = pd.merge(salariesDFkv, playersDFkv, how = "inner")
salaryPlayerDF.head()
yearID teamID lgID playerID salary awardID pointsWon pointsMax votesFirst
0 1985 ATL NL murphda05 1625000 MVP 63 336 0
1 1985 BAL AL murraed02 1472819 MVP 130 392 0
2 1985 BAL AL ripkeca01 800000 MVP 9 392 0
3 1985 BOS AL boggswa01 1000000 MVP 159 392 0
4 1985 BOS AL gedmari01 477500 MVP 1 392 0

I now have a merged dataframe so I can find the correlation between pointsWon and salary. First I am going to standardize each column for pointsWon and salary.

# Standardizing
def standard(sr):
    return (sr - sr.mean())/sr.std(ddof=0)

salaryPlayerDF = salaryPlayerDF[["pointsWon", "salary"]]
# Scatter plot for awardsWon and salary
%matplotlib inline
plt.scatter(salaryPlayerDF["pointsWon"], salaryPlayerDF["salary"])
plt.show()

png

This part is for looking at the trend of the salary change whether it increases or decreases.

""" data by year """
salariesDF_by_year = salariesDF[["yearID","salary"]].groupby("yearID").agg([np.sum,np.max,np.min,np.mean])
salariesDF_by_year.columns = ["sum", "max", "min", "mean"]

if True:
    %matplotlib inline
    plt.title('Sum by year')
    plt.xlabel('Year')
    plt.ylabel('Salary')
    plt.plot(salariesDF_by_year.index, salariesDF_by_year["sum"])
    plt.show()
    plt.title('Maximum by year')
    plt.xlabel('Year')
    plt.ylabel('Salary')
    plt.plot(salariesDF_by_year.index, salariesDF_by_year["max"])
    plt.show()
    plt.title('Minimum by year')
    plt.xlabel('Year')
    plt.ylabel('Salary')
    plt.plot(salariesDF_by_year.index, salariesDF_by_year["min"])
    plt.show()
    plt.title('Mean by year')
    plt.xlabel('Year')
    plt.ylabel('Salary')
    plt.plot(salariesDF_by_year.index, salariesDF_by_year["mean"])
    plt.show()

png

png

png

png

All of these factors increase as year gets close to 2014. But when you look at the second graph there is a peak around at 2010 and I decided to analyze that.

""" Finding maximum year """
maxSalaryByYear = salariesDF_by_year["max"].argmax()
print "\nMaximum salary occurs in {}, $ {}\n".format(maxSalaryByYear, salariesDF_by_year["max"].loc[maxSalaryByYear])
Maximum salary occurs in 2009, $ 33000000
""" Find player with the maximum in 2009 """
maxDFyear = salariesDF[salariesDF["yearID"]==2009]
maxPlayer2009 = maxDFyear[maxDFyear["salary"] == salariesDF_by_year["max"].loc[maxSalaryByYear]]
maxPlayerID = maxPlayer2009["playerID"].values[0]
print maxPlayerID
print "Detailed information of the player"
print maxPlayer2009, "\n"

rodrial01
Detailed information of the player
       yearID teamID lgID   playerID    salary
20286    2009    NYA   AL  rodrial01  33000000 

The maximum value occurs in 2009 and the team id and player id is shown above. He had the highest salary of all time.

""" maximum player's yearly graph """
maxPlayerDF = salariesDF[salariesDF["playerID"] == maxPlayerID]

if True:
    %matplotlib inline
    plt.plot(maxPlayerDF["yearID"], maxPlayerDF["salary"])
    plt.title("Year data of {}".format(maxPlayerID))
    plt.show()

png

He started from very low salary and reached to the highest salary range in 2009. Although the salary goes down after 2010, he is the highest paid player.

The correlation between year and salary is calculated. The correlation will be between -1 and 1. I expect the range would be positive since there is an increase in salary as year goes to 2014.

""" Relationship between year and salary """
salaryYearDF = salariesDF[["yearID","salary"]]
standardizedDF = salaryYearDF.apply(standard)

if True:
    %matplotlib inline
    plt.scatter(standardizedDF["yearID"], standardizedDF["salary"])
    plt.title("Salaries by year")
    plt.show()

png

The code above first take yearID and salary columns and standardize those two columns. And then, the mean of product of each x coordinate and y coordinate is calculated. One interesting thing to note is that although the highest paid salaries is getting increased, the minimum salaries are almost consistent.

""" Calculating correlation """
correlation = (standardizedDF["yearID"] * standardizedDF["salary"]).mean()
print "Correlation between salaries and year:",correlation
Correlation between salaries and year: 0.345339080873

As expected the correlation comes out as a positive number. There is a positive correlation between year and players' salaries.

""" Relationship between year and salary """
salaryYearDF = salariesDF[["yearID","teamID","salary"]].groupby(["yearID","teamID"]).mean()

x = salaryYearDF.index.values
x = np.array(map(lambda x: int(x[0]), x))

y = salaryYearDF.values.reshape(1, len(salaryYearDF.values))[0]

x = (x - x.mean()) / x.std()
y = (y - y.mean()) / y.std()

if True:
    %matplotlib inline
    plt.scatter(x,y)
    plt.title("Means of teams by year")
    plt.show()

png

This graph is showing only the means of teams' salaries not the entire salaries of players. As you can see that the number of points have been reduced compared to the previous graph. The mean of each team is calculated by using "groupby" in pandas. Another notable difference is that the minimum salaries of each year is increasing. We might be able to expect more positive correlation. Since there will be less negative values when we calculate the mean.

""" Calculating correlation """
correlation = (x * y).mean()
print "Correlation between mean salaries by team and year:", correlation

Correlation between mean salaries by team and year: 0.725273264404

As expected, the correlation is calculated to be a positive number and is greater than the previous correlation value. So there is an obvious trend of increase in salaries.

Lastly, I decide to find the highest paid basedball team in 2014 because when you look at the scatter plot above, the highest paid team in 2014 is the highest paid of all time.

""" Find the country with the maximum mean in 2014 """
salaryYearDF = pd.DataFrame((salaryYearDF.index.values), index =np.array(range(len(salaryYearDF))), columns = ["mean"])

salaryYearDFyear = salaryYearDF["mean"].map(lambda x: x[0])
salaryYearDFteam = salaryYearDF["mean"].map(lambda x: x[1])
salaryYearDFsalary = pd.DataFrame(y, index =np.array(range(len(salaryYearDF))), columns = ["mean"])

maxidxSalary2014 = salaryYearDFsalary[salaryYearDFyear == 2014].idxmax()
maxTeam2014 = salaryYearDFteam[maxidxSalary2014.values[0]]
print "Team with the maximum mean of salaries in 2014:", maxTeam2014

Team with the maximum mean of salaries in 2014: SFN

Built With

Share this project:

Updates