S&DS 230 Final Project: Per Game Team Statistics in the NBA
Daniel Tokarz
May 1, 2018
Part 1: Introduction
In this project, I’ve set out to create a model that can predict team win percentage in the NBA. The National Basketball Association consists of 30 teams that each play 82 games throughout the regular season, which lasts from late October to mid April. The league keeps track of numerous statistics throughout the season. Beyond the obvious points scored and points against, these also include statistics like steals, blocks, and free throw percentage. Check out the Data section for a more complete explanation of each variable. With the response variable being win percentage, I’d like to find out which statistics are the highest correlated with win percentage and are thus most useful for prediction. Based off of previous knowledge, my hyphothesis would be that point differential (points for  points against) will be the most sigfnificant predictor. This model can be useful in finding out which statistics are most important to pay attention to when it comes to predicting future success.
Part 2: Data
The data that I’ve used in this project comes from the team statistics section of the NBA website. Click here to see an example of one of the pages I used. This is the data for the 2017/2018 season. I downloaded this grid for the last 10 NBA seasons, dating back to 2008/2009 season. I pasted them into an excel sheet and saved it as a csv. This raw data can be viewed in a google sheet here. After putting the raw data into a csv, I got to work with cleaning the data. All the statistics here are kept as per game values. This is important because the 2011/2012 season was shortened so teams played less games, meaning the total values for their statistics would have been less and thus thrown off any analysis of their win percengage. Using the per game values accounts for this.
#I'll need these libraries throughout the project
library(car)
library(leaps)
source("http://www.reuningscherer.net/s&ds230/Rfuncs/regJDRS.txt")
#This navigates to the folder where the file is stored
setwd("C:/Users/Daniel/Documents/programs/SPRING_2018_HW/STAT230/final_project")
#This reads in the csv to a data frame and keeps the strings not as factors
nba_team_data < read.csv("nba_team_per_game.csv",header = TRUE,as.is = TRUE)
Data Cleaning I: Fixing / Removing Variables
Now that we have our data entered into R, we can get to work cleaning it.
#In our csv, the headers are repeated every 30 rows. The first thing we do is delete these rows
dead_slots < c(31,62,93,124,155,186,217,248,279)
nba_team_data < nba_team_data[(dead_slots),]
#We have an excess column on the right as well as a column without a name and one with an incorrect name. We fix those problems here
nba_team_data < nba_team_data[,1:27]
names(nba_team_data)[27] < "PTD"
names(nba_team_data)[1] < "TEAM"
#Most of the data is numeric, so we convert it to this format
for(i in 2:27){
nba_team_data[,i] < as.numeric(nba_team_data[,i])
}
#4 teams have changed their names throughout the last 10 years.
#We can recode the TEAM column to account for this
nba_team_data$TEAM < recode(nba_team_data$TEAM," 'LA Clippers' = 'Los Angeles Clippers'; 'New Jersey Nets' = 'Brooklyn Nets'; 'Charlotte Bobcats' = 'Charlotte Hornets'; 'New Orleans Hornets' = 'New Orleans Pelicans'")
#Every team plays 82 games, loses 82  wins, and has a win% that is wins / 82.
#Because of this, we can not use games played, losses, or win% to predict wins. We delete those variables here.
nba_team_data < nba_team_data[,c(2,3,4)]
Data Cleaning II: Adding New Variables
There are two categorical variables I would like to add: year and conference.
Year could be useful because certain years have more good teams. For example, if there are more good teams in 2015, a team that scores the same amount of points would probably lose more games. It is not continous because 2009 is as related to 2010 as it is to 2018 in this project.
The NBA is divided into two conferences EAST and WEST. Teams play the teams within their own conference more to cut down on travel time. This variable could be useful because the conferences could be lopsided. If there are better teams in the WEST, a team in the WEST that scores the same amount of points as a team in the EAST would probably win fewer games since they play tougher opponents.
#First we make our two variables. They will start as blank until we fill them in.
#The default for conference is east.
nba_team_data$CONF < rep("east",300)
nba_team_data$YEAR < rep(0,300)
#This vector contains the teams in the west. We will switch the teams that match an entry in this vector to west
west < c("Golden State Warriors", "Dallas Mavericks", "Memphis Grizzlies", "Phoenix Suns", "Sacramento Kings", "San Antonio Spurs", "Utah Jazz", "Los Angeles Lakers","Oklahoma City Thunder", "Minnesota Timberwolves", "New Orleans Pelicans", "Denver Nuggets", "Houston Rockets", "Los Angeles Clippers", "Portland Trail Blazers")
#We go through each entry and find the correct entry for each categorical variable
for(i in 1:300){
#This line calculates the year for a given entry based on the premise that the data is simply stacked on itself
nba_team_data$YEAR[i] < floor((300  i) / 30) + 2009
#If a team matches the name of a west team, the conference switches to west
if(is.element(nba_team_data$TEAM[i],west)){
nba_team_data$CONF[i] < "west"
}
}
#This converts year to a categorical variable
nba_team_data$CONF < as.factor(nba_team_data$CONF)
Explaining Our Variables
Now that we have cleaned our data, we can start by looking at what kind of information we have to work with.
#In our updated data frame, we have 26 variables. The names are below
names(nba_team_data)
## [1] "TEAM" "WIN." "MIN" "PTS" "FGM" "FGA" "FG." "X3PM" "X3PA" "X3P."
## [11] "FTM" "FTA" "FT." "OREB" "DREB" "REB" "AST" "TOV" "STL" "BLK"
## [21] "BLKA" "PF" "PFD" "PTD" "CONF" "YEAR"
The explanation for each variable is listed below:

TEAM = Team Name (corresponds to name in 20172018 season)

WIN. = Win Percentage (percantage of played games that were won)

MIN = Minutes Played (higher for some teams that have played over time games)

PTS = Pts Scored (total points scored throughout the season)

FGM = Field Goals Made (number of baskets made not counting free throws)

FGA = Field Goals Attempted (number of shots taken not counting free throws)

FG. = Field Goal Percentage (column 5 divided by column 6)

X3PM = 3 Point Shots Made (shots made from beyond 3 point arc)

X3PA = 3 Point Shots Attempted (shots taken from beyond 3 point arc)

X3P. = 3 Point Shot Percentage (column 8 divided by column 9)

FTM = Free Throws Made (total free foul shots made)

FTA = Free Throws Attempted (total free foul shots taken)

FT. = Free Throw Percentage (column 11 divided by column 12)

OREB = Offensive Rebounds (missed team shots that were recovered)

DREB = Defensive Rebounds (missed opponents’ shots that were recovered)

REB = Rebounds (column 14 added to column 15)

AST = Assists (total field goals that were assisted on)

TOV = Turnovers (total possessions given away)

STL = Steals (total opponents’ possessions taken away)

BLK = Blocks (total number of opponents’ field goals blocked)

BLKA = Attempted Shots That Were Blocked (total field goals blocked by opponents)

PF = Personal Fouls (total fouls commited by team)

PFD = Personal Fouls Drawn (total fouls commited by opponents)

PTD = Point Differential (points scored  points given up)

CONF = Conference (either “east” or “west”)

YEAR = Year (year that season ends, from 2009 to 2018)
Here is a quick look at what the start of the dataset looks like.
#This shows the first few rows of the frame
head(nba_team_data)
## TEAM WIN. MIN PTS FGM FGA FG. X3PM X3PA X3P.
## 1 Houston Rockets 0.793 48.2 112.4 38.7 84.2 46.0 15.3 42.3 36.2
## 2 Toronto Raptors 0.720 48.4 111.7 41.3 87.4 47.2 11.8 33.0 35.8
## 3 Golden State Warriors 0.707 48.1 113.5 42.8 85.1 50.3 11.3 28.9 39.1
## 4 Boston Celtics 0.671 48.3 104.0 38.3 85.1 45.0 11.5 30.4 37.7
## 5 Philadelphia 76ers 0.634 48.2 109.8 40.8 86.6 47.2 11.0 29.8 36.9
## 6 Cleveland Cavaliers 0.610 48.1 110.9 40.4 84.8 47.6 12.0 32.1 37.2
## FTM FTA FT. OREB DREB REB AST TOV STL BLK BLKA PF PFD PTD CONF
## 1 19.6 25.1 78.1 9.0 34.5 43.5 21.5 13.8 8.5 4.8 4.4 19.5 20.4 8.5 west
## 2 17.3 21.8 79.4 9.8 34.2 44.0 24.3 13.4 7.6 6.1 4.9 21.7 19.9 7.8 east
## 3 16.6 20.3 81.5 8.4 35.1 43.5 29.3 15.4 8.0 7.5 3.7 19.6 18.5 6.0 west
## 4 16.0 20.7 77.1 9.4 35.1 44.5 22.5 14.0 7.4 4.5 4.4 19.7 19.2 3.6 east
## 5 17.1 22.8 75.2 10.9 36.5 47.4 27.1 16.5 8.3 5.1 5.1 22.1 20.4 4.5 east
## 6 18.1 23.3 77.9 8.5 33.7 42.1 23.4 13.7 7.1 3.8 4.1 18.6 20.7 0.9 east
## YEAR
## 1 2018
## 2 2018
## 3 2018
## 4 2018
## 5 2018
## 6 2018
Now that we have a pretty clean dataset and are more familiar with the included information, we can get to work doing some analysis.
Part 3: Findings / Analysis
Case Study 2: Does Conference Matter?
One of the cries often heard throughout the NBA fanbase today is that the east is soft as compared to the western conference. We can perform a two sample ttest on win% in each conference to see if there has been a statistically significant difference over the last ten years.
There certainly does seem to be a difference between the win percentages in each conference. We can use a two sample ttest to check wheather this difference in significant.
#We can perform a two sample ttest on our data
t.test(nba_team_data$WIN.~nba_team_data$CONF)
##
## Welch Two Sample ttest
##
## data: nba_team_data$WIN. by nba_team_data$CONF
## t = 2.2778, df = 297.2, pvalue = 0.02345
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.076075635 0.005551032
## sample estimates:
## mean in group east mean in group west
## 0.4796000 0.5204133
Because the pvalue taken from this ttest is less than .05, we can say that we are 95% confident that there is a difference in the mean win% between the eastern and western conference.
Bootstrapping
We can use bootstrapping to see a vizualization of this test. Bootstrapping for a large number of samples creates a distribution that tells us the probability that the true difference between the win percentages by conference is less than 0. We do so by seeing what wheather or not the middle 95% of our samples fall on one side of 0.
#This shows the procedure used to create the bootstrapped confidence interval
N < 20000
diffWN < rep(NA,N)
for(i in 1:N){
se < sample(nba_team_data$WIN.[nba_team_data$CONF== "east"],sum(nba_team_data$CONF=="east"),replace = TRUE)
sw < sample(nba_team_data$WIN.[nba_team_data$CONF== "west"],sum(nba_team_data$CONF=="east"),replace = TRUE)
diffWN[i] < mean(se)  mean(sw)
}
(ci < quantile(diffWN,c(0.025,.5,0.975)))
## 2.5% 50% 97.5%
## 0.076126667 0.040936667 0.005446333
Now that we have created a theoretical confidence interval that for the difference in win percentage by conference, we can plot our data to get a clearer look at what it means to say we are 95% confident that there is a significant difference.
We can see that 0 is not included in this confidence interval, which shows that there is a statistically significant difference in win percentage by conference.
Perhaps there are more high powered offsenses in the west and this is a possible explanation for the discrepency in win percentage. We can look at the breakdown of Points per Game scored by teams in different conferences. We can start out with a boxplot to get a general sense of the distributions.
It looks like there is a difference between points per game seperated by conference. A two sample ttest can tell us how significant this difference is.
#We can perform a two sample ttest on our data
t.test(nba_team_data$PTS~nba_team_data$CONF)
##
## Welch Two Sample ttest
##
## data: nba_team_data$PTS by nba_team_data$CONF
## t = 5.4225, df = 297.4, pvalue = 1.218e07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 4.059703 1.897630
## sample estimates:
## mean in group east mean in group west
## 99.50667 102.48533
Because the p value is less than .05, we can say that we are 95% confident that the mean of points per game scored by eastern conference teams is lower than for western conference teams. One way to vizualize this is with a permutation test.
Permutation Test
We can use a permutation test to see how likely it is that the observed difference in points per game is due to random chance. We can do this by taking many random samples and seeing what probability of the samples have a difference in mean points per game by conference equal to or larger than our observed difference.
#This shows the procedure used in performing our permutation test
N < 20000
diffvals < rep(NA,N)
for(i in 1:N){
fakecon < sample(nba_team_data$CONF)
diffvals[i] < mean(nba_team_data$PTS[fakecon == "east"]  mean(nba_team_data$PTS[fakecon =="west"]))
}
(cip < quantile(diffWN,c(0.025,.5,0.975)))
## 2.5% 50% 97.5%
## 0.076126667 0.040936667 0.005446333
(mean(nba_team_data$PTS[nba_team_data$CONF == "west"])  mean(nba_team_data$PTS[nba_team_data$CONF == "east"]))
## [1] 2.978667
As we can see, the observed difference is not within the 95% confidence interval generated by our permutation test, meaning it is quite unlikely that the observed difference is due to random chance. We can see this on a histogram.
This histogram displays the results of our permutation test as well as the actual observed difference in points per game by conference. As we can see, almost none of the random samples produced a difference in mean points per game as large as the observed difference, showing that are is a very low chance that the observed difference happened by chance, leading us to conclude that there is a significant difference in points per game scored by each conference.
Case Study 3: Have Different Seasons Been Significantly Different?
The other categorical variable we created was season. By virtue of the teams all playing eachother and every game having 1 winner and 1 loser, it is not very interesting to look at mean Win Percentage by year. Something we can look at, however, would be points per game. The boxplots showing points per game by year (pictured below) suggest that there is a difference.
In order to test wheather there is a significant difference in points per game by year, we can use an ANOVA test, which tests wheather at least one of the mean points per game values is significantly different than the rest.
#This performs an ANOVA test on the distribution of points per game seperated by season
summary(aov(nba_team_data$PTS~nba_team_data$YEAR))
## Df Sum Sq Mean Sq F value Pr(>F)
## nba_team_data$YEAR 1 1380 1380.1 68.21 4.84e15 ***
## Residuals 298 6029 20.2
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As we can see, the ANOVA test above results in a very low pvalue, which suggests that the probability the observed differences in points per game by year were random is quite low. We can conclude that there is a significant difference in points per game by year.
Case Study 4: Building a Model for Win Percentage
As stated in the introduction, one of the most useful applications of this dataset is to create a linear model that predicts win percentage by looking at the other variables. In order to create a linear model, we want to make sure our response variable is normally distributed. We can look at a quantile plot to test this.
Because the quantile plot for win percentage is pretty linear, we can say that the distribution of win percentage over the last ten years has been approximately normal. In order to see how large a model and which variables we should use, we can use best subsets regression. The code below creates models of varying number of variables and stores them in the object “mods”
#This perfomrs best subsets analysis on a linear model for win percentage using every one of our remaining variables
mods < summary(regsubsets(nba_team_data$WIN. ~., data = nba_team_data,nvmax = 26))
There are a few methods we can use when deciding which model to use, including r^2, adjusted r^2, and BIC. r^2 is not very useful because it just increases as the number of variable increases. We can use adjusted r^2, however, because it accounts for and attempts to minimize the number of variables being used.
#Which finds which model had the highest adjusted r^2 value and returns which variables were included
which.max(mods$adjr2)
## [1] 4
names(nba_team_data)[mods$which[4,]][1]
## [1] "FGM" "FGA" "PF" "PTD"
It looks the fourth model has the highest adjuted r^2 value. When we look at which variables are used, we see that is uses Personal Fouls, Field Goals Made, Field Goals Attempted, and Point Differential. Another metric is Bayesian information criterion or BIC. The recommended model is the one that minimizes BIC
#Which finds which model had the lowest BIC value and returns which variables were included
which.min(mods$bic)
## [1] 1
names(nba_team_data)[mods$which[1,]][1]
## [1] "PTD"
Using BIC, the very first model is the recommended one. When we look at this first model, we see that is uses only point differential to predict win percentage. The incredibly high correlation between these two variables means that point differential acting alone can do a pretty good job of predicting win percentage. We can create both of these models to gain more insight into which one works better.
#This creates the linear model that resulted in the highest adjusted r^2 value
WIN_mod < lm(nba_team_data$WIN. ~ nba_team_data$FGM + nba_team_data$FGA + nba_team_data$PF + nba_team_data$PTD)
summary(WIN_mod)
##
## Call:
## lm(formula = nba_team_data$WIN. ~ nba_team_data$FGM + nba_team_data$FGA +
## nba_team_data$PF + nba_team_data$PTD)
##
## Residuals:
## Min 1Q Median 3Q Max
## 0.109766 0.025057 0.001059 0.024566 0.091536
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 0.6702292 0.0699227 9.585 <2e16 ***
## nba_team_data$FGM 0.0033589 0.0022044 1.524 0.1287
## nba_team_data$FGA 0.0027389 0.0012047 2.274 0.0237 *
## nba_team_data$PF 0.0034412 0.0015144 2.272 0.0238 *
## nba_team_data$PTD 0.0316927 0.0006212 51.017 <2e16 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03658 on 295 degrees of freedom
## Multiple Rsquared: 0.9459, Adjusted Rsquared: 0.9452
## Fstatistic: 1290 on 4 and 295 DF, pvalue: < 2.2e16
Our linear model uses the 4 predictors as well as an intercept to predict win percentage. If we look at our variables, we see that point differential is the only one significant at the .01 level and field goals made is not even significant at the .05 level. It’s a pretty accurate model, but we might not need all of these predictors. We can look at the residual plots to further check the model.
The first plot shows that the studentized residuals are pretty linear, indicative of a well fit model. The second plot shows that almost all points are within 2 studentized residuals with no noticable trends, also a sign of a well fit model. It’s pretty good but we can also check our model that minimized the BIC.
#This creates the linear model that resulted in the lowest BIC value
BAS_mod < lm(nba_team_data$WIN. ~ nba_team_data$PTD)
summary(BAS_mod)
##
## Call:
## lm(formula = nba_team_data$WIN. ~ nba_team_data$PTD)
##
## Residuals:
## Min 1Q Median 3Q Max
## 0.109057 0.025210 0.000321 0.024248 0.083832
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 0.4999740 0.0021347 234.22 <2e16 ***
## nba_team_data$PTD 0.0326389 0.0004596 71.01 <2e16 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03697 on 298 degrees of freedom
## Multiple Rsquared: 0.9442, Adjusted Rsquared: 0.944
## Fstatistic: 5042 on 1 and 298 DF, pvalue: < 2.2e16
This linear model uses the single predictor point differential as well as an intercept to predict win percentage. Looking at the model summary, we see that this predictor variable alone can create a very accurate prediction model with an r^2 value pretty close to that for the first model. We can use residual plots to check how well this model works.