Per Game Team Statistics in the NBA

S&DS 230 Final Project: Per Game Team Statistics in the NBA

Part 1: Introduction

In this project, I’ve set out to create a model that can predict team win percentage in the NBA. The National Basketball Association consists of 30 teams that each play 82 games throughout the regular season, which lasts from late October to mid April. The league keeps track of numerous statistics throughout the season. Beyond the obvious points scored and points against, these also include statistics like steals, blocks, and free throw percentage. Check out the Data section for a more complete explanation of each variable. With the response variable being win percentage, I’d like to find out which statistics are the highest correlated with win percentage and are thus most useful for prediction. Based off of previous knowledge, my hyphothesis would be that point differential (points for - points against) will be the most sigfnificant predictor. This model can be useful in finding out which statistics are most important to pay attention to when it comes to predicting future success.

Part 2: Data

The data that I’ve used in this project comes from the team statistics section of the NBA website. Click here to see an example of one of the pages I used. This is the data for the 2017/2018 season. I downloaded this grid for the last 10 NBA seasons, dating back to 2008/2009 season. I pasted them into an excel sheet and saved it as a csv. This raw data can be viewed in a google sheet here. After putting the raw data into a csv, I got to work with cleaning the data. All the statistics here are kept as per game values. This is important because the 2011/2012 season was shortened so teams played less games, meaning the total values for their statistics would have been less and thus thrown off any analysis of their win percengage. Using the per game values accounts for this.

#I'll need these libraries throughout the project 
#This navigates to the folder where the file is stored

#This reads in the csv to a data frame and keeps the strings not as factors
nba_team_data <- read.csv("nba_team_per_game.csv",header = TRUE, = TRUE)

Data Cleaning I: Fixing / Removing Variables

Now that we have our data entered into R, we can get to work cleaning it.

#In our csv, the headers are repeated every 30 rows.  The first thing we do is delete these rows
dead_slots <- c(31,62,93,124,155,186,217,248,279) 
nba_team_data <- nba_team_data[-(dead_slots),]

#We have an excess column on the right as well as a column without a name and one with an incorrect name.  We fix those problems here
nba_team_data <- nba_team_data[,1:27]
names(nba_team_data)[27] <- "PTD"
names(nba_team_data)[1] <- "TEAM"

#Most of the data is numeric, so we convert it to this format
for(i in 2:27){
  nba_team_data[,i] <- as.numeric(nba_team_data[,i]) 

#4 teams have changed their names throughout the last 10 years. 
#We can recode the TEAM column to account for this
nba_team_data$TEAM <- recode(nba_team_data$TEAM," 'LA Clippers' = 'Los Angeles Clippers'; 'New Jersey Nets' = 'Brooklyn Nets'; 'Charlotte Bobcats' = 'Charlotte Hornets'; 'New Orleans Hornets' = 'New Orleans Pelicans'")

#Every team plays 82 games, loses 82 - wins, and has a win% that is wins / 82. 
#Because of this, we can not use games played, losses, or win% to predict wins. We delete those variables here.
nba_team_data <- nba_team_data[,-c(2,3,4)]

Data Cleaning II: Adding New Variables

There are two categorical variables I would like to add: year and conference.

Year could be useful because certain years have more good teams. For example, if there are more good teams in 2015, a team that scores the same amount of points would probably lose more games. It is not continous because 2009 is as related to 2010 as it is to 2018 in this project.

The NBA is divided into two conferences EAST and WEST. Teams play the teams within their own conference more to cut down on travel time. This variable could be useful because the conferences could be lopsided. If there are better teams in the WEST, a team in the WEST that scores the same amount of points as a team in the EAST would probably win fewer games since they play tougher opponents.

#First we make our two variables.  They will start as blank until we fill them in.  
#The default for conference is east.
nba_team_data$CONF <- rep("east",300)
nba_team_data$YEAR <- rep(0,300)

#This vector contains the teams in the west.  We will switch the teams that match an entry in this vector to west
west <- c("Golden State Warriors", "Dallas Mavericks", "Memphis Grizzlies", "Phoenix Suns", "Sacramento Kings", "San Antonio Spurs", "Utah Jazz", "Los Angeles Lakers","Oklahoma City Thunder", "Minnesota Timberwolves", "New Orleans Pelicans", "Denver Nuggets", "Houston Rockets", "Los Angeles Clippers", "Portland Trail Blazers")

#We go through each entry and find the correct entry for each categorical variable
for(i in 1:300){
  #This line calculates the year for a given entry based on the premise that the data is simply stacked on itself
  nba_team_data$YEAR[i] <- floor((300 - i) / 30) + 2009
  #If a team matches the name of a west team, the conference switches to west
    nba_team_data$CONF[i] <- "west"

#This converts year to a categorical variable
nba_team_data$CONF <- as.factor(nba_team_data$CONF)

Explaining Our Variables

Now that we have cleaned our data, we can start by looking at what kind of information we have to work with.

#In our updated data frame, we have 26 variables.  The names are below
##  [1] "TEAM" "WIN." "MIN"  "PTS"  "FGM"  "FGA"  "FG."  "X3PM" "X3PA" "X3P."
## [11] "FTM"  "FTA"  "FT."  "OREB" "DREB" "REB"  "AST"  "TOV"  "STL"  "BLK" 
## [21] "BLKA" "PF"   "PFD"  "PTD"  "CONF" "YEAR"

The explanation for each variable is listed below:

  1. TEAM = Team Name (corresponds to name in 2017-2018 season)

  2. WIN. = Win Percentage (percantage of played games that were won)

  3. MIN = Minutes Played (higher for some teams that have played over time games)

  4. PTS = Pts Scored (total points scored throughout the season)

  5. FGM = Field Goals Made (number of baskets made not counting free throws)

  6. FGA = Field Goals Attempted (number of shots taken not counting free throws)

  7. FG. = Field Goal Percentage (column 5 divided by column 6)

  8. X3PM = 3 Point Shots Made (shots made from beyond 3 point arc)

  9. X3PA = 3 Point Shots Attempted (shots taken from beyond 3 point arc)

  10. X3P. = 3 Point Shot Percentage (column 8 divided by column 9)

  11. FTM = Free Throws Made (total free foul shots made)

  12. FTA = Free Throws Attempted (total free foul shots taken)

  13. FT. = Free Throw Percentage (column 11 divided by column 12)

  14. OREB = Offensive Rebounds (missed team shots that were recovered)

  15. DREB = Defensive Rebounds (missed opponents’ shots that were recovered)

  16. REB = Rebounds (column 14 added to column 15)

  17. AST = Assists (total field goals that were assisted on)

  18. TOV = Turnovers (total possessions given away)

  19. STL = Steals (total opponents’ possessions taken away)

  20. BLK = Blocks (total number of opponents’ field goals blocked)

  21. BLKA = Attempted Shots That Were Blocked (total field goals blocked by opponents)

  22. PF = Personal Fouls (total fouls commited by team)

  23. PFD = Personal Fouls Drawn (total fouls commited by opponents)

  24. PTD = Point Differential (points scored - points given up)

  25. CONF = Conference (either “east” or “west”)

  26. YEAR = Year (year that season ends, from 2009 to 2018)

Here is a quick look at what the start of the dataset looks like.

#This shows the first few rows of the frame
##                    TEAM  WIN.  MIN   PTS  FGM  FGA  FG. X3PM X3PA X3P.
## 1       Houston Rockets 0.793 48.2 112.4 38.7 84.2 46.0 15.3 42.3 36.2
## 2       Toronto Raptors 0.720 48.4 111.7 41.3 87.4 47.2 11.8 33.0 35.8
## 3 Golden State Warriors 0.707 48.1 113.5 42.8 85.1 50.3 11.3 28.9 39.1
## 4        Boston Celtics 0.671 48.3 104.0 38.3 85.1 45.0 11.5 30.4 37.7
## 5    Philadelphia 76ers 0.634 48.2 109.8 40.8 86.6 47.2 11.0 29.8 36.9
## 6   Cleveland Cavaliers 0.610 48.1 110.9 40.4 84.8 47.6 12.0 32.1 37.2
## 1 19.6 25.1 78.1  9.0 34.5 43.5 21.5 13.8 8.5 4.8  4.4 19.5 20.4 8.5 west
## 2 17.3 21.8 79.4  9.8 34.2 44.0 24.3 13.4 7.6 6.1  4.9 21.7 19.9 7.8 east
## 3 16.6 20.3 81.5  8.4 35.1 43.5 29.3 15.4 8.0 7.5  3.7 19.6 18.5 6.0 west
## 4 16.0 20.7 77.1  9.4 35.1 44.5 22.5 14.0 7.4 4.5  4.4 19.7 19.2 3.6 east
## 5 17.1 22.8 75.2 10.9 36.5 47.4 27.1 16.5 8.3 5.1  5.1 22.1 20.4 4.5 east
## 6 18.1 23.3 77.9  8.5 33.7 42.1 23.4 13.7 7.1 3.8  4.1 18.6 20.7 0.9 east
##   YEAR
## 1 2018
## 2 2018
## 3 2018
## 4 2018
## 5 2018
## 6 2018

Now that we have a pretty clean dataset and are more familiar with the included information, we can get to work doing some analysis.

Part 3: Findings / Analysis

Case Study 1: Which Statistics are correlated with Win Percentage?

If our goal in this project is to see which statistics can be best used to predict win percentage and are thus most indicative of team strength, we can start by finding the correlation of win percentage with each of our continous variables.

#This gets the correlation between Win Percentage and every other variable
(win_correlations <- sort(round(cor(nba_team_data[,c(2:24)]),digits = 2)[,1],decreasing = T))
##  WIN.   PTD   FG.  X3P.   PTS   FGM  DREB   AST  X3PM   BLK   REB   FTM 
##  1.00  0.97  0.65  0.53  0.44  0.37  0.36  0.35  0.28  0.28  0.25  0.20 
##   STL  X3PA   PFD   FTA   FT.   MIN   FGA  OREB    PF   TOV  BLKA 
##  0.20  0.19  0.17  0.15  0.14  0.04 -0.16 -0.16 -0.24 -0.27 -0.43

As we can see, some variables have a positive relation, while others have a negative relation. Almost all the variables have the relation we would expect, however. For example, field goals has a positive relations which makes sense, more baskets -> more wins. On the other side, turnovers is negatively related which also make sense. More turnovers -> less wins. We can use correlation tests to see which relations are actually significant. First, however, we can pause to appreciate beauty of this scatter plot showing the remarkable relation between point differential and win percentage.

Correlation Tests

Performing a correlation test tells us the likelihood that there is a non zero true correlation between two variables. Here is the output for the Wins vs Point Differential correlation.

#This performs a correlation test between Point Differential and Win Percentage
##  Pearson's product-moment correlation
## data:  nba_team_data$WIN. and nba_team_data$PTD
## t = 71.01, df = 298, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9645985 0.9773914
## sample estimates:
##       cor 
## 0.9716988

Because the p value is exceptionally low, we can say that there is a non-zero true correlation between these two variables. We can calculate the p values for the correlation between wins and other variables as well to test which relations are significant at a reasonable level such as 95%

#This section creates a data frame that compiles the useful information from a variety of correlation tests
sig_vars$win_correlation <- round(cor(nba_team_data[,c(2:24)]),digits = 2)[,1]
for(i in 1:23){
  sig_vars$p_value[i] <- cor.test(nba_team_data[,2],nba_team_data[,(i+1)])$p.value
  sig_vars$low_bound[i] <- round(cor.test(nba_team_data[,2],nba_team_data[,(i+1)])$[1],digits = 2)
  sig_vars$high_bound[i] <-round(cor.test(nba_team_data[,2],nba_team_data[,(i+1)])$[2],digits = 2)
##    variable low_bound high_bound       p_value win_correlation
## 1      WIN.      1.00       1.00  0.000000e+00            1.00
## 23      PTD      0.96       0.98 8.453665e-189            0.97
## 6       FG.      0.58       0.71  4.977766e-37            0.65
## 9      X3P.      0.44       0.60  8.269595e-23            0.53
## 3       PTS      0.34       0.52  2.469792e-15            0.44
## 20     BLKA     -0.52      -0.34  3.751202e-15           -0.43
## 4       FGM      0.26       0.46  6.813857e-11            0.37
## 14     DREB      0.26       0.45  1.501862e-10            0.36
## 16      AST      0.24       0.44  5.927472e-10            0.35
## 7      X3PM      0.18       0.39  5.584905e-07            0.28
## 19      BLK      0.17       0.38  1.269302e-06            0.28
## 17      TOV     -0.37      -0.16  1.764639e-06           -0.27
## 15      REB      0.14       0.35  1.245509e-05            0.25
## 21       PF     -0.34      -0.13  2.854687e-05           -0.24
## 18      STL      0.09       0.31  4.300958e-04            0.20
## 10      FTM      0.08       0.30  6.409895e-04            0.20
## 8      X3PA      0.08       0.29  1.077275e-03            0.19
## 22      PFD      0.06       0.28  2.505024e-03            0.17
## 5       FGA     -0.27      -0.05  4.389663e-03           -0.16
## 13     OREB     -0.27      -0.05  5.546764e-03           -0.16
## 11      FTA      0.04       0.26  7.327788e-03            0.15
## 12      FT.      0.02       0.24  1.927797e-02            0.14
## 2       MIN     -0.07       0.16  4.616511e-01            0.04

At the 95% confidence level, every statistic is significantly correlated with win percentage with the exception of minutes played. This makes sense, as there’s no reason I could think of that the number of overtime games a team plays would impact the win percentage. As we continue our analysis and eventually create our prediction model, we might use any statistic except minutes as a predictor.

Case Study 2: Does Conference Matter?

One of the cries often heard throughout the NBA fanbase today is that the east is soft as compared to the western conference. We can perform a two sample t-test on win% in each conference to see if there has been a statistically significant difference over the last ten years.

There certainly does seem to be a difference between the win percentages in each conference. We can use a two sample t-test to check wheather this difference in significant.

#We can perform a two sample t-test on our data
##  Welch Two Sample t-test
## data:  nba_team_data$WIN. by nba_team_data$CONF
## t = -2.2778, df = 297.2, p-value = 0.02345
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.076075635 -0.005551032
## sample estimates:
## mean in group east mean in group west 
##          0.4796000          0.5204133

Because the p-value taken from this t-test is less than .05, we can say that we are 95% confident that there is a difference in the mean win% between the eastern and western conference.


We can use bootstrapping to see a vizualization of this test. Bootstrapping for a large number of samples creates a distribution that tells us the probability that the true difference between the win percentages by conference is less than 0. We do so by seeing what wheather or not the middle 95% of our samples fall on one side of 0.

#This shows the procedure used to create the bootstrapped confidence interval
N <- 20000

diffWN <- rep(NA,N)

for(i in 1:N){
  se <- sample(nba_team_data$WIN.[nba_team_data$CONF== "east"],sum(nba_team_data$CONF=="east"),replace = TRUE)
  sw <- sample(nba_team_data$WIN.[nba_team_data$CONF== "west"],sum(nba_team_data$CONF=="east"),replace = TRUE)
  diffWN[i] <- mean(se) - mean(sw)

(ci <- quantile(diffWN,c(0.025,.5,0.975)))
##         2.5%          50%        97.5% 
## -0.076126667 -0.040936667 -0.005446333

Now that we have created a theoretical confidence interval that for the difference in win percentage by conference, we can plot our data to get a clearer look at what it means to say we are 95% confident that there is a significant difference.

We can see that 0 is not included in this confidence interval, which shows that there is a statistically significant difference in win percentage by conference.

Perhaps there are more high powered offsenses in the west and this is a possible explanation for the discrepency in win percentage. We can look at the breakdown of Points per Game scored by teams in different conferences. We can start out with a boxplot to get a general sense of the distributions.

It looks like there is a difference between points per game seperated by conference. A two sample t-test can tell us how significant this difference is.

#We can perform a two sample t-test on our data
##  Welch Two Sample t-test
## data:  nba_team_data$PTS by nba_team_data$CONF
## t = -5.4225, df = 297.4, p-value = 1.218e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.059703 -1.897630
## sample estimates:
## mean in group east mean in group west 
##           99.50667          102.48533

Because the p value is less than .05, we can say that we are 95% confident that the mean of points per game scored by eastern conference teams is lower than for western conference teams. One way to vizualize this is with a permutation test.

Permutation Test

We can use a permutation test to see how likely it is that the observed difference in points per game is due to random chance. We can do this by taking many random samples and seeing what probability of the samples have a difference in mean points per game by conference equal to or larger than our observed difference.

#This shows the procedure used in performing our permutation test
N <- 20000
diffvals <- rep(NA,N)
for(i in 1:N){
  fakecon <- sample(nba_team_data$CONF)
  diffvals[i] <- mean(nba_team_data$PTS[fakecon == "east"] - mean(nba_team_data$PTS[fakecon =="west"]))

(cip <- quantile(diffWN,c(0.025,.5,0.975)))
##         2.5%          50%        97.5% 
## -0.076126667 -0.040936667 -0.005446333
(mean(nba_team_data$PTS[nba_team_data$CONF == "west"]) - mean(nba_team_data$PTS[nba_team_data$CONF == "east"]))
## [1] 2.978667

As we can see, the observed difference is not within the 95% confidence interval generated by our permutation test, meaning it is quite unlikely that the observed difference is due to random chance. We can see this on a histogram.

This histogram displays the results of our permutation test as well as the actual observed difference in points per game by conference. As we can see, almost none of the random samples produced a difference in mean points per game as large as the observed difference, showing that are is a very low chance that the observed difference happened by chance, leading us to conclude that there is a significant difference in points per game scored by each conference.

Case Study 3: Have Different Seasons Been Significantly Different?

The other categorical variable we created was season. By virtue of the teams all playing eachother and every game having 1 winner and 1 loser, it is not very interesting to look at mean Win Percentage by year. Something we can look at, however, would be points per game. The boxplots showing points per game by year (pictured below) suggest that there is a difference.

In order to test wheather there is a significant difference in points per game by year, we can use an ANOVA test, which tests wheather at least one of the mean points per game values is significantly different than the rest.

#This performs an ANOVA test on the distribution of points per game seperated by season
##                     Df Sum Sq Mean Sq F value   Pr(>F)    
## nba_team_data$YEAR   1   1380  1380.1   68.21 4.84e-15 ***
## Residuals          298   6029    20.2                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As we can see, the ANOVA test above results in a very low p-value, which suggests that the probability the observed differences in points per game by year were random is quite low. We can conclude that there is a significant difference in points per game by year.

Case Study 4: Building a Model for Win Percentage

As stated in the introduction, one of the most useful applications of this dataset is to create a linear model that predicts win percentage by looking at the other variables. In order to create a linear model, we want to make sure our response variable is normally distributed. We can look at a quantile plot to test this.

Because the quantile plot for win percentage is pretty linear, we can say that the distribution of win percentage over the last ten years has been approximately normal. In order to see how large a model and which variables we should use, we can use best subsets regression. The code below creates models of varying number of variables and stores them in the object “mods”

#This perfomrs best subsets analysis on a linear model for win percentage using every one of our remaining variables
mods <- summary(regsubsets(nba_team_data$WIN. ~., data = nba_team_data,nvmax = 26))

There are a few methods we can use when deciding which model to use, including r^2, adjusted r^2, and BIC. r^2 is not very useful because it just increases as the number of variable increases. We can use adjusted r^2, however, because it accounts for and attempts to minimize the number of variables being used.

#Which finds which model had the highest adjusted r^2 value and returns which variables were included
## [1] 4
## [1] "FGM" "FGA" "PF"  "PTD"

It looks the fourth model has the highest adjuted r^2 value. When we look at which variables are used, we see that is uses Personal Fouls, Field Goals Made, Field Goals Attempted, and Point Differential. Another metric is Bayesian information criterion or BIC. The recommended model is the one that minimizes BIC

#Which finds which model had the lowest BIC value and returns which variables were included
## [1] 1
## [1] "PTD"

Using BIC, the very first model is the recommended one. When we look at this first model, we see that is uses only point differential to predict win percentage. The incredibly high correlation between these two variables means that point differential acting alone can do a pretty good job of predicting win percentage. We can create both of these models to gain more insight into which one works better.

#This creates the linear model that resulted in the highest adjusted r^2 value
WIN_mod <- lm(nba_team_data$WIN. ~ nba_team_data$FGM + nba_team_data$FGA + nba_team_data$PF + nba_team_data$PTD)
## Call:
## lm(formula = nba_team_data$WIN. ~ nba_team_data$FGM + nba_team_data$FGA + 
##     nba_team_data$PF + nba_team_data$PTD)
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.109766 -0.025057 -0.001059  0.024566  0.091536 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        0.6702292  0.0699227   9.585   <2e-16 ***
## nba_team_data$FGM  0.0033589  0.0022044   1.524   0.1287    
## nba_team_data$FGA -0.0027389  0.0012047  -2.274   0.0237 *  
## nba_team_data$PF  -0.0034412  0.0015144  -2.272   0.0238 *  
## nba_team_data$PTD  0.0316927  0.0006212  51.017   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 0.03658 on 295 degrees of freedom
## Multiple R-squared:  0.9459, Adjusted R-squared:  0.9452 
## F-statistic:  1290 on 4 and 295 DF,  p-value: < 2.2e-16

Our linear model uses the 4 predictors as well as an intercept to predict win percentage. If we look at our variables, we see that point differential is the only one significant at the .01 level and field goals made is not even significant at the .05 level. It’s a pretty accurate model, but we might not need all of these predictors. We can look at the residual plots to further check the model.

The first plot shows that the studentized residuals are pretty linear, indicative of a well fit model. The second plot shows that almost all points are within 2 studentized residuals with no noticable trends, also a sign of a well fit model. It’s pretty good but we can also check our model that minimized the BIC.

#This creates the linear model that resulted in the lowest BIC value
BAS_mod <- lm(nba_team_data$WIN. ~ nba_team_data$PTD)
## Call:
## lm(formula = nba_team_data$WIN. ~ nba_team_data$PTD)
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.109057 -0.025210 -0.000321  0.024248  0.083832 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.4999740  0.0021347  234.22   <2e-16 ***
## nba_team_data$PTD 0.0326389  0.0004596   71.01   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 0.03697 on 298 degrees of freedom
## Multiple R-squared:  0.9442, Adjusted R-squared:  0.944 
## F-statistic:  5042 on 1 and 298 DF,  p-value: < 2.2e-16

This linear model uses the single predictor point differential as well as an intercept to predict win percentage. Looking at the model summary, we see that this predictor variable alone can create a very accurate prediction model with an r^2 value pretty close to that for the first model. We can use residual plots to check how well this model works.