R: NBA Power Rankings



How We Make the YUSAG Power Rankings

In this article, I’ll go through the R Script I use to make the YUSAG NBA Power Rankings

The first step of the process is to scrape the 2017-2018 game data. I’m going to get it from basketball reference because it’s stored in a table that makes it pretty easy to grab with readHTMLtable()

#Here, we load the necessary packages and grab the data for each game.  
#Because each month is stored on a different page, a for loop allows us to collect the data for each month
library(XML)
library(RCurl)
## Loading required package: bitops
months <- c("october", "november", "december", "january", "february", "march")

for(i in 1:length(months)){
  u <- paste0("https://www.basketball-reference.com/leagues/NBA_2018_games-",months[i],".html")
  newu <- getURL(u)
  data <- readHTMLTable(newu)
  
  if(i==1){
    
    frame <- data$schedule
    frame$`Visitor/Neutral` <- as.character(frame$`Visitor/Neutral`)
    frame$`Home/Neutral` <- as.character(frame$`Home/Neutral`)
    frame$PTSV <- as.numeric(as.character(frame[[4]]))
    frame$PTSH <- as.numeric(as.character(frame[[6]]))
    frame <- frame[!is.na(frame$PTSV),]
    
  }
  else{
    frame1 <- data$schedule
    frame1$`Visitor/Neutral` <- as.character(frame1$`Visitor/Neutral`)
    frame1$`Home/Neutral` <- as.character(frame1$`Home/Neutral`)
    frame1$PTSV <- as.numeric(as.character(frame1[[4]]))
    frame1$PTSH <- as.numeric(as.character(frame1[[6]]))
    frame1 <- frame1[!is.na(frame1$PTSV),]
    frame <- rbind(frame,frame1)
    
  }
  if(i != 1 && i != length(months)){
    remove(frame1)
  }
  
}

The dataframe “frame” now holds the basic game data for the 2017-2018 season. Let’s take a look at what we’ve got

head(frame)
##                Date Start (ET)   Visitor/Neutral PTS          Home/Neutral
## 1 Tue, Oct 17, 2017    8:01 pm    Boston Celtics  99   Cleveland Cavaliers
## 2 Tue, Oct 17, 2017   10:30 pm   Houston Rockets 122 Golden State Warriors
## 3 Wed, Oct 18, 2017    7:30 pm   Milwaukee Bucks 108        Boston Celtics
## 4 Wed, Oct 18, 2017    8:30 pm     Atlanta Hawks 117      Dallas Mavericks
## 5 Wed, Oct 18, 2017    7:00 pm Charlotte Hornets  90       Detroit Pistons
## 6 Wed, Oct 18, 2017    7:00 pm     Brooklyn Nets 131        Indiana Pacers
##   PTS            Attend. Notes PTSV PTSH
## 1 102 Box Score     20,562         99  102
## 2 121 Box Score     19,596        122  121
## 3 100 Box Score     18,624        108  100
## 4 111 Box Score     19,709        117  111
## 5 102 Box Score     20,491         90  102
## 6 140 Box Score     15,008        131  140

This is pretty good, but we need to a couple of annoying yet necessary data cleaning steps

#Here, we create a new data frame called games with the data we want, as well as a new column with the home point differential of each game
games <- data.frame(frame$`Visitor/Neutral`,frame$PTSV,frame$`Home/Neutral`,frame$PTSH)
games$frame..Visitor.Neutral. <- as.character(games$frame..Visitor.Neutral.)
games$frame..Home.Neutral. <- as.character(games$frame..Home.Neutral.)


games$pt_dif <- games$frame.PTSH - games$frame.PTSV

head(games)
##   frame..Visitor.Neutral. frame.PTSV  frame..Home.Neutral. frame.PTSH
## 1          Boston Celtics         99   Cleveland Cavaliers        102
## 2         Houston Rockets        122 Golden State Warriors        121
## 3         Milwaukee Bucks        108        Boston Celtics        100
## 4           Atlanta Hawks        117      Dallas Mavericks        111
## 5       Charlotte Hornets         90       Detroit Pistons        102
## 6           Brooklyn Nets        131        Indiana Pacers        140
##   pt_dif
## 1      3
## 2     -1
## 3     -8
## 4     -6
## 5     12
## 6      9

Now we can put this clean data into a final data frame that will contain 4 columns, team, opponent, ptdif and location

team <- character(length = 2*length(games$pt_dif))
opponent <- character(length = 2*length(games$pt_dif))
location <- character(length = 2*length(games$pt_dif))
ptdif <- vector(mode='numeric',length = 2*length(games$pt_dif))


clean <- data.frame(team,opponent,location,ptdif)

clean$team <- as.character(clean$team)
clean$opponent <- as.character(clean$opponent)
clean$location <- as.character(clean$location)


for(i in 1:length(games$pt_dif)){
  
  clean$team[i] <- games$frame..Home.Neutral.[i]
  clean$opponent[i] <- games$frame..Visitor.Neutral.[i]
  clean$location[i] <- "H"
  clean$ptdif[i] <- games$pt_dif[i]
  
  
  clean$team[i + length(games$pt_dif)] <- games$frame..Visitor.Neutral.[i]
  clean$opponent[i + length(games$pt_dif)] <- games$frame..Home.Neutral.[i]
  clean$location[i + length(games$pt_dif)] <- "A"
  clean$ptdif[i + length(games$pt_dif)] <- (-1)*games$pt_dif[i]
  
  
}

#Here, we correct for the 3 NBA games this season that were played at neutral locations
clean$location[363] <- "N"
clean$location[376] <- "N"
clean$location[615] <- "N"
clean$location[363 + length(games$pt_dif)] <- "N"
clean$location[376 + length(games$pt_dif)] <- "N"
clean$location[615+ length(games$pt_dif)] <- "N"

head(clean)
##                    team          opponent location ptdif
## 1   Cleveland Cavaliers    Boston Celtics        H     3
## 2 Golden State Warriors   Houston Rockets        H    -1
## 3        Boston Celtics   Milwaukee Bucks        H    -8
## 4      Dallas Mavericks     Atlanta Hawks        H    -6
## 5       Detroit Pistons Charlotte Hornets        H    12
## 6        Indiana Pacers     Brooklyn Nets        H     9

Now we’re ready to go creating our model. Here’s a fun boxplot of the home point differentials for this season

boxplot(clean$ptdif[1:(.5*length(clean$ptdif))], col = "blue", horizontal = TRUE, main = "Home Point Differentials for 2017-2018 NBA Season", xlab = "Home Score - Away Score")
abline( v = 0, col = "red")

Our NBA model (spoiler alert) is a linear model that uses team, opponent, and location to predict ptdif. We fit a coefficient for each team, as well as for home court advantage, such that the formula for a given game’s predicted pt dif is R1 - R2 + H where R1 and R2 are the power coefficients (YUSAG coefficients) for the home and away teams respectively. Our model does this by finding the set of coefficients that minimize the (predictedscore - acutalscore)^2 over the course of the entire season

lm.NBAhoops <- lm(ptdif ~ team + opponent + location, data = clean) 

#Let's take a look at some of the values
lm.NBAhoops$coefficients[1:16]
##               (Intercept)        teamBoston Celtics 
##                -2.2319493                 8.9367978 
##         teamBrooklyn Nets     teamCharlotte Hornets 
##                 1.4332406                 4.5728856 
##         teamChicago Bulls   teamCleveland Cavaliers 
##                -0.4063552                 5.4354603 
##      teamDallas Mavericks        teamDenver Nuggets 
##                 3.5956409                 6.3379414 
##       teamDetroit Pistons teamGolden State Warriors 
##                 4.1123920                13.0223506 
##       teamHouston Rockets        teamIndiana Pacers 
##                14.4417820                 6.3578205 
##  teamLos Angeles Clippers    teamLos Angeles Lakers 
##                 6.3850592                 4.0390877 
##     teamMemphis Grizzlies            teamMiami Heat 
##                 0.4379501                 5.3903852

Now we can create the rankings using these coefficients. We linearly adjust each value so that the mean of all the rankings is 0. This allows us to see teams as better or worse than an average NBA team immediately with their value.

rankings <- data.frame("team" = sort(unique(clean$team)),
                       "yusag_coeff" = rep(NA, 30))
scale_factor <- mean(lm.NBAhoops$coefficients[2:30])
rankings$yusag_coeff <- c(0, lm.NBAhoops$coefficients[2:30]) - scale_factor
rankings <- rankings[(order(rankings$yusag_coeff, decreasing = T)),]
rankings
##                      team yusag_coeff
## 11        Houston Rockets   8.7892184
## 28        Toronto Raptors   7.8391149
## 10  Golden State Warriors   7.3697870
## 2          Boston Celtics   3.2842342
## 29              Utah Jazz   3.2167448
## 27      San Antonio Spurs   2.9261685
## 21  Oklahoma City Thunder   2.8773180
## 23     Philadelphia 76ers   2.6016207
## 25 Portland Trail Blazers   2.4838489
## 18 Minnesota Timberwolves   2.3983290
## 30     Washington Wizards   1.0499319
## 13   Los Angeles Clippers   0.7324956
## 12         Indiana Pacers   0.7052569
## 8          Denver Nuggets   0.6853778
## 19   New Orleans Pelicans   0.5842646
## 6     Cleveland Cavaliers  -0.2171033
## 16             Miami Heat  -0.2621784
## 17        Milwaukee Bucks  -0.8344661
## 4       Charlotte Hornets  -1.0796780
## 9         Detroit Pistons  -1.5401716
## 14     Los Angeles Lakers  -1.6134759
## 7        Dallas Mavericks  -2.0569227
## 20        New York Knicks  -3.3257917
## 3           Brooklyn Nets  -4.2193230
## 22          Orlando Magic  -4.7416734
## 15      Memphis Grizzlies  -5.2146135
## 1           Atlanta Hawks  -5.6525636
## 5           Chicago Bulls  -6.0589188
## 26       Sacramento Kings  -7.2745375
## 24           Phoenix Suns  -9.1048574

Here’s a stripchart of the rankings.

stripchart(rankings$yusag_coeff, pch = 19 , col = "blue", xlab = "YUSAG Coefficient", main = "YUSAG Coefficients for 2017-2018 Season")

Next, we can create another model, based on the first model, that finds the probability that a team will win given their predicted point differential. First we add two new columns showing the predicted ptdif of a game, as well as wheather or not that team actually won. Then we fit a logistic function that takes in 1 paramater (predicted point differntial) and outputs the win probability based on this season’s game data. This function has limits such that as x goes to infinity or negative infinity, y will go to 1 or 0

clean$predscore <- predict(lm.NBAhoops, newdata = clean)
clean$win <- ifelse(clean$ptdif > 0, 1, 0)
glm.pointspread <- glm(win ~ predscore, data = clean, family = "binomial")
clean$winprob <- predict(glm.pointspread, newdata = clean, type = "response")

We can plot the win prob vs the predicted score to see this function in action.

plot(clean$predscore, clean$winprob, xlab = "Home Predicted Score Differential", ylab = "Home Win Probability", main = "Logistic Function for Predicting Games with the YUSAG NBA Model", pch = 4, col = c("red","blue")[(clean$ptdif > 0) + 1])
legend(5,.5,legend=c("Actual Home Win", "Actual Road Win"), fill=c("blue","red"))

The last we have to do is clean our or environment and write a csv with our resulting power rankings.

remove(data)
remove(frame)
remove(frame1)
remove(games)
write.csv(rankings, "NBA_RANKINGS.csv", row.names = FALSE)

Let us know if you have any feedback, recommendations, and feel free to build on our work and create your own model!