How We Make the YUSAG Power Rankings
Daniel Tokarz
March 20, 2018
In this article, I’ll go through the R Script I use to make the YUSAG NBA Power Rankings
The first step of the process is to scrape the 2017-2018 game data. I’m going to get it from basketball reference because it’s stored in a table that makes it pretty easy to grab with readHTMLtable()
#Here, we load the necessary packages and grab the data for each game.
#Because each month is stored on a different page, a for loop allows us to collect the data for each month
library(XML)
library(RCurl)
## Loading required package: bitops
months <- c("october", "november", "december", "january", "february", "march")
for(i in 1:length(months)){
u <- paste0("https://www.basketball-reference.com/leagues/NBA_2018_games-",months[i],".html")
newu <- getURL(u)
data <- readHTMLTable(newu)
if(i==1){
frame <- data$schedule
frame$`Visitor/Neutral` <- as.character(frame$`Visitor/Neutral`)
frame$`Home/Neutral` <- as.character(frame$`Home/Neutral`)
frame$PTSV <- as.numeric(as.character(frame[[4]]))
frame$PTSH <- as.numeric(as.character(frame[[6]]))
frame <- frame[!is.na(frame$PTSV),]
}
else{
frame1 <- data$schedule
frame1$`Visitor/Neutral` <- as.character(frame1$`Visitor/Neutral`)
frame1$`Home/Neutral` <- as.character(frame1$`Home/Neutral`)
frame1$PTSV <- as.numeric(as.character(frame1[[4]]))
frame1$PTSH <- as.numeric(as.character(frame1[[6]]))
frame1 <- frame1[!is.na(frame1$PTSV),]
frame <- rbind(frame,frame1)
}
if(i != 1 && i != length(months)){
remove(frame1)
}
}
The dataframe “frame” now holds the basic game data for the 2017-2018 season. Let’s take a look at what we’ve got
head(frame)
## Date Start (ET) Visitor/Neutral PTS Home/Neutral
## 1 Tue, Oct 17, 2017 8:01 pm Boston Celtics 99 Cleveland Cavaliers
## 2 Tue, Oct 17, 2017 10:30 pm Houston Rockets 122 Golden State Warriors
## 3 Wed, Oct 18, 2017 7:30 pm Milwaukee Bucks 108 Boston Celtics
## 4 Wed, Oct 18, 2017 8:30 pm Atlanta Hawks 117 Dallas Mavericks
## 5 Wed, Oct 18, 2017 7:00 pm Charlotte Hornets 90 Detroit Pistons
## 6 Wed, Oct 18, 2017 7:00 pm Brooklyn Nets 131 Indiana Pacers
## PTS Â Â Attend. Notes PTSV PTSH
## 1 102 Box Score 20,562 99 102
## 2 121 Box Score 19,596 122 121
## 3 100 Box Score 18,624 108 100
## 4 111 Box Score 19,709 117 111
## 5 102 Box Score 20,491 90 102
## 6 140 Box Score 15,008 131 140
This is pretty good, but we need to a couple of annoying yet necessary data cleaning steps
#Here, we create a new data frame called games with the data we want, as well as a new column with the home point differential of each game
games <- data.frame(frame$`Visitor/Neutral`,frame$PTSV,frame$`Home/Neutral`,frame$PTSH)
games$frame..Visitor.Neutral. <- as.character(games$frame..Visitor.Neutral.)
games$frame..Home.Neutral. <- as.character(games$frame..Home.Neutral.)
games$pt_dif <- games$frame.PTSH - games$frame.PTSV
head(games)
## frame..Visitor.Neutral. frame.PTSV frame..Home.Neutral. frame.PTSH
## 1 Boston Celtics 99 Cleveland Cavaliers 102
## 2 Houston Rockets 122 Golden State Warriors 121
## 3 Milwaukee Bucks 108 Boston Celtics 100
## 4 Atlanta Hawks 117 Dallas Mavericks 111
## 5 Charlotte Hornets 90 Detroit Pistons 102
## 6 Brooklyn Nets 131 Indiana Pacers 140
## pt_dif
## 1 3
## 2 -1
## 3 -8
## 4 -6
## 5 12
## 6 9
Now we can put this clean data into a final data frame that will contain 4 columns, team, opponent, ptdif and location
team <- character(length = 2*length(games$pt_dif))
opponent <- character(length = 2*length(games$pt_dif))
location <- character(length = 2*length(games$pt_dif))
ptdif <- vector(mode='numeric',length = 2*length(games$pt_dif))
clean <- data.frame(team,opponent,location,ptdif)
clean$team <- as.character(clean$team)
clean$opponent <- as.character(clean$opponent)
clean$location <- as.character(clean$location)
for(i in 1:length(games$pt_dif)){
clean$team[i] <- games$frame..Home.Neutral.[i]
clean$opponent[i] <- games$frame..Visitor.Neutral.[i]
clean$location[i] <- "H"
clean$ptdif[i] <- games$pt_dif[i]
clean$team[i + length(games$pt_dif)] <- games$frame..Visitor.Neutral.[i]
clean$opponent[i + length(games$pt_dif)] <- games$frame..Home.Neutral.[i]
clean$location[i + length(games$pt_dif)] <- "A"
clean$ptdif[i + length(games$pt_dif)] <- (-1)*games$pt_dif[i]
}
#Here, we correct for the 3 NBA games this season that were played at neutral locations
clean$location[363] <- "N"
clean$location[376] <- "N"
clean$location[615] <- "N"
clean$location[363 + length(games$pt_dif)] <- "N"
clean$location[376 + length(games$pt_dif)] <- "N"
clean$location[615+ length(games$pt_dif)] <- "N"
head(clean)
## team opponent location ptdif
## 1 Cleveland Cavaliers Boston Celtics H 3
## 2 Golden State Warriors Houston Rockets H -1
## 3 Boston Celtics Milwaukee Bucks H -8
## 4 Dallas Mavericks Atlanta Hawks H -6
## 5 Detroit Pistons Charlotte Hornets H 12
## 6 Indiana Pacers Brooklyn Nets H 9
Now we’re ready to go creating our model. Here’s a fun boxplot of the home point differentials for this season
boxplot(clean$ptdif[1:(.5*length(clean$ptdif))], col = "blue", horizontal = TRUE, main = "Home Point Differentials for 2017-2018 NBA Season", xlab = "Home Score - Away Score")
abline( v = 0, col = "red")
Our NBA model (spoiler alert) is a linear model that uses team, opponent, and location to predict ptdif. We fit a coefficient for each team, as well as for home court advantage, such that the formula for a given game’s predicted pt dif is R1 - R2 + H where R1 and R2 are the power coefficients (YUSAG coefficients) for the home and away teams respectively. Our model does this by finding the set of coefficients that minimize the (predictedscore - acutalscore)^2 over the course of the entire season
lm.NBAhoops <- lm(ptdif ~ team + opponent + location, data = clean)
#Let's take a look at some of the values
lm.NBAhoops$coefficients[1:16]
## (Intercept) teamBoston Celtics
## -2.2319493 8.9367978
## teamBrooklyn Nets teamCharlotte Hornets
## 1.4332406 4.5728856
## teamChicago Bulls teamCleveland Cavaliers
## -0.4063552 5.4354603
## teamDallas Mavericks teamDenver Nuggets
## 3.5956409 6.3379414
## teamDetroit Pistons teamGolden State Warriors
## 4.1123920 13.0223506
## teamHouston Rockets teamIndiana Pacers
## 14.4417820 6.3578205
## teamLos Angeles Clippers teamLos Angeles Lakers
## 6.3850592 4.0390877
## teamMemphis Grizzlies teamMiami Heat
## 0.4379501 5.3903852
Now we can create the rankings using these coefficients. We linearly adjust each value so that the mean of all the rankings is 0. This allows us to see teams as better or worse than an average NBA team immediately with their value.
rankings <- data.frame("team" = sort(unique(clean$team)),
"yusag_coeff" = rep(NA, 30))
scale_factor <- mean(lm.NBAhoops$coefficients[2:30])
rankings$yusag_coeff <- c(0, lm.NBAhoops$coefficients[2:30]) - scale_factor
rankings <- rankings[(order(rankings$yusag_coeff, decreasing = T)),]
rankings
## team yusag_coeff
## 11 Houston Rockets 8.7892184
## 28 Toronto Raptors 7.8391149
## 10 Golden State Warriors 7.3697870
## 2 Boston Celtics 3.2842342
## 29 Utah Jazz 3.2167448
## 27 San Antonio Spurs 2.9261685
## 21 Oklahoma City Thunder 2.8773180
## 23 Philadelphia 76ers 2.6016207
## 25 Portland Trail Blazers 2.4838489
## 18 Minnesota Timberwolves 2.3983290
## 30 Washington Wizards 1.0499319
## 13 Los Angeles Clippers 0.7324956
## 12 Indiana Pacers 0.7052569
## 8 Denver Nuggets 0.6853778
## 19 New Orleans Pelicans 0.5842646
## 6 Cleveland Cavaliers -0.2171033
## 16 Miami Heat -0.2621784
## 17 Milwaukee Bucks -0.8344661
## 4 Charlotte Hornets -1.0796780
## 9 Detroit Pistons -1.5401716
## 14 Los Angeles Lakers -1.6134759
## 7 Dallas Mavericks -2.0569227
## 20 New York Knicks -3.3257917
## 3 Brooklyn Nets -4.2193230
## 22 Orlando Magic -4.7416734
## 15 Memphis Grizzlies -5.2146135
## 1 Atlanta Hawks -5.6525636
## 5 Chicago Bulls -6.0589188
## 26 Sacramento Kings -7.2745375
## 24 Phoenix Suns -9.1048574
Here’s a stripchart of the rankings.
stripchart(rankings$yusag_coeff, pch = 19 , col = "blue", xlab = "YUSAG Coefficient", main = "YUSAG Coefficients for 2017-2018 Season")
Next, we can create another model, based on the first model, that finds the probability that a team will win given their predicted point differential. First we add two new columns showing the predicted ptdif of a game, as well as wheather or not that team actually won. Then we fit a logistic function that takes in 1 paramater (predicted point differntial) and outputs the win probability based on this season’s game data. This function has limits such that as x goes to infinity or negative infinity, y will go to 1 or 0
clean$predscore <- predict(lm.NBAhoops, newdata = clean)
clean$win <- ifelse(clean$ptdif > 0, 1, 0)
glm.pointspread <- glm(win ~ predscore, data = clean, family = "binomial")
clean$winprob <- predict(glm.pointspread, newdata = clean, type = "response")
We can plot the win prob vs the predicted score to see this function in action.
plot(clean$predscore, clean$winprob, xlab = "Home Predicted Score Differential", ylab = "Home Win Probability", main = "Logistic Function for Predicting Games with the YUSAG NBA Model", pch = 4, col = c("red","blue")[(clean$ptdif > 0) + 1])
legend(5,.5,legend=c("Actual Home Win", "Actual Road Win"), fill=c("blue","red"))
The last we have to do is clean our or environment and write a csv with our resulting power rankings.
remove(data)
remove(frame)
remove(frame1)
remove(games)
write.csv(rankings, "NBA_RANKINGS.csv", row.names = FALSE)
Let us know if you have any feedback, recommendations, and feel free to build on our work and create your own model!