By Luke Benz
November 21, 2017
Having acquired complete men’s college basketball play-by-play data for the 2016-17 season, I set out to make an in-game win probability model which updates after each play. The input variables to my model are score differential, seconds remaining in the game, and the home team’s pre-game win probability (used a prior/measure of relative team strength), with the response being the probability that the home team won the given the game situation.
The pre-game win probability was computed using a logistic regression to predict win probability using the game’s Vegas line as listed on ESPN. In the event that a Vegas line was not listed, the pre-game line was taken to be the predicted score differential, using our Division-1 Power Rankings. While this method of imputation covers most cases, it fails when one team is not in Division 1 and/or the game was played prior to 2016. In these cases, the pre-game win probability is simply set to 50%.
I first tried to build a singular logistic regression with my three input variables. While this initial attempt worked reasonably well for most of the game, it performed very badly near the end of games, seemingly missing the fact that non-zero score differentials at the end of games ae deterministic. After doing some reading online, it seemed I wasn’t the only one to have encountered this problem. Namely, both Bart Torvik and Brian Burke have noted that creating seperate logistic regressions for fixed-time intervals solve this issue. The crux of this issue lies in the fact that there is a non-linear relationship between time remaining and win-probability. One would imagine that a team with a five point lead with 12:00 remaining in the first half is about as likely to win as a that same team with a five point lead 20 seconds later. However, time becomes much more important later in the game. A five point lead with 10 seconds left yields a significantly larger win probability than that same lead with 30 seconds left to play.
Thus, my win probability model is really a sequence of 280 logistic regressions– one for each ten second interval between 1-40 minutes remaining in the game, one for each 2-second interval between 30-60 seconds remaining in the game, and one for each second long interval between 0-30 seconds remaining in the game. Each logistic regression uses only score differential and pre-game win probability.
The only thing left to do was deal with overtime. I treated regulation in overtime games identically to that of non-overtime games. That is, regulation was always marked with 0 seconds remaining in the game, 1:00 was marked by 60 seconds remaining etc. Each overtime period was considered a distinct 300 second period, as during an actual game, we would not know before hand how long a given game might be.
Now let’s compare some example charts my model generates compared to those of FiveThirtyEight and ESPN. I have chosen two charts for some of the crazier games in recent memory. First, we’ll look at Texas A&M’s crazy comeback against Northern Iowa in the 2016 NCAA Tournament.
Next, let’s check out the 4 OT thriller between Ohio and Indiana State in this year’s Charleston Classic.
Not too shabby! My model seems to hold up pretty well against some of the most respected analytics in the industry. One pitfall of my model is that it doesn’t incorporate possession. This creates a hiccup near the end of games where, possesion by the winning team late in the game might yield higher win probabilities than my model will predict. In any case, this model seems to do the job telling the key stories lines for a game.
All code used in this model can be found on my GitHub. If you want a custom chart made or have any questions you’d like to bring up, email me at luke.benz@yale.edu or tweet @YaleSportsGroup/@recspecs730.