NCAA Baseball Model Methodology

By Luke Benz

Updated June 16th, 2017

Recently, we built a model to predict the outcome of NCAA Baseball games, and have been using our model to analyze the NCAA College World Series. Our linear model uses team, opponent, and game location to assign to predict run differential. From there, we can use a simple logistic regression to translate run differential into win probability. Each team’s model coefficient is the number of run they would be expected to beat the model-baseline team (A&M Corpus Christi, first alphabetically) on a neutral field. Taking the difference between two teams coefficients gives you how much Team A would be expected to beat Team B by on a neutral field. The model also has coefficients for relative game location (Home, Away, Neutral). Home field advantage is worth roughly 1/3 of a run, according to our model.

In essence, this method determines how much of a team’s results are can be explained by its own strength compared to the strength of its opponent. This method works well because there are so many games played over the course of a season (on the order of 10,000) and enough cross-conference games during the non-conference schedule for the model to make connections relative conference strengths.

Note: A decrease in a team’s model coefficient doesn’t necessarily mean they have declined. It could be that the A&M Corpus Christi, the model’s baseline team has improved. This doesn’t affect prediction results because all teams’ coefficients are shifted accordingly. 

A complete list of our baseball power rankings can be found here, our predictions for the NCAA Super Regionals can be found here, and our CWS odds can be found here.