Applying Regression to the Mean and Final Adjustments — Creating a College Baseball Projection System: Part 3

This is the third part in my college baseball projection system series, part one can be found here and part two can be found here.

Andrew Grenbemer
6 min readFeb 17, 2021

Welcome to the third part in my college baseball projection series. In the first two parts I described the significance of a college baseball projection model, laid out how the data flows through the model, and went over the first step when creating projections. In this part I will be covering a couple of the adjustments that are applied to a player’s statistics once we create a weighted average.

Regression to the Mean

Regression to the mean is a complex topic and a statistical tendency that occurs when creating certain forecasts and projections. The idea behind regression to the mean is that given any sample of data from a large population (think stats from one month of the season as an example) may not be perfectly in line with the underlying average (a player’s true talent level or his career averages), but in the future you would expect the next sample to be closer to the average than the previous one. Observations tend to cluster around the average even if the previous observation was unusual. The main purpose of regression to the mean in context of creating baseball projections is that we can never directly measure a player’s true talent level, we simply create educated guesses based on the player’s outcomes on the field. The sport of baseball includes high degree of randomness that can impact a player’s outcomes. For example a .300 career hitter may have a bad month where he has bad luck when hitting the ball directly at opponents causing his average to drop, however we know that this player has a history of being a .300 hitter and expect that in the future he will perform closer to this average than the recent slump he went through. Because of this randomness it takes a long time for the statistics to stabilize and give us a true idea of how good a player is. This is especially true in the college game when the max amount of seasons a player can have is four. We can think of this simple equation when considering an observation on a player:

Outcomes = Talent + Randomness

This is why in our projections we account for regression to the mean, due to small samples we know that the data we have may not be 100% reflective of a player’s actual talent level, and that the randomness of the sport may have impacted the player’s outcomes. There is no correct way to properly account for regression to the mean in a projection, some systems regress observations more heavily than others. However correlation can give us an idea of how strong regression to the mean may be on our data. Variables with perfect correlation will not regress to the mean, however variables with no correlation at all will always regress to the mean. So when deciding how I would weigh each statistic’s regression factor, I was able to find how much correlation they had from one year to the next.

To get a general idea of how I handled this I will go over two variables my system projects, walks and triples. The correlation coefficient of walks from one year to the next was pretty high at .61 (0 being not correlated and 1 being perfectly correlated). This means that walks in the previous year had a large impact on walks in the next year. Because of this we know that the regression to the mean is going to be less since its correlation is somewhat high. With triples however we have the opposite case. The correlation coefficient for triples was .35. This means that the amount of triples hit in one year had less impact on the next year meaning that the regression to the mean would be higher than walks. Now, when we think of how these outcomes are generated we can understand why they are more or less correlated. A triple is a pretty hard feat to accomplish in baseball. The about of luck and randomness that needs to occur in order to get a triple is pretty high. However a walk has less randomness involved. Sure the pitcher has to throw 4 balls in order to get a walk but the batter can control what pitches he can swing at and which he can lay off of. In the event of a triple a batter has way less control over where the ball goes and what happens to that ball once it lands in the playing field, thus the need to regress more or less heavily to the mean.

So here’s a quick recap because what I just covered is pretty complex topic. We account for regression to the mean because in small samples we have an element of randomness that needs to be account for. We can do this by regressing an observation to a population mean which in this case is the average of NCAA players from 2013–2020. I do this in my projections by finding the coefficients of each variable I’m trying to project and then applying a regression weight based on that coefficient, meaning some statistics are regressed more heavily than others.

Final Factors and Adjustments

Once we created the weighted average and regress to the mean there are a few more factors we apply to our projections. I apply two factors to the observations once we average and regress a players rates based on his past performance. The first factor I apply is a conference factor. This an added factor because Major League Baseball is much different than Division 1. The MLB is the highest level a player can reach and thus we know that the talent at the major league level is the highest you can get. There are teams that are clearly better than others but not as extreme as in Division 1 baseball. Since there are so many more conferences and 300 teams compared to 30 at the MLB level we know that the variance of talent from conference to conference is going to be a lot larger than at a major league level. To help offset this I apply a small conference adjustment that looks to see if the conference a player plays in is above or below the NCAA average. The player’s rate gets a small increase or decrease depending on how above or below average that conference is. These adjustments aren’t anything huge, typically only a couple percentages points and may add or subtract a couple observations from a player’s total.

The next factor I apply to the rate is an age factor. This is similar to aging curves you would see at the major league level. When it comes to aging at the major league level we would see players peak around 27–29 years old. However since most players in college are only 18–22 we see players typically get better as they get older. To create this aging factor I used historical NCAA data to create unique aging curves for each statistic I project and then apply them to players based on what class they are in. Once we apply this step we now have our projected rates for our players!

Now that we know all the factors that go into a player’s rate we have all the information we need to create season totals. In the next part of the series I will go over how my model projects playing time and how it handles players with no data,so freshman or transfers from junior college.

Hey thanks for reading this article on college baseball projections! If you found this interesting or relevant go ahead and leave a like or in this case a clap. I love sharing things about baseball and data so if you like either of these consider giving me a follow and sharing this content with your friends. I am going to be trying to post on here regularly so be sure to look out for more parts of this series and other data and baseball related articles!

--

--

Andrew Grenbemer

University of Oregon grad, avid baseball fan with a passion and interest for the data and analytics side of the game, aspiring baseball front office personnel