Data Flow and Weighted Averages — Creating a College Baseball Projection System: Part 2

This is the second part in my college baseball projection system series, the first part can be found here

8 min readFeb 13, 2021

Welcome to the second part of my college baseball projection system series. In part one I described what a projection system is and how it can be used at the college level. In this part I’m going to go more in depth on the actual system and explain how my system takes a player’s data and uses that to determine what calculations need to be applied to create that projection. I’ll also cover the first step of the system once the data is sorted which is creating a weighted average of prior data.

Going with the Flow

Since these projections will be computed through R the first step when creating the projection system is to develop how the computer will know what to look for when computing these projections. In a function you need arguments, which are values that you include when running the function that the function can then use to determine what to look for and what to do with these arguments. For this baseball projection system we need four arguments: playername, teamid, Yr, and Pos.

playername: This argument looks for a player’s name in the data frame of past college baseball data. This is how the function knows what kind of data the player has and determines what rates will be used. It must be formatted Last, First to match up with the names in the data set.

teamid: This is a unique code assigned to each team by the NCAA website. This is used to identify what team and what conference the player plays in.

Yr: Is a column in the data set that lists what year in school the player is. This input is used to apply the aging factor and determine how much experience the player has in college baseball.

Pos: This is the positional argument, used to determine if the player needs hitting or pitching data or both.

So once all of these inputs are accounted for the function can then use them to determine what process each player must undergo. Below is a basic flowchart that I created that shows how a player is grouped and what calculations are applied to them.

Basic flowchart for the projection system

As you can see the data starts out with the four inputs then sends each player down a different “path” depending on those inputs and applies unique calculations for each. What I didn’t show in the flowchart is that once the data reaches the conference and age factor it then branches out for each individual combination of conference and class which would be way too many to include in the chart. After the calculations are completed it then combines all the calculations and returns a data frame with the player’s projected stats.

Weight, what?

What a great question, when I talk about weighted averages that’s probably the first thing someone with no experience with the term may ask. What are we weighting, what is the purpose of weighting, and why do we use it when creating these projections, all excellent questions that I’ll explain here.

First I want to start with an example and to make it simplistic I will use a baseball example. I’m sure most of the people reading this are familiar with Albert Pujols, if not look up his career stats. One thing you may notice is the trend of his stats which have steeply declined from his earlier days. This is mostly due to aging but I’m using it to show why weighting our data is important especially in sports. So if we took Albert Pujols’ stats and just average his data from the beginning of his career and now to make projections off of we would be pretty far off since his year to year production has dipped substantially, with those earlier years keeping the average higher than the recent trend of his performance. This is why we applying weighting, we know that the data that occurred more recently is more closely related to Pujols’ current performance level, meaning when predicting his performance the current data is more important to our projections. I’m not saying throw out his past data entirely and just use the previous season but we know that the more recent data is going to help us more when trying to predict his performance.

In order to capture this at the college level I used linear regression to try and find the predictive capabilities of past seasons of data. I started off using three seasons of prior data to predict the fourth (so for example, 2016,2017,2018 to predict 2019) however I noticed that the third season carried very little significance when running the linear regressions. At first I wondered why that was but then it became clear. One main difference between college and the MLB is the number of years you spend at each level. In college the MAX amount of years of data you can have is 4, when in the MLB many players play way longer than four seasons. If we also look at the top talent in college baseball we can assume that most will be drafted after their junior season. This means that most of the talent in college baseball will only have two main seasons of data to make projections off of, since after their third they will be drafted and no longer in college. This resulted in me using two seasons of data to create the weighted average between. Sure some seniors will have three seasons of data however since most of the top talent juniors get drafted that means that most seniors who remain weren’t the top talent of the previous season. I’m not saying seniors in college baseball are bad, most are quite good having the most experience I just opted to using two years of data rather than three to keep it consistent with the juniors.

Calculating the Weights

So how did I actually come up with these weights? Some projection systems use arbitrary numbers, like Marcel for example I believe uses 3 years of data weighted 5/4/3. However I wanted my weights to be more unique to each statistic I calculate. Some stats are more predictive from one year to the next, like walks and strikeouts, while something like triples may be less predictive due to the nature of difficulty and sometimes luck needed to get a triple. So in order to create these weights I again used linear regression with the same method I mentioned earlier. If you’re not familiar with linear regression my best one sentence explanation is taking variables and seeing how much they impact your predictor variables and creating a linear equation with those variables to predict your predictor variable. Here’s an example of a linear regression output using home runs hit in 2018 and 2017 to predict home runs in 2019.

Call:
lm(formula = HR_19 ~ HR_18 + HR_17, data = D1_Join)Residuals:
     Min       1Q   Median       3Q      Max 
-10.3996  -1.9458  -0.5867   1.3296  22.1947 Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.42361    0.13021  10.933   <2e-16 ***
HR_18        0.65157    0.03339  19.515   <2e-16 ***
HR_17        0.38771    0.03915   9.902   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The coefficients column under estimate is the formula used to calculate a player’s home runs in 2019. So if a player hit 10 home runs in 2018 and 8 home runs in 2017 that would be 1.42361+.65157(10)+.38771(8) which equals about 11 home runs. This would be a very basic way of predicting home runs. I’m not using this to make my predictions though I just want to see how much more significant the previous season was over two seasons ago. So in this example .65157/.38771 equals 1.68 meaning that the previous season home runs were worth 1.68 times as much as two seasons ago. So to capture how much each individual statistic would be weighted I followed this same process for each pairing of years (so the next group would be predicting 18 home runs with 2017 and 2016 data) and went back as far as the data frame allowed me which is the 2013 season. I then found and calculated the weight for each regression and took the average of each of those to come up with the weight from that statistic. I then did that for each individual statistic so that each one would have a unique weight.

Applying the Weights

Once we find how much each statistic needs to be weighted we can then add them to the formula and create a weighted average for the player we are looking at. I also opted to weight each player’s rates by the amount of plate appearances they had. So our equation looks something like this:

(Player’s 2020 rate *2020 PA * weight + Player’s 2019 rate *2019 PA)/(weight*2020 PA + 2019 PA)

I’m going to go through a quick example with a college player to see how this impacts their projected home run weight. Below I’ve filled in the same formula but with the values for UCLA’s Matt McLain(former first round draft pick and projected top ten pick in the upcoming draft).

(.046875*64*1.3+.016064*249)/(1.3*64+249)

7.9/332.2 = .02378

We can see that the weighted home run rate for McLain is about 2.4% which is lower than his 2020 rate but also higher than his 2019 rate. By weighting by PA as well we see that we account for the shortened season in 2020. Could have McLain sustained that rate through the 2020 season if it wasn’t cut short? Sure absolutely, but based on his past performance, which was uncharacteristically pretty poor his freshman year, we can’t be sure of that. So by weighting by PA we account for the lack of data. Had 2020 been a full season and McLain performed to a similar rate he had in the shortened season, his projected rate would be much closer to his 2020 rate than his 2019 rate due to that 1.3 weight that was applied to his 2020 data.

One other thing I’d like to touch on is the fact that some players only have data in 2020 or 2019. We can’t really create a weighted average with just one season of data so we need to provide something to average their rates against. This is where we compare them to the average across the division 1 level. For whichever year is missing we substitute rates and plate appearances by the average across division 1 for that year. So essentially we are regressing them against the mean which is the topic I will cover in the next part of this series.

The key takeaways from this part of the series is how data goes into the function and what comes out as well as the first main step of creating the weighted average. This is important because all of the projections start with this weighted average and this is where player’s own statistics can help or hurt them when it comes to their projections. Like I mentioned in the next part of this series I will cover the next steps in the projection progress, applying regression to the mean as well as the various factors that are applied.

Hey thanks for reading this article on college baseball projections! If you found this interesting or relevant go ahead and leave a like or in this case a clap. I love sharing things about baseball and data so if you like either of these consider giving me a follow and sharing this content with your friends. I am going to be trying to post on here regularly so be sure to look out for more parts of this series and other data and baseball related articles!