Creating a College Baseball Projection System: Part 1

Andrew Grenbemer
4 min readFeb 12, 2021

--

My first series of posts here on my blog will be covering a project I have been working on in anticipation of the return of baseball season and the return of college baseball. With the 2020 college season being cancelled in mid March we are long overdue for the return of college baseball and the excitement it brings. Having experience within college baseball I know the feeling of anticipation leading up to that opening series, so what better time to share some college baseball content than with the start of the 2021 season rapidly approaching.

My goal with this series is to explain and share the process in creating a projection system for college baseball. To my knowledge this is the first player-level projection system, at least publicly, for the college level. Sure other sites may post projected standings or projected NCAA tournament brackets but this projection system goes down to the player level, creating projections for each player on a division 1 roster, which to me seems pretty significant as I don’t think that this has been done publicly before meaning I may be one of the first to explore this topic.

A baseball projection is a forecast for a player’s performance for the upcoming season. Projections at the major league level are quite common. Some of the most commonly found projection systems include Marcel (created by Tom Tango), PECOTA (created by Nate Silver and is featured on Baseball Prospectus), Steamer (created by Jared Cross and is featured on Fangraphs), and ZIPS (also featured on Fangraphs and created by Dan Szymbroski). Most of these systems have fairly similar ideas but different processes for calculating their projections, I have included links to the MLB glossary articles for each of these systems if you’d you like brief explanations on each. When researching how to create a baseball projection system I looked closely at the Marcel and Steamer methodology, as those two systems are pretty straightforward in how the calculate their projections. The straightforward methodology along with the lack of data at the NCAA level led me to believe that this would be the best approach.

Why Build a College Baseball Projection System?

Looking at the evolution of college baseball we are finally starting to see the data and statistics revolution that has taken over the MLB trickle down into the college levels. Teams are investing more into data and analytics to help them make decision when it comes to players and player development. The importance of data is increasing at the college level and projections can go hand in hand with the data being used by teams to scout and evaluate talent. There is also a lack of public college data out there, so using the tools and data at my disposal I can provide some level of data analysis publicly for college baseball, which is a niche in the baseball community that seems to be lacking in content. This can give coaches and teams an idea of where they stack up respective to their conference and other teams across the division 1 level. It also can give them an idea of how they can expect players on their roster to perform. Now this system isn’t meant to perfectly predict what is going to happen at the NCAA level, I’ll say it now there’s going to be places where the system is way off, but there are also going to be instances where its pretty much spot on. Llike I said earlier its a best “guess” on a player’s performance given their past data.

Getting Started

In order to prepare to build a projection system I first need to acquire all the data I’m going to need in order to create the projections. Thankfully with RStudio and the baseballr package it can be easy to do this. I am able to scrape NCAA statistics dating back to the 2013 season from the NCAA stats page, so my data set contains every player at the division 1 levels yearly statistics since 2013. Here’s a sample of the data set with only a few of the first columns:

As you can see this data set contains plenty of data needed to create projections. The following columns not in the image include yearly totals for any of the statistics you’d find on a NCAA stats page. I also have done the same thing but with pitching data in a different data frame. Just like that I have all the prior data I am going to need to build my projection system. The flow of a projection system breaks down to this: data goes in one end, the algorithm then looks at the past data and starts to manipulate and perform various calculations, and then in the end spits out that player’s projected performance. In the next part of the series I will go over the flow of data through the projection system and the first element used when creating these projections.

Hey thanks for reading this article on college baseball projections! If you found this interesting or relevant go ahead and leave a like or in this case a clap. I love sharing things about baseball and data so if you like either of these consider giving me a follow and sharing this content with your friends. I am going to be trying to post on here regularly so be sure to look out for more parts of this series and other data and baseball related articles!

--

--

Andrew Grenbemer

University of Oregon grad, avid baseball fan with a passion and interest for the data and analytics side of the game, aspiring baseball front office personnel