It's a matter of faith among college football fans that good recruiting leads to winning. A few skeptics suggest causation runs the other direction -- teams recruit well because they win. And there is a third hypothesis -- both winning and recruiting are caused by a third factor, maybe coaching or just general "goodness." I’ve tried to look at some data to see if we can figure it out.
As usual I've half-assed this analysis, but hopefully we can get some insights anyway. At the very least we might learn how to think about the debate. My basic idea is to look at historical winning and recruiting and try to figure out how they are correlated and whether we can point to one as the cause of the other.
The data set
For recruiting, I looked up past recruiting for the Pac-10 teams from CBSSports. (I left out Colorado and Utah for obvious reasons.) Recruiting rankings go back to 2000. I decided to work with rankings within the Pac-10 rather than raw recruiting scores because I had more confidence in the ability of CBSSports to tell me whether UCLA out-recruited Arizona in 2002 than I did in their ability to tell me whether UCLA’s recruiting class was "really REALLY good," or just "really good." In retrospect, that may have been a mistake, but we’re talking about water over the dam.
Anyway, I translated the recruiting into a 1-10 Pac-10 ranking for each team in each year. Then I translated each team's conference win total into a similar 1-10 ranking going back to 1996. What we have, then, is a big database of each team's win rank back to 1996 and recruiting rank back to 2000. The table below shows what this looked like for a few recent years. The highlighted columns are the ones we’re interested in; they show how each team ranked each year within the Pac-10 in conference wins and recruiting.
Stat 101: Correlation
The idea is to ask how well past and future recruiting correlate with current wins. Correlation is a statistical concept that basically tells you how well you could predict one thing if you knew the other. Let's say there's a good correlation between the weather in North Bend and how sad Brad Johnson is. That means that if you tell me it's raining in North Bend, I can make a pretty good guess that Brad is feeling a bit weepy and might even be writing a poem about his cat.
It also means that if you tell me Brad is making balloon animals and blasting Katrina and the Waves, I can tell you it's sunny and 75 in North Bend. In other words, correlation is not causation, it could go either way. We only know that they go together.
The "correlation coefficient" goes from 0 (no correlation) to 1 (perfect correlation). (It can also be negative, but let's not cloud the issue.) A big coefficient means good correlation, a low one means bad correlation. If we multiply the correlation coefficient by itself, we get what's called the "R-squared" value, which tells us percentage of one variable is "explained" by the other. Say the R-square between weather in North Bend and Brad's mood is 60%. That would mean that 60% of Brad's mood swings have to do with the weather and 40% have to do with something else. Probably reruns of Sex in the City.
Let's look at a simple example using our rankings of wins and recruiting. The figure below plots the Pac-10 teams in terms of win rank in 2014 and recruiting rank in 2012 (i.e., the guys who were juniors and redshirt-sophomores in 2014). Each dot is one team. The trend-line and R-squared are shown. For this example, we can say that about 20% of win rank in 2014 is explained by recruiting rank in 2012, or vice versa. This would be considered a weak to moderate correlation.
"But wait!," I can hear you say. "Does it really make sense that winning in 2014 could explain -- could cause -- good recruiting in 2012?" Good question. No, it doesn't. So we do have some causal information after all: since time travel is not possible, present winning cannot explain past recruiting. Likewise, present recruiting cannot explain past winning. This does not mean past recruiting causes current winning, however. It could be that something else, like a new coach, causes both.
We noted in the graph above that the R-square was 20%. This is just for one pair of win-rank/recruiting rank years. Below is another graph, for recruiting in 2008 and winning in 2010. In this example the correlation is very small -- R-square about 7% -- which means that recruiting in 2008 had nothing to do with winning in 2010.
The point is that there's a lot of noise. We want to look at as many correlations as we can to filter out as much of the noise as possible. Also, we’re interested not just in a two-year interval (i.e., recruiting in 2008, winning in 2010) but in every interval from five years up to zero. We’re also interested in "negative" intervals, such as how does winning in 2010 correlate with recruiting in 2012.
Let's look at some results
The figure below shows the R-square values when we correlate recruiting in years leading up to and following the wins in a given year. How well does the redshirt-senior class explain wins? The true freshman class? How well does wins explain the following recruiting class? The one in two years? Correlations that are statistically significant are shown in blue, those that aren’t are shown in grey.
The figure has several interesting features. First, the correlation between winning and recruiting increases as you get closer to the year in question. This means that if you were asked to predict how a team was going to finish in the conference, you’d be better off knowing about its true freshman recruiting class than any other one. I’m not sure the people who think recruiting drives winning would have guessed that. Second, the correlations are larger on the downhill side – for subsequent recruiting classes. So it looks like a reasonable case can be made that winning is more an effect than a cause.
We can improve on this a little bit, though. We’re not just interested in how a single recruiting class relates to winning, we’re interested in how recruiting in all the years leading up to a given season relates to the success of that season. Your first thought is that we should be able to add up the contributions of each recruiting class and that will tell you how well you can predict the win ranking. So 4% for seniors, 4% for junior, 6% for sophomores, and 7% for freshmen equals 21% total "explanatory power." Right?
Not quite. We would expect that recruiting classes are correlated among themselves. All things being equal, a team that recruits well one year is likely to recruit well next year, too. So that means that some of the correlation between, say, your junior class and your win rank is actually due to your sophomore and senior classes. To handle this we use a tool called multiple regression, which calculates the correlation for each class at the same time it controls for all the other classes’ contributions. Make sense? Me neither.
When we do the multiple correlation between win rank and the recruiting ranks for each of the classes leading up, we get the graph below.
This is pretty interesting. The statistical significance of the RS senior, senior, and junior classes disappeared. We’re pretty much only interested in the sophomores and freshmen. And mostly the freshmen. Does that make any sense if you believe recruiting causes winning? Maybe you could argue that a great freshman class can make you turn the corner or something, but I think that’s stretching it. You’re just making up a story so you can hold onto your cherished world-view. I think there are other explanations that make more sense.
(Note: the R-square value for redshirt seniors is higher than for seniors and juniors. However, the correlation is actually negative, though not statistically significant. I think we should just interpret this to mean that when you consider all the recruiting classes on the roster, redshirt seniors are not important for predicting the outcome of your season.)
What about the other direction? Let’s look at the multiple regression for subsequent recruiting rankings. Pow!
This strikes me as intuitive. The class right after a good (or bad) year is not much effected. The next two are. Notice that the total R-square is 20%, exactly the same as for the multiple regression when we looked at recruiting leading up to the season. That means that if you asked me to predict where the Dawgs will finish next year, I’d be just as happy to know about recruiting in 2016-2019 as I would 2012-2015. But that number is not huge. We can’t say that recruiting explains winning, but we can’t exactly say that winning explains recruiting, either. In fact 80% of winning is not explained by its relationship to recruiting. Don James, call your office.
(This is the point where I wished I’d used raw recruiting scores instead of rankings. I also wish I’d used a more rigorous measure of "goodness" – probably Sagarin’s ELO-Chess method. My guess is that I gave up a lot of precisions by converting to rankings.)
One last multiple correlation. Let’s look at how well the recruiting just before and just after a season correlate to win ranking. In other words, when we consider a general climate of good recruiting, how does the team do?
Well, now. This is pretty interesting. It says that although the classes leading up to a season were good predictors of wins when considered on their own (the first graph), they ceased to be good predictors when considered along with the classes subsequent to that year. The effect of the freshman and sophomore classes is completely washed out when they are considered alongside classes following the season in question. Also, our R-square value went up a bit, from 20% to 26%. So we can conclude that knowing recruiting classes a couple years down the line is the best predictor we have looked at of how good a team will be.
The Deadspin data
Brad linked to an article by on Deadspin comparing recruiting and overall ranking for all NCAA teams. This is a better dataset in one way (more teams), but a worse one in another (fewer seasons). How do their results compare?
First, Deadspin correlated the average recruiting rankings for 2009-2013 with the BCS rank at the end of the 2013 season. They report an R-square value of 77%, but I think that’s a mistake. When I recreate their analysis I get a correlation coefficient of 0.77, which gives an R-square of 59%. Still a lot better than the results I was getting above. What happens if we do a similar averaging thing with our Pac-10 data?
The R-square value between the Pac-10 win ranking and the average recruiting ranking over the preceding four years is only about 13%. This is a lot worse than the Deadspin result. I’m not sure why – it could be that their larger dataset reveals some things mine doesn’t, it could be that there’s a lot of randomness in the Pac-10, or it could be that one or both of our results is anomalous. Very hard to say.
What about looking at the Deadspin data the other way? In other words, what if we compare this year’s recruiting with past years’ average BCS ranking? When we do this, we get almost exactly the same R-square value: 60%. In other words, Deadspin’s own data offers support for all three of our theories: recruiting causes winning, winning causes recruiting, or something else causes both. That’s the same results we got with the Pac-10 data. Here’s a graph similar to the first one we generated for the Pac-10 data, just for comparison. It shows much better correlation than we had (using more teams seems to be better), and we still see a general upward trend (i.e., this year’s wins are a better predictor of future recruiting than previous years’ recruiting was a predictor of this year’s wins), but it still doesn’t tell a clear story.
And I would say that’s the basic takeaway from our look at the Deadspin data: although it suggests a bigger dataset might be helpful, it doesn’t change our basic observations much.
Long term averages
There is one more bit of evidence that seems to support the theory that "something else" predicts both recruiting and winning. If we compare the average recruiting ranking and the average win ranking for all Pac-10 teams over the whole period 1996-2015, the R-square value is 56%. This is consistent with the Deadspin result comparing average 2009-2013 recruiting rankings for all teams to wins in 2013, and it’s also consistent with the fact that both the Pac-10 data and the Deadsin data show correlations both ways – past recruiting correlates with present winning, and past winning correlates with present recruiting. So over the long haul good recruiting and winning go together better than the specific yearly winning and recruiting for a given season.
Sort of. It seems to happen a lot when I try this kind of thing that the conclusion is some version of, "You can't tell." This time is no exception. I would say the evidence seems to point strongest to our third hypothesis: something else causes both winning and recruiting. Good teams win and good teams recruit well. But I wouldn't call this a slam-dunk.
We can make one strong statement: people who are just sure recruiting leads to winnings should tone down their confidence a bit. It’s not clear that’s true, no matter how intuitive it may seem. We should start being more concerned about player development and execution (coaching) or maybe stadiums and media coverage ("other") or whatever that dreaded Third Thing is. I really have no idea.