This Saturday is the NFL draft, which means that NFL scouts have spent the past months going over 40-yard dash times and college game tapes, and fans have debated which prospect would be the best fit for their team. It also means it’s time for media and fans to recycle the usual punchlines about the folly of using an aptitude test like the Wonderlic on NFL prospects. Football, more than any other American team sport, is about physicality, and the idea that performance on an aptitude test could have much to do with success on the football field seems absurd. Skeptics point out that a low Wonderlic score didn’t prevent Dan Marino from becoming one of the most prolific passers in history, or Vince Young from making the Pro Bowl in his rookie year. When Criteria works with customers to gather evidence for the validity of our employment tests at their organization, we sometimes hear similar anecdotes. I’ve often heard HR managers express concern that “one of our best performers did poorly on the test.” (Criteria has an aptitude test, the CCAT, that is similar to the Wonderlic.) Such reactions are understandable, but the measure of a test’s predictive validity can’t be judged from one test score–the only meaningful way to measure a test’s ability to predict productivity is to study the correlations between test scores and job performance across a broad sample of people. Based on this standard, the Wonderlic may be a better predictor of performance in the NFL than you might think.
Two business professors from the University of Louisville recently did such a study with NFL data. They correlated test scores with performance measures and concluded that there was no association between test scores and performance in the NFL. If there is no association between the two, why is the Wonderlic used on NFL prospects? The study was critical of the selection measures used by the NFL.
This is the kind of study we often conduct for our clients, BUT we also point out that you have to be careful when evaluating how well a selection measure predicts performance. Success criteria must be chosen appropriately, and the sample has to be appropriate. I have concerns with exactly these issues in the Louisville study.
As a performance measure, the authors use average salary in a player’s first three years as one of the “success metrics,” but any football fan knows that a player’s salary in his first years in the league is a function of draft order, not performance in the league, since he hasn’t played any games when he signs a contract. The authors also use draft order as a “success measure.” Both draft order and first-year salary are meaningful measures of a player’s success only from the point of a view of the player–they reflect the collective wisdom about a player’s future prospects. To owners and fans, on-field performance after entering the NFL is a much more meaningful measure of productivity.
The second problem with the study is that the authors include everyone in the performance evaluation, even if they never had a chance to perform. They found data on 68 quarterbacks drafted between 1999 and 2004, and included them all in the analysis comparing test scores to “success.” The problem is that many of these QBs saw no or limited action in the NFL. So what does it mean to assess their performance when they didn’t get to perform?
We tried a similar study by using data from NFL.com and other websites to find data on QBs drafted between 2000 and 2004. (We didn’t use data from before 2000 because the data on players scores is unreliable and incomplete.) The simplest way to measure the predictive validity of an employment test is to compare test scores to one or more metrics used to measure productivity in a given job. We chose QBs because that position requires the decision-making and problem solving skills that aptitude tests are supposed to measure, and as productivity metrics we chose yards passing and number of TDs thrown in the first four years (four years is the average length of an NFL player’s career.) Passing yards and TDs thrown aren’t the perfect metric (did you know Joey Harrington threw for more yards in his first four years than did Tom Brady, who didn’t start until his second year?) You can check out the data we used below: there were 68 QBs drafted from 2000 to 2004, but we eliminated the 5 QBs for whom we couldn’t find Wonderlic scores, as well as two others who ended up playing other positons (Ronald Curry) or other sports (Drew Henson).
The data is very interesting. If you look at the data for all 61 QBs, there is only a fairly weak correlation between aptitude and passing yards (r=.19) and TDs thrown (r=.20) But we made a plot of the test scores (Figure 1, x-axis) and the passing yards (y-axis) and saw that the story was much more complicated than that. As it turns out, there does appear to be a strong association between test score and performance (yds thrown)–you just don’t see it until you look at QBs who threw for 1000 or more yards (which is where we put the horizontal line).
A performance measure can have multiple meanings. Some QBs don’t throw for many yards because they barely get on the field, and this can happen for many reasons; they might not be good enough, but they also could be drafted to a team with a good starter in place, or get injured, etc. Below the 1000 yards passing mark, the data are all spread out across the score spectrum–there is no correlation there. Above the line, however, the correlation is a whopping r=.51 (r=.49 for TDs thrown), right up there with some of the strongest coefficients reported anywhere in organizational psychology.
Another way to look at the strength of these correlations is that for this sample, the QBs who scored below the median Wonderlic score (for QBs) of 27 averaged 5,202 passing yards and 31.2 TDs over their first four years, whereas those scoring above the median averaged 6,570 yards and 40.8 TDs over the same period. Seems like the cognitive measure might be worth something after all!
Of course, you might ask where should the cut-off be? How did we pick 1,000 yards? We tried it again with QBs who had started more than 5 games, and the same pattern replicates, but there is a bit of a caveat. Craig Krenzel, who studied molecular genetics at Ohio State and had a less-than-stellar stint with the Bears, scored very high on his aptitude test. He threw for about 800 yards in the pros, and started 5 games. If you change the thresholds and include him, then the overall predictive validity of the aptitude test goes down a little–that’s the problem with small sample sizes–the anecdotes can actually affect the statistics in a non-trivial way.
All in all we think the data linking aptitude test scores with NFL performance is much more interesting than is currently recognized. In fact, for QBs drafted between 2000 and 2004 the data suggest there is a definite link between aptitude test scores and on-field performance. And we went through this exercise because it illustrates a lot of lessons we try to share with our customers. Think carefully about the measure of performance; make sure you recognize that there can be many reasons for good or bad performance that are unrelated to the test; plot your data so that you can visualize what’s going on; and beware of making inferences in small samples.