Posted by Josh Millet on Thu, Dec 18, 2008 @ 12:24 PM
In my last post I compared a speech given by Malcolm Gladwell in the spring to the content of his new book Outliers, and wondered what had happened to the employee selection angle he had promised in the speech. Well, no sooner did my post go live than my New Yorker magazine showed up in my mailbox with the answer — this week's cover story is an article by Gladwell entitled "Most Likely to Succeed: How do we hire when we don't know who's right for the job?"
In the article the author describes the problems inherent in evaluating talent and predicting job performance, and cites three examples of jobs where he sees this as a problem: pro football quarterbacks, teachers, and financial analysts. I'm going to focus here just on the issue of predicting success for NFL quarterbacks.
Gladwell describes the challenges faced by NFL scouts who evaluate college quarterbacks, and relates the examples of some prominent "can't miss" prospects who became NFL busts. Gladwell is at his most comfortable spinning an anecdote about a single subject, and he structures this article around the story of Chase Daniel of Missouri. But somehow in trying to tell the story of how difficult it is for NFL teams to decide who to draft, Gladwell makes the ludicrous statement that the entire NFL selection process is fraught with error. He concludes that "there are certain jobs where almost nothing you can
learn about candidates before they start predicts how they'll do once
they're hired."
In fact, when one looks at the NFL's record of predicting quarterback success, Gladwell's conclusion is on very shaky ground. The collective opinion of an NFL player's prospects are reflected in the order in which players are drafted by NFL teams. It turns out that draft order is a very accurate predictor of subsequent statistical performance for quarterbacks.
To take the most recent decade as an example, when one looks at all the quarterbacks (67 in all) who were drafted by NFL teams from 2000 to 2004, and compares their overall draft position to their statistics in their first four years in the league, it is clear that on balance NFL teams are very accurate in predicting statistical success in the NFL. Organizational psychologists measure the predictive validity of an employee selection technique by quantifying the strength of the relationship between selection measure and job performance; the strength of the association is expressed as a correlation coefficient. For the whole group, the correlation between draft order and passing yardage is very strong (-.73 — the coefficient is negative because the higher a player is drafted, the lower their draft rank). For those concerned that a measure of total productivity such as passing yardage is somewhat correlated with opportunity, we can consider passer efficiency, as measured by QB rating. Only 51 of the 67 quarterbacks drafted attempted a pass in the NFL, a necessary requirement for calcuclating a QB rating: for this group there was a -.34 correlation between draft position and QB rating. This is still a strong association, and shows a clear, statistically significant correlation between draft order and future statistical success in the NFL.
Like the fans of the teams that drafted them, Gladwell has let the Ryan Leafs (a high draft choice that flopped) and the Tom Bradys (a low draft choice who became a superstar) of the world influence his thinking. These are outliers, a concept with which Gladwell should be familiar given the title of his latest book. (If you take Brady out of the mix the correlations strengthen considerably!) It turns out, in fact, that on average the NFL draft process is highly accurate at predicting QB success, and the draft is based entirely on things that Gladwell dismisses as useless--college performance, scouting, performance in the NFL combine.
If Gladwell had considered any quantitative measures at all relating to the efficacy of the draft he'd have no basis for his conclusion that "a prediction, in a field where prediction is not possible, is nothing but a prejudice." Gladwell, we fear, gets swept up in his own story telling, and in the process badly miscontrues the alleged "quarterback problem."
Gladwell's approach reminds me of a challenge we face every day in discussing our pre-employment testing services with customers and prospective clients. When evaluating a selection instrument, there's a strong tendency to take an anecdotal approach and fixate on the outlier for whom the selection instrument was not an accurate predictor. For example, when a sales manager administers one of our tests to a few dozen existing employees, we'll often hear about the one high performer who didn't fare well on the test. Anecdotes are powerful, and it is sometimes difficult to persuade the manager to focus on how well the assessment predicts performance across the whole group. No selection measure is perfect, and we must be careful when evaluating the efficacy of our selection processes not to follow Gladwell's lead and let anecdotal evidence trump more rigorous analysis.
Posted by Josh Millet on Thu, Apr 24, 2008 @ 11:49 AM
This Saturday is the NFL draft, which means that NFL scouts have spent the past months going over 40-yard dash times and college game tapes, and fans have debated which prospect would be the best fit for their team. It also means it's time for media and fans to recycle the usual punchlines about the folly of using an aptitude test like the Wonderlic on NFL prospects. Football, more than any other American team sport, is about physicality, and the idea that performance on an aptitude test could have much to do with success on the football field seems absurd. Skeptics point out that a low Wonderlic score didn't prevent Dan Marino from becoming one of the most prolific passers in history, or Vince Young from making the Pro Bowl in his rookie year. When Criteria works with customers to gather evidence for the validity of our employment tests at their organization, we sometimes hear similar anecdotes. I've often heard HR managers express concern that "one of our best performers did poorly on the test." (Criteria has an aptitude test, the CCAT, that is similar to the Wonderlic.) Such reactions are understandable, but the measure of a test's predictive validity can't be judged from one test score--the only meaningful way to measure a test's ability to predict productivity is to study the correlations between test scores and job performance across a broad sample of people. Based on this standard, the Wonderlic may be a better predictor of performance in the NFL than you might think.
Two business professors from the University of Louisville recently did such a study with NFL data. They correlated test scores with performance measures and concluded that there was no association between test scores and performance in the NFL. If there is no association between the two, why is the Wonderlic used on NFL prospects? The study was critical of the selection measures used by the NFL.
This is the kind of study we often conduct for our clients, BUT we also point out that you have to be careful when evaluating how well a selection measure predicts performance. Success criteria must be chosen appropriately, and the sample has to be appropriate. I have concerns with exactly these issues in the Louisville study.
As a performance measure, the authors use average salary in a player's first three years as one of the "success metrics," but any football fan knows that a player's salary in his first years in the league is a function of draft order, not performance in the league, since he hasn't played any games when he signs a contract. The authors also use draft order as a "success measure." Both draft order and first-year salary are meaningful measures of a player's success only from the point of a view of the player--they reflect the collective wisdom about a player's future prospects. To owners and fans, on-field performance after entering the NFL is a much more meaningful measure of productivity.
The second problem with the study is that the authors include everyone in the performance evaluation, even if they never had a chance to perform. They found data on 68 quarterbacks drafted between 1999 and 2004, and included them all in the analysis comparing test scores to "success." The problem is that many of these QBs saw no or limited action in the NFL. So what does it mean to assess their performance when they didn't get to perform?
We tried a similar study by using data from NFL.com and other websites to find data on QBs drafted between 2000 and 2004. (We didn't use data from before 2000 because the data on players scores is unreliable and incomplete.) The simplest way to measure the predictive validity of an employment test is to compare test scores to one or more metrics used to measure productivity in a given job. We chose QBs because that position requires the decision-making and problem solving skills that aptitude tests are supposed to measure, and as productivity metrics we chose yards passing and number of TDs thrown in the first four years (four years is the average length of an NFL player's career.) Passing yards and TDs thrown aren't the perfect metric (did you know Joey Harrington threw for more yards in his first four years than did Tom Brady, who didn't start until his second year?) You can check out the data we used here: there were 68 QBs drafted from 2000 to 2004, but we eliminated the 5 QBs for whom we couldn't find Wonderlic scores, as well as two others who ended up playing other positons (Ronald Curry) or other sports (Drew Henson).
The data is very interesting. If you look at the data for all 61 QBs, there is only a fairly weak correlation between aptitude and passing yards (r=.19) and TDs thrown (r=.20) But we made a plot of the test scores (Figure 1, x-axis) and the passing yards (y-axis) and saw that the story was much more complicated than that. As it turns out, there does appear to be a strong association between test score and performance (yds thrown)--you just don't see it until you look at QBs who threw for 1000 or more yards (which is where we put the horizontal line).
A performance measure can have multiple meanings. Some QBs don't throw for many yards because they barely get on the field, and this can happen for many reasons; they might not be good enough, but they also could be drafted to a team with a good starter in place, or get injured, etc. Below the 1000 yards passing mark, the data are all spread out across the score spectrum--there is no correlation there. Above the line, however, the correlation is a whopping r=.51 (r=.49 for TDs thrown), right up there with some of the strongest coefficients reported anywhere in organizational psychology.
Another way to look at the strength of these correlations is that for this sample, the QBs who scored below the median Wonderlic score (for QBs) of 27 averaged 5,202 passing yards and 31.2 TDs over their first four years, whereas those scoring above the median averaged 6,570 yards and 40.8 TDs over the same period. Seems like the cognitive measure might be worth something after all!

Of course, you might ask where should the cut-off be? How did we pick 1,000 yards? We tried it again with QBs who had started more than 5 games, and the same pattern replicates, but there is a bit of a caveat. Craig Krenzel, who studied molecular genetics at Ohio State and had a less-than-stellar stint with the Bears, scored very high on his aptitude test. He threw for about 800 yards in the pros, and started 5 games. If you change the thresholds and include him, then the overall predictive validity of the aptitude test goes down a little--that's the problem with small sample sizes--the anecdotes can actually affect the statistics in a non-trivial way.
All in all we think the data linking aptitude test scores with NFL performance is much more interesting than is currently recognized. In fact, for QBs drafted between 2000 and 2004 the data suggest there is a definite link between aptitude test scores and on-field performance. And we went through this exercise because it illustrates a lot of lessons we try to share with our customers. Think carefully about the measure of performance; make sure you recognize that there can be many reasons for good or bad performance that are unrelated to the test; plot your data so that you can visualize what's going on; and beware of making inferences in small samples.