Resources

Subscribe by Email

Your email:

Helpful Links

Criteria’s Employee Testing Blog

Current Articles | RSS Feed RSS Feed

The Obama Effect?

  | Share on Twitter Twitter | Share on Facebook Facebook | Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon |  Share on LinkedIn LinkedIn 

Last week the New York Times published an article on a possible Obama effect on test scores of black test takers. It was unusual for a major newspaper to publish a story on a social science study before that study has been published, let alone reviewed. But when you hear that so-and-so reported their results at some national conference, that isn't really peer reviewed either. The conference organizers have often only seen a 200 word description of what the researchers thought they would present. So although unusual, it's not entirely out of line to try to get the first step on a story like this, and the Times did circulate the study to some academics to get professional opinions.

Let me say at the outset that I hope the central result is true. The authors claim that they gave a short academic aptitude type test to black and white test-takers. When they administered the test last summer, they noted a difference between average scores for blacks and whites. However, after (now) President Obama had received his party's nomination and given his acceptance speech, the difference in scores disappeared. The theory is that Obama's rise has had a positive motivating influence on test taking performance.

The story has legs because there is a well-documented body of research on test performance, and how it can be affected by contextual cues. You can start with the cultural beliefs about aptitude tests in general. If there is a belief among one target group that the tests always show underperformance, then that belief can have a self-fulfilling aspect. Researchers have experimentally manipulated that contextual clue by describing tests differently to participants before they take them. Researchers have also manipulated the race and gender of the test administrator and done a variety of clever tricks to see to what extent performance can be affected by context. One enterprising team actually had women of Asian heritage take a math test, randomly dividing them into one group who answered a questionnaire designed to get them to think of their female identity, while the other group answered questions about their Asian identify. Guess what? One group underperformed relative to the other, and because the study was conducted as a randomized experiment, the authors are allowed to infer that their contextual manipulation caused the differences in performance.

So I'm sympathetic to the study described in the Times, and I fully appreciate the research tradition it comes from. That said, there are a couple of warning flags about the study. First, it is unclear from the Times piece whether there was any reference at all to Obama before the participants took the test. If not, then the story must be that if there was a difference in performance over time it was because Obama was "in the air". That's true enough – he certainly was in the air. The country was electrified. But most studies on test taking performance try to make the contextual cue more closely connected to the test taking event. Lots of things happened from last summer to now...millions of jobs were lost, the stock market tanked, Tom Brady was injured, and the seasons changed.

But the more worrisome concern is the quality of the data.  Based on the Time article, it seems like there were four tests, and at each occasion there were maybe 20 black participants. Furthermore, the age range of the participants was around 50 years. I don't want to make your eyes swirl with statistical mumbo jumbo...but let me throw out these two points. The degree of sampling variability from occasion to occasion would be huge. Would you trust the results of an opinion poll that gathered a group of 20 participants? So why trust the results of a test taken by 20 people? It's all the more problematic that the researchers are trying to prove a lack of difference. With such a small sample size, and such wide variability of participants in age and occupation, it becomes very difficult to prove that a difference exists. But as I have to remind my PhD students everyday – failing to prove that there is a difference is not the same thing as proving that there is no difference. Their eyes swirl at me too.

Come to think of it, it makes you wonder why everyone is looking at the data in this particular way. The story is that on the one testing occasion before Obama's meteoric rise, there was a black white difference, and then it disappeared over the next three testing occasions. The implicit reasoning is that something has happened. But why privilege the summer result so much? Why not ask "What was happening last summer that made a black-white difference show up?" Why assume that that result is somehow "true" and that it has recently "disappeared"?

At any rate, more data is already in hand. There have been several administrations of the SAT during the election run, and even one since President Obama's inauguration. Let's take a look at the national trend based on millions of scores. I'd be very happy if there is something to write about. I personally expect that there will be something to write about over time, but I also believe that the evidence is going to take some time to develop. Let's hope the New York Times is still paying attention then, and not just trying to front-run another study that has barely been mailed out for review.

Gladwell's New Yorker Article on Hiring

  | Share on Twitter Twitter | Share on Facebook Facebook | Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon |  Share on LinkedIn LinkedIn 

In my last post I compared a speech given by Malcolm Gladwell in the spring to the content of his new book Outliers, and wondered what had happened to the employee selection angle he had promised in the speech. Well, no sooner did my post go live than my New Yorker magazine showed up in my mailbox with the answer — this week's cover story is an article by Gladwell entitled "Most Likely to Succeed: How do we hire when we don't know who's right for the job?"

In the article the author describes the problems inherent in evaluating talent and predicting job performance, and cites three examples of jobs where he sees this as a problem: pro football quarterbacks, teachers, and financial analysts. I'm going to focus here just on the issue of predicting success for NFL quarterbacks.

Gladwell describes the challenges faced by NFL scouts who evaluate college quarterbacks, and relates the examples of some prominent "can't miss" prospects who became NFL busts. Gladwell is at his most comfortable spinning an anecdote about a single subject, and he structures this article around the story of Chase Daniel of Missouri. But somehow in trying to tell the story of how difficult it is for NFL teams to decide who to draft, Gladwell makes the ludicrous statement that the entire NFL selection process is fraught with error. He concludes that "there are certain jobs where almost nothing you can learn about candidates before they start predicts how they'll do once they're hired."

In fact, when one looks at the NFL's record of predicting quarterback success, Gladwell's conclusion is on very shaky ground. The collective opinion of an NFL player's prospects are reflected in the order in which players are drafted by NFL teams. It turns out that draft order is a very accurate predictor of subsequent statistical performance for quarterbacks.

To take the most recent decade as an example, when one looks at all the quarterbacks (67 in all) who were drafted by NFL teams from 2000 to 2004, and compares their overall draft position to their statistics in their first four years in the league, it is clear that on balance NFL teams are very accurate in predicting statistical success in the NFL. Organizational psychologists measure the predictive validity of an employee selection technique by quantifying the strength of the relationship between selection measure and job performance; the strength of the association is expressed as a correlation coefficient. For the whole group, the correlation between draft order and passing yardage is very strong (-.73 — the coefficient is negative because the higher a player is drafted, the lower their draft rank). For those concerned that a measure of total productivity such as passing yardage is somewhat correlated with opportunity, we can consider passer efficiency, as measured by QB rating.  Only 51 of the 67 quarterbacks drafted attempted a pass in the NFL, a necessary requirement for calcuclating a QB rating: for this group there was a -.34 correlation between draft position and QB rating. This is still a strong association, and shows a clear, statistically significant correlation between draft order and future statistical success in the NFL.

Like the fans of the teams that drafted them, Gladwell has let the Ryan Leafs (a high draft choice that flopped) and the Tom Bradys (a low draft choice who became a superstar) of the world influence his thinking. These are outliers, a concept with which Gladwell should be familiar given the title of his latest book. (If you take Brady out of the mix the correlations strengthen considerably!) It turns out, in fact, that on average the NFL draft process is highly accurate at predicting QB success, and the draft is based entirely on things that Gladwell dismisses as useless--college performance, scouting, performance in the NFL combine. 

If Gladwell had considered any quantitative measures at all relating to the efficacy of the draft he'd have no basis for his conclusion that "a prediction, in a field where prediction is not possible, is nothing but a prejudice." Gladwell, we fear, gets swept up in his own story telling, and in the process badly miscontrues the alleged "quarterback problem." 

Gladwell's approach reminds me of a challenge we face every day in discussing our pre-employment testing services with customers and prospective clients. When evaluating a selection instrument, there's a strong tendency to take an anecdotal approach and fixate on the outlier for whom the selection instrument was not an accurate predictor. For example, when a sales manager administers one of our tests to a few dozen existing employees, we'll often hear about the one high performer who didn't fare well on the test. Anecdotes are powerful, and it is sometimes difficult to persuade the manager to focus on how well the assessment predicts performance across the whole group. No selection measure is perfect, and we must be careful when evaluating the efficacy of our selection processes not to follow Gladwell's lead and let anecdotal evidence trump more rigorous analysis. 

Malcolm Gladwell's Outliers

  | Share on Twitter Twitter | Share on Facebook Facebook | Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon |  Share on LinkedIn LinkedIn 

Recently on a plane the guy beside me was reading the same book I was – Malcolm Gladwell's Outliers. My fellow passenger didn't think this was remarkable as the airport bookstores had huge displays. Gladwell has become somewhat of a household name for his skill at popularizing social science through collecting compelling anecdotes. Blink and The Tipping Point were entertaining enough to read, and that's why the guy beside me had made an impulse purchase.

I had been more proactive in getting my hands on the book. We here at Criteria had actually been eagerly waiting for Gladwell's book ever since May, when we stumbled across a truly odd speech he gave at a New Yorker conference. Gladwell's speech, which was explicitly delivered as a sneak preview of the book, covered what he called the "mismatch" problem. His thesis was that the way employers evaluate prospective employees — including the practice of pre-employment testing — is at a complete "mismatch" with what is required. As evidence he offered three loosely linked examples – sports combines where amateur athletes are evaluated before a draft; certification requirements for teachers; and the University of Michigan's affirmative action program for law school admissions.

Gladwell opens with anecdotes from the National Hockey League's's pre-draft combine, and then goes on to discuss the NBA and NFL drafts. He relates the finding that the aptitude test given to the NFL quarterbacks has no correlation with their performance. As usual, his evidence for this contention is entirely anecdotal. In a blog post in the spring we described evidence showing that the test may be predictive of QB success, and it doesn't help Gladwell's case that he mocks the fact that Eli Manning and Tony Romo scored well on the test, while Vince Young and David Garrard did not. Even back in May 2008 he should have understood that his alleged exceptions weren't exactly disproving the rule. By the end of the opening example, Gladwell has shown that he probably doesn't know all that much about sports, and has launched a puzzling, and difficult to support, argument. We were eager to see him make his point in print, but this line of thought didn't make it into the book.

In his speech, Gladwell next moved on to discussing teachers. We can't disagree with him that good teachers are important, or that it might be a good idea to broaden the pool from which new teachers are selected. But we were a bit confused by his argument about hiring standards. Teacher quality, he tells us, is much more predictive of student achievement than classroom size, and so it is worth investing in. But according to Gladwell, there are no criteria to predict whether someone will be a successful teacher. This is quite a jump, and again one is left with only colorful anecdotes for what is a very sweeping point with broad social significance.

Finally, Gladwell points out that even though the University of Michigan Law School had, in the past, given extra consideration to minority applicants and made concessions on testing standards, there was no evidence of differences in "success" years after graduation. Gladwell does not address that the admissions tests were not designed to predict success after graduation, but rather performance in the first year of law school. Nor does he address that there are statistical problems with evaluating a selection tool on a sample that was selected in the first place using that tool. But most important, Gladwell seems to be arguing that based on the experiences of the University of Michigan, law schools should abandon altogether their use of test scores to select applicants. 

Well, somewhere between his May speech and the November publication of Outliers Gladwell must have realized he had his own mismatch problem. His evidence didn't match his thesis. He must have changed the direction of his book significantly, because Outliers is barely relevant to employee selection. Instead, it's a motley collection of examples arguing that exceptionally successful people are not entirely self-made, and that their ascent is due also to extraordinary good fortune with regard to the opportunities they were presented with. (This is hardly a shocker of a thesis, and is right up there with Blink's bold contention that first impressions are often correct, except when they are not.) But where's the employee selection angle? Where his speech proclaimed that it was "time to shut down the combines", his book only discusses the mildly interesting fact that the birthdays of professional athletes tend to cluster near critical cut-off dates for selection into elite programs. Where he promised to show how the process of hiring and training pilots is completely at odds with what is required, he only ends up discussing how cultural factors can have a tragic influence on the dynamics between pilots in a cockpit. Rather than discuss the process of hiring teachers, he describes one successful charter school, and the demands it makes of its students. He also speculates on cultural and linguistic factors that might correlate with math perseverance.

We're left to wonder why the change in focus from the speech to the book... could it be Gladwell realized that after all there is substantial evidence for the effectiveness of aptitude testing as a predictor of job success? Or did he just decide that "what determines exceptional success?" is a more interesting question (and easier to write about) than "how can we hire better?"

Using College Admissions Exams: Part II

  | Share on Twitter Twitter | Share on Facebook Facebook | Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon |  Share on LinkedIn LinkedIn 

Today's blog post is the second by Dr. Howard Wainer, who is the Distinguished Research Scientist at the National Board of Medical Examiners, as well as Professor of Statistics at the Wharton School of the University of Pennsylvania.  Dr. Wainer is also a member of Criteria's Scientific Advisory Board.

 

In an earlier post I commented on one aspect of a report, commissioned by the National Association for College Admission Counseling, that was critical of the current college admission exams, the SAT and the ACT. The commission was chaired by William R. Fitzsimmons, the dean of admissions and financial aid at Harvard.

One of the recommendations of the Commission was for colleges to consider making their admissions tests (SAT or ACT) optional. Using data from Bowdoin College, which has had such a policy for almost 40 years, I showed that those students who did not submit their SAT scores had, in fact, scored about a standard deviation lower than those students that did submit them. This isn't surprising. More important, the students who did not submit SAT scores also performed about a standard deviation lower in their freshmen grade point average at Bowdoin. This would have been predictable from their SAT scores had the College insisted on them. My conclusion is that colleges deny themselves useful information by making SAT's optional. And the Commission, by making their recommendations in the absence of such data, was shooting in the dark.

In this post I'd like to discuss another of their other principal recommendations:

Schools should consider eliminating the SAT/ACT altogether and substituting instead achievement tests. They cite the unfair effect of coaching as the motivation for this — they weren't naive enough to suggest that if achievement tests were to become more high stakes coaching for them would not be offered. Rather, they argued that such coaching would be related to schooling and hence more beneficial to education than is coaching that focuses on test-taking skills.

Driving the Commission's recommendations was the notion that the differential availability of commercial coaching made admissions testing unfair. They recognized that the 100 point gain (on the 1200 point SAT scale) test prep providers often tout as a typical outcome was hype and agreed with the estimates from more neutral sources that about 20 points was more likely. However, they deemed even 20 points too many. The Commission pointed out that there was no wide-spread coaching for achievement tests, but agreed that should the admissions option shift to achievement tests the coaching would likely follow. This would be no fairer to those applicants who could not afford extra coaching, but at least the coaching would be of material more germane to the subject matter and less related to test-taking strategies.

One can argue with the logic of this – that a test that is less subject oriented and related more to the estimation of a general aptitude might have greater generality. And that a test that is less related to specific subject matter might be fairer to those students whose schools have more limited resources for teaching a broad range of courses. I find these arguments persuasive, but I have no data at hand to support them. So instead I will take a different, albeit more technical, tack. I will argue that the psychometric reality associated with replacing general aptitude tests with achievement tests means that making the kinds of comparisons that schools need among different candidates impossible.

When all students take the same tests we can compare their scores on the same basis. The SAT and ACT were constructed specifically to be suitable for a wide range of curricula. SAT–Math is based on mathematics no more advanced than 8th grade. Contrast this with what would be the case with achievement tests. There would need to be a range of tests and students would chose a subset of them that best displayed both the coursework they had had and the areas they felt they were best in. Some might take chemistry, others physics; some French, others music. The current system has students typically taking three achievement tests (SAT-II). How can such very different tests be scored so that the outcome on different tests can be compared? Do you know more French than I know physics? Was Mozart a better composer than Einstein was a physicist? How can admissions officers make sensible decisions through incomparable scores?

How are SAT-II exams scored currently? Or more specifically, how they had been scored for decades when I left the employ of ETS seven years ago – I don't know if they have changed anything in the interim. They were all scored on the familiar 200-800 scales, but similar scores on two different tests are only vaguely comparable. How could they be? What is currently done is that tests in mathematics and science are roughly equated using the SAT-Math, the aptitude test that everyone takes, as an equating link. In the same way tests in the humanities and social sciences are equated using the SAT-Verbal. This is not a great solution, but is the best that can be done in a very difficult situation. Comparing history with physics is not worth doing for even moderately close comparisons.

One obvious approach would be to norm reference each test, so that someone who scores average for all those who take a particular test gets a 500 and someone a standard deviation higher gets a 600, etc.. This would work if the people who take each test were, in some sense, of equal ability. But that is not only unlikely, it is empirically false. The average student taking the French achievement test might starve to death in a French restaurant, whereas the average person taking the Hebrew achievement test, might do just fine if dropped in the middle of the night onto the streets of Tel Aviv. Happily the latter students also do much better on the SAT-VERBAL test and so the equating helps. This is not true for the Spanish test, where a substantial portion of those taking it come from Spanish speaking homes.

Substituting achievement tests is not a practical option unless admissions officers are prepared to have subject matter quotas. I believe that solution would be too inflexible to be feasible.

Tags: ,

Study Examines Effectiveness of Employment Testing

  | Share on Twitter Twitter | Share on Facebook Facebook | Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon |  Share on LinkedIn LinkedIn 

The May-June edition of the APA's journal American Psychologist contains an important new study on the effectiveness and validity of employment testing. The study examines the predictive validity of testing in both educational and employment settings. There's a good summary of the study's findings on my favorite statistics blog. Essentially, the study shows that employment aptitude tests are a generally valid way of predicting a wide variety of aspects of job performance. It also contains encouraging conclusions about the fairness of aptitude tests.

While most of the conclusions of the study will not be surprising to people familiar with the field of employment testing, my sense is that this is an important study because of the amount of evidence it considers. The study is a meta-analysis (a review of many different studies) and may be the most comprehensive attempt to examine the effectiveness and validity of employment testing since Hunter and Schmidt's seminal study in the late 1990s.

If you want to cough up $12 to read the whole study you can do so here. I'll probably get into the details of the study in subsequent posts.

All Posts