Posted by Eric Loken on Wed, Jan 28, 2009 @ 01:40 PM
Last week the New York Times published an article on a possible Obama
effect on test scores of black test takers. It was unusual for a
major newspaper to publish a story on a social science study before that study
has been published, let alone reviewed. But when you hear that so-and-so
reported their results at some national conference, that isn't really peer
reviewed either. The conference organizers have often only seen a 200
word description of what the researchers thought they would present. So
although unusual, it's not entirely out of line to try to get the first step on
a story like this, and the Times did circulate the study to some academics to
get professional opinions.
Let me say at the outset that I hope the central result is true. The authors
claim that they gave a short academic aptitude type test to black and white
test-takers. When they administered the test last summer, they
noted a difference between average scores for blacks and whites. However,
after (now) President Obama had received his party's nomination and given
his acceptance speech, the difference in scores disappeared. The theory
is that Obama's rise has had a positive motivating influence on test taking
performance.
The story has legs because there is a well-documented body of research on test performance,
and how it can be affected by contextual cues. You can start with the
cultural beliefs about aptitude tests in general. If there is a belief
among one target group that the tests always show underperformance, then that
belief can have a self-fulfilling aspect. Researchers have
experimentally manipulated that contextual clue by describing tests differently
to participants before they take them. Researchers have also manipulated
the race and gender of the test administrator and done a variety of clever
tricks to see to what extent performance can be affected by context. One
enterprising team actually had women of Asian heritage take a math test, randomly
dividing them into one group who answered a questionnaire designed to get them
to think of their female identity, while the other group answered questions
about their Asian identify. Guess what? One group underperformed
relative to the other, and because the study was conducted as a randomized
experiment, the authors are allowed to infer that their contextual manipulation
caused the differences in performance.
So I'm sympathetic to the study described in the Times, and I fully appreciate
the research tradition it comes from. That said, there are a couple of
warning flags about the study. First, it is unclear from the Times piece
whether there was any reference at all to Obama before the participants took
the test. If not, then the story must be that if there was a difference
in performance over time it was because Obama was "in the air". That's true
enough – he certainly was in the air. The country was electrified.
But most studies on test taking performance try to make the contextual cue more
closely connected to the test taking event. Lots of things happened
from last summer to now...millions of jobs were lost, the stock market tanked,
Tom Brady was injured, and the seasons changed.
But the more worrisome concern is the quality of the data. Based on the Time
article, it seems like there were four tests, and at each occasion there were
maybe 20 black participants. Furthermore, the age range of the
participants was around 50 years. I don't want to make your eyes
swirl with statistical mumbo jumbo...but let me throw out these two points.
The degree of sampling variability from occasion to occasion would be
huge. Would you trust the results of an opinion poll that gathered a
group of 20 participants? So why trust the results of a test taken by 20
people? It's all the more problematic that the researchers are trying to
prove a lack of difference. With such a small sample size, and such
wide variability of participants in age and occupation, it becomes very
difficult to prove that a difference exists. But as I have to
remind my PhD students everyday – failing to prove that there is a difference
is not the same thing as proving that there is no difference. Their eyes swirl at me too.
Come to think of it, it makes you wonder why everyone is looking at the data in
this particular way. The story is that on the one testing occasion before
Obama's meteoric rise, there was a black white difference, and then it
disappeared over the next three testing occasions. The implicit
reasoning is that something has happened. But why privilege the summer
result so much? Why not ask "What was happening last summer that made a
black-white difference show up?" Why assume that that result is somehow
"true" and that it has recently "disappeared"?
At any rate, more data is already in hand. There have been several
administrations of the SAT during the election run, and even one since
President Obama's inauguration. Let's take a look at the national trend
based on millions of scores. I'd be very happy if there is something to
write about. I personally expect that there will be something to write about
over time, but I also believe that the evidence is going to take some time to
develop. Let's hope the New York Times is still paying attention then,
and not just trying to front-run another study that has barely been mailed out
for review.
Posted by Josh Millet on Thu, Dec 18, 2008 @ 12:24 PM
In my last post I compared a speech given by Malcolm Gladwell in the spring to the content of his new book Outliers, and wondered what had happened to the employee selection angle he had promised in the speech. Well, no sooner did my post go live than my New Yorker magazine showed up in my mailbox with the answer — this week's cover story is an article by Gladwell entitled "Most Likely to Succeed: How do we hire when we don't know who's right for the job?"
In the article the author describes the problems inherent in evaluating talent and predicting job performance, and cites three examples of jobs where he sees this as a problem: pro football quarterbacks, teachers, and financial analysts. I'm going to focus here just on the issue of predicting success for NFL quarterbacks.
Gladwell describes the challenges faced by NFL scouts who evaluate college quarterbacks, and relates the examples of some prominent "can't miss" prospects who became NFL busts. Gladwell is at his most comfortable spinning an anecdote about a single subject, and he structures this article around the story of Chase Daniel of Missouri. But somehow in trying to tell the story of how difficult it is for NFL teams to decide who to draft, Gladwell makes the ludicrous statement that the entire NFL selection process is fraught with error. He concludes that "there are certain jobs where almost nothing you can
learn about candidates before they start predicts how they'll do once
they're hired."
In fact, when one looks at the NFL's record of predicting quarterback success, Gladwell's conclusion is on very shaky ground. The collective opinion of an NFL player's prospects are reflected in the order in which players are drafted by NFL teams. It turns out that draft order is a very accurate predictor of subsequent statistical performance for quarterbacks.
To take the most recent decade as an example, when one looks at all the quarterbacks (67 in all) who were drafted by NFL teams from 2000 to 2004, and compares their overall draft position to their statistics in their first four years in the league, it is clear that on balance NFL teams are very accurate in predicting statistical success in the NFL. Organizational psychologists measure the predictive validity of an employee selection technique by quantifying the strength of the relationship between selection measure and job performance; the strength of the association is expressed as a correlation coefficient. For the whole group, the correlation between draft order and passing yardage is very strong (-.73 — the coefficient is negative because the higher a player is drafted, the lower their draft rank). For those concerned that a measure of total productivity such as passing yardage is somewhat correlated with opportunity, we can consider passer efficiency, as measured by QB rating. Only 51 of the 67 quarterbacks drafted attempted a pass in the NFL, a necessary requirement for calcuclating a QB rating: for this group there was a -.34 correlation between draft position and QB rating. This is still a strong association, and shows a clear, statistically significant correlation between draft order and future statistical success in the NFL.
Like the fans of the teams that drafted them, Gladwell has let the Ryan Leafs (a high draft choice that flopped) and the Tom Bradys (a low draft choice who became a superstar) of the world influence his thinking. These are outliers, a concept with which Gladwell should be familiar given the title of his latest book. (If you take Brady out of the mix the correlations strengthen considerably!) It turns out, in fact, that on average the NFL draft process is highly accurate at predicting QB success, and the draft is based entirely on things that Gladwell dismisses as useless--college performance, scouting, performance in the NFL combine.
If Gladwell had considered any quantitative measures at all relating to the efficacy of the draft he'd have no basis for his conclusion that "a prediction, in a field where prediction is not possible, is nothing but a prejudice." Gladwell, we fear, gets swept up in his own story telling, and in the process badly miscontrues the alleged "quarterback problem."
Gladwell's approach reminds me of a challenge we face every day in discussing our pre-employment testing services with customers and prospective clients. When evaluating a selection instrument, there's a strong tendency to take an anecdotal approach and fixate on the outlier for whom the selection instrument was not an accurate predictor. For example, when a sales manager administers one of our tests to a few dozen existing employees, we'll often hear about the one high performer who didn't fare well on the test. Anecdotes are powerful, and it is sometimes difficult to persuade the manager to focus on how well the assessment predicts performance across the whole group. No selection measure is perfect, and we must be careful when evaluating the efficacy of our selection processes not to follow Gladwell's lead and let anecdotal evidence trump more rigorous analysis.
Posted by Josh Millet on Fri, Dec 12, 2008 @ 12:12 PM
Recently on a plane the guy beside me was reading the same book I
was – Malcolm Gladwell's Outliers. My fellow passenger didn't think
this was remarkable as the airport bookstores had huge displays.
Gladwell has become somewhat of a household name for his skill at
popularizing social science through collecting compelling anecdotes. Blink and The Tipping Point were entertaining enough to read, and
that's why the guy beside me had made an impulse purchase.
I had been more proactive in getting my hands on the book. We here
at Criteria had actually been eagerly waiting for Gladwell's book ever
since May, when we stumbled across a truly odd speech
he gave at a New Yorker conference. Gladwell's speech, which was
explicitly delivered as a sneak preview of the book, covered what he
called the "mismatch" problem. His thesis was that the way employers
evaluate prospective employees — including the practice of
pre-employment testing — is at a complete "mismatch" with what is
required. As evidence he offered three loosely linked examples – sports
combines where amateur athletes are evaluated before a draft;
certification requirements for teachers; and the University of
Michigan's affirmative action program for law school admissions.
Gladwell opens with anecdotes from the National Hockey League's's
pre-draft combine, and then goes on to discuss the NBA and NFL drafts.
He relates the finding that the aptitude test given to the NFL
quarterbacks has no correlation with their performance. As usual, his
evidence for this contention is entirely anecdotal. In a blog post
in the spring we described evidence showing that the test may be
predictive of QB success, and it doesn't help Gladwell's case that he
mocks the fact that Eli Manning and Tony Romo scored well on the test,
while Vince Young and David Garrard did not. Even back in May 2008 he
should have understood that his alleged exceptions weren't exactly
disproving the rule. By the end of the opening example, Gladwell has
shown that he probably doesn't know all that much about sports, and has
launched a puzzling, and difficult to support, argument. We were eager
to see him make his point in print, but this line of thought didn't
make it into the book.
In his speech, Gladwell next moved on to discussing teachers. We
can't disagree with him that good teachers are important, or that it
might be a good idea to broaden the pool from which new teachers are
selected. But we were a bit confused by his argument about hiring
standards. Teacher quality, he tells us, is much more predictive of
student achievement than classroom size, and so it is worth investing
in. But according to Gladwell, there are no criteria to predict whether
someone will be a successful teacher. This is quite a jump, and again
one is left with only colorful anecdotes for what is a very sweeping
point with broad social significance.
Finally, Gladwell points out that even though the University of
Michigan Law School had, in the past, given extra consideration to
minority applicants and made concessions on testing standards, there
was no evidence of differences in "success" years after graduation.
Gladwell does not address that the admissions tests were not designed
to predict success after graduation, but rather performance in the
first year of law school. Nor does he address that there are
statistical problems with evaluating a selection tool on a sample that
was selected in the first place using that tool. But most important,
Gladwell seems to be arguing that based on the experiences of the
University of Michigan, law schools should abandon altogether their use
of test scores to select applicants.
Well, somewhere between his May speech and the November publication
of Outliers Gladwell must have realized he had his own mismatch
problem. His evidence didn't match his thesis. He must have changed the
direction of his book significantly, because Outliers is barely
relevant to employee selection. Instead, it's a motley collection of
examples arguing that exceptionally successful people are not entirely
self-made, and that their ascent is due also to extraordinary good
fortune with regard to the opportunities they were presented with.
(This is hardly a shocker of a thesis, and is right up there with
Blink's bold contention that first impressions are often correct,
except when they are not.) But where's the employee selection angle?
Where his speech proclaimed that it was "time to shut down the
combines", his book only discusses the mildly interesting fact that the
birthdays of professional athletes tend to cluster near critical
cut-off dates for selection into elite programs. Where he promised to
show how the process of hiring and training pilots is completely at
odds with what is required, he only ends up discussing how cultural
factors can have a tragic influence on the dynamics between pilots in a
cockpit. Rather than discuss the process of hiring teachers, he
describes one successful charter school, and the demands it makes of
its students. He also speculates on
cultural and linguistic factors that might correlate with math
perseverance.
We're left to wonder why the change in focus from the speech to the
book... could it be Gladwell realized that after all there is
substantial evidence for the effectiveness of aptitude testing as a
predictor of job success? Or did he just decide that "what determines
exceptional success?" is a more interesting question (and easier to
write about) than "how can we hire better?"
Posted by Howard Wainer on Mon, Nov 10, 2008 @ 02:49 PM
Today's blog post is the second by Dr. Howard Wainer, who is the Distinguished
Research Scientist at the National Board of Medical Examiners, as well
as Professor of Statistics at the Wharton School of the University of
Pennsylvania. Dr. Wainer is also a member of Criteria's Scientific Advisory Board.
In an earlier post
I commented on one aspect of a report, commissioned by the National
Association for College Admission Counseling, that was critical of the
current college admission exams, the SAT and the ACT. The commission
was chaired by William R. Fitzsimmons, the dean of admissions and
financial aid at Harvard.
One of the recommendations of the Commission was for colleges to
consider making their admissions tests (SAT or ACT) optional. Using
data from Bowdoin College, which has had such a policy for almost 40
years, I showed that those students who did not submit their SAT scores
had, in fact, scored about a standard deviation lower than those
students that did submit them. This isn't surprising. More important,
the students who did not submit SAT scores also performed about a
standard deviation lower in their freshmen grade point average at
Bowdoin. This would have been predictable from their SAT scores had the
College insisted on them. My conclusion is that colleges deny
themselves useful information by making SAT's optional. And the
Commission, by making their recommendations in the absence of such
data, was shooting in the dark.
In this post I'd like to discuss another of their other principal recommendations:
Schools should consider eliminating the SAT/ACT altogether
and substituting instead achievement tests. They cite the unfair effect
of coaching as the motivation for this — they weren't naive enough to
suggest that if achievement tests were to become more high stakes
coaching for them would not be offered. Rather, they argued that such
coaching would be related to schooling and hence more beneficial to
education than is coaching that focuses on test-taking skills.
Driving the Commission's recommendations was the notion that the
differential availability of commercial coaching made admissions
testing unfair. They recognized that the 100 point gain (on the 1200
point SAT scale) test prep providers often tout as a typical outcome
was hype and agreed with the estimates from more neutral sources that
about 20 points was more likely. However, they deemed even 20 points
too many. The Commission pointed out that there was no wide-spread
coaching for achievement tests, but agreed that should the admissions
option shift to achievement tests the coaching would likely follow.
This would be no fairer to those applicants who could not afford extra
coaching, but at least the coaching would be of material more germane
to the subject matter and less related to test-taking strategies.
One can argue with the logic of this – that a test that is less
subject oriented and related more to the estimation of a general
aptitude might have greater generality. And that a test that is less
related to specific subject matter might be fairer to those students
whose schools have more limited resources for teaching a broad range of
courses. I find these arguments persuasive, but I have no data at hand
to support them. So instead I will take a different, albeit more
technical, tack. I will argue that the psychometric reality associated
with replacing general aptitude tests with achievement tests means that
making the kinds of comparisons that schools need among different
candidates impossible.
When all students take the same tests we can compare their scores on
the same basis. The SAT and ACT were constructed specifically to be
suitable for a wide range of curricula. SAT–Math is based on
mathematics no more advanced than 8th grade. Contrast this
with what would be the case with achievement tests. There would need to
be a range of tests and students would chose a subset of them that best
displayed both the coursework they had had and the areas they felt they
were best in. Some might take chemistry, others physics; some French,
others music. The current system has students typically taking three
achievement tests (SAT-II). How can such very different tests be scored
so that the outcome on different tests can be compared? Do you know
more French than I know physics? Was Mozart a better composer than
Einstein was a physicist? How can admissions officers make sensible
decisions through incomparable scores?
How are SAT-II exams scored currently? Or more specifically, how
they had been scored for decades when I left the employ of ETS seven
years ago – I don't know if they have changed anything in the interim.
They were all scored on the familiar 200-800 scales, but similar scores
on two different tests are only vaguely comparable. How could they be?
What is currently done is that tests in mathematics and science are
roughly equated using the SAT-Math, the aptitude test that everyone
takes, as an equating link. In the same way tests in the humanities and
social sciences are equated using the SAT-Verbal. This is not a great
solution, but is the best that can be done in a very difficult
situation. Comparing history with physics is not worth doing for even
moderately close comparisons.
One obvious approach would be to norm reference each test, so that
someone who scores average for all those who take a particular test
gets a 500 and someone a standard deviation higher gets a 600, etc..
This would work if the people who take each test were, in some sense,
of equal ability. But that is not only unlikely, it is empirically
false. The average student taking the French achievement test might
starve to death in a French restaurant, whereas the average person
taking the Hebrew achievement test, might do just fine if dropped in
the middle of the night onto the streets of Tel Aviv. Happily the
latter students also do much better on the SAT-VERBAL test and so the
equating helps. This is not true for the Spanish test, where a
substantial portion of those taking it come from Spanish speaking homes.
Substituting achievement tests is not a practical option unless
admissions officers are prepared to have subject matter quotas. I believe that solution would be too inflexible to be feasible.
Posted by Josh Millet on Fri, Jun 06, 2008 @ 03:18 PM
The May-June edition of the APA's journal American Psychologist contains an important new study on the effectiveness and validity of employment testing. The study examines the predictive validity of testing in both educational and employment settings. There's a good summary of the study's findings on my favorite statistics blog. Essentially, the study shows that employment aptitude tests are a generally valid way of predicting a wide variety of aspects of job performance. It also contains encouraging conclusions about the fairness of aptitude tests.
While most of the conclusions of the study will not be surprising to people familiar with the field of employment testing, my sense is that this is an important study because of the amount of evidence it considers. The study is a meta-analysis (a review of many different studies) and may be the most comprehensive attempt to examine the effectiveness and validity of employment testing since Hunter and Schmidt's seminal study in the late 1990s.
If you want to cough up $12 to read the whole study you can do so here. I'll probably get into the details of the study in subsequent posts.