Is the "Ultimate Test" in Golf Unreliable?
Posted by Eric Loken on Wed, Jun 11, 2008 @ 07:13 PM
The US Open golf tournament is often called the “ultimate
test” in championship golf and its goal is to crown the US champion. That got us thinking about viewing golf
tournaments as a selection process. In a
typical tournament, players play 4 rounds of golf. (There’s usually a cut after the second day
where the field is reduced by half. This
complicates the analysis below somewhat, but not enough to change the main point.) When all is said and done by Sunday afternoon,
millions of dollars of prize money is distributed purely according to rank
order from best to worst score.
Suppose this were a process to select the top employees from
an applicant pool. Suppose the
candidates were assessed four times on essentially the same exercise (the same golf
course, in this case). If the
assessment tool is doing its job well, you would expect the rank ordering of
the applicant pool to remain relatively stable from assessment to
assessment. Psychometricians have a name
for this stability of measures: the degree to which an assessment produces
consistent results over repeated administrations is referred to as its test-retest
reliability and they quantify it with a correlation coefficient. The correlation coefficient is a number
between -1 and 1, where 0 means that there is no relationship whatsoever
between the test and the retest, and a 1 meaning that the retest scores were
perfectly predictable given the original test scores. (A correlation of
negative 1 means that there is perfect inverse prediction from test to retest –
i.e. the worst became the best and the best became the worst).
Obviously a higher number means a more stable and reliable
measure. How high is good enough for a
test to be considered reliable? As a reference, if a random sample of high
school students took the SAT twice, you’d expect the two sets of test scores to
correlate somewhere close to 0.9. Decent
assessment measures should have at least a correlation of 0.8 for normal
purposes. If the retest reliability is
low, then one strategy is to average over many repetitions. If a certain test correlates .5 on repetition,
then averaging over 5 repeated measurements would give you an acceptable
overall reliability. At lower reliabilities than that you start to hear
analogies of the scores from test to test resembling random shots at a
dartboard.
With that set-up, here are the correlations for the four
rounds played at the Memorial Tournament two weeks ago. Basically these correlations all hover
around 0. There is no evidence here
that the rank ordering of participants from round to round has any appreciable
level of stability. And yet $6 million
of prize money was doled out on the basis of this selection process. What gives?
|
|
Round1
|
Round2
|
Round3
|
Round4
|
|
Round1
|
1.00
|
0.10
|
0.22
|
0.17
|
|
Round2
|
0.10
|
1.00
|
0.05
|
0.06
|
|
Round3
|
0.22
|
0.05
|
1.00
|
0.02
|
|
Round4
|
0.17
|
0.06
|
0.02
|
1.00
|
Well, ordinarily your first reaction here would be that the
assessment is poor. When reapplied on a
second (and third and fourth) occasion, the information you get is almost
totally different, an indication that the assessment has weak internal
consistency. That’s usually the sign of
a deeply flawed measure, and the question arises, “How can it be valid when it
is so unreliable?”. But hold on a
second, there can be no more valid measure of golf ability than golf
score. Some people might worry that the
quality of intelligence is nothing more than what intelligence tests measure,
and they are rightly concerned that if this were true it would have
implications for how we talk about (and use) employment aptitude tests. But there
should be unanimous agreement that a person’s typical score on a round of golf
is exactly what is meant when describing a person’s golf ability.
So with such immediate links (no pun intended) between
construct and measure, it can’t be argued that a professional golf tournament
is not a valid assessment of golf ability.
So why is it so unreliable? Why would it fail by the standards of a
psychometric assessment?
The answer lies in the concept of range restriction. The
most prestigious tournaments in golf attract the best players in the
world. If you could assign to every
player a “true score” – a hypothetical number that might be the average of an
infinite number of rounds of golf played on that particular golf course – then
the variability from player to player in true scores is much smaller than the
variability in scores seen from round to round.
According to a psychometrician, the results of the Memorial Tournament
were dominated by little more than random noise – the random bounces, the random
gusts of wind, etc.
Psychometricians would not support a selection process to choose applicants that operates like the rounds of a professional golf tournament. The chances are quite small that the final selection of top applicants would match up with the hypothetical rankings that you would see if God could come down and tell you all the "true scores." The lesson for businesses is that the properties of an assessment are not fixed, but depend on the population in which they are applied. If the applicant pool is highly uniform on the construct of interest, then even an excellent measure will be unreliable. For example, the SAT is a less reliable instrument when used to rank order the freshman class at Harvard than when it is used to rank order a more nationally representative sample. (But don't think it would look anywhere near as bad as the golf data above!)
A company or school can adjust its selection process based on what they know about both their tests and their applicants. Golf tournaments, however, operate strictly by the scores. The sports lesson is that poor reliability is why Vegas finds it difficult to handicap golf for betting purposes. It's really more like Keno when compared with other sports betting. It should also be clear why it is an absolutely staggering accomplishment that Tiger Woods has won about 30% of the tournaments that he has entered. A psychometrician would predict that was literally almost impossible, and would instead hypothesize that Tiger isn't quite like the rest of the world's elite golfers. A golf fan, of course, would have told you that years ago.