The US Open golf tournament is often called the “ultimate test” in championship golf and its goal is to crown the US champion. That got us thinking about viewing golf tournaments as a selection process. In a typical tournament, players play 4 rounds of golf. (There’s usually a cut after the second day where the field is reduced by half. This complicates the analysis below somewhat, but not enough to change the main point.) When all is said and done by Sunday afternoon, millions of dollars of prize money is distributed purely according to rank order from best to worst score.
Suppose this were a process to select the top employees from an applicant pool. Suppose the candidates were assessed four times on essentially the same exercise (the same golf course, in this case). If the assessment tool is doing its job well, you would expect the rank ordering of the applicant pool to remain relatively stable from assessment to assessment. Psychometricians have a name for this stability of measures: the degree to which an assessment produces consistent results over repeated administrations is referred to as its test-retest reliability and they quantify it with a correlation coefficient. The correlation coefficient is a number between -1 and 1, where 0 means that there is no relationship whatsoever between the test and the retest, and a 1 meaning that the retest scores were perfectly predictable given the original test scores. (A correlation of negative 1 means that there is perfect inverse prediction from test to retest – i.e. the worst became the best and the best became the worst).
Obviously a higher number means a more stable and reliable measure. How high is good enough for a test to be considered reliable? As a reference, if a random sample of high school students took the SAT twice, you’d expect the two sets of test scores to correlate somewhere close to 0.9. Decent assessment measures should have at least a correlation of 0.8 for normal purposes. If the retest reliability is low, then one strategy is to average over many repetitions. If a certain test correlates .5 on repetition, then averaging over 5 repeated measurements would give you an acceptable overall reliability. At lower reliabilities than that you start to hear analogies of the scores from test to test resembling random shots at a dartboard.
With that set-up, here are the correlations for the four rounds played at the Memorial Tournament two weeks ago. Basically these correlations all hover around 0. There is no evidence here that the rank ordering of participants from round to round has any appreciable level of stability. And yet $6 million of prize money was doled out on the basis of this selection process. What gives?
|Round 1||Round 2||Round 3||Round 4|
Well, ordinarily your first reaction here would be that the assessment is poor. When reapplied on a second (and third and fourth) occasion, the information you get is almost totally different, an indication that the assessment has weak internal consistency. That’s usually the sign of a deeply flawed measure, and the question arises, “How can it be valid when it is so unreliable?”. But hold on a second, there can be no more valid measure of golf ability than golf score. Some people might worry that the quality of intelligence is nothing more than what intelligence tests measure, and they are rightly concerned that if this were true it would have implications for how we talk about (and use) employment aptitude tests. But there should be unanimous agreement that a person’s typical score on a round of golf is exactly what is meant when describing a person’s golf ability.
So with such immediate links (no pun intended) between construct and measure, it can’t be argued that a professional golf tournament is not a valid assessment of golf ability. So why is it so unreliable? Why would it fail by the standards of a psychometric assessment?
The answer lies in the concept of range restriction. The most prestigious tournaments in golf attract the best players in the world. If you could assign to every player a “true score” – a hypothetical number that might be the average of an infinite number of rounds of golf played on that particular golf course – then the variability from player to player in true scores is much smaller than the variability in scores seen from round to round. According to a psychometrician, the results of the Memorial Tournament were dominated by little more than random noise – the random bounces, the random gusts of wind, etc.
Psychometricians would not support a selection process to choose applicants that operates like the rounds of a professional golf tournament. The chances are quite small that the final selection of top applicants would match up with the hypothetical rankings that you would see if God could come down and tell you all the “true scores.” The lesson for businesses is that the properties of an assessment are not fixed, but depend on the population in which they are applied. If the applicant pool is highly uniform on the construct of interest, then even an excellent measure will be unreliable. For example, the SAT is a less reliable instrument when used to rank order the freshman class at Harvard than when it is used to rank order a more nationally representative sample. (But don’t think it would look anywhere near as bad as the golf data above!)
A company or school can adjust its selection process based on what they know about both their tests and their applicants. Golf tournaments, however, operate strictly by the scores. The sports lesson is that poor reliability is why Vegas finds it difficult to handicap golf for betting purposes. It’s really more like Keno when compared with other sports betting. It should also be clear why it is an absolutely staggering accomplishment that Tiger Woods has won about 30% of the tournaments that he has entered. A psychometrician would predict that was literally almost impossible, and would instead hypothesize that Tiger isn’t quite like the rest of the world’s elite golfers. A golf fan, of course, would have told you that years ago.