Posted by Wayne Chuen on Thu, Jun 19, 2008 @ 07:23 PM
Today's blog post is by Wayne Chuen, the lead software engineer in Criteria's product development group. Wayne directs Criteria's software development initiatives.
Over the last few months, the product development team at Criteria has been hard at work on a new tool that will help customers visualize and analyze their test results. Instead of simply delivering a list of test results by candidate, the Results Analysis tool uses interactive charts to present an aggregate view of a company's candidate pool, while allowing customers to drill down to the candidate level.
When a customer enters the Results Analysis section, they are asked to determine which data they wish to analyze (for example, you might want to look at all Customer Service Representative candidates that you've tested in the past year.) Once you (a HireSelect user) select a data set, you'll see an interactive page that presents data by test, where each test is a new tab. For example, in the tab for the Criteria Basic Skills Test (CBST), a pie chart shows the percentage of candidates that scored in the Excellent, Good, Fair, or Low range. Clicking any of these pie slices will highlight the relevant candidates in the Candidate Scores bar chart. Additionally, there is a section that shows the average and median scores of a candidate pool, along with a suggested cutoff score. You can compare those statistics to your past candidates, overall national norms, or position-specific norms by simply selecting the appropriate category from a drop-down list. By clicking any of these scores, another pie chart will show the percentage of your candidate pool that scored at that level or above.

In the tab for the Customer Service Aptitude Profile (CSAP), clicking the bars in the Customer Service Characteristics bar chart will show you the percentage of your candidate pool that scored Low, Average, or High for that trait.

The big payoff comes in the Candidate Selection tab. After analyzing your results in each of the test-specific tabs, you can use sliders in the Candidate Selection section to set minimum scores, and immediately see the percentage of matching candidates in a pie chart. The list of matching candidates will also automatically update, and show only the candidates that qualify based on the scores you set. With the click of a button, you can then export the qualified candidates list as an Excel spreadsheet.

We believe that the Results Analysis section is a big step forward as HireSelect continues to evolve. It will make it much easier for organizations to analyze aggregate testing data from their applicant pool. If you’re currently a HireSelect subscriber, there's no additional charge or anything you need to do to activate this feature; it's already available to you as the second tab in the Results section. However, if you have any suggestions or comments, please let us know in the Comments section below or by giving us a call. If you're not a subscriber, but would like to check out this new feature, along with what we believe to be the most user-friendly employee testing solution available today, feel free to begin a free trial of HireSelect.
Posted by Eric Loken on Wed, Jun 11, 2008 @ 07:13 PM
The US Open golf tournament is often called the “ultimate
test” in championship golf and its goal is to crown the US champion. That got us thinking about viewing golf
tournaments as a selection process. In a
typical tournament, players play 4 rounds of golf. (There’s usually a cut after the second day
where the field is reduced by half. This
complicates the analysis below somewhat, but not enough to change the main point.) When all is said and done by Sunday afternoon,
millions of dollars of prize money is distributed purely according to rank
order from best to worst score.
Suppose this were a process to select the top employees from
an applicant pool. Suppose the
candidates were assessed four times on essentially the same exercise (the same golf
course, in this case). If the
assessment tool is doing its job well, you would expect the rank ordering of
the applicant pool to remain relatively stable from assessment to
assessment. Psychometricians have a name
for this stability of measures: the degree to which an assessment produces
consistent results over repeated administrations is referred to as its test-retest
reliability and they quantify it with a correlation coefficient. The correlation coefficient is a number
between -1 and 1, where 0 means that there is no relationship whatsoever
between the test and the retest, and a 1 meaning that the retest scores were
perfectly predictable given the original test scores. (A correlation of
negative 1 means that there is perfect inverse prediction from test to retest –
i.e. the worst became the best and the best became the worst).
Obviously a higher number means a more stable and reliable
measure. How high is good enough for a
test to be considered reliable? As a reference, if a random sample of high
school students took the SAT twice, you’d expect the two sets of test scores to
correlate somewhere close to 0.9. Decent
assessment measures should have at least a correlation of 0.8 for normal
purposes. If the retest reliability is
low, then one strategy is to average over many repetitions. If a certain test correlates .5 on repetition,
then averaging over 5 repeated measurements would give you an acceptable
overall reliability. At lower reliabilities than that you start to hear
analogies of the scores from test to test resembling random shots at a
dartboard.
With that set-up, here are the correlations for the four
rounds played at the Memorial Tournament two weeks ago. Basically these correlations all hover
around 0. There is no evidence here
that the rank ordering of participants from round to round has any appreciable
level of stability. And yet $6 million
of prize money was doled out on the basis of this selection process. What gives?
|
|
Round1
|
Round2
|
Round3
|
Round4
|
|
Round1
|
1.00
|
0.10
|
0.22
|
0.17
|
|
Round2
|
0.10
|
1.00
|
0.05
|
0.06
|
|
Round3
|
0.22
|
0.05
|
1.00
|
0.02
|
|
Round4
|
0.17
|
0.06
|
0.02
|
1.00
|
Well, ordinarily your first reaction here would be that the
assessment is poor. When reapplied on a
second (and third and fourth) occasion, the information you get is almost
totally different, an indication that the assessment has weak internal
consistency. That’s usually the sign of
a deeply flawed measure, and the question arises, “How can it be valid when it
is so unreliable?”. But hold on a
second, there can be no more valid measure of golf ability than golf
score. Some people might worry that the
quality of intelligence is nothing more than what intelligence tests measure,
and they are rightly concerned that if this were true it would have
implications for how we talk about (and use) employment aptitude tests. But there
should be unanimous agreement that a person’s typical score on a round of golf
is exactly what is meant when describing a person’s golf ability.
So with such immediate links (no pun intended) between
construct and measure, it can’t be argued that a professional golf tournament
is not a valid assessment of golf ability.
So why is it so unreliable? Why would it fail by the standards of a
psychometric assessment?
The answer lies in the concept of range restriction. The
most prestigious tournaments in golf attract the best players in the
world. If you could assign to every
player a “true score” – a hypothetical number that might be the average of an
infinite number of rounds of golf played on that particular golf course – then
the variability from player to player in true scores is much smaller than the
variability in scores seen from round to round.
According to a psychometrician, the results of the Memorial Tournament
were dominated by little more than random noise – the random bounces, the random
gusts of wind, etc.
Psychometricians would not support a selection process to choose applicants that operates like the rounds of a professional golf tournament. The chances are quite small that the final selection of top applicants would match up with the hypothetical rankings that you would see if God could come down and tell you all the "true scores." The lesson for businesses is that the properties of an assessment are not fixed, but depend on the population in which they are applied. If the applicant pool is highly uniform on the construct of interest, then even an excellent measure will be unreliable. For example, the SAT is a less reliable instrument when used to rank order the freshman class at Harvard than when it is used to rank order a more nationally representative sample. (But don't think it would look anywhere near as bad as the golf data above!)
A company or school can adjust its selection process based on what they know about both their tests and their applicants. Golf tournaments, however, operate strictly by the scores. The sports lesson is that poor reliability is why Vegas finds it difficult to handicap golf for betting purposes. It's really more like Keno when compared with other sports betting. It should also be clear why it is an absolutely staggering accomplishment that Tiger Woods has won about 30% of the tournaments that he has entered. A psychometrician would predict that was literally almost impossible, and would instead hypothesize that Tiger isn't quite like the rest of the world's elite golfers. A golf fan, of course, would have told you that years ago.
Posted by Josh Millet on Fri, Jun 06, 2008 @ 03:18 PM
The May-June edition of the APA's journal American Psychologist contains an important new study on the effectiveness and validity of employment testing. The study examines the predictive validity of testing in both educational and employment settings. There's a good summary of the study's findings on my favorite statistics blog. Essentially, the study shows that employment aptitude tests are a generally valid way of predicting a wide variety of aspects of job performance. It also contains encouraging conclusions about the fairness of aptitude tests.
While most of the conclusions of the study will not be surprising to people familiar with the field of employment testing, my sense is that this is an important study because of the amount of evidence it considers. The study is a meta-analysis (a review of many different studies) and may be the most comprehensive attempt to examine the effectiveness and validity of employment testing since Hunter and Schmidt's seminal study in the late 1990s.
If you want to cough up $12 to read the whole study you can do so here. I'll probably get into the details of the study in subsequent posts.
Posted by Josh Millet on Wed, Jun 04, 2008 @ 02:02 PM
To finish off our discussion about personality tests, I wanted to discuss ways in which test developers are moving beyond the Big Five. The Big Five is sometimes too broad to predict work behaviors for specific jobs, where more fine-grained personality measures may be useful. For example, it has been shown that certain jobs such as sales positions are best performed by people with a set of personality characteristics that correspond to the work activities involved in sales jobs. Sales jobs often require cold-calling, initiating social interactions, prospecting, and building relationships. It won't be surprising to most people that qualities like assertiveness, extraversion, competitiveness, and self-confidence might be qualities that could help an individual perform well in such roles. For work in the field of customer service, on the other hand, qualities such as patience, cooperativeness, and personal diplomacy would be most important given the job activities of most customer service positions.
Because there is growing evidence of the predictive validity of personality measures for jobs such as sales and customer service, many test publishers have developed employment personality tests focused on these areas. For example, Criteria has a sales aptitude test and a customer service test that measure 18 different personality traits that predict performance in these jobs. These tests can have far greater utility than a Big-Five based test for a given position, because they provide much more targeted and fine-grained information based on the specific requirements of a given job. Because they have been customized to specific positions, the score reports for such tests are also typically easier to interpret than are general Big Five inventories. As personality research continues to advance, expect to see targeted, job-specific personality tests for a much wider range of positions in years ahead.
Posted by Josh Millet on Tue, May 27, 2008 @ 12:53 PM
Following up on the discussion I started last time about the Big Five personality traits, I want to provide a little more context on the Big Five and how they relate to the field of personality testing as a whole. The Big Five are personality dimensions that describe the ways in which an individual reacts to other people and to the world around them. For example, the Extraversion/Introversion dimension describes the extent to which an individual is more or less outgoing, gregarious and in need of social stimulation. If a personality test determines that an individual is in the 65th percentile for "Extraversion," this means that the individual is more extraverted than 65 percent of the individuals in the norm group.
The notion of personality "traits" is now fairly widely accepted, and is superceding an older paradigm of personality "types" that originated with Carl Jung and relied on a view of personality that grouped people into one of two distinct types, such as introvert or extravert, thinker or feeler. The traits model is gaining credence in personality research because of growing evidence that suggests that a strict dichotomy between two distinct types does not sufficiently describe the nuances in the extent to which individuals tend to one side or the other.
The best known example of a test based on the older model is the Myers-Briggs Type Indicator (MBTI). Since the MBTI is probably the most widely known and thoroughly studied personality test today, and since we get asked about it all the time, I thought I'd offer some thoughts on it. Or one thought, to be exact. Do not use the MBTI to make hiring decisions! I repeat, the MBTI should not be used for the purpose of employee selection...ever. I say this because the MBTI, which has a large and enthusiatic following, is often used in just this way, even though it shouldn't be. There are many reasons the MBTI should never be used to inform hiring decisions, many of which are described here. But the most important is simply that there's no convincing evidence to link MBTI results to job performance. In order to ward off the anticipated deluge of angry emails from MBTI-devotees, I would just say that if you don't believe me, take it from the MBTI's publisher. Even they do not suggest it should be used for employee selection...they provide a table that lists every conceivable use for a test, but note the complete lack of check marks in the "Selection" column.
There's plenty of evidence, on the other hand, to link the Big Five Traits to job performance for a variety of positions. Conscientiousness, which measures the extent to which an individual is reliable, organized, persistent, and responsible (those who score low in Conscientiousness may be more impulsive and at times unreliable) has been shown to be moderately predictive of success across many job types, but particularly for entry-level positions where characteristics like reliability and punctuality may be more valuable than creativity. Certain Big Five traits are useful for certain types of jobs; for example, extraverts perform better in sales than do introverts, and highly agreeable people are well-suited for customer service but might not make good judges or CEOs, because those jobs require objective decison-making that highly agreeable people may not be comfortable with. Other Big Five traits are much less relevant to employee selection: for example, there isn't much evidence that Openness (the extent to which an individual is imaginative and creative, rather than down to earth and conventional) is predictive of work success, even though it seems logical that people with high Openness scores would be better suited for jobs that require imagination, creativity or abstract thinking.
Alright, that's enough for now. Next time I'll finish up with this thread by discussing ways in which some employment personality tests move beyond the Big Five by measuring more fine-grained traits that have been shown to predict success for specific jobs.
Posted by Josh Millet on Wed, May 14, 2008 @ 04:34 PM
A recent report summarized here suggests that personality testing is the fastest growing segment of the pre-employment testing market. The survey of HR professionals revealed that the percentage of respondents whose firms used personality tests has grown from 21% to 59% in the last five years alone. Unfortunately, there are still a lot of misconceptions about what personality tests are, and how they should be used. Since we get so many questions about how personality tests work, from both HR professionals and job candidates, I thought I'd try to explain some of the basics.
Employment personality tests are designed to measure personality traits that may be related to job performance. Most personality tests consist of a series of self-evaluative "prompts" and ask a test-taker to indicate the extent to which they agree or disagree with the statement. An example of a prompt might be "Meeting new people is enjoyable to me." There are no right or wrong answers to these questions, as the responses can be used to indicate behavioral tendencies that may or may not fit a particular job. For example, the prompt above might be one of many used to calculate an individual's "Extraversion" rating. A high score in Extraversion is not necessarily better than a low one, but an extraverted individual might be better suited for some jobs and less well-suited for others. A classic example is that highly extraverted individuals tend to perform better in sales jobs because not only are they more comfortable dealing with people, they are also better at initiating interactions. On the other hand, highly extraverted people might be less suited for jobs that require very little or even no social contact. They could potentially become bored or frustrated without social stimulation.
Professionally developed and validated employment personality tests attempt to measure constant, fixed "traits" thought to influence an individual's behavioral tendencies across a variety of contexts. The use of personality tests by companies during the hiring process has grown rapidly in recent years, with the dominant framework being the "Big Five." The Big Five are five dimensions of personality that seem to emerge consistently in empirical research: Agreeableness, Conscientiousness, Extraversion, Openness (to Experience) and Stability.
How are the Big Five used to help companies with their hiring process? I'll get into more details on the applied use of personality tests in my next post.
Posted by Eric Loken on Mon, May 05, 2008 @ 03:01 PM
Today's blog post is by Eric Loken, Criteria's Chief Research Scientist and a member of Criteria's Scientific Advisory Board. Eric plays a leading role in the development of Criteria's employment tests.
Last week there was an article in the New York Times that described a study finding that intelligence might not be the constant, innate quality that it is usually assumed to be. Researchers at Michigan showed that when a group of participants practiced a challenging cognitive task for two to three weeks, they scored better on a standardized measure of intelligence.
At first this sounds like the kind of obvious effect that commercial test preparation companies pass off as a marketable service. It's well known that if you take a group of students and give them practice SATs over and over again, their scores will go up slightly, even if they haven't paid $1,000 for the privilege of practicing.
But the Michigan study is different because they showed something called transfer. The participants in the study started by taking a matrices pattern test, supposed to be a culture-free intelligence test where success doesn't depend on the kind of skills and knowledge developed in school. Then they trained on a difficult attention and working memory task called the n-back test (Criteria's MRAB aptitude test contains a very similar task). The participants in the training group were pushed 20 minutes a day for up to 19 days to get better on this task, and they did. (Now the control group during this time was basically doing nothing which is a bit of a flaw in the experiment, but we'll let that go for now.)
The point of the study is that the matrices intelligence test is a different task from the one the group was training on, and yet the training transferred over to yield improved performance. This study caught our attention for a few reasons. First, the control group showed improvement in their matrices test scores (despite just sitting around). In general, people don't perform at their best the first time they take a test, and they will improve the second time around just because of practice or familiarity. This is something to keep in mind with employee testing — if for whatever reason you have to give a candidate a test for a second time, even if you use a different form of the test you shouldn't be surprised to see a mild improvement over the first score (this is sometimes called the "practice effect.")
But the most important finding of the study is that the group who practiced the memory task improved their scores by a wider margin. It's interesting to think about what the study says about the effects of the workplace on intelligence. Employers are obviously looking for intelligent employees who will have a positive impact on their organization. Employers should also keep in mind that the workplace environment will impact the intelligence of the employees. We're not sure it would serve the interests of productivity to set aside 20 minutes a day for "cognitive training" (although similar proposals exist in the interests of maintaining employee health and thus reducing healthcare costs). But it is worth remembering that a challenging work environment will likely keep skills and minds sharp.
This study is the latest in the age-old debate over "brain plasticity" and the extent to which our mental ability is fixed. We'll probably have more discussion on this topic as we keep track of which way the pendulum is swinging.
Posted by Josh Millet on Thu, Apr 24, 2008 @ 11:49 AM
This Saturday is the NFL draft, which means that NFL scouts have spent the past months going over 40-yard dash times and college game tapes, and fans have debated which prospect would be the best fit for their team. It also means it's time for media and fans to recycle the usual punchlines about the folly of using an aptitude test like the Wonderlic on NFL prospects. Football, more than any other American team sport, is about physicality, and the idea that performance on an aptitude test could have much to do with success on the football field seems absurd. Skeptics point out that a low Wonderlic score didn't prevent Dan Marino from becoming one of the most prolific passers in history, or Vince Young from making the Pro Bowl in his rookie year. When Criteria works with customers to gather evidence for the validity of our employment tests at their organization, we sometimes hear similar anecdotes. I've often heard HR managers express concern that "one of our best performers did poorly on the test." (Criteria has an aptitude test, the CCAT, that is similar to the Wonderlic.) Such reactions are understandable, but the measure of a test's predictive validity can't be judged from one test score--the only meaningful way to measure a test's ability to predict productivity is to study the correlations between test scores and job performance across a broad sample of people. Based on this standard, the Wonderlic may be a better predictor of performance in the NFL than you might think.
Two business professors from the University of Louisville recently did such a study with NFL data. They correlated test scores with performance measures and concluded that there was no association between test scores and performance in the NFL. If there is no association between the two, why is the Wonderlic used on NFL prospects? The study was critical of the selection measures used by the NFL.
This is the kind of study we often conduct for our clients, BUT we also point out that you have to be careful when evaluating how well a selection measure predicts performance. Success criteria must be chosen appropriately, and the sample has to be appropriate. I have concerns with exactly these issues in the Louisville study.
As a performance measure, the authors use average salary in a player's first three years as one of the "success metrics," but any football fan knows that a player's salary in his first years in the league is a function of draft order, not performance in the league, since he hasn't played any games when he signs a contract. The authors also use draft order as a "success measure." Both draft order and first-year salary are meaningful measures of a player's success only from the point of a view of the player--they reflect the collective wisdom about a player's future prospects. To owners and fans, on-field performance after entering the NFL is a much more meaningful measure of productivity.
The second problem with the study is that the authors include everyone in the performance evaluation, even if they never had a chance to perform. They found data on 68 quarterbacks drafted between 1999 and 2004, and included them all in the analysis comparing test scores to "success." The problem is that many of these QBs saw no or limited action in the NFL. So what does it mean to assess their performance when they didn't get to perform?
We tried a similar study by using data from NFL.com and other websites to find data on QBs drafted between 2000 and 2004. (We didn't use data from before 2000 because the data on players scores is unreliable and incomplete.) The simplest way to measure the predictive validity of an employment test is to compare test scores to one or more metrics used to measure productivity in a given job. We chose QBs because that position requires the decision-making and problem solving skills that aptitude tests are supposed to measure, and as productivity metrics we chose yards passing and number of TDs thrown in the first four years (four years is the average length of an NFL player's career.) Passing yards and TDs thrown aren't the perfect metric (did you know Joey Harrington threw for more yards in his first four years than did Tom Brady, who didn't start until his second year?) You can check out the data we used here: there were 68 QBs drafted from 2000 to 2004, but we eliminated the 5 QBs for whom we couldn't find Wonderlic scores, as well as two others who ended up playing other positons (Ronald Curry) or other sports (Drew Henson).
The data is very interesting. If you look at the data for all 61 QBs, there is only a fairly weak correlation between aptitude and passing yards (r=.19) and TDs thrown (r=.20) But we made a plot of the test scores (Figure 1, x-axis) and the passing yards (y-axis) and saw that the story was much more complicated than that. As it turns out, there does appear to be a strong association between test score and performance (yds thrown)--you just don't see it until you look at QBs who threw for 1000 or more yards (which is where we put the horizontal line).
A performance measure can have multiple meanings. Some QBs don't throw for many yards because they barely get on the field, and this can happen for many reasons; they might not be good enough, but they also could be drafted to a team with a good starter in place, or get injured, etc. Below the 1000 yards passing mark, the data are all spread out across the score spectrum--there is no correlation there. Above the line, however, the correlation is a whopping r=.51 (r=.49 for TDs thrown), right up there with some of the strongest coefficients reported anywhere in organizational psychology.
Another way to look at the strength of these correlations is that for this sample, the QBs who scored below the median Wonderlic score (for QBs) of 27 averaged 5,202 passing yards and 31.2 TDs over their first four years, whereas those scoring above the median averaged 6,570 yards and 40.8 TDs over the same period. Seems like the cognitive measure might be worth something after all!

Of course, you might ask where should the cut-off be? How did we pick 1,000 yards? We tried it again with QBs who had started more than 5 games, and the same pattern replicates, but there is a bit of a caveat. Craig Krenzel, who studied molecular genetics at Ohio State and had a less-than-stellar stint with the Bears, scored very high on his aptitude test. He threw for about 800 yards in the pros, and started 5 games. If you change the thresholds and include him, then the overall predictive validity of the aptitude test goes down a little--that's the problem with small sample sizes--the anecdotes can actually affect the statistics in a non-trivial way.
All in all we think the data linking aptitude test scores with NFL performance is much more interesting than is currently recognized. In fact, for QBs drafted between 2000 and 2004 the data suggest there is a definite link between aptitude test scores and on-field performance. And we went through this exercise because it illustrates a lot of lessons we try to share with our customers. Think carefully about the measure of performance; make sure you recognize that there can be many reasons for good or bad performance that are unrelated to the test; plot your data so that you can visualize what's going on; and beware of making inferences in small samples.
Posted by Josh Millet on Tue, Apr 22, 2008 @ 01:07 PM
Hello all. Well, we're taking the plunge into the blogosphere. We're launching our blog in order to be able to discuss trends in employee testing and Human Resources in general, provide updates about new features in HireSelect, Criteria's web-based pre-employment testing system, and to provide news about our company that may be of interest to our customers.
First, if you're new to the subject of pre-employment testing you may want to check out our white paper. There are also a couple of blogs that I read regularly that I would recommend to HR people who are interested in following trends in HR testing. The one I want to mention today is this HR Testing blog that may be of interest to those of you who are already familiar with the basic issues surrounding employee testing. One of the recent posts gives a good summary of a recent survey of trends in employment testing.
Well, that's all for now, I'll keep the first one short and sweet. But check back later this week when I'll post a summary of a recent study we did examining the use of the Wonderlic Personnel Test by teams in the National Football League. We came up with some findings that I think may surprise you.