Criteria's Employee Testing Blog

Help! One of my top performers bombed your test!

The most effective method we have for selling our pre-employment testing software, HireSelect, is our 30-day free trial. It allows prospective customers to try the tests, preview the software, and ask our sales team questions about how to best use HireSelect. We also encourage people to evaluate HireSelect by administering the tests to a group of their existing employees. Since companies have a good idea of how their existing employees are performing, testing incumbents can be an effective way to analyze the accuracy and predictive validity of our tests. Most testing companies won’t let potential customers preview their tests in such a comprehensive way, but for us it’s a great sales tool-we have plenty of evidence about the predictive accuracy of our tests, and we want to make sure people see the value in our assessments before they invest in using our service.

But one scenario that we face is when we get comments like, “I don’t know… one of my top performers failed your tests.” Our sales staff will hear comments like this from people who doubt the effectiveness of the tests because of a notable case where the results don’t correspond with what they know about a particular employee. When this happens, we often ask if they’d be willing to share performance data for the employees they tested. We often get back something like this (the data below is not real, but is pretty typical of the data sets we frequently review):

Employee # CCAT Percentile SalesAP Score Monthly Sales
1 71 Highly Recommended $69,243
2 34 Not Recommended $67,445
3 84 Recommended $55,767
4 71 Highly Recommended $50,240
5 61 Recommended $46,772
6 58 Not Recommended $41,389
7 92 Recommended $40,102
8 65 Recommended $37,655
9 45 Highly Recommended $34,241
10 74 Recommended $31,498
11 53 Recommended $31,400
12 65 Recommended $30,084
13 45 Recommended $29,751
14 50 Not Recommended $27,782
15 41 Recommended $26,997
16 45 Not Recommended $24,408
17 29 Highly Recommended $21,126
18 38 Not Recommended $18,665
19 78 Recommended $12,505
20 34 Not Recommended $9,449
0.34 0.25

In this case it seems clear that employee #2, who is one of the company’s top performing salespeople, didn’t do very well on either the Criteria Cognitive Aptitude Test (CCAT) OR the SalesAP, our sales personality test. In employee #2′s case, the test “didn’t work” in the sense that it dramatically under-predicted her potential. But in any sample of any size, there can always be cases where the test results “didn’t work”; no test is a crystal ball. But the way we should evaluate the predictive accuracy of selection tools is by looking at the whole data set, to see how well the tests predicted performance across the sample population. With this in mind, take another look at the table above.

If you are looking for instances the test “didn’t work” you might also notice that employee #19 got good scores on both tests, but evidently can’t sell a lick. But other than these two outliers, the correlation between test results and job performance (as measured in this case by monthly sales) is pretty strong. How can we be sure of this? (Besides noticing that the scores at the top of the chart, which is sorted by monthly sales, tend to be higher than those at the bottom.) Organizational psychologists measure the predictive validity of a test by calculating a correlation coefficient — a measure statisticians use to represent the strength of a relationship between two things: in this case test scores and job performance. The correlations for the two tests in this case are .34 and .25, respectively. A correlation coefficient can range from -1 (perfectly uncorrelated) to 1 (perfectly correlated): for a pre-employment test a correlation of .21 -.35 is likely to be useful–anything higher than .35 is very beneficial as a predictor. Correlation coefficients of .34 and .25 are respectable: although this particular sample is small, a 20 person sample is much more representative than a one person sample. Calculating the correlation coefficient is a great way to combat “the curse of the anecdote;” letting one prominent data point obscure the trend that is the real story of this data set. The scatter plot below provides another way to visualize this data — it shows that as CCAT scores increase, so does performance — with the two notable outliers as exceptions to the rule. Remember, don’t look at anecdotal evidence if you have a whole data set to examine.

CCAT Scores and Performance

Tagged , , , 1 Comment

America’s Computer Literacy Problem

As we announced in this blog post earlier this year, our newest test is called the Computer Literacy and Internet Knowledge test (CLIK). We developed the CLIK because many of our customers  requested a test of general computer literacy.  The CLIK consists of two short simulations in which the test-taker is asked to perform basic tasks (opening a document, copying and pasting, sending an email, doing a Google search, etc.) on a simulated desktop, followed by some multiple choice questions.   The CLIK has quickly become one of our most popular tests, which to me is a sign that employers are definitely seeing the need for a test that measures basic computer skills, rather than specific knowledge of a particular application, like Microsoft Excel of Word.

As with all of our tests, we have monitored the data collected from the CLIK, and we recently did a thorough analysis of item-by-item responses for 20,000 CLIK administrations. The findings were pretty surprising. First of all, 24% of all test-takers received an overall score of “Not Proficient.”  But the more alarming data came from the item-by-item analysis, which showed that some very basic elements of computer literacy were not performed correctly by large numbers of test-takers.  Specifically, 37% of people were unable to retrieve basic information through a Google search, 32% were unable to correctly format and send an email, and 21% were unable to copy and paste a text passage.

Now, we should caution that the sample of people who took the CLIK may not be representative of the general population.   Our customers tend to administer the CLIK for entry level positions for which basic computer proficiency is required, but perhaps cannot be assumed—it would be uncommon, for example, to administer the CLIK when screening for a professional position.  The CLIK tends to be used for positions like customer service reps, medical billers, clerical workers, etc.  Conversely, however, it’s also true that there are many positions for which computer literacy may not be necessary, and one would assume that the applicant pools for these positions might be made up of people with even lower rates of computer literacy.  So although it’s difficult to make any decisive conclusions on the basis of 20,000 test results, it certainly looks like America has a computer literacy problem.  The data we examined confirms what we were hearing from our customers who asked for this kind of test—too many job applicants lack basic computer proficiency.


Tagged , Leave a comment

Moneyball and Pre-employment Testing

I finally got around to seeing “Moneyball” this weekend, the movie adaptation of the Michael Lewis book of the same name. The movie documents the role played by Billy Beane, General Manager of the Oakland A’s, in transforming the way baseball teams drafted and evaluated players a decade ago. Beane and his staff pioneered the application of sophisticated statistical analysis to the process of player selection. In so doing he was able to help his chronically underfunded Oakland As compete with the big budget teams like the Yankees, whose payroll was four times that of the As. His methods have since been imitated by many other teams, including the Boston Red Sox, who used it to win two World Series championships.

The lessons of Moneyball have obvious implications that reach beyond baseball, and it has garnered some lively discussion in HR circles. Beane’s breakthrough was that he found objective, quantifiable ways to measure player potential that turned out to be more accurate predictors of on field success than the collective wisdom of baseball scouts and insiders. This is exactly the promise of pre-employment testing. Well designed tests provide employers a way to gather objective, reliable data that predicts performance more accurately than traditional, more subjective methods of employee selection such as interviews.  Most of our clients are small and medium-sized businesses, for whom hiring smarter is one of their best chances to compete with the bigger “Yankees” of their respective fields.

Leave a comment

The HAI and Tomorrow’s Jobs Report

This holiday shortened week has brought some weak economic news to the fore, and yesterday the stock market took a steep loss.  Some analysts are pointing to tomorrow’s monthly payrolls numbers as an important event that could significantly impact the markets.  As avid readers of this blog will remember, our Hiring Activity Index (HAI) is a metric based on the proportion of our customers who are actively conducting pre-employment testing in a given month. The HAI touched an all-time high for the month of May.  We were also stable and high for March and April.

We’re not sure whether this strength reflects the hiring environment, or a new maturity in our business model as we build a loyal customer base.  Either way, we’re pleased with the indicators and expect that the jobs number tomorrow will likely be decent. The consensus estimate seems to be that we’ll have added between 90,000 and 200,000 non-farm payroll jobs: if the HAI is any guide, as it has been in the past, we’d be surprised if the number isn’t on the high end of that range.

Leave a comment

More NFL Draft Selection Geekiness

Today is that time of year again, the NFL draft. Not quite the same with the labor situation overhang, but that doesn’t seem to have slowed the perennial debates about draft order and the professional prospects of various members of this year’s draft class.  Our blog posts on the NFL draft are always among our more widely read posts, and we’re very interested in the draft because it is such an iconic example of NFL teams devising methods to tackle the challenge we think about every day: devising employee selection systems that help organizations hire better and derive long-term competitive advantages.  I don’t have much new to add on the draft this year, but did want to highlight a really interesting article on a new entrant to the field.

http://www.slate.com/id/2292312

If you have thoughts on this approach let us know in the comments section.

Leave a comment

New Computer Skills Test (CLIK) Added to HireSelect

Just a quick heads up that we introduced a new computer skills test today. If you want to read the press release it is here. A lot of customers have been asking about a quick assessment of basic computer literacy, so we created one. It’s designed to determine whether a candidate has fluency with basic computer skills that many employers take for granted–using the internet, email, etc.  Very often employers find out too late that they should NOT be taking these skills for granted, because even in the US, they are far from universal.  The name of the new test if the Computer Literacy and Internet Knowledge Test (CLIK).

Leave a comment

Last Place is Second Best? Ridiculous Advice From a Dating Website

Okay, first an upfront explanation of why we are even blogging about this. We are a web-based business that generates vast amounts of data. We continuously monitor and analyze our data, and even sometimes blog about what we find. So when we saw that a blog from OKCupid was the source of headlines such as, “The Curse of Being Cute” we had to see what they had done with their data. They did this.

OKCupid is an internet dating site. They have millions of users, and their users have rated each other on attractiveness (on a 1 to 5 scale, low to high) and sent messages to each other.  Of course getting a lot of 5s was most predictive of getting lots of messages. A second claim by the company, and the one getting most of the attention, is that being rated 1 was the next best thing to being rated 5 when it comes to getting more messages, and being rated 4 could even have a negative impact on he flow of messages. Hence the “curse of cuteness” and the discussion that being being polarizing can be beneficial.

To translate it to an employment setting, let’s suppose that on a job search site, prospective employers can rate potential applicants on a 1 to 5 scale. And suppose that it’s also possible to track the number of interview offers made to job applicants. Then here is the graph offered up by the folks at OKCupid:

Messages Received Last Month vs. AttractivenessOf course as an employment blog we’ll read the x-axis as average employer rating and the y-axis as interview requests.

Notice the clear upward trend. Notice also that it is an accelerating upward trend and not just a straight line relationship. The OKCupid folks calculate the positive slope, but then noticed that there was also still a lot of scatter around the straight line they drew. They found that in addition to the average rating being predictive of getting messages, the variability of the ratings was also predictive.  All else being equal, it was better to have variability in the profile of ratings. Imagine two people who both have an average rating of 4.0. If one had all ratings of 4.0 (i.e., no variability) and another had some variability, the second person would be expected to get more messages. (Hopefully you see that this is because that person must have more 5 ratings than the other person…but we’ll get to that.)

Okay OKCupid, so far so good. But next comes the egregious error. They decided to construct a regression equation that used all the individual rating profiles to predict the number of messages received.  As inputs to their regression they put the number of 1 ratings, the number of 2 ratings, the number of 4 ratings and the number of 5 ratings (they left 3 out because it caused some redundancy).  Here’s the equation they came up with:

Now it would have been helpful for them to have given the uncertainty in those regression weights, but let’s just see how they interpreted them. Correctly, they noticed that the largest positive weight was given to the 5 ratings. Indeed, there is almost a 1-1 payoff in that for every additional highest rating someone receives, they are expected to get .9 more messages. But then they went on to claim that the lowest category appears to have the next highest positive association. They actually conclude that the next best thing to getting a 5 rating is to get a 1 rating. By their logic, it would be better to be a person who has 10 of the worst rating (expected messages = k + .4*10 = k + 4) than to be a person who has 100 of the 4 rating (expected messages = k – .1*100 = k -10). Really??

Where did they go wrong? First they looked at the individual parts of the equation without realizing that it has to be looked at as a whole.  You can’t just imagine adding or subtracting specific ratings without thinking about the whole profile.  The number of 1, 2, 3, 4 & 5 ratings received are interrelated – the more high ones you have, the fewer low ones and vice versa.  It’s not easy to imagine dialing up and down the number of low ratings while holding everything else constant.  So because the rating counts are so interrelated, the regression weights are also interrelated, and they should not be interpreted on their own.  Otherwise you quickly come to silly examples such as it is better to receive all low ratings than to receive all 4 ratings.  That just can’t be true.  Second, they are ignoring that receiving a lot of 1 ratings has two meanings: it might mean you’re very unpopular, but it also might mean your profile has been rated a whole lot of times and when that happens ratings will accumulate in all the slots from 1 through 5.  When predictors in a regression have double meanings like this, you often see some paradoxical behavior.  In this case, 1s on their own are probably neutrally or negatively related to number of messages; but when the number of 4s and 5s are also included, things can change.  And finally, the association between messages and ratings, as seen in the first graph, is non-linear meaning that the pattern of regression weights will favor the highest category.  The weights were [.4, -.5, -.1, .9].  It’s not a coincidence that the graph has a non-linear feel, and that those weights trace out a curve.

Students in an introductory statistics class would easily spot the errors in the OKCupid analysis and realize that the authors had jumped to an unsupported conclusion. But judging from the retweets and the Facebook shares, people are just running with the study’s conclusion and not bothering to evaluate the evidence. Let me be clear – if you are a job seeker, it would be a ludicrous strategy to say, “Gee, if I can’t get prospective employers to rate me as the best possible candidate, the next best thing I can do is get them to rate me as the worst possible candidate.” Such a strategy would be just flat out….well, I don’t want to be rude but I’m thinking of a word that rhymes with “cupid”.

1 Comment

Criteria Radio Interview

I was interviewed the other day by Ric Franzi, who does a twice weekly business show where he interviews CEOs of companies based here in SoCal. It’s actually fairly extensive–25 minutes long, as we get into how Criteria’s employee testing services help small companies address their hiring challenges. If anyone is interested here’s the interview.

Leave a comment

Twice As Many Job Seekers For Every Job

A recent New York Times article highlighted the challenges faced by a small business owner in devoting sufficient time to hiring new staff. The challenges are particularly acute in the current economic climate, it was argued, because the number of applicants per open position has risen dramatically.  An unemployment rate that hovers stubbornly around 10% has certainly meant more competition for fewer jobs. For job searchers, this makes it even more important to stand out in a crowded field. For employers, this “buyer’s market” means that although there is plenty of talent available, it can be difficult and time consuming to find the right person for the job when there are so many applicants.

Of course, pre-employment testing is a very efficient way to filter through large applicant pools. Because companies use our HireSelect pre-employment testing software to help them do just this, we are able to gather data that offers real-time insights into hiring trends. For example, we can track, for every job opening on our site, the number of test takers who show up to take the tests.  This presumably works out to a reasonable measure of the number of applicants per position.  Since I haven’t seen much hard data out there about the extent to which applicant pools have grown, I thought it would be interesting to share our findings. Our data confirms the anecdotal reports of much bigger applicant pools. In fact, as the graph below shows, amongst small businesses the average number of applicants per job has just about doubled since 2008, when the recession began.

Just to be clear on what the data represent, we’ve actually tallied test batteries by company, and test takers per battery.  Often larger companies set up a test battery with the intent of multiple hires.  That was our rationale for restricting our sample to smaller companies (companies of fewer than 50 employees), as they are more likely to be making only one hire per position opening.  We should also note that our company is relatively young and growing rapidly, so the pool of companies and how they use our services has evolved since Q3 2008.  All that considered though, the basic point remains that by a constant metric on our site we have seen a large increase in the number of test takers showing up to take pre-employment tests on our site, even when considered on a per-opening basis.

1 Comment

The NFL Draft as a Predictor of Success (Round 3)

We made a few posts last year about the NFL and whether or not draft order is related to productivity. The core issue for us was a claim Malcolm Gladwell repeatedly asserted that the draft order of NFL quarterbacks (QBs) is unrelated to performance. Well, the issue was raised again over the Labor Day weekend and we were alerted to some more recent material we hadn’t seen because to be honest we thought we were done with the whole thing. We found this very sensible WSJ blog from last December, but then we also found this CNBC blog from May of this year. Darren Rovell, the CNBC blogger, reproduced the following table from economist Dave Berri. It purports to show that performance of lower drafted QBs is similar to that of the top drafted QBs. Now to be fair, the table was used to argue that the cost-benefit of the lower picks might exceed that of the higher picks and that is entirely plausible. But Berri also uses a table like this to argue that draft order is not a good predictor of success.

Performance of Quarterbacks Selected from 1980-2009
WP100 = Wins Produced per 100 plays
Performance Adjusted for Average Observed in Each Season

PICKS SEASONS OBSERVED TOTAL PLAYS RELATIVE WINS RELATIVE WP100
1-10 281 104084 442.7 0.425
11-50 325 102009 456.6 0.448
51-90 259 42660 146.1 0.343
91-150 294 54800 207.3 0.378
151-250 334 58835 229.3 0.390

Punch-line: Quarterbacks selected between picks 11-50 outperform picks 1-10 (and cost less)
Source: Dave Berri/Stumbling On Wins

Because we blog about issues in employee selection and we want to be helpful to our clients and anyone else interested in the topic, we just have to say please don’t analyze your data by summarizing it in a table like this one. The real unit of analysis is individual QBs and whether they are good selections. The table aggregates into “QB years,” so some QBs are contributing 5 to 10 years of data and some are contributing just one. And what happens to the QBs who have no data to contribute? They disappear. Compare the first row of the table (draft order 1-10) and the third row (draft order 51-90). Loosely speaking, you’d think that there would be about 4 times as much data from a range of 40 draft positions compared to a range of 10. But instead, there are fewer “QByears” and less than half as many plays in the larger pool compared with the elite pool. It’s because a huge proportion of the lower draft order QBs never made any contribution at all. The average performance documented in the fourth row of the table is the average performance of the lower drafted QBs who succeeded in the NFL.

Berri argues that aggregating this way is necessary to avoid a self-confirming “opportunity bias”. The legitimate point is that highly drafted QBs will get more playing time, and so total production might be a misleading metric of success. But the table goes to the other extreme and says that anyone who never played in the NFL was just denied the chance to show what they could do, and they should be omitted from the analysis.

If you are evaluating your own employee selection system, you have to think carefully about the correct way to organize and analyze your data. Any selection mechanism, including pre-employment testing, yields false positives (people who were significantly overrated by the selection system relative to their contribution) and false negatives (people who were underrated by the selection system relative to what they did contribute – or what they could have contributed had they been selected at all). Before you say, “Hmm, I think the weekly sales of my staff who scored low on the selection test is pretty similar to the weekly sales of those who scored high” you have to consider the full sample. What fraction of those hired with low scores didn’t pass training, or were fired, or quit? If you find that the rates of attrition and failure are similar for the low and high performing groups, then maybe you should indeed wonder if the selection tool matters all that much for predicting sales. But if you find that attrition in the lower group is double or triple that of the higher group, then you have to realize that you are only including the best of the low qualifiers in your comparison. Comparing the high and low selection groups on total sales would immediately reveal the hidden risk and cost associated with the lower qualified group. Only you will know, for the purposes of your specific business setting, whether any opportunity bias that might sneak in from considering total productivity is of more or less concern than the bias that comes from pretending the failed hires never happened.

Tagged , , Leave a comment