Skip to main content
Military Psychology logoLink to Military Psychology
. 2022 May 31;35(4):308–320. doi: 10.1080/08995605.2022.2050165

Identification and evaluation of criterion measurement methods

Matthew Allen a,, Teresa Russell b, Laura Ford a, Thomas Carretta c, Angela Lee a, Cristina Kirkendall d
PMCID: PMC10291913  PMID: 37352453

ABSTRACT

Criterion measures vary greatly in terms of their psychometric quality and ease of use. This paper serves two purposes. First, it provides a general summary of different approaches to criterion measurement in a military context. Second, it provides an extensive review of 16 specific types of criterion measurement methods (e.g., job performance rating scales, self-report questionnaires, job knowledge tests) on nine psychometric and ease-of-use evaluation factors. Eight criterion measurement experts read a summary of extant research and made ratings to evaluate each measurement method on the evaluation factors. Rater intra-class correlations (ICCs) were high, ranging from .75 to .95 across the evaluation dimensions with a median of .91. Data showed a quality-feasibility tradeoff, where criterion data that are easy to obtain often have technical flaws. Recommendations for military services and future directions in criterion measurement (e.g., applications of machine learning) are discussed.

KEYWORDS: Job performance, criterion measurement, measurement methods, self-report, job knowledge tests, work sample tests, simulations, performance ratings, administrative records


What is the public significance of this article?—When conducting research to support organizational decision-making, researchers have several measurement tools at their disposal. However, each of these measurement tools comes with advantages and disadvantages. The purpose of the current paper is to support organizational researchers by systematically and empirically reviewing the tradeoffs associated with these measurement tools, and combine the relevant research into one paper.

The purpose of this paper is to summarize and evaluate available measurement methods for use in criterion-related validation studies. While other papers in this special issue discuss criterion constructs or measurement of specific constructs, our intention is to take a high-level, summary view of criterion measurement in military contexts. Our hope is that this summary will provide military psychology researchers with a useful reference for determining a criterion measurement approach that is most appropriate to their unique circumstances. To accomplish this objective, we first provide a brief history of criterion measurement in the U.S. military. We then describe our approach to identifying a comprehensive set of criterion measurement methods used previously in military criterion-related validation. This is followed by a description of those measurement methods, culminating in a table summarizing descriptive information about each method, constructs that are suitable for measurement with that method, and key references for learning more about the appropriate design for each method. We then evaluate each method using input from expert raters, drawing conclusions about appropriate uses for each criterion measure in criterion-related validation. We conclude with some implications for future research.

Brief history of criterion measurement in the U.S. military

Approaches to measure the criterion domain have been studied for more than a century. As described by Knapp and Rumsey (2023), the U.S. military conducted landmark studies in the 1980s and 1990s to examine the relations of cognitive and non-cognitive assessments to a variety of criteria (i.e., Job Performance Measurement [JPM] program; Knapp & Campbell, 1993; Kavanagh et al., 1987). In their literature review, Kavanagh et al. (1987) noted that military occupational research historically relied heavily on “broad-based generic indices, performance ratings, or operational measures with their inherent problems of inflation and halo effects” (p. i). They further noted that these broad measures were unsuccessful because they could not take into account specific task-level influences such as opportunities to perform or training differences and noted several factors that can affect measurement quality.

The Army’s JPM project, called Project A, was unique in that the Army pursued the development of an empirical job performance model. Using data from a concurrent validation study, confirmatory factor analysis identified five broad factors of first-term job performance. One of the most important findings from Project A was that the Armed Services Vocational Aptitude Battery (ASVAB) was the best predictor of technical performance and general soldiering proficiency, but non-cognitive measures of temperament and interest were better than the ASVAB at predicting other dimensions such as effort and leadership (ELS), personal discipline (PD), and physical fitness and bearing (PFB; McHenry et al., 1990; Oppler et al., 2001). The research showed that non-cognitive measures should supplement cognitive assessments if the objective is to optimize ELS, PD, and PFB aspects of job performance.

Over the years, researchers (e.g., Campbell & Wiernik, 2015) expanded upon this initial 5-factor model to better capture the job performance space. Recently, as described in Russell, Allen et al. (2022), the U.S. military built on this research to develop a comprehensive performance taxonomy for entry-level military occupations. The taxonomy includes training and in-unit performance in the first term of enlistment and, consistent with best practice, is organized hierarchically with four performance categories at the highest level and 33 specific dimensions.

In addition to performance, researchers performing criterion-related validation studies are often interested in other outcomes. A common example is turnover – while most treatments of criterion-related validation assume job performance as the criterion, turnover is just as important to organizations. The gains realized by a selection system that maximizes job performance will not be realized if employees leave early in their tenure (Sackett et al., 2012). Similarly, researchers are often interested in examining attrition (early separation from a term of service) as an outcome as it can be very costly to the military services (Hughes et al., 2020; Marrone, 2020). Thus, criterion-related validation studies frequently expand the criterion space beyond performance to include outcomes of interest (e.g., attrition, absences) and antecedents of those outcomes, such as job attitudes (Griffeth et al., 2000).

Application of assessment development standards

The U.S. military services have developed and adopted technical standards for new assessments based on professional standards (American Educational Research Association, American Psychological Association & National Council on Measurement in Education, 2014; Society for Industrial and Organizational Psychology, 2018). The Manpower Accession Policy Working Group (MAPWG), which oversees the development and use of the ASVAB, developed a checklist targeted toward the assessment of new predictors proposed as adjuncts to the ASVAB (T. Carretta, MAPWG member, personal communication, September 20, 2021). Both the Air Force (LeBreton, 2021; Ployhart, 2021) and Navy (Held et al., 2015, 2014) also have published guidelines for conducting personnel measurement, selection, and classification studies. However, the MAPWG and service guidelines focus on predictors of performance, with little discussion of criterion development.

Relationship between constructs and methods

All sciences are based on systems of constructs and their interrelations. Psychological constructs are used to facilitate understanding of human behavior. Occupational performance constructs include broad domains that in turn include more specific constructs. For example, as described in Russell, Allen et al. (2022), job performance for entry-level enlisted personnel can be described by four broad domains: (a) Technical Proficiency, (b) Organizational Citizenship and Peer Leadership, (c) Psychosocial Well-Being, and (d) Physical Performance. While methods and constructs are two different aspects of a criterion measure (though often conflated; Arthur & Villado, 2008), certain measures are better suited to assess some constructs versus others. Other criterion measures are in themselves the “construct” of interest, though in these cases, they would be better characterized as “outcomes” as they are relevant to organizational objectives rather than individual behaviors or attitudes (Yu et al., 2022). As mentioned earlier, attrition is one such measure, but other examples include absenteeism and productivity indices (e.g., sales volume).

Special considerations for criterion measurement in military contexts

The U.S. military services select and train over 150,000 new recruits annually.1 Recruits receive training in basic skills needed in their service as well as advanced technical training in a specific occupation. All sorts of outcomes are important. How successful are recruits in training? How well do new service members perform the technical duties of their jobs? How many high performers choose to stay beyond their first term of service? How well do service members perform in combat, when required? Consequently, criterion measurement in the military services is complex in terms of what to measure and how and when to collect those measurements. Taxonomic papers in this issue discuss what to measure (Russell, Allen et al., 2022; Russell, Ingerick et al., 2022). The focus of the current paper is on methods of measuring criterion constructs (e.g., performance ratings, job knowledge tests).

Method and results

Identification of criterion measures

Our first step in reviewing and evaluating different criterion measurement methods was to systematically review previous military validation research to ensure we had a comprehensive list of criterion measure types. Given the long history of military personnel assessment research described in this issue (Knapp & Rumsey, 2023) and elsewhere (Sellman et al., 2017), and considering the cross-service efforts described previously (Russell, Allen et al., 2022), we limited our review to instruments that (a) could apply across services, excluding job-specific training and performance criteria; (b) were developed in 1980 or later, aligning to the JPM research described above; and (c) measured enlisted servicemember outcomes.

We began our review by identifying criteria that had been used recently in validation studies by U. S. military services. To accomplish this, researchers identified one personnel assessment professional from each of the five services (Army, Air Force, Navy, Marine Corps, Coast Guard) and asked them for (a) technical reports or presentations that used criterion measures to validate selection instruments, and, if possible, (b) copies of the measures themselves. We supplemented the information provided by these individuals with information from historical validation studies known to the research team, and keyword searches for validation studies in the Defense Technical Information Center (DTIC; https://discover.dtic.mil/). We supplemented this review with military-specific academic resources – the journal Military Psychology and conference proceeding of the International Military Testing Association (IMTA). For Military Psychology, we reviewed all article abstracts from 2013 to 2017 and identified those with the potential criterion measures (e.g., the article was about personnel selection). For IMTA, we reviewed abstracts for all conference presentations from 2000 to 2017. Through this process, we identified over 200 unique criterion measures.2

From this process, we determined that all of the specific criterion measures identified through this process could be sorted into one of four broad categories:

  1. Job Performance Rating Scales – Job performance rating scales ask one or more judges to evaluate the behavior of a target individual. In the criterion measures identified above, most job performance rating scales measure typical performance dimensions – that is, how an individual performs day-to-day (or what they “will-do”) on the job or in training as opposed to maximal (or “can-do”) dimensions (Klehe & Anderson, 2007; Russell, Allen et al., 2022). Job performance rating scales can be further sorted by the source of judges – supervisory, peer, self, and multisource.

  2. Direct Assessments – Direct assessments require individuals to demonstrate their performance on a task or articulate what they would do in response to a situation. Unlike job performance rating scales, most direct assessments identified in our research measure maximal/can-do performance dimensions. In other words, unlike job performance rating scales, direct assessments involve putting individuals in clear testing/assessment situations (e.g., taking a test, sitting down for an interview), and asking them to demonstrate their proficiency. This distinguishes these assessments from job performance rating scales, which, as described above, are often more typically linked to day-to-day or “typical” performance.3 Direct assessments include work sample tests, simulations, interviews, situational judgment tests (SJTs), and job knowledge tests (JKTs).

  3. Self-Report Questionnaires – As the name implies, self-report questionnaires ask individuals to review a stimulus (such as a question or a statement) and provide a response (e.g., an open-ended response, a rating) based on their personal experience, attitudes, or opinions. We identified two fundamental types of self-report items used in criterion-related validation research. The first are attitudinal items that measure dimensions such as job satisfaction, while the other are items that measure objective performance information, such as self-reported disciplinary incidents.

  4. Administrative Records – Administrative records is a broad category subsuming a wide range of constructs. Most criterion measures based on administrative records used in previous military research – such as attrition, school grades, and promotion rates – can best be classified as “outcomes” vs. direct measures of performance or attitudes (Campbell et al., 1990). There are many considerations in evaluating whether to use administrative records in validation studies. This topic is beyond the scope of the current paper but is covered in detail in Yu et al., (2022).

We describe the component measurement methods within the above categories next.

Measurement methods

Historically, most criterion measures used in military validation research are aimed at estimating job performance or predicting attrition (or measuring attrition after the fact). Job performance (JP), as defined by Campbell et al. (1993), is a function of declarative knowledge (DK), procedural knowledge and skill (PKS), and motivation (M), specifically, JP = DK x PKS x M. Job performance is behavior. DK is knowing facts relevant to the job. PKS is knowing what to do and how to do it, and M is choosing to expend effort on the job. Job performance rating scales and direct assessments (e.g., work samples and simulations) attempt to measure behavior. Oral interviews, JKTs, and SJTs are best thought of as measures of DK and PKS, depending upon how much factual versus procedural knowledge is tested. Self-report questionnaires of attitudes attempt to assess motivation, and they typically serve as indicators of attrition. Finally, some administrative measures, such as rifle qualification scores or awards, are closely tied to job performance, but others, such as training school grades and promotion rate, are outcome measures. They are outcomes of the individual’s performance, but outcome measures can be influenced by variables outside the individual’s control (e.g., the number of promotion slots available in a job). Therefore, administrative measures are often (though not exclusively, see, Yu et al., 2022, for details) considered to be indirect measures of an individual’s job performance.

Table 1 presents criterion measures for each measurement method type along with a brief definition, examples, a list of constructs they are best suited to measure, and key references. Note that each measurement method has advantages and disadvantages, which are summarized here:

Table 1.

Descriptive information for criterion measurement methods.

Measurement Method Definition Examples Construct Suitability Key References
Job Performance Rating Scales
Supervisor Ratings Performance ratings provided by supervisor
  • Behaviorally anchored rating scales (BARS)

  • Behavior summary scales (BSS)

  • Behavior observation scales (BOS)

  • Computer adaptive rating scales (CARS)

Job performance
  • Technical proficiency

  • Organizational citizenship and peer leadership

  • Psychosocial well-being

  • Physical performance

  • Carpenter et al. (2014)

  • Facteau and Craig (2001)

  • Heidemeier and Moser (2009)

  • Knapp and Campbell (1993)

  • O’Leary and Pulakos (2017)

  • Viswesvaran et al. (2005)

Peer Ratings Performance ratings provided by peer(s)      
Self-ratings Ratings made by a person regarding their his/her performance      
Multisource Ratings
Performance ratings provided by a combination of supervisors, subordinates, peer(s), and/or self (e.g., 360-degree feedback)
 
 
 
Direct Assessments
Work Samples/Hands-On Tests Tests simulating performance of a job task or relevant skill; typically involves use of equipment or audio/visual aids
  • Cleaning or shooting a rifle

  • Operating a piece of equipment

  • Transcribing a memo

Job performance
  • Technical proficiency

  • Physical performance

  • Roth et al. (2005)

  • P. Roth et al. (2008)

  • Wigdor and Green (1991)

Simulations/Assessment Centers Tests simulating tasks, equipment, and job setting without using actual equipment or job aids
  • In-basket

  • Role play

  • Leaderless group discussion

  • Case analysis

  • Oral presentation

Job performance
  • Technical proficiency

  • Organizational citizenship and peer leadership

  • Psychosocial well-being

  • Physical performance

  • Arthur et al. (2003)

  • Gaugler et al. (1987)

  • Hoffman et al. (2015)

Oral Interviews Questions eliciting verbal responses that are scored against rating scales
  • Type: situational, behavioral

  • Format: individual, panel (group)

  • Amount of structure: structured, unstructured

Job performance
  • Technical proficiency

  • Organizational citizenship and peer leadership


Job attitudes
  • Person-environment fit

  • Conway et al. (1995)

  • McDaniel et al. (1994)

  • Thorsteinson (2018)

Situational judgment Tests Tests presenting challenging, job-related scenarios that assess judgment about actions to take
  • Format: text-based, video-based

  • Response instructions: would do, should do

  • Response type: pick the best/worst, rate the effectiveness

Job performance
  • Technical proficiency

  • Organizational citizenship and peer leadership

  • Chan and Schmitt (1997)

  • Hough et al. (2001)

  • McDaniel et al. (2001)

Job Knowledge Tests
Tests consisting of questions designed to assess technical or professional expertise in specific knowledge areas
  • Format: multiple choice, select all that apply


Job performance
  • Technical proficiency


  • Knapp and Campbell (1993)

  • Hunter (1986)

  • Roth et al. (2003)


Self-Report Questionnaires
Attitudes Ratings made by a person regarding his/her job-related attitudes
  • Level: state, trait

  • Response format: Likert scale, forced choice, open-ended

Job attitudes
  • Work satisfaction

  • Morale

  • Organizational commitment

  • Withdrawal cognitions/intentions

  • Person-environment fit

  • Organizational outcomes

  • Griffeth et al. (2000)

  • Judge et al. (2001)

  • Meyer et al. (2002)

Objective Performance/
Personnel Data
Data provided by a person regarding his/her performance or administrative records
  • Physical fitness test scores

  • Reprimands

  • Rifle/pistol qualification score


Job Performance
  • Technical proficiency

  • Organizational citizenship and peer leadership

  • Psychosocial well-being

  • Physical performance


Organizational outcomes
  • Reenlistment

  • Promotion rate

  • Training school grades


  • Adler et al. (2005)

  • Knapp et al. (2004)

  • Kuncel et al. (2005)


Administrative Records
Attrition Separation before the end of a service member’s enlistment contract
  • Delayed entry program attrition

  • Basic/Boot Camp attrition

  • Advanced training attrition

  • In-unit attrition

Organizational outcomes
  • Attrition/turnover

  • Borman et al. (2017)

  • Campbell (1987)

  • Campion (1991)

  • Ford et al. (2020)

  • General Accounting Office (1998)

  • Knapp and Campbell (1993)

Reenlistment Agreement to serve an additional term of service after the initial service obligation is met
  • Failure to reenlist after one’s active-duty service obligation

Organizational outcomes
  • Reenlistment

Promotion Rate A deviation-score calculated by comparing a service member’s pay grade with the average pay grade for service members in his/her occupation having the same time in service
  • Ranks or grades advanced per year

  • Recommendation for an accelerated promotion (for second-term service members)

Organizational outcomes
  • Performance

Training School Grades Standardized grades from operational training courses for first-term enlistees
  • Course pass/fail

Organizational outcomes
  • Performance

  • Borman et al. (2017)

  • Campbell (1987)

  • Campion (1991)

  • Ford et al. (2020)

  • General Accounting Office (1998)

  • Knapp and Campbell (1993)

Personnel File Records: Performance Scores Variety of scores on measures of performance
  • Physical fitness test scores

  • Weapon qualification scores

Organizational outcomes
  • Performance

Job performance rating scales

A particular advantage of performance ratings scales is that, compared to other measurement methods, all performance requirements of a job can be included in the rating scales (Knapp & Campbell, 1993; O’Leary & Pulakos, 2017). Ratings are most commonly provided by supervisors but can also be provided by peers, subordinates, and the individual themself. In addition, rating scales are less expensive to develop than most other measures and are relatively easy and convenient to administer. Drawbacks to rating scales include (a) rater contamination, that is, ratings can be distorted by factors such as personal likability, stereotype generalizations, and carelessness; and (b) attenuated variance because many employees are rated at the middle (i.e., central tendency) or high end of the scale (O’Leary & Pulakos, 2017). Thus, rating errors and bias can be problematic with operational performance ratings (Knapp & Campbell, 1993). Frame of reference training, which familiarizes raters with the content of rating dimensions, has been shown to decrease rating errors (Borman et al., 2017).

Direct assessments

Direct assessments can be described along three dimensions (Knapp, 2011):

  • Authenticity – The extent to which the assessment simulates real-world experience.

  • Response flexibility – The extent to which response options are structured. For example, multiple-choice tests are low in response flexibility whereas essay tests and constructed response tests are high in response flexibility.

  • Stimulus flexibility – The extent to which test questions vary based on the test-taker’s responses. For example, a JKT that is static (i.e., contains the same questions for all test-takers) is low in stimulus flexibility while an adaptive one (i.e., contains different questions depending on the individual test-taker’s performance) is high in stimulus flexibility.

For the assessments listed in Table 1, measures are listed from highest (work samples/hands-on tests) to lowest in authenticity (JKTs). Work samples and hands-on tests have high authenticity because they mimic real-life work activities. Therefore, they are most useful in situations where measurement of critical, frequently performed, or difficult tasks is warranted. The disadvantages of these types of tests are that they can be time-consuming to administer which in turn can limit the number of tasks or skills that can be tested. Thus, work samples and hands-on tests are generally less comprehensive than other measurement methods. At the low end of the authenticity spectrum are JKTs, which have the advantage of being able to sample more tasks with greater efficiency (Knapp & Campbell, 1993), and are a good format to use when a task is procedural (i.e., requiring knowledge about steps or actions to take). However, tasks involving complex motor skills are better assessed using hands-on performance tests (Borman et al., 2017). Tests falling in the middle of the authenticity spectrum include simulations and assessment centers, oral interviews, and SJTs.

Note that some direct assessments (e.g., assessment centers, oral interviews) often rely on ratings of stimulus material gathered during the administration process. These are distinct from job performance rating scales in that the stimulus for the ratings is contained within the assessment process. However, the addition of subjective judgment into the assessment process can have psychometric and operational considerations. Although this discussion provides an organizing framework for describing the direct assessments, the degree to which specific criterion formats vary on key evaluative criteria is described in the next section.

Self-report questionnaires

Table 1 describes two types of self-report measures: (a) attitudes and (b) objective performance/administrative data such as physical fitness test scores and training school grades. Self-reported attitudes are typically used as indicators of work motivation and attrition. An advantage of these types of surveys is that they are efficient tools for collecting data on a wide variety of attitudinal and training/work experience variables. Additionally, self-report attitudinal questionnaires are typically less expensive to develop than most other measures. One disadvantage is that self-report questionnaires can be susceptible to distortion (Chan, 2009; John & Robins, 1993) in that self-reported performance can be susceptible to leniency. However, this potential for distortion seems to be of little concern to researchers, perhaps because attitude surveys are typically low stakes (i.e., having no operational consequences for respondents).

Administrative records

The advantage of administrative records is that organizations, including the U.S. military, collect extensive data with the potential to be used as criterion measures, such as attrition, reenlistment, promotion rate, training school grades, and personnel file records. One limitation is that these outcome measures usually assess a narrow portion of the total criterion space (Borman et al., 2017). Additionally, the quality of all administrative measures lies in the efficacy of how training and personnel database systems are populated and updated. One source of error for administrative variables is datedness. Another is whether processes within or between the services have been standardized. For instance, training school grades have been used extensively for ASVAB validation over the years (e.g., Carretta & King, 2008; Wise et al., 1992). In particular, the Air Force has standardized the process of grading and collecting grades across schoolhouses, but other services encounter difficulties in doing so. The topic of administrative records is described in more detail by Yu et al., (2022).

It is worth mentioning that there is overlap between extant objective performance data collected through self-report questionnaires and those collected through administrative records. Military researchers may collect the same information from different sources depending on circumstances. For example, if the researchers have concerns about the quality or availability of certain administrative records, they may opt to collect those data through self-report. On the other hand, if there are concerns with self-report (e.g., the respondent may not recall the information), the researchers may opt to obtain the relevant information through administrative records. As with any criterion measurement approach, there are pros and cons to each, an issue that we explore in more detail in the next section.

Measurement methods evaluation

We conducted an expert judgment exercise to evaluate the measurement methods described in Table 1 for use in criterion-related validation studies in the U.S. military. First, we developed rating scales comprising nine evaluation dimensions based on extant research (i.e., psychometric) considerations and operational considerations. These dimensions and scales are described below:

Psychometric Considerations:

  • Comprehensiveness (vs. Deficiency) – The extent to which the content of the measure covers the criterion domain (1 = Low comprehensiveness; 5 = High comprehensiveness).

  • Susceptibility to Contamination – The extent to which score variance is attributable to job irrelevant determinants (1 = High susceptibility; 5 = Low susceptibility).

  • Reliability – The extent to which scores produced by the measurement method are consistent over time (1 = Low reliability; 5 = High reliability).

  • Discriminability – The extent to which the measurement method distinguishes between good and poor performers (1 = Low discriminability; 5 = High discriminability).

  • Validation Uses – The extent to which the measurement method has proven useful for assessing the validity of cognitive, personality, interest, or physical ability constructs (1 = Low validation uses; 5 = High validation uses).

Operational Considerations:

  • Ease and Cost of Measure Development – The cost associated with developing new measurement tools (1 = High cost; 5 = Low cost).

  • Ease and Quality of Administration – The extent to which high-quality data can be collected efficiently (1 = Low ease/quality; 5 = High ease/quality).

  • Ease and Quality of Data Management – The extent to which data can be stored and managed easily, securely, and accurately (1 = Low ease/quality; 5 = High ease/quality).

  • Ease and Cost of Maintenance – The cost associated with updating, revising, or maintaining the measurement tools and databases over time (1 = High cost; 5 = Low cost).

We asked eight individuals, all with Ph.D.’s in industrial and organizational psychology or a related field and extensive experience with both criterion measurement and military personnel research, to complete the rating task. Specifically, raters were asked to (a) read through a summary of extant research, prepared by the research team, on the measurement methods described in Table 1 and (b) evaluate each measurement method on a 5-point rating scale. Rater intra-class correlations (ICC [C,8]) were very high, ranging from .74 to .95 across the evaluation dimensions with a median of .91. Table 2 presents the results of this rating activity. Composite “Psychometric Considerations” and “Operational Considerations” means and standard deviations are presented along with advantages and disadvantages for each method informed by the component scale ratings. Closer inspection of Table 2 suggests the following:

  • No one measurement method is perfect. Each measurement method has pros and cons, suggesting the most comprehensive approaches to criterion-related validation will employ a variety of methods.

  • There is some tradeoff between psychometric and operational considerations. The correlation between the mean ratings of psychometric considerations and the mean ratings of operational considerations in Table 2 is −.14. This suggests that assessments that are easier to develop, administer, manage, and maintain are also somewhat more likely to have measurement limitations.

  • Some measurement methods have high ratings on both evaluation factors relative to other methods. JKTs, attitudes, and multisource ratings yield the highest scores related to psychometric considerations, while attitudes, self-report objective performance/personnel data, and self-ratings yield the highest ratings for operational considerations. Averaging across the two metrics results in attitudes, JKTs, ratings (supervisor, multisource, peer), and objective performance/personnel data having the highest overall scores. All approaches can be incorporated into existing business processes (e.g., knowledge tests as end-of-training measures, ratings for developmental feedback).

  • Certain methods may be more situationally desirable for use in validation despite low ratings on certain dimensions. For example, organizations that have high quality data in administrative records may find those criteria more desirable. Additionally, some constructs (e.g., interpersonal skills) are measured particularly well by certain methods (e.g., SJTs). Thus, the chosen criterion measurement method should be appropriate to the constructs being measured and other situational factors, such as the state of administrative records data.

Table 2.

Mean psychometric quality and application ease ratings.

Measurement Method Types Psychometric Considerations
M (SD)
Operational Considerations
M (SD)
Pros Cons
Supervisor Ratings 3.40 (1.20) 4.22 (0.40)
  • Can be used for wide range of constructs

  • Easy to develop, administer, and maintain

  • Susceptible to contamination and unreliability, particularly with low number of raters

  • Can lack variance/discriminability

Peer Ratings 3.03 (1.24) 4.19 (0.39)
  • Same as for supervisor ratings

  • Same as for supervisor ratings, but more susceptible to unreliability and contamination

Self Ratings 2.88 (1.33) 4.44 (0.16)
  • Can be used for any construct

  • Very easy to execute operationally

  • High susceptibility to contamination

  • Low discriminability

Multisource Ratings 3.63 (0.97) 3.88 (0.53)
  • Similar to single-source ratings, but comparatively lower susceptibility to contamination and higher discriminability

  • Similar challenges to those listed for single-source ratings

  • More difficult to administer than single-source ratings

Work
Samples/
Hands-On
Tests
Direct 3.33 (0.81) 2.41 (1.19)
  • Strong psychometrics (e.g., reliability, discriminability)

  • Yields high quality data

  • Best used for specific (e.g., can do) constructs; difficult to measure many constructs

  • Difficult to develop, administer, and maintain

Simulations/
Assessment
Centers
Direct 3.15 (0.51) 2.38 (1.23)
  • Similar to work samples, but can cover wider range of constructs

  • As with work samples, very difficult to develop and maintain

  • Assessment centers particularly difficult to administer

Oral
Interviews
Direct 3.45 (0.56) 3.34 (0.82)
  • Can be used for a wide range of constructs

  • Easy to develop and maintain

  • Good psychometric properties, particularly with multiple interviewers

  • Resource-intensive to administer compared to more automated methods; difficult to scale

Situational
Judgment
Tests
Direct 3.53 (0.50) 3.25 (1.10)
  • Strong psychometric properties

  • Easy to administer once has been set up

  • Resource intensive to develop and maintain

  • Best for a limited range of constructs (e.g., interpersonal and judgment-oriented constructs)

Job
Knowledge
Tests
Direct 4.18 (0.68) 3.69 (0.77)
  • Strong psychometric properties

  • Easy to administer and manage data

  • Effective for a narrow range of constructs

Attitudes Questionnaire 3.70 (0.56) 4.63 (0.18)
  • Very easy to develop, administer, and maintain

  • Good psychometric properties when used for appropriate constructs

  • Susceptible to contamination/low variance responding (particularly socially desirable responding)

Objective
Performance /
Personnel
Data
Questionnaire 3.01 (0.67) 4.56 (0.26)
  • Easy to develop, administer, and maintain

  • Yields valid, reliable data under the right circumstances

  • Dependent on availability of objective measures

  • Susceptible to psychometric problems (e.g., faking) in other (e.g., high stakes) circumstances

Attrition Administrative 2.69 (1.17) 4.03 (0.89)
  • Commonly used criterion in military research

  • Typically collected and maintained in administrative record; easy to use in validation research

  • Narrow focus in the criterion space

  • Psychometric quality depends on quality of input data

Reenlistment Administrative 2.53 (0.96) 4.19 (0.65)
  • Same as attrition

  • Same as attrition

Promotion
Rate
Administrative 2.62 (0.54) 4.38 (0.68)
  • Similar to attrition/reenlistment

  • Similar to attrition/reenlistment, but more difficult to construct given occupation differences in promotion rates

Training
School
Grades
Administrative 3.10 (0.47) 4.16 (0.69)
  • If data well-maintained, easy to obtain high quality data for validation

  • Data typically limited to a narrow range of criterion space (e.g., technical performance)

  • Psychometric quality depends on quality of input data and diligence in tracking outcomes

Personnel File
Records:
Performance
Scores
Administrative 3.17 (0.75) 4.16 (0.64)
  • Good high-quality criterion, particularly when derived from high stakes evaluations (e.g., fitness scores, weapon qualifications)

  • Easy to access if maintained in centralized administrative records

  • Narrow focus in the criterion space

Note. The scale ranged from 1 (Low) to 5 (High). M = Average rating across component scale means across raters (susceptibility to contamination, reliability, discriminability, validation uses); SD = Standard deviation across component scale means across raters. k = 8. Missing data (as indicated by a “not relevant” rating) were replaced with the harmonic mean for the scale across raters.

Discussion

A key objective for this paper is to have it serve as a reference for military psychology researchers conducting criterion-related validation studies. Criterion measurement has a long and rich history in the U.S. military. Thus, the first activity was to systematically survey criterion measurement research (and the broader military psychology literature) to construct a comprehensive list of potential measurement approaches. This complements other papers in the current special issue that describe criterion constructs (e.g., Russell, Allen et al., 2022; Russell, Ingerick et al., 2022); specifically, Table 1 provides constructs that are suitable for each identified method. Content in Tables 1 and 2 expand on this by providing (a) pros and cons for each measurement method and (b) resources for developing those methods. With these resources, we hope to assist military personnel psychologists in addressing the “criterion problem” in their own research (Austin & Villanova, 1992).

A few caveats should be noted about the current research. First, evaluations of the measurement methods (provided in Table 2) are based on “standard” or “typical” approaches found in military personnel research. However, there is a significant amount of variance within each of these approaches, such that choices made in the design of the instrument can have a significant impact on key evaluation criteria such as reliability. For example, the type of structure in oral interviews (e.g., interviewer training) can have a significant positive effect on interrater reliability (Conway et al., 1995). Conforming to best practices in SJTs can also have a significant positive effect on their utility and psychometric quality (Whetzel et al., 2020). Thus, how a criterion measure is developed and implemented can be just as critical as the measurement method chosen.

Second, the current paper focuses on existing research into criterion measurement methods rather than possible measurement methods. Specifically, modern analytic methods such as machine learning are opening up new avenues in assessment that can and should be explored for criterion measurement. For example, expert ratings for oral interviews, assessment centers, and so forth are based on the understanding that these approaches require trained raters for scoring. Machine learning methods offer the possibility of automated scoring, which would allow these methods to scale to larger populations more effectively. This would result in higher mean scores for “ease of administration” for these measures. Relatedly, more and more organizations are relying on passive means of collecting data on employees to make judgments about key variables of interest. Sources for these data include items such as analyzing internal organizational communications (e.g., e-mail, instant messages). In addition, passive data collection and analysis is being used to determine employee sentiment (e.g., engagement) and in identifying potential insider threats. The use of these passive methods for criterion measurement has not been explored in a systematic way.

Finally, in reviewing the history of criterion measurement in the U.S. military and comparing them to the methods in Table 1, many of the methods used date back to the JPM projects described above. A key difference is that, more and more, organizations (and military services are no exception) prefer to rely on administrative records than on active data collection to inform personnel decisions. Given these records are always available, it makes sense to make full use of those data prior to initiating any additional criterion-related data collection activities. However, the use of administrative records can be fraught with peril, particularly when the records are not well-maintained. Proper data stewardship and other best practices can help to ensure data have maximum utility for organizational decision makers, a topic that is explored in more detail in another paper in this special issue (Yu et al., 2022).

Correction Statement

This article has been corrected with minor changes. These changes do not impact the academic content of the article.

Funding Statement

This research was funded under Contract [#: FA8650-14-D-6500-0007], Development of Criterion Measures for Evaluating Accession and Classification Testing, with Infoscitex Corporation.

Notes

2.

Due to space constraints, the full list of measures initially identified is not provided here but is available upon request from the corresponding author.

3.

Note that this is not a clean distinction. Maximal and typical performance is more of a continuum than discrete categories (Russell, Allen et al., 2022), and direct assessments can be designed to orient toward typical performance. For example, SJTs can have instructions to respond with what they “would typically” do in a certain situation to determine their typical responses (McDaniel et al., 2007). Interview questions also frequently ask respondents to describe what they have done in previous situations or what they would do (Taylor & Small, 2002), which is also typical performance oriented. Despite these caveats, however, we feel the distinction between ratings and assessments remains useful given the latter’s orientation to formal assessment situations (i.e., direct assessment of the respondent rather than indirect).

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data availability statement

The data that support the findings of this study are available from the corresponding author, Matthew Allen, upon reasonable request.

References

  1. Adler, A. B., Thomas, J. L., & Castro, C. A. (2005). Measuring up: Comparing self-reports with unit records for assessing soldier performance. Military Psychology, 17(1), 3–24. 10.1207/s15327876mp1701_2 [DOI] [Google Scholar]
  2. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education . (2014). Standards for educational and psychological testing. [Google Scholar]
  3. Arthur, W., Jr, Day, E. A., McNelly, T. L., & Edens, P. S. (2003). A meta‐analysis of the criterion‐related validity of assessment center dimensions. Personnel Psychology, 56(1), 125–153. 10.1111/j.1744-6570.2003.tb00146.x [DOI] [Google Scholar]
  4. Arthur, W., Jr, & Villado, A. J. (2008). The importance of distinguishing between constructs and methods when comparing predictors in personnel selection research and practice. Journal of Applied Psychology, 93(2), 435–442. 10.1037/0021-9010.93.2.435 [DOI] [PubMed] [Google Scholar]
  5. Austin, J. T., & Villanova, P. (1992). The criterion problem: 1917–1992. Journal of Applied Psychology, 77(6), 836–874. 10.1037/0021-9010.77.6.836 [DOI] [Google Scholar]
  6. Borman, W. C., Grossman, M. R., Bryant, R. H., & Dorio, J. (2017). The measurement of task performance as criteria in selection research. In Farr J. L. & Tippins N. T. (Eds.), Handbook of employee selection (2nd ed., pp. 429–447). Routledge. [Google Scholar]
  7. Campbell, J. P. (1987). Improving the selection, classification, and utilization of Army enlisted personnel: Annual report, 1985 Fiscal Year-supplement to ARI Technical Report TR 746 (ARI Research Note 87-54). Army Research Institute for the Behavioral and Social Sciences. https://apps.dtic.mil/dtic/tr/fulltext/u2/a188267.pdf [Google Scholar]
  8. Campbell, J. P., McHenry, J. J., & Wise, L. L. (1990). Modeling job performance in a population of jobs. Personnel Psychology, 43(2), 313–333. 10.1111/j.1744-6570.1990.tb01561.x [DOI] [Google Scholar]
  9. Campbell, J. P., McCloy, R. A., Oppler, S. H., & Sager, C. E. (1993). A theory of performance. In Schmitt N. & Borman W. C. (Eds.), Personnel selection in organizations (pp. 35–70). Jossey-Bass. [Google Scholar]
  10. Campbell, J. P., & Wiernik, B. M. (2015). The modeling and assessment of work performance. Annual Review of Organizational Psychology and Organizational Behavior, 2(1), 47–74. 10.1146/annurev-orgpsych-032414-111427 [DOI] [Google Scholar]
  11. Campion, M. A. (1991). Meaning and measurement of turnover: Comparison of alternative measures and recommendations for research. Journal of Applied Psychology, 76(2), 199–212. 10.1037/0021-9010.76.2.199 [DOI] [Google Scholar]
  12. Carpenter, N. C., Berry, C. M., & Houston, L. (2014). A meta-analytic comparison of self-reported and other-reported organizational citizenship behavior. Journal of Organizational Behavior, 35(4), 547–574. 10.1002/job.1909 [DOI] [Google Scholar]
  13. Carretta, T. R., & King, R. E. (2008). Improved military air traffic controller selection methods as measured by subsequent training performance. Aviation, Space, and Environmental Medicine, 79(1), 36–43. 10.3357/ASEM.2166.2008 [DOI] [PubMed] [Google Scholar]
  14. Chan, D., & Schmitt, N. (1997). Video-based versus paper-and-pencil method of assessment in situational judgment tests: Subgroup differences in test performance and face validity perceptions. Journal of Applied Psychology, 82(1), 143–159. 10.1037/0021-9010.82.1.143 [DOI] [PubMed] [Google Scholar]
  15. Chan, D. (2009). So why ask me? Are self-report data really that bad? In Lance C. E. & Vandenberg R. J. (Eds.), Statistical and methodological myths and urban legends: Doctrine, verity and fable in the organizational and social sciences (pp. 309–336). Routledge. [Google Scholar]
  16. Conway, J. M., Jako, R. A., & Goodman, D. F. (1995). A meta-analysis of interrater and internal consistency reliability of selection interviews. Journal of Applied Psychology, 80(5), 565–579. 10.1037/0021-9010.80.5.565 [DOI] [Google Scholar]
  17. Facteau, J. D., & Craig, S. B. (2001). Are performance appraisal ratings from different rating sources comparable? Journal of Applied Psychology, 86(2), 215–227. 10.1037/0021-9010.86.2.215 [DOI] [PubMed] [Google Scholar]
  18. Ford, L. A., Yu, M. C., Graves, C. R., Huber, C. R., Russell, T. L., Wilmot, M. P., & Ellis, B. (2020). Development of joint-service criterion instruments for enlisted jobs (AFRL-RH-WP-TR- 2020-0036). Wright-Patterson AFB, OH: 711 Human Performance Wing, Warfighter Interface Division, Collaborative Interfaces and Teaming Branch. [Google Scholar]
  19. Gaugler, B. B., Rosenthal, D. B., Thornton, G. C., & Bentson, C. (1987). Meta-analysis of assessment center validity. Journal of Applied Psychology, 72(3), 493–511. 10.1037/0021-9010.72.3.493 [DOI] [Google Scholar]
  20. General Accounting Office . (1998). Military attrition: DOD needs to better analyze reasons for separation and improve recruiting systems (GAO/T-NSIAD-98-117).
  21. Griffeth, R. W., Hom, P. W., & Gaertner, S. (2000). A meta-analysis of antecedents and correlates of employee turnover: Update, moderator tests, and research implications for the next millennium. Journal of Management, 26(3), 463–488. 10.1177/014920630002600305 [DOI] [Google Scholar]
  22. Heidemeier, H., & Moser, K. (2009). Self–other agreement in job performance ratings: A meta-analytic test of a process model. Journal of Applied Psychology, 94(2), 353–370. 10.1037/0021-9010.94.2.353 [DOI] [PubMed] [Google Scholar]
  23. Held, J. D., Carretta, T. R., Johnson, J. W., & McCloy, R. A., (Eds.) (2014). Introductory guide for conducting ASVAB validation/ standards studies in the U.S. Navy (NPRST-TR-15-1). Navy Personnel Research, Studies, and Technology. [Google Scholar]
  24. Held, J. D., Carretta, T. R., Hazlett, S. A., Johnson, J. W., Mendoza, J. L., Abraham, N. M., Drasgow, F., McCloy, R. A., & Wolfe, J. H., (Eds.) (2015). Technical guidance for conducting ASVAB validation/standards studies in the U. S. Navy (NPRST-TR-15-2). Navy Personnel Research, Studies, and Technology. [Google Scholar]
  25. Hoffman, B. J., Kennedy, C. L., LoPilato, A. C., Monahan, E. L., & Lance, C. E. (2015). A review of the content, criterion-related, and construct-related validity of assessment center exercises. Journal of Applied Psychology, 100(4), 1143–1168. 10.1037/a0038707 [DOI] [PubMed] [Google Scholar]
  26. Hough, L. M., Oswald, F. L., & Ployhart, R. E. (2001). Determinants, detection and amelioration of adverse impact in personnel selection procedures: Issues, evidence and lessons learned. International Journal of Selection and Assessment, 9(1‐2), 152–194. 10.1111/1468-2389.00171 [DOI] [Google Scholar]
  27. Hughes, M. G., O’Brien, E. L., Reeder, M. C., & Purl, J. (2020). Attrition and reenlistment in the Army: Using the Tailored Adaptive Personality Assessment System (TAPAS) to improve retention. Military Psychology, 32(1), 36–50. 10.1080/08995605.2019.1652487 [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Hunter, J. E. (1986). Cognitive ability, cognitive aptitudes, job knowledge, and job performance. Journal of Vocational Behavior, 29(3), 340–362. 10.1016/0001-8791(86)90013-8 [DOI] [Google Scholar]
  29. John, O. P., & Robins, R. W. (1993). Determinants of interjudge agreement on personality traits: The Big Five domains, observability, evaluativeness, and the unique perspective of the self. Journal of Personality, 61(4), 521–551. 10.1111/j.1467-6494.1993.tb00781.x [DOI] [PubMed] [Google Scholar]
  30. Judge, T. A., Thoresen, C. J., Bono, J. E., & Patton, G. K. (2001). The job satisfaction–job performance relationship: A qualitative and quantitative review. Psychological Bulletin, 127(3), 376–407. 10.1037/0033-2909.127.3.376 [DOI] [PubMed] [Google Scholar]
  31. Kavanagh, M. J., Borman, W. C., Hedge, J. W., & Gould, R. B. (1987). Job performance measurement in the military: A classification scheme, literature review, and directions for research (AFHR-TR-87-15). Air Force Human Resources Laboratory. [Google Scholar]
  32. Klehe, U. C., & Anderson, N. (2007). Working hard and working smart: Motivation and ability during typical and maximum performance. Journal of Applied Psychology, 92(4), 978–992. 10.1037/0021-9010.92.4.978 [DOI] [PubMed] [Google Scholar]
  33. Knapp, D. J., & Campbell, J. P. (1993). Building a joint-service classification research roadmap: Criterion-related issues (AL/HR-TP-1993-0028). Armstrong Laboratory. [Google Scholar]
  34. Knapp, D. J., McCloy, R. A., & Heffner, T. S. (2004). Validation of measures designed to maximize 21st-century Army NCO performance (TR 1145). U.S. Army Research Institute for the Behavioral and Social Sciences. [Google Scholar]
  35. Knapp, D. J. (2011, January). Approaches to developing assessments of 21st century skills [Paper presentation]. National Research Council’s Workshop on Assessment of 21st Century Skills, Washington D.C. [Google Scholar]
  36. Knapp, D., & Rumsey, M. (2023). Introduction to the special issue on criterion measurement. Military Psychology, 35(4), 273–282. 10.1080/08995605.2022.2050165 [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Kuncel, N. R., Credé, M., & Thomas, L. L. (2005). The validity of self-reported grade point averages, class ranks, and test scores: A meta-analysis and review of the literature. Review of Educational Research, 75(1), 63–82. 10.3102/00346543075001063 [DOI] [Google Scholar]
  38. LeBreton, J. M. (2021). Air Force Personnel Center best practices guide: Test development and validation (AFRL-RH-WP-TR-2021-0002). Wright-Patterson AFB, OH: 711 Human Performance Wing, Airman Biosciences Division, Performance Optimization Branch. [Google Scholar]
  39. Marrone, J. V. (2020). Predicting 36-month attrition in the U.S. military: A comparison across service branches. RAND. [Google Scholar]
  40. McDaniel, M. A., Whetzel, D. L., Schmidt, F. L., & Maurer, S. D. (1994). The validity of employment interviews: A comprehensive review and meta-analysis. Journal of Applied Psychology, 79(4), 599–616. 10.1037/0021-9010.79.4.599 [DOI] [Google Scholar]
  41. McDaniel, M. A., Morgeson, F. P., Finnegan, E. B., Campion, M. A., & Braverman, E. P. (2001). Use of situational judgment tests to predict job performance: A clarification of the literature. Journal of Applied Psychology, 86(4), 730–740. 10.1037/0021-9010.86.4.730 [DOI] [PubMed] [Google Scholar]
  42. McDaniel, M. A., Hartman, N. S., Whetzel, D. L., & Grubb, W. L. (2007). Situational judgment tests, response instructions, and validity: A meta-analysis. Personnel Psychology, 60(1), 63–91. 10.1111/j.1744-6570.2007.00065.x [DOI] [Google Scholar]
  43. McHenry, J., Hough, L. M., Toquam, J. L., Hanson, M. A., & Ashworth, S. (1990). Project A validity results: The relationship between predictor and criterion domains. Personnel Psychology, 43(2), 335–354. 10.1111/j.1744-6570.1990.tb01562.x [DOI] [Google Scholar]
  44. Meyer, J. P., Stanley, D. J., Herscovitch, L., & Topolnytsky, L. (2002). Affective, continuance, and normative commitment to the organization: A meta-analysis of antecedents, correlates, and consequences. Journal of Vocational Behavior, 61(1), 20–52. 10.1006/jvbe.2001.1842 [DOI] [Google Scholar]
  45. O’Leary, R. S., & Pulakos, E. D. (2017). Defining and measuring results of workplace behavior. In Farr J. L. & Tippins N. T. (Eds.), Handbook of employee selection (2nd ed., pp. 509–529). Routledge. [Google Scholar]
  46. Oppler, S. J., McCloy, R. A., Peterson, N. G., Russell, T. L., & Campbell, J. P. (2001). The prediction of multiple components of entry-level performance. In Campbell J. P. & Knapp D. J. (Eds.), Exploring the limits in personnel selection and classification (pp. 349–388). Erlbaum. [Google Scholar]
  47. Ployhart, R. E. (2021). Air Force Personnel Center best practices guide: Selection and classification model development (AFRL-RH-WP-TR-2021-0003). Wright-Patterson AFB, OH: 711 Human Performance Wing, Airman Biosciences Division, Performance Optimization Branch. [Google Scholar]
  48. Roth, P. L., Huffcutt, A. I., & Bobko, P. (2003). Ethnic group differences in measures of job performance: A new meta-analysis. Journal of Applied Psychology, 88(4), 694–706. 10.1037/0021-9010.88.4.694 [DOI] [PubMed] [Google Scholar]
  49. Roth, P. L., Bobko, P., & McFarland, L. A. (2005). A meta‐analysis of work sample test validity: Updating and integrating some classic literature. Personnel Psychology, 58(4), 1009–1037. 10.1111/j.1744-6570.2005.00714.x [DOI] [Google Scholar]
  50. Roth, P., Bobko, P., McFarland, L., & Buster, M. (2008). Work sample tests in personnel selection: A meta‐analysis of black–white differences in overall and exercise scores. Personnel Psychology, 61(3), 637–661. 10.1111/j.1744-6570.2008.00125.x [DOI] [Google Scholar]
  51. Russell, T., Allen, M., Ford, L., Carretta, T., & Kirkendall, C. (2022). Development of a performance taxonomy for entry-level military occupations. Military Psychology, 35(4), 283–294. 10.1080/08995605.2022.2050163 [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Russell, T., Ingerick, M. J., & Barron, L. G . (2022). Defining occupation-specific performance components for military selection and classification. Military Psychology, 35(4), 295–307. 10.1080/08995605.2022.2065901 [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Sackett, P. R., Putka, D. J., & McCloy, R. A. (2012). The concept of validity and the process of validation. In Schmitt N. (Ed.), Oxford Handbook of Assessment and Selection (pp. 91–118). Oxford University Press. [Google Scholar]
  54. Sellman, W. S., Russell, T. L., & Strickland, W. J. (2017). Selection and classification in the U.S. military. In Farr J. L. & Tippins N. T. (Eds.), Handbook of employee selection (2nd ed., pp. 697–721). Taylor and Francis Group. [Google Scholar]
  55. Society for Industrial and Organizational Psychology . (2018). Principles for the validation and use of personnel selection procedures (5th ed.). SIOP. [Google Scholar]
  56. Taylor, P. J., & Small, B. (2002). Asking applicants what they would do versus what they did do: A meta-analytic comparison of situational and past behaviour employment interview questions. Journal of Occupational and Organizational Psychology, 75(3), 277–294. 10.1348/096317902320369712 [DOI] [Google Scholar]
  57. Thorsteinson, T. J. (2018). A meta‐analysis of interview length on reliability and validity. Journal of Occupational and Organizational Psychology, 91(1), 1–32. 10.1111/joop.12186 [DOI] [Google Scholar]
  58. Viswesvaran, C., Schmidt, F. L., & Ones, D. S. (2005). Is there a general factor in ratings of job performance? A meta-analytic framework for disentangling substantive and error influences. Journal of Applied Psychology, 90(1), 108–131. 10.1037/0021-9010.90.1.108 [DOI] [PubMed] [Google Scholar]
  59. Whetzel, D. L., Sullivan, T. S., & McCloy, R. A. (2020). Situational judgment tests: An overview of development practices and psychometric characteristics. Personnel Assessment and Decisions, 6(1), 1–16. 10.25035/pad.2020.01.001 [DOI] [Google Scholar]
  60. Wigdor, A. K., & Green, B. F., Jr. (1991). Performance assessment for the workplace, Vol. 1; Vol. 2: Technical issues. National Academy Press. [Google Scholar]
  61. Wise, L., Welsh, J., Grafton, F., Foley, P., Earles, J., Sawin, L., & Divgi, D. R. (1992). Sensitivity and fairness of the Armed Services Vocational Aptitude Battery (ASVAB) technical composites. Defense Manpower Data Center. [Google Scholar]
  62. Yu, M., Reeder, M., Dorsey, D., & Allen, M. (2022). Administrative records-based criterion measures. Military Psychology, 35(4), 351–363. 10.1080/08995605.2022.2063614 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author, Matthew Allen, upon reasonable request.


Articles from Military Psychology are provided here courtesy of Division of Military Psychology of the American Psychological Association and Taylor & Francis

RESOURCES