ABSTRACT
The Bureau of Labor Statistics reported that 39.1% of the civilian workforce in the United States performs physically demanding jobs that require lifting, carrying, pushing/pulling, kneeling, stooping, crawling, and climbing activities in varied environmental conditions. United States military occupations are similar to those in the civilian sector involving equipment installation, emergency rescues, and maintenance, along with combat arms occupations. This article provides an overview of the types of criterion measures used to assess the physical domain and approaches for designing and evaluating the criteria. It also includes a method for generating criterion measures that are applicable across multiple jobs.
KEYWORDS: Criterion measure, physical, work sample, supervisor ratings, simulations
What is the public significance of this article?—This paper presents strategies for developing criterion measures that assess the performance of physically demanding tasks in military and civilian workplaces. Additionally, it highlights how to gather and integrate physiological and ergonomic information related to work tasks into the design of the criterion measures.
Introduction
The Bureau of Labor Statistics (Bureau of Labor Statistics, U.S. Department of Labor, 2020, 2018) reported that 39.1% of the civilian workforce in the United States performs physically demanding jobs. Physical demand indicates the performance of activities involving lifting, carrying, pushing/pulling, kneeling, stooping, crawling, digging, standing, and climbing at varying levels of intensity and duration. There are hundreds of arduous jobs across the military and civilian sectors (e.g., electric, telecommunications, railroad, public safety) that require equipment installations and repairs, citizen rescues, construction, and other maintenance tasks. The U.S. Army, Air Force, and Navy contain comparable jobs to those in the civilian sector, along with combat arms occupations (e.g., Infantryman, Special Warfare Combatant-Craft Crewman, Combat Engineer). The military jobs involve installation, maintenance, construction, transportation, and combat and have moderate to heavy physical demands.
To ensure individuals can perform arduous jobs in a safe, effective, and efficient manner, many organizations require pre-employment assessments of an individual’s physical capabilities in relation to the job demands. The U.S. Army’s Occupational Physical Assessment Test (OPAT) is an example of a test that determines whether an inductee has the physical capabilities to perform combat arms and non-combat jobs. To ascertain how well the assessment predicts present or future performance, researchers developed alternate measures of job performance or criterion measures. There are a variety of criterion measures used in this context (e.g., work samples, supervisor ratings). In the physical domain, the criterion measures can include work samples, supervisor ratings, and reductions in injuries, attrition, and medical costs. These types of measures are discussed in this article. However, there is not one best criterion, and one must consider the strengths and weakness of each type in relation to job outcomes.
This paper focuses on the types of criterion measures used in military and civilian research to provide an understanding of their utility as indicators of job performance. It includes an overview of the development of multiple types of physical criterion measures and a discussion of designing physical criterion measures for multiple jobs. We also address scoring metrics, standardization of criterion performance, administration, and identifying minimum acceptable criterion performance. Finally, we provide examples of physical criterion measures used in the military and civilian settings.
Similarities and differences between physical and other types of criterion measures
Taxonomies of human performance by Fleishman and Bloom describe four domains: physical, cognitive, psychomotor, and behavioral/affective (Fleishman, Quaintance, & Broedling, 1984; Forehand, 2010). Researchers used objective and subjective criterion measures to assess these domains in personnel selection settings. It should be clarified that physical performance and psychomotor are separate domains. The O*NET defines psychomotor abilities as those involved in manipulation and control of objects (National Center for O*NET Development, 2022). It defines physical abilities as muscular strength, muscular endurance, cardiovascular endurance, flexibility, balance and coordination. For clarity, we refer to the cognitive, psychomotor, and behavioral/affective domains as nonphysical. There are differences and similarities in the criterion measures used to assess the physical and nonphysical domains (Gebhardt & Baker, 2017; Kehoe & Sackett, 2017). Regardless of format, criterion measures must include important tasks, behaviors, knowledges, skills, and/or abilities. Accurate identification of these criterion components is especially important to combat and non-combat jobs because of the consequences of misidentifying levels of acceptable task skills and behaviors in the military selection process (e.g., improper equipment repair, injury, death).
Generating accurate criterion measures is one of the most difficult components of personnel research. Both physical and nonphysical criterion measures include work samples, supervisor/peer ratings, training outcomes, injury identification, and attrition. However, there are differences in the development, scoring, and administration of these criterion measure types. Table 1 provides an overview of the similarities and differences in physical and nonphysical criterion measure components for several criterion types (e.g., injury rates). For components that are comparable between the physical and nonphysical measures, the term “similar” is used. The differences between physical and nonphysical components across criterion-type focus on inclusion of physiological, ergonomic, and biomechanical factors that impact human movement.
Table 1.
Unique physical performance components related to criterion measure types.
| Physical Performance Criterion Measure Type | Components Specific to Physical Criterion Measures and Similar to Nonphysical Criterion Measures |
|---|---|
| Work samples |
Specific to Physical Criterion Measures a. Physiological parameters (e.g., aerobic) b. Ergonomic parameters (e.g., weight, distance) c. Biomechanical parameters (e.g., forces, torques) d. Specialized clothing and equipment worn e. Safety and medical criteria for subject inclusion f. Instructions describing movement patterns and disqualification criteria g. Individual’s anthropometric structure h. Burden of data collection (e.g., purchase/build equipment, cost, logistics) |
|
Similar to Nonphysical Criterion Measures a. Equipment for high fidelity work samples b. Tasks performed by multiple workers c. Balance of practicality, reliability, and fidelity | |
| Supervisor/peer ratings |
Similar to Nonphysical Criterion Measures Behavioral anchored rating scales with ergonomic data |
| Injuries, attrition, & production rates |
Similar to Nonphysical Criterion Measures a. Time period until injury or turnover b. Type of injury |
Overview of physical criterion measures
Work samples
Work samples provide an objective, structured method to measure the types of behaviors exhibited when performing physical job tasks and are the most common type of criterion used in physical test validity studies. The advantages of work samples are the direct assessment of job-relevant skills, use as a criterion measure or predictor, and face validity. Limitations include the inability of novices to perform tasks requiring practice/training, increased costs, and design complexity.
Lievens and De Soete (2012) identified seven factors that impact the effectiveness of work samples: (a) consistency between job and simulation behaviors, (b) content that reflects essential job parameters, (c) fidelity with job task(s), (d) response to dynamic cues (i.e., interactivity), (e) standardization in presentation of stimuli, (f) scoring, and (g) cost. Achieving these baseline standards for physical criterion measures can be more complex than nonphysical measures due to the presence of machinery, construction of the work setting, safety, and anthropometric constraints. Further, replicating actual on-the-job performance is challenging, because of the potential for criterion contamination. For example, physical work samples that rely on complex verbal instructions can lead to criterion contamination in which the cognitive load affects physical performance (Chan & Schmitt, 1997; Lievens & De Soete, 2012). Therefore, the criterion measures should assess physical demands only in terms of duration and exertion required to perform occupational tasks.
The first step towards developing criterion measures is the completion of a job analysis that identifies the essential/critical job tasks and other behavioral components (T. J. Reilly et al., 2015; Lee-Bates et al., 2017). An analysis of physically demanding jobs includes gathering ergonomic data (e.g., questionnaire) that delineate the arduousness of essential tasks (Gebhardt & Baker, 2017). Ergonomic measures are used to define levels of performance in terms of distance, duration, weight, height, reach (e.g., overhead), and other factors (Table 2). When direct measurement is unattainable (e.g., task is performed in a variety of scenarios), questionnaires that address these parameters are an effective alternative. An example of an ergonomic question related to military personnel dragging a casualty is: “When you dragged a casualty, what percentage of the time did you drag a person at a distance of (a) less than 5 m, (b) 5–9 m, (c) 10–14 m, (d) 15–19 m and (e) 20 m or greater.” Foulis et al. (2015) found that responses to this question indicated that 55% of the time, the drag distance was 15–19 m. Based on the responses coupled with the military training criteria, the distance selected for the criterion measure was 15 m (Foulis et al., 2015).
Table 2.
Methods to gather ergonomic, physiological, and biomechanical essential task data.
| Ergonomic Data | Physiological Data | Biomechanical Data |
|---|---|---|
| a. Ergonomic questionnaire b. Tape measure c. Load cell or dynamometer for forces d. Scale for weight e. Observation (e.g., counting repetitions, body position) f. Research literature |
a. Aerobic demand measurement of oxygen consumption (VO2)/energy expenditure via portable gas analyzer) b. Heart rate (e.g., convert to aerobic demand, % of maximum) c. Research literature |
a. Muscular force (e.g., push/pull) via load cell or dynamometer b. Force and torque measurements from film analysis c. Mathematical modeling d. Research literature |
The second step involves selecting a subset of the essential physical tasks for criterion measure consideration based on their physical demands, frequency, feasibility of replication, and cost. A prime consideration is the balance between fidelity and feasibility. High fidelity work samples have greater face validity, but may be cost prohibitive, difficult to replicate, and not generalizable across multiple jobs, if needed. When high fidelity work samples are not feasible, lower fidelity work samples will have lower or similar cost and increased generalization for multiple jobs (Gebhardt & Baker, 2017; T. J. Reilly et al., 2015; T. Reilly et al., 2019).
The third step entails gathering the physiological, ergonomic, and/or biomechanical data to define the physical demands of the essential physical tasks in greater detail and in turn design higher fidelity work samples and other evaluations (Table 2). Ergonomic measures are the most common data obtained in the workplace. Researchers collect heights, distances, and weights using tape measures and scales for the selected essential physical tasks. Load cells that convert mechanical force into digital values measure the forces required to move objects (e.g., drag equipment, clear malfunction in gun systems). These data are input into the criterion measure design process to ensure replication of the job task demands. Similarly, a worker’s uniform and cramped workspaces must be included in the work samples. For example, soldiers perform foot marches and other tasks while wearing a ruck sack and body armor (e.g., 90–110 lb; Sharp et al., 2017). These types of uniform components (e.g., weight, restrictiveness) should be included in work samples to accurately reflect the physical demands of a task.
The prime physiological data obtained is aerobic demand (VO2/oxygen uptake), which is defined as the ability of the cardiovascular system (heart and lungs) to meet the increased oxygen demand of an activity. To assess aerobic demand, researchers used metabolic and cardio-pulmonary equipment to obtain measures of oxygen consumption during actual task performance (McArdle et al., 2014). For example, researchers used this methodology to define the demands of firefighter tasks in U.S. Navy ships and public safety settings (e.g., Bilzon et al., 2001; Gledhill & Jamnik, 1992; Siddall et al., 2016). In addition, the Compendium of Physical Activities and other research contain lists of the aerobic demand for work tasks, daily living activities (e.g., walking), and sports activities (Ainsworth et al., 2000; McArdle et al., 2014).
Heart rate (beats/minute) is an alternate method to classify tasks by the level of exertion. The American College of Sports Medicine (ACSM) classifies heart rate response to activity using a scale related to the percentage of maximum heart rate (e.g., task heart rate/maximum heart rate; 130/180 = 72% of max; Riebe et al., 2018). Past research in the warehouse industry showed that orderfillers sustained 71–81% of their maximum heart rate across a 3–4-hour period indicating they were performing well above the normal working level (e.g., 40–50% of max; Gebhardt et al., 2006). Thus, work sample criterion measures should reflect the same level of aerobic and/or heart rate demand as the actual job tasks.
Biomechanical data provide insight into the forces and torques acting upon the muscular and skeletal systems during human movement. It provides information related to the interaction between a worker’s movements and the task, equipment, and environment (Table 2). Ergonomic data and film analysis form the basis for modeling task performance and provide information to help ensure safe performance of work samples. An example of this approach involved determining the forces required to lift the head and foot ends of a patient-loaded stretcher into an ambulance (Gebhardt & Crump, 1984). The model used patient heights and weights obtained from the job analysis (i.e., 200 lb) and stretcher parameters (e.g., weight, length) to identify the forces required to lift the head (152 lb) and foot (135 lb) ends of the stretcher. Similarly, a film analysis (velocity and acceleration data) of two powered stretchers showed that forces in the lower back forces were lower for one stretcher than the other when loading a patient into an ambulance (Lad et al., 2018). Thus, when designing a criterion measure involving human transport, only tasks least likely to incur injury (e.g., lift foot-end, lower back forces) should be selected.
Another factor that affects work sample design is when tasks are performed by two or more person teams. This occurs most often in manual materials handling, which is the process of moving objects horizontally and vertically (e.g., lift, carry, push, hold) from one location to another location (M. A. Sharp et al., 2019). The U.S. and British Armies reported 66% and 48% of their lift and lift/carry tasks, respectively, required two or more soldiers (M. A. Sharp et al., 2019). For instance, when developing a single-person work sample that simulates two soldiers loading a 25 mm gun barrel (107 lb) onto a vehicle, one must determine whether to distribute the weight evenly (53.5 lb) or use the weight incurred at the heavier end of the barrel. Therefore, the use of multi-person tasks as criterion measures is only viable if the researcher identifies an accurate single-person load.
When constructing work samples, one must consider whether it is practical to use the actual work setting and equipment or create a replication. In one study, it was not reasonable to use an $8,000 piece of aircraft equipment in the work sample. Thus, a substitute was fabricated to replicate the length, width, and weight of the aircraft equipment as shown in Figure 1 (Gebhardt, Baker, Linnenkohl et al., 2015).
Figure 1.

Example of simulation in confined spaces.
Note: Equipment held for specific duration. Copyright 2015 by Deborah Gebhardt.
We ensure the safety of individuals performing work samples by addressing the (a) feasibility of replicating actual equipment, (b) movement patterns, and (c) medical criteria for participation. The equipment must not only replicate that used in the actual job task but must be constructed to withstand the forces and torques applied to it during work sample performance. Although there is a desire to replicate the aspects of a task involving movement in multiple directions (e.g., move under fire, foot pursuit), researchers must account for the potential for slips, trips, and falls (e.g., 90-degree turn followed by 180-degree turn). Finally, individuals participating in work samples should be medically screened to ensure that their participation does not exacerbate the current condition. Medical-related questionnaires such as the 2014 PAR-Q+ provide an avenue to screening participants, as do examinations by medical personnel (Riebe et al., 2018).
In summary, work samples should (a) not be overly complex or exceed the demands of the job; (b) strike a fidelity/feasibility balance to ensure reliability, effectiveness, and safety; and (c) minimize criterion contamination.
Job performance ratings
Although objective measures of job performance are preferable, they are not always feasible. Subjective measures (e.g., ratings) are used in the physical and nonphysical domains. In the physical domain, both supervisors and peers typically provide assessments of job performance. In some instances, peer ratings are needed because supervisors are unable to observe a subordinate’s performance. For example, military police patrol with a partner and the supervisor (e.g., sergeant) typically arrives after an incident is controlled. Thus, only the partner observed the physical aspects related to controlling the incident.
Behaviorally anchored rating scales (BARS) are the most common type of subjective assessment used in the physical domain. To achieve accurate ratings of individuals, it is important to specify a frame of reference that defines performance dimensions, levels of quality, and rationale for distinguishing performance levels (Lievens & De Soete, 2012; Woehr, 1994). This involves defining the physical parameters of the job(s) within scale anchors. The rating scales used in the physical domain are typically absolute (e.g., rater selects one response) and contain 5–7 anchors with gradations of traits, which are intended to improve intra- and inter-reliability (Roch et al., 2012; Woehr & Roch, 2012). The scale anchors describe observable behaviors and include ergonomic data that define the physical demands (e.g., distance marched, weight of equipment, frequency of performance). If physical tasks have observable gradations (e.g., lift boxes weighing 20, 30, 40, and 50 lb), the rater selects the descriptor that best describes the worker’s performance. An example is: Able to lift a patient-loaded backboard from the ground to a standing position at the (a) foot-end for a 150-lb patient, (b) head-end for a 150-lb patient, (c) foot-end for a 200-lb patient, (d) head-end for a 200-lb patient, (e) foot-end for a 250+-lb patient, and (f) head-end for a 250+-lb patient. (Gebhardt & Crump, 1984). When rating physical abilities (e.g., muscular strength), each ability is defined and combined with examples of tasks requiring moderate to high levels of an ability (Gebhardt & Baker, 2017). Relative scales involving a forced distribution to compare one employee relative to other employees on a continuum (e.g., worst = 0; moderate = 50, best = 100) are infrequently used in the physical domain due to the presence of multiple supervisors across different shifts and/or locations, and an inability to ensure supervisors have an equal distribution of workers across job performance levels.
Job performance ratings can be direct (in real time) and indirect (past behaviors). Direct observation of task performance (work sample) is an evaluation of quantifiable criteria that includes effectiveness of task performance, quality of performance, safety violations, and task completion time. In the military and many industries (e.g., electric, natural gas), job tasks contain technical movement and safety components that if not adhered to can result in injury or death (e.g., Army Power Distribution Specialist [12Q], Air Force Electrical Systems [3E0X1]). The following example describes a high-fidelity work sample for a lineworker job that utilized ratings that addressed safety and quality of performance.
Researchers developed a work sample that consisted of climbing a utility pole to a height of 30 ft., performing tasks on the pole (e.g., install/remove a 90-lb piece of equipment), and descending the pole (Gebhardt et al., 2012). Due to the consequences of on-the-job performance mistakes, it was important to identify risky behaviors and safety violations. Subject matter experts (SMEs) identified quality of climbing and equipment installation as important to task efficiency and safety and identified the subtasks that represented performance components (e.g., take 8- to 12-inch steps during pole ascent). The subtasks for ascending a pole were rated using a dichotomous rating (yes/no) and a 5-point quality scale, along with errors. Figure 2 presents an excerpt of the rating form used in the evaluation process.
Figure 2.

Example of combined task and quality assessor ratings.
Other types of criterion measures
Table 1 presents four additional criterion measure types: (a) production rates, (b) attrition and absences, (c) injury rates, and medical costs that are measures of utility to an organization in terms of cost reduction. These measures may not be attainable because the data are (a) not available, (b) not in a useable format, or (c) lack the detail required for effective analysis. In the physical domain, these variables are represented by either archival or prospective data. It takes years to amass an adequate sample size of prospective data for these types of measures.
Military and other organizations use injury and attrition data to assess the efficacy of training interventions, selection procedures, and modifications to task protocols. Due to the large number of annual U.S. Army (~90,000) and Air Force (~35,000) inductees, injury reduction is paramount to achieving lower medical costs and attrition (McGurk, 2018; Nye et al., 2016). Similar to other militaries (e.g., UK, Australia, Canada), the U.S. military has extensive databases to classify and interpret the impact of injuries on job performance during garrison and deployment (Hauret et al., 2019). The U.S. Army gathered injury data associated with combat training and injury risk for decades and continues today with research to identify physiological factors related to injury risk (e.g., Hughes et al., 2019; Jones et al., 1993; Knapik et al., 2001). Recently, the U.S. Army assessed the injury rates for trainees who passed the OPAT prior to entering Initial Entry Training. They found that recruits who were not injured during initial training (i.e., no medical calls) had significantly higher OPAT scores than those who were injured (Hauret et al., 2018). Finally, reduction in injuries translates to lower medical and attrition costs. Assessing turnover of incumbents utilizes similar data acquisition and analysis techniques to injury assessment.
Today’s warehouse operations (e.g., Amazon) maximize efficiency by instituting a floor plan that optimizes process flow, organizes merchandise storage in relation to product type and weight, minimizes product handling, and uses electronic tracking systems (Naqvi et al., 2001). Parameters such as distance traveled to complete an order, number of items in the order, and shelf heights provide the basis to define an engineered standard production rate for a qualified worker (orderfiller) to complete an order at a specified time and the number of orders processed during a shift. The engineered standard accounts for normal worker fatigue, work delays, and performing job tasks safely. A tracking system gathers an orderfiller’s actual time to complete an order and type of order (e.g., weight, number of products) and compares it to the engineered standard to produce a production rate. Orderfillers have to process orders at a specific percentage (e.g., 95%) of the 100% engineered standard to retain their job. Past research found that production rates were significantly related (p < .01) to selection tests and work samples in a warehouse setting (Gebhardt et al., 2009).
In summary, the use of injury, attrition, and production data requires (a) an intact data collection process, (b) a detailed database for classifying and interpreting these three types of data, and (c) demographic variables (e.g., sex, time of hire, age) to determine the impact of ancillary variables (e.g., years in job) on job/criterion measure performance.
Designing criterion measures for multiple jobs
Designing criterion measures for a single job involves identifying essential tasks, physiological and ergonomic parameters, and scoring procedures. However, many organizations have multiple physically demanding jobs, and it becomes untenable from a labor and cost standpoint to develop separate criterion measures for each job. To address this issue, Gebhardt and Baker (2015) developed a model for classifying large numbers of jobs by their physical demands. The classification system consists of 16 movement categories (MvCat) such as lift, carry, climb, dig, crawl, and others. Each MvCat contains subdivisions that define the ergonomic, physiological, and work environment parameters (e.g., duration, distance, surface condition) that affect the physical demands of a task. Table 3 shows an example of the Hold MvCat that includes gradations of ergonomic parameters associated with the hold height, body position, and distance from torso. Each parameter includes descriptors of performance difficulty in ascending order. For instance, one can hold a 30-lb kettle ball with one hand at chest level and be incapable of holding the same weight 12 inches anterior to the torso.
Table 3.
Hold movement category by ergonomic parameter gradations.
| Height Held At | Body Position | Hold Distance from Torso |
|---|---|---|
| 1 = Ankle level | 1 = Standing upright | 1 = 1–6 inches |
| 2 = Knee level | 2 = Stooping/squatting | 2 = 7–12 inches |
| 3 = Waist level | 3 = Kneeling | 3 = 13–18 inches |
| 4 = Chest level | 4 = On back | 4 = 19–24 inches |
| 5 = Shoulder level | 5 = On stomach | 5 = 25 inches or |
| 6 = Above shoulder level | 6 = On side | greater |
MvCat data are gathered via observations and questionnaires for the essential physically demanding tasks within each job of interest. The data for each MvCat (e.g., hold, climb) are consolidated across jobs to identify similarities and differences in the physical demands. The development of the criterion measures focuses on the MvCats that are most prevalent across the jobs and include gradations of performance to accommodate jobs with less or more physical demand. For instance, a work sample involving holding equipment in place for installation may include holding several pieces of equipment of increasing weight at waist, chest, and shoulder level to simulate tasks across multiple jobs. Thus, levels of performance can be defined for each job (e.g., time to complete each hold activity, inability to perform some task segments). Using the MvCat system is an efficient approach to identify jobs that are highly similar and dissimilar in physical demand, thus enabling design of common criterion measures applicable to multiple jobs. For example, this methodology allowed for identifying the physical demands of 32 jobs in the shipbuilding industry (e.g., very heavy to light) and resulted in the design of four work samples that addressed the essential tasks and MvCats across all jobs (Gebhardt & Baker, 2016; Gebhardt, Baker, Volpe et al., 2015).
Scoring physical criterion measures
Physical criterion measures may be scored based on: (a) time to complete a task, (b) number of repetitions completed in a specific time period, (c) force produced, (d) completion of subtasks (e.g., checklist), and (e) categorical and quality ratings of performance. Regardless of scoring type, the unit of measurement must reveal individual differences in performance.
Work samples
The two most common approaches to scoring work samples are time to complete a task and number of repetitions completed in a specific time period. The choice of which approach to use depends upon the type of job being assessed. When speed is a factor in performance, the time to complete is the acceptable scoring metric. Examples of soldier tasks with varying speed requirements include casualty evacuation, foot march, and move under fire (Foulis, Sharp et al., 2017). When a task has a standard completion time, another scoring metric, number of repetitions completed in a specific time period, is employed. Tasks in military and civilian sectors with a specified time frame include carrying supplies from storage to the worksite and equipment repair/installation. An example of this occurs in a warehouse work sample that replicates completing orders in a defined time period (e.g., Gebhardt et al., 2009). Thus, it is important to determine the speed required to successfully complete a job task and reflect that pace in the work sample.
The third scoring method uses a load cell to measure the force exerted during a work sample. An example is placing a load cell behind a steel plate to record the force of a sledgehammer striking the plate (International Association of Fire Fighters (IAFF), 2007). The value recorded is compared to the force required to break down doors at a fire. The task is completed when the cumulative forces from multiple strikes reach the level needed to complete an entry task.
Supervisor/Peer/SME ratings
Typically, supervisors, peers, and SMEs provide direct and non-direct evaluations of tasks, abilities, safety, and quality of performance using BARS with numerical values for each anchor (Smith & Kendall, 1963). Common scoring metrics include the mean of the ratings and the total sum of the ratings. In many instances, assessors provide ratings for multiple types of constructs (e.g., behaviors, abilities). To increase the criterion space for these ratings, researchers may generate a single composite score to define job performance.
Use of multiple types of criterion measures
Many validation studies utilize multiple-criterion measures (e.g., work samples, supervisor ratings) in an effort to expand the criterion space and enhance the prediction of job performance. Factors to consider when generating composite criterion measures are the (a) relevance of each criterion to job performance, (b) number and spread of observations in the criterion space, (c) percentage of content covered by each criterion measure, and (d) statistical properties (e.g., reliability, range restriction) of each criterion. These factors and weighting strategies (e.g., standardized scores) are similar for both physical and nonphysical work. For an overview of generating composite criterion measures that include objective and subjective data, see, Borman and Smith (2012).
Standardizing criterion measure administration
Work samples
Standardizing criterion measures is key to criterion reliability and effectiveness. Factors to address for work samples are: (a) equipment, (b) environment, (c) location logistics, (d) administration and instructions, and (e) administrator training. Physical criterion measures use a variety of equipment ranging from the actual job apparatus to replicas that must be inspected prior to each criterion measure administration for maintenance and safety issues. For instance, a casualty drag task must use a standardized mannequin and drag surface (e.g., carpet) to ensure that the force required to move a mannequin does not vary across multiple trials and locations (Frykman et al., 2019; Gebhardt & Baker, 2017).
Temperature extremes (e.g., ≥85°F) coupled with high humidity can have a detrimental impact on physical performance (e.g., muscular endurance) and a potential for heat stress due to increased core temperature and the effects of dehydration (Riebe et al., 2018). Thus, researchers use the Wet Bulb Globe Temperature (WBGT) Index to determine whether to conduct or cancel administration sessions in heated environments (Department of the Army, 2016; Department of the Navy, 2016). The WGBT Index uses flag colors to designate levels of physical activity in relation to temperature conditions. The flag colors and temperatures range from green (80–84.9°F) requiring discretion when doing heavy exercise to black (90°F and above) resulting in suspended activity (Liljegren, 2008).
Cold temperatures (e.g., 32–40°F) can also result in decreased physical performance (e.g., muscle weakness, decreased dexterity, and speed) due to body core heat loss and vasoconstriction at the extremities (Young et al., 1996). For example, tasks involving manipulation of tools or turning of valves would be affected due to decreases in grip strength. In summary, researchers must evaluate the intensity and duration of the activities in relation to WBGT and cold temperatures (Department of the Army, 2016; Department of the Navy, 2016).
In many instances, administration of criterion measures occurs in multiple locations. To standardize settings, the layout of the criterion measure must be identical in terms of length, width, movement patterns, equipment placement, and turn angles. It is not acceptable to setup a work sample with a 90-degree turn in one location and a 45-degree turn in another location. In addition, the floor surface at each location should be comparable.
Finally, the instructions for performing a work sample must be detailed but not too cumbersome. They should include step-by-step directions that specify the type and sequence of actions, and those actions that result in performance termination. For example, instructions for a manual materials handling work sample specified the (a) sequence for lifting and moving objects to multiple height platforms, (b) acceptable pace (brisk walking), and (c) criteria for stopping performance (e.g., running; Gebhardt et al., 2009). In addition, administrator training is important to ensure participants receive clear instructions and timely sequential cues when performing a work sample. Adhering to these administration components results in work samples that show consistent performance by workers and applicants.
Job performance ratings
Standardizing subjective ratings involves setting the context, providing rater training and clear instructions, and motivating raters. Past research illustrates that rater training leads to increased rater accuracy and reduction in rater error (e.g., halo, range restriction; Clark & Rooney, 2021; Murphy & Balzer, 1989; Woehr & Roch, 2012). Providing clear instructions and selecting individuals who have sufficient knowledge of ratees’ performance helps reduce rater errors. Finally, raters’ motivation to provide accurate appraisals increased when the raters were ensured confidentiality of their ratings and that their rating accuracy would dictate the efficacy of future applicant selection procedures (Park & Hubert, 2017).
Identification of minimally acceptable criterion measures performance
Identifying minimum proficiency for physical criterion measures involves a multifaceted challenge and should adhere to legal and professional standards (e.g., Canadian Human Rights Act, RSC, 1985; Equality Act (UK) 2010; Civil Rights Act of 1964(Title VII), 42 U.S.C. §2000e-2, et seq, 1964). Setting minimum standards involves the integration of job analysis data, physiological and ergonomic parameters, and training standards, along with input from SMEs. Further, minimally acceptable criterion performance should align with normal expectations of proficient job task performance (Cascio, 1998; Equal Employment Opportunity Commission, Civil Service Commission, Department of Labor and Department of Justice, 1978).
In this section, we cover four approaches to define minimum levels of performance for work samples and describe approaches for use of supervisor/peer ratings and multiple-criterion measures for setting standards. The first approach uses job analysis data from the frequency and time spent scales. The time spent scale (e.g., 10 s, 5 min) indicates the task duration and the frequency assists in determining the number of times a task is completed. These two metrics serve as the basis for setting a specific duration for a work sample and the number of task iterations (e.g., replenish ammunition cans) within that timeframe. This information can be enhanced with ergonomic data that indicates a specific criterion was met (e.g., force exerted with a sledgehammer).
The second approach utilizes established training standards. However, training standards do not always reflect current on-the-job performance and thus should be evaluated prior to their use in setting a minimum criterion measure standard. For example, the Army Training and Doctrine Command (TRADOC) supplied the standards (e.g., drag 207-lb casualty 15 m in 60 s) for the essential physical tasks common to multiple Combat Arms Military Occupational Specialties (MOS; Foulis et al., 2015). To assess the efficacy of these standards, Army personnel observed over 500 soldiers performing the tasks. If less than 90% of the soldiers were able to meet the training standards for the tasks, they revised the standards until 90% were able to perform the tasks. This process resulted in modification to several standards and task statements to reflect soldier performance.
The third technique involves the use of pacing studies that employ videotapes of work sample performance at varying speeds. This approach is appropriate for time-sensitive occupations in which an unacceptable pace may pose a threat to the safety of coworkers or others. For example, a firefighter-pacing study used six videotapes of a fire suppression evolution, ranging from very fast (1 SD faster than mean) to very slow (3 SD slower than mean), to identify the minimum acceptable pace (Sothmann et al., 2004). A sample of female and male firefighters (n = 41) indicated whether the videotape performance paces were acceptable or unacceptable. Statistical analyses determined the percentage of firefighters who correctly identified the time sequence of the paces (slowest to fastest) and the percentages who classified each pace as acceptable or unacceptable. This approach resulted in the identification of a minimally acceptable pace for the criterion measure and served as a benchmark for acceptable job performance.
The fourth approach uses a combination of benchmarking and other information. A U.S. Air Force study utilized airmen’s performance distributions for several work samples coupled with the estimates of acceptable criterion performance by incumbent airmen who performed the actual tasks in training and deployment operations (Robson et al., 2020). The airmen’s estimates were 1.4 times slower than their actual performance times. To ensure that airmen were trained to perform their job tasks, the researchers discarded the lowest 10% of the airmen’s time estimates. Senior personnel reviewed all data and used a consensus approach (Cizek, 2012) to finalize the minimum acceptable standards.
When using ratings as a criterion measure, alone or with other data, the anchors should define specific job relevant components and behaviors of varying degrees of acceptable and unacceptable performance. The most efficient approach to determining a minimum acceptable level for ratings is predetermining the anchor level that describes the minimum acceptable behaviors and attributes (Gebhardt et al., 2009).
Pretesting criterion measures
Pretesting of physical and nonphysical work samples and ratings addresses issues related to standardization of the criterion measure content, administration protocols, and scoring procedures. Although pretesting is similar for both physical and nonphysical criterion measures, pilot tests of physical work samples address additional issues related to equipment, layout, instructions, level of difficulty (e.g., too hard, too easy), and safety. To assess these issues, a sample of job incumbents who represent the demographic make-up (e.g., sex, race/ethnicity, job tenure) of the job population should complete the pretest. This is important because of differences across incumbents in relation to their anthropometric proportions, interpretation of instructions, and job performance ability.
In both military and civilian settings, there is a paucity of women performing arduous jobs. Thus, women should be recruited for pretests to assess their performance level, to ensure they provide input into the parameters of performing arduous job tasks, and for the future defensibility of selection procedures based on the criterion measures. For example, there were no women in the Army’s Combat Arms MOSs to complete the criterion measures related to the OPAT (Foulis, Redmond et al., 2017; Foulis, Sharp et al., 2017). Thus, the research team recruited female soldiers from other physically demanding MOSs. Similarly, when there were no women in a statewide fire department, researchers recruited 30 female firefighters from different departments in two adjoining states to ensure the work samples were appropriate for both women and men (Gebhardt & Baker, 1999).
Finally, direct observational ratings of work sample performance require assessors to simultaneously observe, evaluate, and record performance. A pretest detects weaknesses in (a) performing multiple evaluation tasks simultaneously, (b) rating form configuration, and (c) differences across raters. It also addresses calibration issues among raters (e.g., large scoring differences), thus indicating a need for more rater training. Similarly, a pretest of non-direct observational ratings can identify issues such as a need for more training and scale anchor revisions.
Determining the efficacy of work samples and job performance ratings
Measurement of the efficacy of work samples and job performance ratings is similar across the physical and nonphysical domains. Issues with ratings (e.g., rating error) have been detailed in the literature many times (e.g., Murphy, 2008; Putka, 2017; Woehr & Roch, 2012). Thus, we will focus on physical work samples.
Methods used to evaluate the reliability of physical work samples include test–retest, repeated measure analysis of variance, and intra-class correlation coefficients (ICC; M. A. Sharp et al., 2019; Crocker & Algina, 1986; Gebhardt & Baker, 2017; Shrout & Fleiss, 1979). When using a test–retest approach, it is important to allow adequate time periods (e.g., 1 hour to 1 day) between trials to ensure fatigue or other forms of criterion contamination are not factors in subsequent trials (Baker & Gebhardt, 2012). Further, the learning effect and performance improvement across trials will dictate the number of trials required (Gebhardt & Baker, 2017). Foulis et al. (2017) found that non-combat arms soldiers’ performance of combat arms criterion tasks improved significantly from the first to second trials and the second to third trials for more complex tasks (e.g., move under fire). Past research by the Canadian Forces found similar work sample reliability results using ICCs (Stockbrugger et al., 2018). Thus, the impact of job experience and improvement across trials should be investigated to determine whether the criterion measures should be modified, or whether additional trials are needed.
Examples of physically oriented criterion measurement studies
The above sections described a variety of physical performance criterion measures and issues related to their design and efficacy. This section provides examples of physical criterion measure implementation in military and civilian settings.
U.S. Army OPAT
In 2013, the Department of Defense (DoD) rescinded its direct combat assignment rules, thus providing women with the opportunity to serve in direct combat roles in all military branches. This change in DoD policy was codified by the U.S. Congress with the passage of the Carl Levin and Howard P. “Buck” McKeon National Defense Authorization Act for Fiscal Year 2015 (P.L. 113-291; 128 Stat. 1919; 2014, 2015) that mandated the standards for each combat arms job be based on documented job requirements and that there should be no “artificial barriers” to women’s entry into a combat arms career field. To address the Congressional mandate, the U.S. Army conducted a four-year study to provide physical assessment procedures for individuals entering seven combat arms MOSs (e.g., Infantryman, Armor Crewman). Here we summarize the design, evaluation, and implementation of the criterion measures used to validate a physical test battery.
The Army conducted a job analysis involving focus groups and surveys that identified 32 essential physically demanding tasks across the seven MOSs (Sharp et al., 2017). Based on input from TRADOC and high-level SMEs, the Army identified a subset of these critical tasks for simulation. To assess the aerobic and strength demands of the subset tasks, researchers gathered measures of oxygen consumption (VO2), heart rate, ratings of perceived exertion, and force for each task (Foulis et al., 2015; Foulis, Sharp et al., 2017). These data, along with consideration of work sample feasibility and applicability to the seven MOSs, resulted in the design of eight criterion measure task simulations (CMTS; Table 4). Figure 3 illustrates loading ammunition (25 kg) and demonstrates the difficulty performing this task in a limited space while wearing approximately 22 kg of task-specific equipment.
Table 4.
U.S. Army Criterion Measure Task Simulations (CMTS).
| CMTS | Ergonomic Parameters |
|---|---|
| Casualty Evaluation (BFV/Abrams Tank) | Vertical lift of incremental weight (23–95 kg) in 4.5 kg increments |
| Casualty Drag | Drag casualty (123 k) 15 m |
| Foot March | March 6.4 km |
| Prepare Fighting Position-Sandbag Carry | Carry 16 sandbags (18 kg each) |
| Move Under Fire | Raise from prone position, sprint for 3–5 s, & lower body; repeat across 100 m |
| Field Artillery Ammunition Supply (FAASV) | Transfer 30 rounds (45 kg each) from floor to ammunition rack |
| Stow Ammunition on Abrams Tank | Transfer 18 rounds (25 kg each) from supply area to deck of tank |
| Load main gun on Abrams Tank | Move 120 mm rounds (25 kg each) from ready rack to main gun area in confined space |
Figure 3.

Soldier loading rounds from abrams tank ready rack into main gun breach.
Note: Load Ammunition. Copyright 2014 by Deborah Gebhardt.
A unique challenge to the research was the absence of women in the seven MOSs at all phases of the study. Therefore, the Army recruited female soldiers to participate in the physical data collection phases. To evaluate the reliability of the CMTS, male, and female soldiers (Men = 79, Women = 70) from 48 different MOSs completed all CMTS two to four times over a multiday period. The reliability of the CMTS using ICCs was .87 or higher, except for the Foot March (.76; Foulis, Redmond et al., 2017).
Due to safety and logistics parameters (i.e., space requirements, resources, multiple administration locations), it was not feasible to administer MOS simulations at the Military Entrance Processing Stations (MEPS). Therefore, the Army chose to use basic ability predictors that evaluate the physical abilities required to perform in combat arms jobs. Over 800 soldiers (608 men, 230 women) completed the eight CMTS and 14 basic physical ability tests (e.g., Beep Test, Seated Power Throw, Handgrip). The validity results yielded a four-test model (Seated Power Throw, Squat Lift, Beep Test, Standing Long Jump) that significantly predicted CMTS performance (adjusted R2 = .79, p < .01). A longitudinal study confirmed this relationship with an R2 of .70 (p < .01; Sharp et al., 2018).
The Army generated four OPAT passing score levels (Black, Gray, Gold, White) based on the physical demands of Army MOSs. The Black category passing scores aligned with the combat arms MOSs, while the Gray and Gold aligned with MOSs with less physical demands. The White category represented OPAT scores below the Gold category. Use of these score categories showed that 76% of the recruits who achieved the OPAT passing score for their MOS met minimum performance standards for their MOS CMTSs. To further calculate the efficacy of the OPAT, the Army conducted injury and attrition analyses. The injury analysis, conducted after 10 weeks of Initial Entry Training, found that the higher OPAT composite scores for the Gray (p < .02) and White (p < .01) groups yielded significantly lower injury rates (Hauret et al., 2018). Further, recruits in the lower scoring White category had higher attrition rates (p < .001) than those in the Black category.
The costs associated with attrition were substantial. TRADOC indicated the cost to train a soldier from recruitment to Initial Entry Training graduation is $50k–75k with a recruiting cost of $22,334 per recruit (Hauret et al., 2018; McGurk, 2018). In 2016, the Army had an attrition rate from training of 11.4% across 10,795 recruits (Research and Analysis Directorate, 2018). Using only the recruiting costs ($22k) and medical costs ($872) associated with an injury, a 1% reduction in attrition when using OPAT would yield an estimated cost savings of at least $2.5 million per year (Hauret et al., 2018; Research and Analysis Directorate, 2018). Regardless of the method used to calculate potential savings, these values are conservative when one considers the costs associated with recruiting and training additional soldiers the next year for slots not filled due to initial training attrition the previous year.
U.S. Air Force
With the congressional mandate to open all jobs in the military to women, the U.S. Air Force conducted a study to evaluate the validity of their entry-level physical test and determine whether additional physical tests (e.g., aerobic demand) would better ascertain a recruit’s physical capabilities. The Air Force utilizes a strength-based assessment to evaluate whether recruits meet the physical demands of their Air Force Specialty Code (AFSC; Air Force, 2013). This assessment, the Strength Aptitude Test (SAT), uses a weight-lifting machine to measure the ability to lift specific weights (40–200 lb) in 10-lb increments to a height of 6 ft. (McDaniel et al.’s, 1983).
To assess the validity of the SAT and additional tests for evaluating recruits, Gebhardt and Baker (2015) initially used their movement category model (see earlier section in this paper “Designing Criterion Measures for Multiple Jobs”) to identify a common set of MvCats associated with physical tasks for over 100 AFSCs (Gebhardt, Baker, & Linnenkohl, 2015). This analysis showed that 75% of the AFSCs contained the following MvCats: lift, carry, push/pull, climb, stand, kneel, hold, and operate non-powered tools. Review of these MvCats by their ergonomic parameters across the AFSCs resulted in generation of four common work sample criterion tasks (Table 5).
Table 5.
U.S. Air Force criterion measures.
| Criterion Measures | Ergonomic Parameters |
|---|---|
| Lift/Carry | Lift and carry 10 pieces of equipment (17–59 lb) to/from different heights (30–72 in) |
| Push/Pull | Push/pull aerospace ground equipment (e.g., portable lights, carts; 500–1,500 lb) |
| Climb/Carry | Carry 12-ft. ladder, climb 24-ft. ladder to 12 ft., and move 14-lb toolbox to 2 positions |
| Hold | Hold equipment at different heights (chest and above shoulder level; 14–45 lb) from standing and kneeling/squatting positions |
Airmen from representative AFSCs performed the work sample criterion tasks on consecutive days resulting in test–retest reliabilities of .77 to .93 (p < .01). During the validity study, airmen performed the four criterion tasks and basic physical ability tests involving muscular strength, muscular endurance, and aerobic capacity. The statistical analyses generated and compared for linear and non-linear models. The linear model resulted in a R2 of .58 for the SAT (Gebhardt, Baker, & Linnenkohl, 2015; Robson et al., 2019). When adding a second predictor, Arm Endurance (AE), the R2 increased to .70. The non-linear models’ R2s for the SAT and SAT plus AE were lower (e.g., SAT+AE, R2 = .63). Further, the one and two test linear models were more accurate. In summary, classifying job tasks by movement category provided a methodology to develop criterion measures for over 100 AFSCs and resulted in establishing the validity of the SAT and an additional test.
Warehouse industry
A study in the warehouse industry used production rates, along with other criterion measures, to validate a manual materials handling physical test (Gebhardt et al., 2009). Organizational archive data showed that an individual orderfiller completes approximately 20 orders per shift or 1,740 orders in an 8-month period. Orderfillers retrieved products of varying sizes from different shelf heights and stacked them on pallets to heights up to 6 ft. (Gebhardt et al., 2009). The total weight of products handled in a shift was greater than 6,000 lb. To ensure an adequate level of experience and account for days off (e.g., holidays, vacation), study subjects had to have completed a minimum of 1,000 orders in the 8 months prior to the study.
The study employed three criterion measures: (a) production rate, (b) work sample, and (c) supervisor ratings. Production rates were retrieved from the tracking system that gathers an orderfiller’s actual time to complete an order and type of order (e.g., weight, number of products) and compares it to the engineered standard to produce a production rate (i.e., percentage of the engineered standard). Orders completed more slowly than the engineered standard time resulted in a production rate less than 100% (e.g., 92%).
The work sample included essential job tasks and engineering data that specified the (a) types, sizes, and weight of the grocery items (e.g., 20-lb bag of potatoes, 80-lb piece of meat); (b) heights of storage shelves and pallets; and (c) number of items moved in a typical order. The scoring metric for the work sample was time to complete an order. For the third criterion measure, supervisors rated subjects on their job performance and their physical capabilities relative to job performance.
The predictor tests consisted of basic physical ability assessments that evaluated muscular strength, muscular endurance, and flexibility (e.g., Arm Endurance, Step Test). All three criterion measures were significantly related to at least six of the eight predictor tests. The work sample had the highest correlations (r = .34–.68), followed by the ratings (r = .31–.49) and production rate (r = .13–.22). The three criterion measures were combined into an equally weighted composite. Use of the composite criterion measure yielded a 3-test battery (i.e., Carton Lift, Arm Endurance, Sit-ups) with a R2 of .44. The lower R2 for this study when compared to the Army and Air Force studies most likely occurred because the military studies only included work sample criterion measures, while this study included three different criterion measures.
Electric industry lineworkers
Lineworkers in the military (e.g., Army Power Distribution Specialist MOS, Air Force Electrical Systems AFSC) and civilian electric industry install, maintain, and repair electrical power infrastructures on overhead energized and non-energized structures (e.g., utility poles) at heights of 30 to over 100 ft. above the ground. They ascend utility poles manually by using climbers (spikes, gaffs) and a climbing belt (20–30 lb), while wearing personal protection equipment. Lineworkers must also be capable of lifting and installing heavy equipment (50–90 lb) on a pole. Line work is time and safety sensitive and can lead to on-the-job injuries and death (e.g., fall from pole).
An electric utility industry company wanted to hire fully qualified journeyman lineworkers. To be a journeyman, a lineworker must demonstrate mastery of the physical, psychomotor, and cognitive job components and possess a journeyman card/certification. To ensure the applicants had the knowledge, skills, and abilities to safely perform pole climbing tasks “day one” on the job, researchers designed three job simulation tests of increasing proficiency that required (a) climbing to a height of 65 ft. (65 Foot Pole Climb), (b) installing a 90-lb piece of equipment onto the pole (Frame Pole), and (c) using a 6-foot pole (hot stick) at a height of 30 ft. to attach equipment to the electrical lines (Ground Conductor; Gebhardt et al., 2012). The scoring for the three work samples consisted of the time to complete a work sample, a check list of items performed, quality of performance ratings, and performance errors. For example, cutting out (feet disengaging from pole) causing a lineworker to slide down the pole was an error that resulted in task termination due to the high injury risk. A summary of the checklist and quality evaluations procedures is provided above in the Job Performance Ratings section and Figure 2.
To assess the validity of these job simulation tests, an equally weighted composite of three criterion measures was generated. The first, Dead Ending, involved climbing over obstacles during the pole ascent and descent, installing multiple pieces of equipment (e.g., 10–90 lb), and connecting and disconnecting equipment (e.g., insulators). The scoring for this criterion measure was identical to that used for the work samples predictors. The second, Truck Unload/Reload, consisted of lifting and carrying equipment weighing 20–48 lb (e.g., cables) from a truck to the worksite and was scored as time to unload and reload a truck. The third criterion measure consisted of supervisor ratings of essential physical tasks and abilities.
To determine whether job simulations with lower skill difficulty were predictive of higher-level skills, we conducted a validation study. Prior to administration of the job simulations and criterion measures, experienced linework assessors completed multiple-training sessions that involved rating videotaped and live performance of multiple levels of pole climbing tasks. A sample of incumbent journeyman lineworkers (n = 87) completed the three job simulation predictor tests (65-ft. pole climb, frame pole, ground conductor) and the two of the three criterion measures (truck unload/reload, dead ending). Supervisors completed ratings of the validation subjects’ job performance.
The correlations between the Dead Ending criterion measure and the three individual predictors’ time to complete scores were moderately high, ranging from 0.58 to 0.80 (p < .01). However, only the 65-ft. climb and frame pole work samples were significantly related to the two other criterion measures (supervisor ratings, truck unload/reload) (r = −.32 to −.33, p < .01). Correlations between a standardized composite criterion measure (zTruck unload/reload + zDead ending + zSupervisor ratings) and each individual predictor simulation (e.g., Frame Pole = zTime to complete + zQuality rating) were significant (r = .44–.48, p < .01). The regression used the composite criterion measure and showed that each job simulation added significantly to the prediction of job performance. The final test battery had a R2 of .31. These results showed that use of lower difficulty work samples with performance quality ratings was indicative of higher-level performance.
Summary
The intent of this article was to provide an overview of the types of criterion measures used in physical assessment research. We described objective and subjective criterion measures that apply to physical and nonphysical domains (e.g., supervisor ratings, attrition), along with unique approaches for designing physical criterion measures. These additional approaches involved use of ergonomic, physiological, and/or biomechanical data, quality of movement assessment, production rates, and injury evaluation. Also outlined was a method to design criterion measures for multiple jobs using movement categories (MvCat) to identify the human movements involved in essential tasks (e.g., lift, climb) and the ergonomic measures associated with each type of movement (e.g., climb height, push/pull distance). Further, monetary savings were realized in studies that used injury and attrition data as criterion measures. Finally, it was important to consider feasibility, fidelity, reliability, cost, safety, and logistics of the criterion measures and to use objective and subjective measures when possible.
Disclosure statement
No potential conflict of interest was reported by the author(s).
Data availability
The authors confirm that the data supporting this paper are available within the article and cited research. However, some data are not publicly available due to restrictions related to release of data and privacy of research participants.
References
- Ainsworth, B. E., Haskell, W. L., Whitt, M. C., Irwin, M. L., Swartz, A. M., Strath, S. J., O’Brien, W. L., Bassett, D. R., Schmitz, K. H., Emplaincourt, P. O., Jacobs, D. R., & Leon, A. S. (2000). Compendium of physical activities: An update of activity codes and MET intensities. Medicine & Science in Sports & Exercise, 32(9), S498–S516. 10.1097/00005768-200009001-00009 [DOI] [PubMed] [Google Scholar]
- Air Force . (2013). Air force enlisted classification directory. [Google Scholar]
- Baker, T. A., & Gebhardt, D. L. (2012). Chapter 13: The assessment of physical capabilities in the workplace. In Schmitt N. (Ed.), Handbook of assessment and selection (pp. 274–296). Oxford University Press, Inc. [Google Scholar]
- Bilzon, J. L., Scarpello, E. G., Smith, C. V., Ravenhill, N. A., & Rayson, M. P. (2001). Characterization of the metabolic demands of simulated shipboard Royal Navy fire-fighting tasks. Ergonomics, 44(8), 766–780. 10.1080/00140130118253 [DOI] [PubMed] [Google Scholar]
- Borman, W. C., & Smith, T. N. (2012). Chapter 23: The use of objective measures as criteria in I/O psychology. In Schmitt N. (Ed.), Handbook of assessment and selection (pp. 532–542). Oxford University Press, Inc. [Google Scholar]
- Bureau of Labor Statistics, U.S. Department of Labor . (2018). The economics daily: Physical strength required for jobs in different occupations in 2018. [online 2019]. http://www.bls.gov/opub/ted/2017/physical-strength-required-for-jobs-in-different-occupations-in-2016.htm
- Bureau of Labor Statistics, U.S. Department of Labor . (2020). Occupational requirements surgery, [online 2021]. https://data.bls.gov/pdq/SurveyOutputServlet
- Canadian Human Rights Act, RSC . (1985). c H-6.
- Cascio, W. F. (1998). Applied psychology in human resource management. Prentice Hall. [Google Scholar]
- Chan, D., & Schmitt, N. (1997). Video-based versus paper and pencil method of assessment in situational judgment tests: Subgroup differences in test performance and face validity perceptions. Journal of Applied Psychology, 82(1), 143–159. 10.1037/0021-9010.82.1.143 [DOI] [PubMed] [Google Scholar]
- Civil Rights Act of 1964 (Title VII), 42 U.S.C. §2000e-2, et seq . (1964).
- Cizek, G. J. (2012). Setting performance standards: Foundation, methods, and innovations. Routledge. [Google Scholar]
- Clark, C. C., & Rooney, N. J. (2021). Does benchmarking of rating scales improve ratings of search performance given by specialist search dog handlers? Frontiers in Veterinary Science, 8(2), 545398–545413. 10.3389/fvets.2021.545398 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Crocker, L. A., & Algina, J. (1986). Introduction to classical and modern test theory. Harcourt. [Google Scholar]
- Department of the Army . (2016). Prevention of heat and cold casualties. United States Army Training and Doctrine Command. (TRADOC Regulation 350-29). [Google Scholar]
- Department of the Navy . (2016). Naval Academy preparatory school instruction 6110.1A. Naval Academy Preparatory School. [Google Scholar]
- Equal Employment Opportunity Commission, Civil Service Commission, Department of Labor, and Department of Justice . (1978). Uniform guidelines on employee selection procedures. Bureau of National Affairs, Inc. [Google Scholar]
- Equality Act (UK) . (2010). The national archives. [Retrieved March, 2013]. https://www.equalityhumanrights.com/en/equality-act-2010/what-equality-act
- Fleishman, E. A., Quaintance, M. K., & Broedling, L. A. (1984). Taxonomies of human performance: The description of human tasks. Academic Press. [Google Scholar]
- Forehand, M. (2010). Chapter 3: Bloom’s taxonomy: Original and revised. In Orey M. (Ed.), Emerging perspectives on learning, teaching, and technology (pp. 41–47). Jacobs Foundation. [Google Scholar]
- Foulis, S. A., Redmond, J. E., Frykman, P. N., Warr, B. J., Zambraski, E. J., & Sharp, M. A. (2017). U.S. Army physical demands study: Reliability of simulations of physically demanding tasks performed by combat arms soldiers. Journal of Strength and Conditioning Research, 31(12), 3245–3252. 10.1519/JSC.0000000000001894 [DOI] [PubMed] [Google Scholar]
- Foulis, S. A., Redmond, J. E., Warr, B. J., Zambraski, E. J., Frykman, P. N., Gebhardt, D. L., & Sharp, M. A. (2015). Development of occupational physical assessment test (OPAT) for Combat Arms soldiers (Report No. T16-2). U.S. Army Research Institute of Environmental Medicine. [Google Scholar]
- Foulis, S. A., Sharp, M. A., Redmond, J. E., Frykman, P. N., Warr, B. J., Gebhardt, D. L., … Zambraski, E. J. (2017). U.S. Army physical demands study: Development of occupational physical assessment test for combat arms soldiers. Journal of Science and Medicine in Sports, 63(4), 571–579. [DOI] [PubMed] [Google Scholar]
- Frykman, P. N., Foulis, S. A., Canino, M. C., Hydren, J. R., Redmond, J. E., & Sharp, M. A. (2019). Development of criterion measure task simulations for physically demanding tasks (T19-05). U.S. Army Research Institute of Environmental Medicine. [Google Scholar]
- Gebhardt, D. L., & Baker, T. A. (1999). Development and validation of a physical performance test for the selection of firefighters in State of New Jersey. Human Performance Systems, Inc. [Google Scholar]
- Gebhardt, D. L., & Baker, T. A. (2015). Development of models to predict United States Air Force specialty physical demand (2015 No. 056). Human Resources Research Organization. [Google Scholar]
- Gebhardt, D. L., & Baker, T. A. (2016). Development and validation of physical assessments for Huntington-Ingalls Shipyard jobs, Volume 2: Test development and validation. Human Resources Research Organization. [Google Scholar]
- Gebhardt, D. L., & Baker, T. A. (2017). Chapter 12: Physical performance tests. In Farr J. & Tippins N. (Eds.), Handbook on employee selection (2nd ed., pp. 277–297). Routledge. [Google Scholar]
- Gebhardt, D. L., Baker, T. A., & Linnenkohl, K. A. (2015). Development and validation of physical performance tests for selection into United States Air Force specialty. (2015 No. 051). Human Resources Research Organization. [Google Scholar]
- Gebhardt, D. L., Baker, T. A., & Thune, A. (2006). Development and validation of physical performance, cognitive, and personality assessments for selectors and delivery drivers. Human Performance Systems, Inc. [Google Scholar]
- Gebhardt, D. L., Baker, T. A., Volpe, E. K., & Billerbeck, K. T. (2009). Development and validation of physical performance tests for selection of orderfillers. Human Performance Systems, Inc. [Google Scholar]
- Gebhardt, D. L., Baker, T. A., Volpe, E. K., & St Ville, K. A. (2012). Development and validation of physical performance assessments for Southern California Edison linemen. Human Performance Systems, Inc. [Google Scholar]
- Gebhardt, D. L., Baker, T. A., Volpe, E. M., & St Ville, K. A. (2015). Development and validation of physical assessments for Huntington-Ingalls Shipyard jobs, Volume 1: Job analysis. Human Performance Systems, Inc. [Google Scholar]
- Gebhardt, D. L., & Crump, C. E. (1984). Validation of physical performance selection tests for paramedics. Advanced Research Resources Organization. [Google Scholar]
- Gledhill, N., & Jamnik, V. K. (1992). Characterization of the physical demands of firefighting. Canadian Journal of Sport Science, 17(3), 207–213. [PubMed] [Google Scholar]
- Hauret, K., Drain, J., Fieldhouse, A., Reilly, T., & Jackson, S. (2019). Chapter 6: The role of physical employment standards in musculoskeletal injury prevention. In Reilly T. Ed., Combat integration: Implications for physical employment standards (STRO-TR-HFM-269). (pp. 6-1–6-23). North American Treaty Organization(NATO) – Science and Technology Organization. Chair. [Google Scholar]
- Hauret, K., Steelman, R., Pierce, J., Alemany, J., Sharp, M., Foulis, S., Redman, J., & Jones, B. (2018). Association of performance on the Occupational Physical Assessment Test (OPAT), injuries, and attrition during Initial Entry Training – OPAT Phase I (PHR No. S.0047229-18b). Army Public Health Center. [Google Scholar]
- Hughes, J., Foulis, S., Taylor, K., Guerrier, K., Walker, L., Hand, A., Popp, K. L., Gaffney-Stomberg, E., Heaton, K. J., Sharp, M. A., Grier, T. L., Hauret, K. G., Jones, B. H., Bouxsein, M. L., McClung, J. P., Matheny, R. W., & Proctor, S. (2019). A prospective field study of U.S. Army trainees to identify the physiological bases and key factors influencing musculoskeletal injuries: A study protocol. BMC Musculoskeletal Disorder, 20(1), 282–289. 10.1186/s12891-019-2634-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- International Association of Fire Fighters (IAFF) . (2007). Candidate physical ability test. [Google Scholar]
- Jones, B., Cowan, D., Tomlison, J., Robinson, J., Polly, D., & Frykman, P. (1993). Epidemiology of injuries associated with physical training among young men in the Army. Medicine and Science in Sports and Exercise, 25(2), 197–203. 10.1249/00005768-199302000-00006 [DOI] [PubMed] [Google Scholar]
- Kehoe, J. F., & Sackett, P. R. (2017). Chapter 3: Validity considerations in the design and implementation of selection systems. In Farr J. & Tippins N. (Eds.), Handbook on employee selection (2nd ed., pp. 56–92). Routledge. [Google Scholar]
- Knapik, J. J., Canham-Chervak, M., Hauret, K., Hoedebecke, E., Laurin, M. J., & Cuthie, J. (2001). Discharges during U.S. Army basic training: Injury rates and risk factors. Military Medicine, 166(7), 641–647. 10.1093/milmed/166.7.641 [DOI] [PubMed] [Google Scholar]
- Lad, U., Oomen, N., Callaghan, J., & Fischer, S. (2018). Comparing the biomechanical and psychophysical demands imposed on paramedics when using manual and powered stretcher. Applied Ergonomics, 70, 167–174. 10.1016/j.apergo.2018.03.001 [DOI] [PubMed] [Google Scholar]
- Lee-Bates, B., Billing, D. C., Caputi, P., Carstairs, G. L., Linnane, P., & Middleton, K. (2017). The application of subjective job task analysis techniques in physically demanding occupations: Evidence for the presence of self-serving bias. Ergonomics, 60(9), 1240–1249. 10.1080/00140139.2016.1262063 [DOI] [PubMed] [Google Scholar]
- Lievens, F., & De Soete, B. (2012). Chapter 17: Simulations. In Schmitt N. (Ed.), Handbook of assessment and selection (pp. 383–410). Oxford University Press, Inc. [Google Scholar]
- Liljegren, J. (2008). Wet Bulb Globe Temperature (WGBT), Version 1.2. Argonne National Laboratory, University of Chicago Argonne, LLC. [Google Scholar]
- McArdle, W. D., Katch, F. I., & Katch, V. L. (2014). Exercise physiology: Energy, nutrition, and human performance physiology (8th ed.). Lippincott Williams & Wilkins. [Google Scholar]
- McDaniel, J. W., Skandis, R. J., & Madole, S. W. (1983). Weight lift capabilities of Air Force basic trainees. (Report No. AFAMRL-TR-83-0001). Air Force Aerospace Medical Research Laboratory. [Google Scholar]
- McGurk, M. S. (personal communication, 2018, May, 1). [Google Scholar]
- Murphy, K. R. (2008). Explaining-the-weak-relationship-between-job-performance-and-ratings-of-job-performance. Industrial and Organizational Psychology, 1(2), 148–160. 10.1111/j.1754-9434.2008.00030.x [DOI] [Google Scholar]
- Murphy, K. R., & Balzer, W. K. (1989). Rater errors and rating accuracy. Journal of Applied Psychology, 74(4), 619–624. 10.1037/0021-9010.74.4.619 [DOI] [Google Scholar]
- Naqvi, S. A., King, A., & Rook, C. (2001). Engineering standards development and ergonomics – A literature perspective with special focus on warehousing. Proceedings of SELF-ACE 2001 Conference – Ergonomics for Changing Work, Volume 4. University Park, PA: The Pennsylvania State University. [Google Scholar]
- National Center for O*NET Development . (2022). O*NET online. Retrieved January 19, 2022, from https://www.onetonline.org/
- Nye, N. S., Pawlak, M. T., Webber, B. J., Tchandja, J. N., & Milner, M. R. (2016). Description and rate of musculoskeletal injuries in Air Force basic military trainees - 2012-2014. Journal of Athletic Training, 51(11), 858–865. 10.4085/1062-6050-51.10.10 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Park, S., & Hubert, M. (2017). Motivating raters through work design: Applying the job characteristics model to the performance appraisal context. Cogent Psychology, 4(1), 1287320. 10.1080/23311908.2017.1287320 [DOI] [Google Scholar]
- P.L. 113-291; 128 Stat. 1919; September 19, 2014 . (2015). [Carl Levin and Howard P. “Buck” McKeon National Defense Authorization Act for Fiscal Year 2015; Section 524]. 113th Congress [Google Scholar]
- Putka, D. J. (2017). Chapter 1: Reliability. In Farr J. & Tippins N. (Eds.), Handbook on employee selection (2nd ed., pp. 3–33). Routledge. [Google Scholar]
- Reilly, T., Blacker, S., Sharp, M., Gebhardt, D., Brown, P., Drain, J., & Kilding, H. (2019). Chapter 3: NATO Guide for PES development. In Reilly T. (Ed.), Combat integration: Implications for physical employment standards (STRO-TR-HFM-269) (pp. 3-1–3-48). North American Treaty Organization(NATO) – Science and Technology Organization. [Google Scholar]
- Reilly, T. J., Gebhardt, D. L., Billing, D. C., Greeves, J. P., & Sharp, M. A. (2015). Development and implementation of evidence based physical employment standards: Key challenges in the military context. Journal of Strength and Condition Research, 29(Suppl. 11), S28–S33. 10.1519/JSC.0000000000001105 [DOI] [PubMed] [Google Scholar]
- Research and Analysis Directorate . (2018). TRADOC center for initial military training: Attrition summary initial entry training. TRADOC Center for Initial Military Training. [Google Scholar]
- Riebe, D., Ehrman, J., Liguori, G., & Magal, M. (2018). ACSM’s guidelines for exercise testing and prescription (10th ed.). Wolters Kluwer. [Google Scholar]
- Robson, S., Lytell, M. C., Atler, A., Campbell, J. H., & Sims, C. S. (2020). Physical task simulations: Performance measures for the validation of physical tests and standards for Battlefield Airmen. RAND Corporation. [Google Scholar]
- Robson, S., Pezard, S., Lytell, M. C., Sims, C. S., Boon, J. E., Etchegaray, J. M., … Linnenkohl, K. A. (2019). Evaluation of the strength Aptitude test and other fitness test to qualify Air Force recruits for physically demanding specialties. RAND Corporation. [PMC free article] [PubMed] [Google Scholar]
- Roch, S. G., Woehr, D. J., Mishra, V., & Kieszczy, U. (2012). Rater training revisited: An updated meta-analytic review of frame-of-reference training. Journal of Occupational and Organizational Psychology, 85(2), 370–395. 10.1111/j.2044-8325.2011.02045.x [DOI] [Google Scholar]
- Sharp, M., Cohen, B., Boye, M., Foulis, S., Redman, J., Larcom, K., & Zambraski, E. (2017). U.S. Army physical demand study: Identification and validation of physically demanding tasks of combat arms occupations. Journal of Science and Medicine in Sport, 20(S4), S62–S67. 10.1016/j.jsams.2017.09.013 [DOI] [PubMed] [Google Scholar]
- Sharp, M., Foulis, S., Redmond, J., Canino, M., Cohen, B., Hauret, K., Frykman, P., & Zambraski, E. (2018). Longitudinal validation of the Occupational Physical Assessment Test (OPAT). (Report No. T18-05). U.S. Army Research Institute of Environmental Medicine. [Google Scholar]
- Sharp, M. A., Rossberger, M., & Knapik, J. (2019). Chapter 5: Common military task: Materials handling. In Reilly T. Ed., Combat integration: Implications for physical employment standards (STRO-TR-HFM-269). (pp. 5-1–5-48). North American Treaty Organization(NATO) – Science and Technology Organization. Chair. [Google Scholar]
- Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420–428. 10.1037/0033-2909.86.2.420 [DOI] [PubMed] [Google Scholar]
- Siddall, A. G., Stevenson, R. D., Turner, P. F., Stokes, K. A., & Bilzon, J. L. (2016). Development of role-related minimum cardiorespiratory fitness standards for firefighters and commanders. Ergonomics, 59(10), 1335–1343. 10.1080/00140139.2015.1135997 [DOI] [PubMed] [Google Scholar]
- Smith, P. C., & Kendall, L. M. (1963). Retranslation of expectations: An approach to the construction of unambiguous anchors for rating scales. Journal of Applied Psychology, 47(2), 149–155. 10.1037/h0047060 [DOI] [Google Scholar]
- Sothmann, M. S., Gebhardt, D. L., Baker, T. A., Kastello, G. M., & Sheppard, V. A. (2004). Performance requirements of physically strenuous occupations: Validating minimum standards for muscular strength and endurance. Ergonomics, 47(8), 864–875. 10.1080/00140130410001670372 [DOI] [PubMed] [Google Scholar]
- Stockbrugger, B., Reilly, T., Blacklock, R., & Gagnon, P. (2018). Reliability of individual components of the Canadian Armed Forces physical employment stand. Applied Physiology and Nutrition, and Metabolism, 43(7), 663–668. 10.1139/apnm-2017-0650 [DOI] [PubMed] [Google Scholar]
- Woehr, D. J. (1994). Understanding frame-of-reference training: The impact of training on the recall of performance information. Journal of Applied Psychology, 79(4), 525–534. 10.1037/0021-9010.79.4.525 [DOI] [Google Scholar]
- Woehr, D. J., & Roch, S. (2012). Chapter 22: Supervisory performance ratings. In Schmitt N. (Ed.), Handbook of Assessment and Selection (pp. 517–531). Oxford University Press, Inc. [Google Scholar]
- Young, A. J., Sawka, M. N., & Pandolf, K. B. (1996). Chapter 7: Physiology of cold exposure. In Marriott B. M. & Carlson S. J. (Eds.), Nutritional needs in cold and high altitude environments: Applications for military personnel in field operations (pp. 127–148). National Academies Press (US). [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The authors confirm that the data supporting this paper are available within the article and cited research. However, some data are not publicly available due to restrictions related to release of data and privacy of research participants.
