Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Mar 1.
Published in final edited form as: Int J Sel Assess. 2022 Jan 23;30(1):167–181. doi: 10.1111/ijsa.12375

Using game-like animations of geometric shapes to simulate social interactions: An evaluation of group score differences.

Matt I Brown 1, Andrew B Speer 2, Andrew P Tenbrink 2, Christopher F Chabris 1
PMCID: PMC9355331  NIHMSID: NIHMS1771567  PMID: 35935096

Abstract

This study introduces a novel, game-like method for measuring social intelligence: the Social Shapes Test. Unlike other existing video or game-based tests, the Shapes Test uses animations of abstract shapes to represent social interactions. We explore demographic differences in Shapes Test scores compared to a written situational judgment test. Gender and race/ethnicity only had meaningful effects on written SJT scores while no effects were found for Shapes Test scores. This pattern of results remained after controlling for general mental ability and English language exposure. We also found metric invariance between demographic groups for both tests. Our results demonstrate the potential for using animated shape tasks as an alternative to written SJTs when designing future game-based assessments.


The gamification of psychometric tests is widely acknowledged as an emerging trend in the practice of industrial and organizational psychology (Woods et al., 2020). Gamification is generally defined as the application of game design techniques to real world processes (Landers et al., 2018). Despite popular demand among employers and popular media attention to new technology firms marketing gamified tests (e.g., Carr, 2018), there are few peer-reviewed studies on the development and validation of game-based assessments (Chamorro-Premuzic et al., 2016). Many of the gamified tests in the literature were designed using existing ability test formats or item types (e.g., Leutner et al., 2020; Weiner & Sanchez, 2020). In particular, the situational judgement test (SJT) method has commonly been used to develop game-based measures (Georgiou et al., 2019; Landers et al. 2020). Although the SJT format is commonly used to measure interpersonal skills in organizational research and practice (Christian et al., 2010), a variety of alternative methods for assessing these skills can be found in other areas of psychology and neuroscience research (Eddy, 2019; Pinkham et al., 2018). Therefore, the purpose of the present study is to introduce an alternative, game-like approach for measuring social intelligence and to observe group mean score differences compared to a standard, text-based SJT. We also hope that this study serves as a proof-of-concept endeavor for a novel test method for designing future game-based assessments.

The Social Shapes Test (Shapes Test; Brown et al., 2019) is a novel assessment of social intelligence that is the focus of this research. Unlike most measures of social intelligence or related ability constructs in organizational research (e.g., emotional intelligence or applied interpersonal skills assessed using SJTs), the Shapes Test uses animated, two-dimensional shapes to represent social interactions (Figure 1). These animations are similar in design to the appearance of early video games (e.g., Asteroids, Space Invaders, or Tetris) and are more a cost-effective alternative to creating higher-fidelity videos of human interactions (e.g., Bardach et al., 2021; Golubovich et al., 2017). We are specifically interested demonstrating whether the animated shape task used in the Shapes Test yields weaker mean group score differences relative to a text-based situational judgment test (SJT) of social intelligence. This focus is informed by initial findings which suggest that animated shape tasks tend to show little to no mean score differences based on gender or race and ethnicity (Brown et al., 2019; Lee et al., 2018). Given the concerns regarding fairness and bias in modern selection practices (Burgoyne et al., 2021), we explore whether this test method may be useful for minimizing adverse impact threat when designing game-based assessments to measure social intelligence.

Figure 1.

Figure 1.

Sample Item from the Social Shapes Test (SST; Brown et al., 2019). This is a screenshot of how each test item appeared when completing the SST. Each item consists of a short animation paired with a single multiple-choice question with four response options.

Origins of Animated Shape Tasks

Animated shape tasks originated from the pioneering work of Heider and Simmel (1944) who observed that their participants, after viewing a short film with moving geometric shapes, described the movements of the shapes in social terms and assigned social roles to individual shapes (see Figure 2). This finding inspired modern researchers to use the original Heider and Simmel film, or newer animations mimicking the original, to assess difficulties in social intelligence among children and adults affected by psychological conditions characterized by challenges in social functioning (e.g., autism spectrum disorder or schizophrenia; Klin, 2000; Martinez et al., 2019). These researchers discovered that typically developing individuals were generally more able to identify the social relationships between these abstract shapes as indicated by moderate, meta-analytic effect sizes across studies of adults and children (Wilson, 2021). Moreover, performance on these animated shape tasks tend to correlate positively with other ability measures of social or emotional intelligence (Johannesen et al., 2018). These tasks have also been used to identify areas of brain activation involved in social cognition (e.g., Isik et al., 2017; Ludwig et al., 2020). In particular, animations designed to simulate social interactions have been found to activate different areas of the brain compared to animations meant to represent strictly mechanical or random motion (Mossesnang et al., 2020; Vandewouw et al., 2021).

Figure 2.

Figure 2.

Examples of animated shape tasks. A frame from the original Heider and Simmel (1944) film is displayed on the left (A) and a frame from one of the Social Shapes Test animations is shown on the right (B). This figure also appears in Ratajska et al., (2020).

Despite these findings in developmental and neuroimaging research, there has been little work to adapt animated shape tasks for use in employee selection or development. Not only is it difficult to translate social or emotional intelligence research across fields due to differences in terminology and taxonomies (Olderbak & Wilhelm, 2020), there are likely several, specific limitations which have made it challenging to incorporate existing animated shape tasks into organizational research. One such limitation is that many applications of animated shape tasks require researchers to code narrative descriptions provided by test takers after viewing the shape animations (Klin, 2000). This is an obstacle to organizational use where assessments are typically administered and scored electronically without a proctor or administrator (Tippins, 2009). In addition, several shape tasks rely only on the original Heider and Simmel film (e.g., Klin, 2000) or relatively few, unique animations (e.g., Martin & Weisberg, 2003). Although this is typical practice in neuroimaging studies, it is difficult to see how a fine-grained assessment could be created using such few stimuli. Among the exceptions where a multiple-choice format has been used, these tests have primarily been studied in research of social impairments due to autism or schizophrenia (e.g., Bell et al., 2010; White et al., 2011) and there is little psychometric evidence for their use among typically-developing adults. This potentially has limited the adoption of this method to measure individual differences in social intelligence.

Unlike existing measures in the developmental or neuroscience literature, however, the Shapes Test was specifically designed and validated as a self-administered test to identify individual differences in social intelligence among typically-developing adults. This format can easily be implemented as a non-proctored, online assessment which makes it convenient for organizational research. The animations used in the Shapes Test have been found to evoke a similar degree of social attribution compared to the original Heider and Simmel film (Ratajska et al., 2020). Scores on this test also correlate positively with other validated social intelligence measures, including the Reading the Mind in the Eyes test (Baron-Cohen et al., 2001) and the Situational Test of Emotion Understanding (MacCann & Roberts, 2008), while displaying discriminant validity from measures of verbal ability or abstract reasoning (Brown et al., 2019). Taken together, the Shapes Test appears to be a useful alternative to existing methods for assessing social intelligence in selection or development, and we believe that the simple, abstract nature of the shape animations presents a novel approach for designing future game-based assessments. Thus, we seek to provide proof-of-concept evidence for using these animated shape tasks in organizational research and as an inspiration for future game-based assessments. For the purposes of this study, we refer to the Shapes Test as a game-like test of social intelligence due to the visual style of the animated shape videos but concede that the assessment is not fully gamified as defined by organizational scholars (Landers et al., 2018).

Present Study

The use of animated shape tasks provides several potential advantages compared to existing social intelligence tests in organizational research. Generally speaking, some scholars suggest that using video stimuli in place of text may help minimize group score differences (Chan & Schmitt, 1997). This is theorized to be caused by minimizing construct-irrelevant variance, such as reading ability or cognitive load (Karakolidis et al., 2021). However, most of the research on video elements has focused on tests involving human actors or animated characters (Golubovich et al., 2017). These visual representations of humans may contain cultural or gender cues which are more easily detected by in-group members (Adams et al., 2010; Golubovich & Ryan, 2021). In contrast, the shape animations in the Social Shapes Test simulate social interactions more abstractly without these cues. The Social Shapes Test and similar animated shape tasks have displayed a lack of subgroup score differences based on self-identified gender and racial or ethnic group in past research (e.g., Brown et al., 2019; Lee et al., 2018) but there have been few primary studies to formally replicate these findings.

We chose to compare group mean score differences on the Shapes Test to a written SJT (Situational Social Intelligence Test, SSIT, Speer et al., 2019) given that the SJT format is widely used to measure applied interpersonal skills (Patterson et al., 2009) and social or emotional intelligence (MacCann & Roberts, 2008; Schlegel & Mortillaro, 2019) in organizational research. Moreover, several gamified assessments use a written SJT format (Georgiou et al., 2019; Landers et al., 2020). In these latter two examples, the respective authors added game-like elements, such as character avatars, continuous storylines, and fantasy elements, to a traditional SJT structure which included a verbal description of a specific scenario and behavioral response options. The SSIT, on the other hand, is a more traditional SJT that includes only descriptions of a social, work-related contexts and a list of possible behavioral responses. Thus, it does not possess any features of gamification.

SJTs are widely assumed to yield weaker group score difference relative to general mental ability (GMA) tests, and yet interpersonal skill SJTs exhibit moderate group score differences favoring White test takers (e.g., Bobko & Roth, 2013; Whetzel et al., 2008). Similar group mean score differences have been found among applicants who completed SJTs designed for one of four different occupational groups (Herde et al., 2019). Likewise, female test takers are commonly found to outperform males on SJTs across a variety of domains and settings (Lievens et al., 2016; Weekley et al., 2015). In contrast, Brown et al. (2019) reported little to no differences in Shapes Test scores based on gender and race or ethnicity even though significant differences were detected for other social intelligence measures. This lack of a meaningful gender effect has also been observed for similar shape tasks administered to participants in clinical populations (Bell et al., 2010; Johannesen et al., 2018). Based on this research and the few studies which have examined group score differences in different animated shape tasks, we propose the following two hypotheses:

Hypothesis 1. Larger racial/ethnic group score differences will be observed for written SI scores (SSIT) compared to game-like SI scores (Social Shapes Test).

Hypothesis 2: Larger gender differences (favoring women) will be observed for written SI (SSIT) than for game-like SI (Social Shapes Test).

We also examine several factors that might help explain why people may score differently on written and game-like social intelligence tests. First, English language proficiency and exposure to the English language may contribute to differences in test scores between demographic groups; respondents with less exposure to the English language (i.e., those who were not raised speaking English) should perform better on the game-like test than on a written SJT which likely requires heavier amounts of reading. Along these lines, Hausdorf and Robie (2018) recently reported that racial/ethnic differences in GMA scores could be meaningfully reduced after controlling for differences in English-language fluency, generational status, and other demographic variables. Due to the expected cognitive saturation of both assessments used in this study, we expect that these past findings may generalize to our study. Another potential explanation for group differences in social intelligence test scores is the influence of GMA. Even though social intelligence scores are expected to correlate with measures of GMA or other broad cognitive abilities (Mayer et al., 2016), it is unclear the extent to which observed group differences in social intelligence test scores can be explained by differences in GMA. Both tests have been observed to correlate positively with measures of GMA or second-order cognitive abilities (e.g., verbal or abstract reasoning). Conventional wisdom states that cognitively-loaded tests will display sizable mean group differences (Dahlke & Sackett, 2017). We thus explored whether controlling for these language factors would reduce demographic differences in written and game-like test scores.

Research Question 1: Will demographic group differences in SI scores be reduced after controlling for English language exposure?

Research Question 2: Will demographic group differences in SI scores be reduced after controlling for GMA?

Method

Participants

We recruited 548 undergraduate participants who were enrolled in psychology courses at a large United States university. All participants received course credit for participating in the study. We removed a total of 148 participants who either failed to correctly respond to an explicit attention check item (e.g. “Choose ‘somewhat disagree’ for this statement”) or completed the entire study more quickly than possible (the researchers completed the assessment as quickly as possible while remaining attentive to set a cutoff). Using our exclusion criteria did not change the overall demographics of our sample based on gender or race/ethnicity. The final sample size was 400 people. Most participants identified as female (82%), and the sample was diverse in terms of race: White (38%), Black/African American (17%), Arab/Middle Eastern (23%), and Asian (21%). Participants were recruited in order to obtain a relatively balanced sample based on race/ethnicity. Participants were also asked whether they were born in an English-speaking country (ESC; 84% of full sample) and whether they learned and used English from early childhood (native English speaker; 91% of full sample) allowing for comparisons in scores based on language factors besides race. We provide a full summary of participant demographics in Table 1.

Table 1.

Participants by Gender and Race/Ethnicity

White Black Arab Asian
n % n % n % n %
Gender
 Female 132 87% 58 91% 65 73% 62 78%
 Male 20 13% 6 9% 24 27% 17 22%
Born in an English-speaking country?
 Yes 145 95% 63 98% 67 75% 46 58%
 No 7 5% 1 2% 22 25% 33 42%
Identify as a native English speaker?
 Yes 143 94% 62 97% 75 84% 68 85%
 No 9 6% 2 3% 14 16% 11 15%
Total Sample 152 38% 64 17% 89 23% 79 21%

Note. 16 individuals did not report a Race or Ethnicity (including 10 individuals who identified as Female)

Procedure

Participants completed all measures using an online survey hosted by Qualtrics. The order in which participants completed the Shapes Test and the SSIT was randomly assigned in order to prevent any order effects on our within-person analyses. We chose this design in order to prevent any true ability differences from confounding our comparisons of gender or race/ethnicity effects on game-like or written social intelligence test scores. After completing each of the tests, participants were asked to self-identify their gender, race/ethnicity, and age along with other demographic variables. The median time to complete the study was 52 minutes with an interquartile range of 40 to 75 minutes.

Measures

Game-like Social Intelligence Test (Social Shapes Test).

The Shapes Test consists of 23 multiple-choice items with each consisting of a short, 13–23 second video involving a set of two-dimensional, geometric shapes and a multiple-choice question with four response options. The same set of shapes (e.g., a yellow plus, a purple star, a red square, and a blue triangle) appear in all videos. Participants are allowed to view the animation as many times as needed to complete the item. Some items ask participants to correctly identify which character expressed a specific mental or emotional state (e.g., “Who is annoyed?”) while others required participants to make inferences about a character’s intentions (e.g., “Who is being rude?”). Other items ask participants to take the perspective of one of the characters in the animation (e.g., “Who does X think is in the house at the end of the video?”).

Each Shapes Test question is scored as correct or incorrect based on an objective scoring key. The scoring key was initially created by the test developers who identified the correct response for each item based on the intended interpretation of the animated social interaction (Brown et al., 2019). This key was then empirically validated by observing that a consensus of test takers identified the intended correct response for all items. Consensus-based scoring is a common method for identifying correct responses to emotional intelligence test items (Barchard et al., 2013). In addition, all 23 items had positive corrected item-test correlations and positive correlations with other, ability-based measures of social intelligence (Brown et al., 2019). These findings provide support for the validity of the scoring key. Shapes Test scores were calculated as the sum of correct responses and displayed modest internal consistency (α = .71). All Social Shapes Test items and animations are freely available to view or download at OSF (https://osf.io/sqxy6).

Social intelligence falls within a larger umbrella of social aptitude constructs (Lievens & Chan, 2010). As a measure of cognitive social intelligence, the Shapes Test is designed to assess one’s ability to decode nonverbal behavior and accurately infer the emotional or mental states of others (Wong et al., 1995). Within the organizational sciences, this operationalization is most similar to the perceiving emotions dimension within Mayer et al.’s model of emotional intelligence (2016), though emotional intelligence is typically assessed by identifying emotions from visual representations (e.g., static images of faces) or from textual descriptions. The Shapes Test also involves understanding perception of mental states, like the Theory of Mind (Premack & Woodruff, 1978). Although there has been a problem with construct label proliferation (Olderbak & Wilhelm, 2020), most social aptitudes (Ferris et al., 2001) have their roots in the longer standing construct of social intelligence (Thorndike, 1920; Lievens & Chan, 2010), and thus we use the term social intelligence because it is the basis for many of the more narrowly defined concepts currently used in the organizational sciences (e.g., Lievens & Chan, 2010). Using terminology outlined by Lievens and Chan, the Shapes Test can be thought to measure cognitive SI as it focused on understanding and decoding nonverbal behavior.

Written Social Intelligence Test (SSIT).

The SSIT is an SJT-based measure of social intelligence in workplace settings. This test consists of 29 items where test-takers read workplace scenarios and identify the most and least effective behavioral responses among four options (α = .86). The SSIT is scored using an expert-based key which assigns points to each most and least effective choice (either 0, 1, or 2 points). A total score was calculated by summing all the points scored across both most and least effective choices for all 29 items (maximum possible score = 110). The SSIT demonstrates construct validity evidence across multiple studies (Speer et al., 2019). We provide a sample item in Figure 3. Similar to the Shapes Test, the SSIT is also designed to measure cognitive social intelligence as defined by Lievens and Chan (2010). We selected the SSIT because it is an exemplar of typical SJT-based measures commonly used to assess interpersonal skill in selection and development contexts (Christian et al., 2010). Therefore, the SSIT is likely similar to many tests currently available to practitioners as an alternative to the Shapes Test. The SSIT is also representative of the test format which is commonly used for designing game-based assessments in past research.

Figure 3.

Figure 3.

Sample Item from the Written SJT (SSIT; Speer et al., 2019). Each question includes a workplace scenario followed by a list of four behavioral response options. Test-takers are asked to indicate the options that they are most likely and least likely to do in response to the situation.

In the present study, SSIT scores correlated positively with performance on the Shapes Test (r = .40, p < .001) which indicated modest convergent validity between the two tests. This magnitude of correlation is typical among broad tests of emotional intelligence (Elfenbein & MacCann, 2017; Schlegel & Mortillaro, 2019) or more narrowly defined tests of emotion recognition (Kittel et al., 2021). More specifically, Bryan and Meyer (2021) reported a meta-analytic correlation of r = .43 (95% CI = .39, .48) between people-centered ability measures across 87 unique studies. Our observed correlation may also be somewhat attenuated by unshared method variance due to the differences in format between the two tests (Spector et al., 2019). Using different item formats or test stimuli can artifactually diminish the observed correlation between tests designed to measure the same underlying construct. This has been widely observed across various constructs and between different test modalities (e.g., game-based versus written SJT: Georgiou et al., 2019; behavioral versus written responses: Lievens et al,. 2015).

Content Validity Evidence for the Social Intelligence Measures.

Even though both the Shapes Test and SSIT were developed to assess social intelligence, and even though both exhibit construct validity evidence in assessing their intended construct domains (Brown et al., 2019; Speer et al., 2019), an additional layer of validity evidence was obtained for this study. Specifically, content validity judgments were independently made for each test as to the degree to which the tests measure social intelligence. Subject matter experts (SMEs) were contacted and asked to complete these content validity judgments. All SMEs held a Ph.D. in industrial-organizational psychology and had experience in the research or practice of selection testing. SMEs were assigned to either the Shapes Test or the SSIT, and they were first given the definition of social intelligence used in this study. Then, they reviewed all test items for their assigned test. Upon doing this, they made judgments regarding the degree to which the test assessed social intelligence. The SMEs made judgments globally and also regarding specific components of the social intelligence definition, thus ensuring the tests were both relevant and representative of the intended construct domain. Ratings were made on a 1–5 scale (1=”Not at all”, 2 = “Minimally”, 3 = “Somewhat measures the domain”, 4 = “Measures the domain”, 5 = “Very much measures the domain”), with ratings made regarding the overall definition, knowledge of social norms, appraisal of context-specific social demands, insight into others (thoughts, emotions, intentions), and the ability to predict behaviors of others. Four SMEs assessed the Shapes Test and four assessed the SSIT. We observed adequate interrater agreement for ratings of each test (Shapes Test rwg = .88; SSIT rwg = .73). Both tests exhibited evidence of content validity, with an average rating across all raters and judgments being 3.60 for the Shapes Test and 4.25 for the SSIT. Both tests received similar ratings for measuring social norms (both M = 4.00) and insight into thoughts, emotions, and intentions (Shapes Test M = 3.75; SSIT M = 4.00) and overall social intelligence (Shapes Test M = 3.75; SSIT M = 4.25). However, the SSIT was rated higher for assessing context-specific social demands (M = 4.50) compared to the Shapes Test (M = 3.50). This suggests that both tests measure similar aspects of social intelligence but the SSIT involves more contextual information. Considering these ratings and findings reported in past research (Brown et al., 2019; Speer et al., 2019), we conclude that the Shapes Test and SSIT are comparable measures of social intelligence.

SAT/ACT.

In order to investigate whether group test score differences can be explained by differences in GMA, we asked participants to self-report their overall SAT and/or ACT scores as a proxy measure (n = 350). Each score was transformed into a percentile score based on national norms provided by the test publishers. Past research has suggested the use of admissions test scores as a proxy for GMA in situations where administering GMA tests to participants is not feasible (Wai et al., 2018). Both ACT and SAT scores have generally been observed to be strongly correlated with more general measures of GMA (Frey & Detterman, 2004). In the present study, men reported higher test scores compared to women (d = −0.33, t(345) = 2.38, p = .02), whereas White participants reported significantly higher scores compared to Black participants (d = 0.95, t(331) = 6.08, p < .001), but not Asian (d = −0.27, t(331) = 1.51, p = .13) or Arab participants (d = 0.00, t(331) = 0.18, p = .86).

Results

Overall Comparisons by Measurement Method

Correlations between all study variables are reported in Table 2. We first tested Hypotheses 1 and 2 using ANOVA (Table 2). Comparisons based on racial or ethnic group membership were tested using multiple regression with dummy-coded variables coded using White as the referent group. We report standardized mean differences as effect sizes where positive values correspond to higher scores for female participants or White participants. Although scores on the written SJT and the Shapes Test were positively correlated (r = .40, p < .001), we found a different pattern of group differences for each test. Collectively, we observed stronger gender and racial differences for written SJT compared to game-like social intelligence. We did not detect a difference in Shapes Test scores based on participant gender (d = −0.15, p = .24) or meaningful differences between White participants and Black (d = −0.23, p = .43), Middle Eastern (d = −0.15, p = .68), or Asian participants (d = −0.03, p = .99). In comparison, there were mean differences for the written SJT based on gender (F(1, 397) = 7.71, p = .006) and race/ethnicity (F(3,380) = 4.61 p = .004). Women outperformed men on the written SJT (d = 0.35, p = .006). Further, White participants scored significantly higher than Black (d = −0.44, p = .002) Arab (d = −0.37, p = .004), and Asian participants (d = −0.29, p = .04). It is worthy to note that these differences are still weaker than what is commonly found for GMA tests in past literature and for self-reported admissions tests within the current sample.

Table 2.

Descriptive Statistics and Correlation Matrix

M SD 1 2 3 4 5 6 7 8 9 10
1. Game-Like Shapes Test 15.39 3.49
2. Written SJT 77.30 12.92 .40*
3. Age 20.28 3.22 −.13* .01
4. Gender 0.82 0.39 −.06 .13* .02
5. Black/African American 0.17 0.37 −.07 −.10* .20* .09
6. Arab/Middle Eastern 0.23 0.42 −.04 −.09 −.08 −.14 −.24*
7. Asian 0.21 0.40 .02 −.03 −.11* −.05 −.23* −.28*
8. White 0.38 0.49 .08 .18* .01 .10* −.36* −.44* −.41*
9. ESC 0.84 0.36 .01 .11* −.04 .05 .18* −.12* −.35* .24*
10. Native English Speaker 0.91 0.29 −.01 .02 .01 −.08 .09 −.12* −.08 .08 .51*
11. SAT/ACT 0.76 0.18 .19* .19* .12* −.13* −.35* .05 .16* .09 −.05 .03

Note. Gender (1 = female, 0 = male); Black/African American (1=Black; 0 = White, Asian, or Arab/Middle-Eastern); Arab/Middle-Eastern (1=Arab/Middle-Eastern, 0 = White, Asian, or Black); Asian (ESC = born in an English-speaking country (1 = yes, 0 = no); Native English speaker (1 = yes, 0 = no); SAT/ACT scores were self-reported and transformed onto a common, percentile metric based on test norms

*

p < .05 (two-tailed)

To statistically test for differences in gender or race/ethnicity effects between the Shapes Test and written SJT, we next conducted a mixed ANOVA. Each model included one between-person factors (gender or race/ethnicity) and one within-person factor (test type = game-like or written social intelligence). We converted the Shapes Test and written SJT number correct scores into z-scores in order to conduct this analysis. We failed to detect a significant interaction between race/ethnicity and test type, F(3, 380) = 1.17, p = .32. Even though we observed nominally weaker differences in game-like SI between White and Black, Arab, or Asian participants, the effects for race/ethnicity were not significantly lower than those estimated for the written social intelligence task. These results suggest only partial support for Hypothesis 1. We did find a significant interaction between test type and gender which indicates that the gender effect was significantly different between written and game-like tests, F(1, 382) = 6.63, p = .001. This provided further support for Hypothesis 2.

We further investigated these differences by plotting the score distributions for each gender and racial/ethnic group (Figure 4). For the Shapes Test, we observed a slightly wider distribution of scores among women compared to men. However, the scores at the median and 75th percentiles appeared roughly equivalent. We also found consistent results for scores at median and 75th percentile within each racial/ethnic category. In contrast, women had higher scores on the written SJT test (SSIT) at each quartile compared to men. Moreover, the median written SJT score for White participants was roughly equal to the 75th percentile score for Black and Arab participants. These results further suggest that the Shapes Test may be less susceptible to gender or race/ethnicity-based AI than the written SJT if using top-down selection.

Figure 4.

Figure 4.

Differences in Game-Like Shapes Test (top row) and Written SJT (bottom row) test scores by Gender (first column) and Race/Ethnicity (second column). All scores are reported as z-scores. Each box represents the middle 50% of observed scores for each group (ranging from the 25th percentile to the 75th percentile).

Effects of Exposure to English Language and GMA

To investigate Research Question 1, we explored whether there were any differences in written or game-like SI due to being born in an English-speaking country (ESC) or growing up as a native English speaker using ANOVA (Table 3). It is important to note that a large majority of our participants reported being a native English speaker or born in an English-speaking country. Not only does this limit our statistical power to detect the effect of these variables but it also limits our ability to generalize our findings beyond other, primarily English-speaking samples. Participants who reported being born in an English-speaking country scored higher on the written SJT test (d = −0.28, p = .04) but not the non-verbal, game-like Shapes Test (d = −0.03, p = .67). However, we did not find any differences between native and non-native English speakers for either the written SJT (d = −0.11, p = .52) or the game-like Shapes test (d = −0.07, p = .67). These mixed findings suggest that growing up in an English-speaking country has a stronger, yet modest effect on written SJT performance compared to the game-like task. Next, we tested whether controlling for differences in English language exposure reduced group differences in social intelligence using multiple regression (Table 4, Model 1). Race and gender variables accounted for incremental variance in both written SJT (△R2 = .05, p < .001) and game-like Shapes Test scores (△R2 = .02, p = .04), beyond past English language exposure. All race and gender effects remained significant for the written SI test, except for the White-Asian comparison. On the other hand, we only observed a significant White-Middle Eastern difference for Shapes Test scores.

Table 3.

Group Differences in Social Intelligence Test Scores

Game-Like Shapes Test Written SJT (SSIT)
M SD d M SD d
Gender
 Female (n = 327) 15.31 3.45 78.15 12.71
 Male (n = 72) 15.81 3.67 –0.14
[–0.40, 0.11]
73.51 13.34 0.36*
[0.11, 0.62]
Race/Ethnicity
 White (n = 152) 15.77 3.34 80.33 12.09
 Black (n = 64) 14.92 3.77 –0.24
[0.04, −0.54]
74.45 13.38 –0.46*
[–0.18, −0.77]
 Arab (n = 89) 15.20 3.67 –0.16
[0.10, −0.43]
75.43 13.45 –0.38*
[–0.12, −0.65]
 Asian (n = 79) 15.56 3.42 –0.06
[0.20, −0.33]
76.73 12.56 –0.29*
[–0.02, −0.57]
Born in an ESC?
 Yes (n = 337) 15.41 3.47 77.93 12.84
 No (n = 63) 15.30 3.66 –0.03
[0.24, −0.30]
73.95 12.97 –0.31*
[–0.04, −0.58]
Native English speaker?
 Yes (n = 369) 15.62 3.51 77.14 13.09
 No (n = 37) 15.36 3.52 –0.07
[0.26, −0.41]
75.68 13.27 –0.11
[0.22, −0.45]

Note. Positive d values indicate higher scores for Female (Sex), White (Race/Ethnicity), or for participants who responded “Yes” (Born in an English-speaking country or Native English speaker); 95% confidence intervals are reported in brackets; Effect of Gender on SST scores = F(1,397) = 1.18, p = .28; Effect of Gender on SSIT scores = F(1,397) = 7.71, p = .006; Omnibus effect of Race/Ethnicity on SST scores F(3,380) = 1.08, adjusted R2 < .01, p = .36; Omnibus effect of Race/Ethnicity on SSIT scores F(3,380) = 4.61, adjusted R2 = .03, p = .004; Effect of ESC on SST scores F(1,398) = 0.05, p = .82; Effect of ESC on SSIT scores F(1,398) = 5.07, p = .02; Effect of NES on SST scores F(1,398) = 0.02, p = .89; Effect of NES on SSIT scores F(1,398) = 0.17, p = .68.

*

p < .05 (two-tailed)

Table 4.

Hierarchical Multiple Regression Analyses Predicting SI Scores from Demographic Variables

Game-Like Shapes Test Written SJT (SSIT)
B (SE) β R 2 ΔR2 B (SE) β R 2 ΔR2
Model 1 .02 .02* .06* .05*
 Black −1.03 (0.55) −.11 −6.99 (2.03) −.20*
 Arab −1.23 (0.50) −.15* −4.61 (1.85) −.15*
 Asian −0.43 (0.54) −.05 −2.09 (1.99) −.07
 Gender −0.85 (0.49) −.10 3.88 (1.81) .12*
 ESC −0.13 (0.56) −.01 3.46 (2.05) .10
Model 2 .04 .01 .07* .04*
 Black −0.50 (0.57) −.05 −4.86 (2.11) −.14*
 Arab −1.16 (0.48) −.14* −5.19 (1.79) −.17*
 Asian −0.50 (0.50) −.06 −3.78 (1.83) −.12*
 Gender −0.71 (0.49) −.08 4.39 (1.80) .13*
 SAT/ACT 3.32 (1.11) .17* 12.66 (4.10) .18*
Model 3 .04* .01 .08* .04*
 Black −0.49 (0.57) −.05 −4.97 (2.11) −.14*
 Arab −1.20 (0.50) −.15* −4.53 (1.83) −.15*
 Asian −0.57 (0.54) −.07 −2.60 (1.98) −.08
 Gender −0.71 (0.49) −.08 4.40 (1.80) .13*
 ESC −0.20 (0.55) −.02 3.21 (2.03) .09
 SAT/ACT 3.33 (1.11) .17* 12.40 (4.09) .17*

Note. Listwise n = 335 (due to missing SAT/ACT data); Model 1: effects of demographic variables after controlling for English language exposure; Model 2: effects of demographic variables after controlling for SAT/ACT scores; Model 3: effects of demographic variables after controlling for SAT/ACT and English language exposure; Gender (1 = female, 0 = male); Race/ethnicity effects were tested using dummy-coded variables (referent group = White); ESC = born in an English-speaking country (1 = yes, 0 = no); SAT/ACT scores were self-reported and transformed onto a common, percentile metric based on test norms; B = unstandardized regression coefficient; SE = standard error; β = standardized regression coefficient

*

p < .05 (two-tailed)

In order to test Research Question #2, we first examined the relationship between scholastic test scores and the two measures of social intelligence. SAT/ACT scores correlated .19 (p = .001) with written SI and .19 (p < .001) with non-verbal, game-like test scores. Next, we removed the effects of SAT/ACT scores and then reexamined the degree of group differences for each test (Model 2 in Table 4). Despite controlling for SAT/ACT scores, male-female, White-Black, White-Middle-Eastern, and White-Asian score differences in written SJT scores all remained statistically significant (incremental △R2 = .04, p < .001). Controlling for SAT/ACT scores did not change any of the observed demographic effects for the game-like Shapes Test except that the White-Middle-Eastern difference became statistically significant (despite only an incremental △R2 = .01, p = .14). Based on these results, it does not appear that controlling for SAT/ACT scores or English-speaking background decreases group differences in either Shapes Test or written SJT scores.

Concurrent Effects of Gender, Race/Ethnicity and other Individual Differences

We further tested Hypotheses 1 and 2 by regressing written SJT and Shapes Test scores onto all of the demographic variables of interest after entering SAT/ACT scores and English language exposure (Model 3; Table 4). These regressions also act to parse out the unique effects of gender and race/ethnicity from one another. Here, gender and race/ethnicity failed to predict meaningful incremental variance in game-like test scores yet accounted for 5% of incremental variance in written SJT scores. All of the group differences in written SJT remained statistically significant except the White-Asian dummy-coded variable. These results further indicate that the differences in observed gender and race/ethnicity effects between written and game-like test scores could not be fully attributed to English language exposure or GMA as estimated using self-reported SAT/ACT scores.

Measurement Invariance and Differential Item Functioning

It is important to establish measurement invariance in order to properly interpret mean score differences or the lack thereof (Vandenberg & Lance, 2000). We first tested metric equivalence of the factor loadings for the Shapes Test and the written SJT separately for both gender and race/ethnicity using confirmatory factor analysis. To test metric equivalence, we estimated a single factor model for each test and estimated the change in model fit after constraining all factor loadings to be equal across gender or racial/ethnic groups. For gender, we found that the factor loadings were invariant for the Shapes Test (df = 22, Δχ2 = 17.51, p = .73) and the written SJT (df = 28, Δχ2 = 28.41, p = .44) as indicated by minimal change in model fit after applying the invariance constraints. We also found that the factor loadings were invariant when comparing Whites versus all other racial and ethnic groups for the Shapes Test (df = 22, Δχ2 = 20.11, p = .58) and the written SJT (df = 28, Δχ2 = 29.29, p = .40). Thus, the items were similarly related to latent construct scores for different demographic groups across both tests, suggesting that the observed mean differences between demographic groups were not due to some form of measurement bias but rather reflect real differences on test scores. These differences could possibly be explained by methodological differences between the two tests, including the use of text-based scenarios, work-specific context, or the presence of human characters in the SSIT. Although we are not able to deduce the exact cause for the lack of group differences in Shapes Test scores, our results suggest that this test provides a comparable measure of social intelligence while minimizing score differences based on self-identified gender or race/ethnicity.

Lastly, we also tested for differential item functioning (DIF) on the Shapes Test using difR package in R. We used this package to test for uniform and non-uniform DIF using binomial logistic regression (Magis et al., 2010; Rogers & Swaminathan, 1993). We failed to detect DIF based on either sex or race/ethnicity for 20 of the 23 SST items (Table 5). For race/ethnicity, no evidence of DIF could be detected. Even if we had loosened our statistical significance criteria to p < .10, only two items met this threshold for DIF. These results indicate that the lack of differences in SST scores based on racial or ethnic group membership is not merely due to DIF. Likewise, we found evidence for gender DIF for only three of the 23 SST items. In each of these three cases, we detected only uniform DIF with two items displaying an advantage for men (items 1 and 17). However, the observed effect sizes for all three items were relatively weak in magnitude (Nagelkerke’s R2 = .03 or less, Jodin & Gierl, 2001). We conclude that there is practically no evidence for systematic DIF in the Shapes Test based on gender or race/ethnicity.

Table 5.

DIF Results for Game-Like S Items

Race/Ethnicity
DIF
Gender
DIF
Wald χ2 p Wald χ2 p
Item 1 1.90 .39 6.03 .05
Item 2 2.38 .30 1.82 .40
Item 3 0.68 .71 1.24 .54
Item 4 2.95 .23 1.85 .40
Item 5 0.02 .99 1.64 .44
Item 6 2.64 .27 3.23 .20
Item 7 3.31 .19 1.05 .59
Item 8 3.36 .19 11.25 <.01
Item 9 2.21 .33 0.99 .61
Item 10 1.21 .55 0.11 .95
Item 11 5.91 .05 1.01 .60
Item 12 3.31 .19 2.74 .25
Item 13 0.68 .71 0.40 .82
Item 14 0.48 .79 0.75 .69
Item 15 0.90 .64 3.24 .20
Item 16 0.43 .81 3.82 .15
Item 17 4.31 .12 6.20 .05
Item 18 2.92 .23 2.04 .36
Item 19 2.14 .34 0.16 .92
Item 20 0.36 .83 3.84 .15
Item 21 3.98 .14 3.05 .22
Item 22 5.86 .05 2.61 .27
Item 23 1.45 .48 2.46 .29

Note. DIF = differential item functioning; All estimates were calculated using the difR package (Magis et al., 2010); Each statistic represents an omnibus test for uniform and nonuniform DIF for each item.

Discussion

Collectively our findings suggest that the game-like test in our study yielded minimal subgroup differences relative to a written SJT. These findings also represent a constructive replication by recruiting participants from a different population (undergraduate students instead of MTurk workers) and holding educational attainment and age relatively constant compared to past research (Brown et al., 2019). We also expanded the findings of Brown et al. by including Arab/Middle-Eastern participants, finding little differences in scores between the groups. Further, we tested for measurement invariance by gender or race/ethnicity and found no evidence of bias for the Shapes Test and the written SJT. These results suggest that animated shape tasks, like those used in the Shapes Test and other tests (Bell et al., 2010), provide a promising and novel alternative to written SJTs when building game-based assessments.

The animated shape task paradigm featured in the Shapes Test provides several unique advantages compared to other forms of video-based tests. Even though researchers have theorized that video-based tests should yield weaker group score differences compared to text-based tests (Chan & Schmitt, 1997), this effect has not always been replicated in subsequent research (e.g., Bardach et al., 2021). One reason why some video-based tests may still yield group score differences is that many include cultural cues expressed by actors or characters through verbal or non-verbal behavior. These cues, in either video or static form, have been found to have differential effects based on the test taker’s cultural background and can contribute to group score differences (Adams et al., 2010; Golubovich & Ryan, 2021). In contrast, shape animations represent social interactions which are relatively free of cultural cues. This lack of culturally-specific cues may explain why we observed weaker group score differences for the Shapes Test. However, future research is needed to determine whether abstract, animated shape tasks like those in the Shapes Test yield weaker group score differences compared to a video-based test including human actors or characters. This comparison is necessary to see whether the abstract nature of the animated shape videos further minimize demographic group effects beyond simply replacing text with video. This would also be an important contribution to adverse impact research where scholars are still working to identify ways for mitigating group score differences on cognitively-loaded assessments.

Despite this uncertainty, animated shape tasks like the ones featured in the Shapes Test provide several logistical advantages for designing game-based assessments. These shape animations can be developed without human actors or highly-skilled digital animators. This makes them less expensive and time intensive to develop compared to stimuli used in other video-based tests in research and practice (e.g., Golubovich et al., 2017). As noted by Weekley and colleagues (2015), the typical videos used in high-fidelity tests remain costly to create even as computer and recording equipment has become more affordable. Moreover, the current form of the Shapes Test is freely available for research use via the Open Science Framework and all of the animated videos can be used or adapted for future research purposes (https://osf.io/sqxy6). We hope that this enables future research on gamified assessments given that existing gamified tests are often proprietary or otherwise difficult to access for researchers.

This method may not only help alleviate validity-diversity tradeoffs, but it is also designed to measure social skills which have grown increasingly relevant in the U.S. job market (Deming, 2017). Specific forms of social intelligence, including social perception and emotion understanding, are considered important characteristics for leadership (Zaccaro et al., 2018) and managerial effectiveness (Côté, 2017). These skills may also be especially important for performance in highly complex, teamwork settings (Farh et al., 2012; Wooley et al., 2010). Along these lines, there is a strong desire to measure these qualities in prospective job candidates and current employees while also balancing diversity goals. When doing so, performance-based measures are less susceptible to faking and response distortion and thus more suitable for high-stakes contexts compared to trait-based, self-report measures (Christiansen et al., 2010). Moreover, the Shapes Test also appears to yield minimal subgroup differences despite being positively related to measures of GMA. Therefore, we argue that the Shapes Test, and similar animated shape tasks, provide a promising and novel approach for designing gamified assessments.

Implications for Research and Practice

In past research, group differences in test scores have often been attributed to the influence of GMA (Dahlke & Sackett, 2017) or confounding effects of reading or verbal ability (Ployhart & Holtz, 2008; Whetzel et al., 2008). Our results indicate that cognitive saturation may not always guarantee substantial subgroup test score differences. Scores on Shapes Test tend to correlate positively with more general measures of verbal ability, abstract reasoning, and the cognitive reflection task (Brown et al., 2019) and we observed a positive correlation between Shapes Test scores and self-reported admissions test scores in the present study. Despite these relationships, the Shapes Test has yielded weaker group mean differences compared to other, cognitively-loaded tasks. We hope that these findings help spur further research to uncover the conditions or methods which may help mitigate observed group differences in test scores. According to Cottrell and colleagues (2015), subgroup differences in cognitive test scores can potentially be minimized by developing tests that are equally unfamiliar to all test-takers. It is possible that our most of our participants had never performed a task like the Shapes Test given that tasks like this have rarely been used outside of clinical or neuroimaging research. If this is the case, our findings suggest that novel approaches to measuring social intelligence or other cognitive skills may be a viable way to minimize differences in test scores based on gender, race, or ethnicity. This could help alleviate concerns about adverse impact while still retaining the predictive validity of a cognitively-based task.

Although the Shapes Test does not meet all of the criteria for a gamified assessment in its current form, we believe that the animated shape paradigm can be further developed to create a more game-like test experience. Future versions of this assessment could potentially allow test-takers to control the movements of one of the shape characters. This would enable testers to move their character in response to the other shapes and to receive feedback based on their movements. Another potential feature is to replace multiple choice response options by allowing the test taker to indicate a character or behavior by touch. For example, test takers could be instructed to identify the character who is displaying a specific emotion or mental state (e.g., who is the bully?) and would respond by tapping the shape during the video. Another possible direction is that the interactions between shapes could be designed to be continuous rather than short, individual videos. Furthermore, a more gamified version of this test might involve branching questions where subsequent shape interactions are shown based on responses to each question (Reddock et al., 2020) instead of following a uniform, static sequence. This would enable the test to become more interactive and to potentially simulate more complex social interactions. These modifications or adaptations are beyond the scope of the present study, but we believe that our results highlight the utility of this method for further development and gamification in the future.

Directions for Future Research

Our results suggest that the stimulus methods in Shapes Test may help minimize group score differences, potentially by minimizing verbal content and using abstract animations to represent social interactions. Although we found that the Shapes Test yielded weaker score differences between racial or ethnic groups relative to a written SJT, it would be useful to observe how the Shapes Test performs compared to other video-based tests involving human actors or characters. This would allow researchers to determine whether using abstract representations of social interactions can result in weaker subgroup differences while achieving similar criterion-related validity relative to a comparable video-based test. Along these lines, past research suggests that higher fidelity test methods, such as video-based social skill SJTs, provide stronger criterion-related validity compared to SJTs using written narratives (corrected validity coefficients of .47 and .27, respectively; Christian et al., 2010). According to Lievens and Sackett (2006), higher fidelity tests may provide a more realistic representation of the desired construct. This may be especially true for social intelligence and other forms of social acuity, since videos and animations can provide a broader range of cues for test takers to observe and judge. However, more research is needed to verify whether using higher fidelity stimuli generally improves validity. With a deeper understanding of group differences in narrow aspects of social intelligence, it may be possible to build a battery of social intelligence assessments which yields the most criterion-related validity while also minimizing potential for adverse impact (e.g., using pareto optimization; De Corte et al., 2011).

To date, many scholars have assumed that any increase in predictive validity results in greater adverse impact potential or vice versa (De Corte et al., 2011). As a result, most methods for achieving greater diversity in selection focus on decreasing the degree to which cognitively-loaded assessment scores are weighted in decision making. Although cognitively-loaded tests are expected to yield substantial subgroup differences based on race or ethnicity, the Shapes Test appears to be an exception to this rule. Our findings suggest that it might be possible to design assessments which can be used to achieve both validity and diversity goals. We hope that our results help inspire future research to better understand the construct or method factors affecting subgroup test score differences. Scholars have theorized about the benefits of using alternative testing methods (e.g., videos or constructed response formats; Lievens et al., 2019; Ployhart & Holtz, 2008) but there has been little systematic research to test these propositions across different content or construct domains. These advances would be greatly influential and would help further inform the development of game-based assessments for organizational use for selection or development purposes.

Study Limitations

When drawing conclusions, it is important to acknowledge some of the limitations of the sample in our study. First, the proportion of male and female participants in our sample is not representative of the general population. However, our results for gender differences in Shapes Test scores replicates the null findings reported by Brown et al. (2019) based on a sample of 505 MTurk workers where there was a more equal proportion of participants by gender (44% female). Likewise, a recent report indicates that women constitute over three quarters of frontline health care (76.8%) and social services workers in the United States (85.2%; Rho et al., 2020). These occupations are especially relevant given that they involve a high degree of emotional labor (Grandey, 2000) which is theorized to strengthen the relationship between social or emotional intelligence and job performance (Joseph & Newman, 2010). Although our results replicate previous findings by Brown et al. using a different participant population, both samples also consisted mostly of U.S. participants who were native English speakers and had some level of college education. Further research is needed to determine whether these results can be replicated in more diverse samples or participants outside of the U.S.

Another limitation is that the internal consistency for the game-like Shapes Test was somewhat weaker than that for the written SJT. This limitation was also acknowledged in the development of the Shapes Test (Brown et al., 2019). In its current form, the Shapes Test is likely not suitable nor intended for use in high-stakes selection settings. Our interest in featuring the test in our study was to introduce the novel method that it uses as a potential alterative to text-based testing methods when designing future game-based assessments. That said, the underlying construct of social intelligence is likely multidimensional and there is much confusion over distinction among more narrowly defined facets (e.g., Olderbak & Wilhelm, 2020; Wong et al., 1995). In such a case, alternative procedures for reliability estimation like test-retest reliability may be more suitable than internal consistency. Although the lack of group score differences on game-like Shapes Test could be explained by unreliability, the corrected group differences for the game-like test remain smaller compared to those for the written SJT (e.g., White-Black corrected d game-like SI = 0.29; corrected d written SJT = 0.51). This also fails to account for the differences in direction for the gender effects for each test. Lastly, both the present study and Brown et al. did not test for group differences in criterion-related validity. Even though our results suggest that the game-like Shapes Test may lessen the potential for adverse impact, further research is needed to observe whether the game-like test also provides comparable prediction of behavioral criteria or job performance.

Conclusions

Our results demonstrate the potential benefits of designing gamified assessments using the animated shape paradigm featured in the Shapes Test. These animations require little reading compared to traditional, text-based measures of interpersonal skills and include fewer cultural cues relative to other video-based assessments. Animated shape tasks like those in the Shapes Test may help researchers and practitioners design gamified social intelligence tests for selection or development which potentially pose less risk for adverse impact compared to traditional, text-based methods (e.g., written SJTs). Not only does this method show promise for balancing validity and diversity goals, but it may also broaden access for researchers to conduct further gamification research by being cheaper to develop and use relative to proprietary game-based assessments. In particular the video files for the Shapes Test are freely available for research purposes. We hope that our study acts to introduce this methodology to researchers and practitioners who are interested in developing novel, gamified assessments and inspires others to seek out alternative testing methods from outside of the organizational literature.

Acknowledgements:

Data from this study were presented as a poster at the 35th Annual Meeting of the Society for Industrial and Organizational Psychology (virtual conference).

Funding Information:

This work was supported by funds from the National Institutes of Health (grant U01MH119705).

Footnotes

Conflict of Interest Statement: We do not have any conflicts of interest to disclose.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

  1. Adams RB, Rule NO, Franklin RG, Wang E, Stevenson MT, Yoshikawa S, Nomura M, Sato W, Kverga K, & Ambady N (2010). Cross-cultural Reading the Mind in the Eyes: An fMRI investigation. Journal of Cognitive Neuroscience, 22, 97–108. 10.1162/jocn.2009.21187 [DOI] [PubMed] [Google Scholar]
  2. Arthur W Jr., & Villado AJ (2008). The importance of distinguishing between constructs and methods when comparing predictors in personnel selection research and practice. Journal of Applied Psychology, 93, 435–442. 10.1037/0021-9010.93.2.435 [DOI] [PubMed] [Google Scholar]
  3. Barchard KA, Hensley S, & Anderson E (2013). When proportion consensus scoring works. Personality and Individual Differences, 55, 14–18. 10.1016/j.paid.2013.01.017 [DOI] [Google Scholar]
  4. Bardach L, Rushby JV, Kim LE, & Klassen RM (2021). Using video- and text-based situational judgment tests for teacher selection: A quasi-experiment exploring the relations between test format, subgroup differences, and applicant reactions. European Journal of Work and Organizational Psychology, 30, 251–264. 10.1080/1359432X.2020.1736619 [DOI] [Google Scholar]
  5. Baron-Cohen S, Wheelwright S, Hill J, Raste Y, & Plumb I (2001). The “Reading the Mind in the Eyes” test revised version: A study with normal adults, and adults with Asperger syndrome or high-functioning autism. Journal of Child Psychology and Psychiatry, 42, 241–251. 10.1111/1469-7610.00715 [DOI] [PubMed] [Google Scholar]
  6. Bell MD, Fiszdon JM, Greig TC, & Wexler BE (2010). Social attribution test – multiple choice (SAT-MC) in schizophrenia: Comparison with community sample and relationship to neurocognitive, social cognitive, and symptom measures. Schizophrenia Research, 122, 164–171. 10.1016/j.schres.2010.03.024 [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Bobko P, & Roth PL (2013). Reviewing, categorizing, and analyzing the literature on Black-White mean differences for predictors of job performance: Verifying some perceptions and updating/correcting others. Personnel Psychology, 66, 91–126. 10.1111/peps.12007 [DOI] [Google Scholar]
  8. Brown MI, Ratajska A, Hughes SL, Fishman JB, Huerta E, & Chabris CF (2019). The social shapes test: A new measure of social intelligence, mentalizing, and theory of mind. Personality and Individual Differences, 143, 107–117. 10.1016/j.paid.2019.01.035 [DOI] [Google Scholar]
  9. Bryan VM, & Mayer JD (2021). Are people-centered intelligences psychometrically distinct from thing-centered intelligences? A meta-analysis. Journal of Intelligence, 9, 48. 10.3390/jintelligence9040048 [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Burgoyne AP, Mashburn CA, & Engle RW (2021). Reducing adverse impact in high-stakes testing. Intelligence, 87, 101561. 10.1016/j.intell.2021.101561 [DOI] [Google Scholar]
  11. Carr A (2018). Moneyball for business. Fast Company, 226, 60–67. [Google Scholar]
  12. Chamorro-Premuzic T, Winsborough D, Sherman RA, & Hogan R (2016). New talent signals: Shiny new objects or a brave new world? Industrial and Organizational Psychology, 9, 621–640. 10.1017/iop.2016.6 [DOI] [Google Scholar]
  13. Chan D, & Schmitt N (1997). Video-based versus paper-and-pencil method of assessment in situational judgment tests: Subgroup differences in test performance and validity perceptions. Journal of Applied Psychology, 82, 143–159. 10.1037/0021-9010.82.1.143 [DOI] [PubMed] [Google Scholar]
  14. Christian MS, Edwards BD, & Bradley JC (2010). Situational judgment tests: Constructs assessed and a meta-analysis of their criterion-related validities. Personnel Psychology, 63, 83–117. 10.1111/j.1744-6570.2009.01163.x [DOI] [Google Scholar]
  15. Christiansen ND, Janovics JE, & Siers BP (2010). Emotional intelligence in selection contexts: Measurement method, criterion-related validity, and vulnerability to response distortion. International Journal of Selection and Assessment, 18, 87–101. 10.1111/j.1468-2389.2010.00491.x [DOI] [Google Scholar]
  16. Côté S (2017). Enhancing managerial effectiveness via four core facets of emotional intelligence: Self-awareness, social perception, emotion understanding, and emotion regulation. Organizational Dynamics, 46, 140–147. 10.1016/j.orgdyn.2017.05.007 [DOI] [Google Scholar]
  17. Cottrell JM, Newman DA, & Roisman GI (2015). Explaining the Black-White gap in cognitive test scores: Toward a theory of adverse impact. Journal of Applied Psychology, 100, 1713–1736. 10.1037/apl0000020 [DOI] [PubMed] [Google Scholar]
  18. Dahlke JA, & Sackett PR (2017). The relationship between cognitive-ability saturation and subgroup mean differences across predictors of job performance. Journal of Applied Psychology, 102, 1403–1420. 10.1037/apl0000234 [DOI] [PubMed] [Google Scholar]
  19. De Corte W, Sackett PR, & Lievens F (2011). Designing pareto-optimal selection systems: Formalizing the decisions required for selection system development. Journal of Applied Psychology, 96, 907–926. 10.1037/a0023298 [DOI] [PubMed] [Google Scholar]
  20. Deming DJ (2017). The growing importance of social skills in the labor market. The Quarterly Journal of Economics, 132, 1593–1640. 10.1093/qje/qjx022 [DOI] [Google Scholar]
  21. Eddy CM (2019). What do you have in mind? Measures to asses mental state reasoning in neuropsychiatric populations. Frontiers in Psychiatry, 10: 425. 10.3389/fpsyt.2019.00425 [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Elfenbein HA, & MacCann C (2017). A closer look at ability emotional intelligence (EI): What are its component parts, and how do they relate to one another? Social and Personality Psychology Compass, 11, e12324. 10.1111/spc3.12324 [DOI] [Google Scholar]
  23. Farh CICC, Seo MG, & Tesluk PE (2012). Emotional intelligence, teamwork effectiveness, and job performance: The moderating role of job context. Journal of Applied Psychology, 97, 890–900. 10.1037/a0027377 [DOI] [PubMed] [Google Scholar]
  24. Ferris GR, Witt LA, & Hochwarter WA (2001). Interaction of social skill and general mental ability on job performance and salary. Journal of Applied Psychology, 86, 1075–1082. 10.1037/0021-9010.86.6.1075 [DOI] [PubMed] [Google Scholar]
  25. Frey MC, & Detterman DK (2004). Scholastic assessment of g? The relationship between the Scholastic Assessment Test and general cognitive ability. Psychological Science, 14, 373–378. 10.1111/j.0956-7976.2004.00687.x [DOI] [PubMed] [Google Scholar]
  26. Georgiou K, Gouras A, & Nikolaou I (2019). Gamification in employee selection: The development of a gamified assessment. International Journal of Selection and Assessment, 27, 91–103. 10.1111/ijsa.12240 [DOI] [Google Scholar]
  27. Golubovich J, & Ryan AM (2021). Performance on video-based situational judgment test items: Simulated interracial interactions. Journal of Business and Psychology, 36, 693–711. 10.1007/s10869-020-09697-1 [DOI] [Google Scholar]
  28. Golubovich J, Seybert J, Martin-Raugh M, Naemi B, Vega RP, & Roberts RD (2017). Assessing perceptions of interpersonal behavior with a video-based situational judgment test. International Journal of Testing, 17, 191–209. 10.1080/15305058.2016.1194275 [DOI] [Google Scholar]
  29. Grandey A (2000). Emotion regulation in the workplace: A new way to conceptualize emotional labor. Journal of Occupational Health Psychology, 5, 95–110. 10.1037/1076-8998.5.1.95 [DOI] [PubMed] [Google Scholar]
  30. Hausdorf PA, & Robie C (2018). Understanding subgroup differences with general mental ability tests in employment selection: Exploring socio-cultural factors across inter-generational groups. International Journal of Selection and Assessment, 26, 176–190. 10.1111/ijsa.12226 [DOI] [Google Scholar]
  31. Heider F, & Simmel M (1944). An experimental study of apparent behavior. American Journal of Psychology, 57, 243–259. 10.2307/1416950 [DOI] [Google Scholar]
  32. Isik L, Koldewyn K, Beeler D, & Kanwisher N (2017). Perceiving social interactions in the posterior superior temporal sulcus. Proceedings of the National Academy of Sciences, 114, E9145–E9152. 10.1073/pnas.1714471114 [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Jodoin MG and Gierl MJ (2001). Evaluating Type I error and power rates using an effect size measure with logistic regression procedure for DIF detection. Applied Measurement in Education, 14, 329–349. 10.1207/S15324818AME1404_2 [DOI] [Google Scholar]
  34. Johannesen JK, Fiszdon JM, Weinstein A, Ciosek D, & Bell MD (2018). The social attribution task – multiple choice (SAT-MC): Psychometric comparison with social cognitive measures for schizophrenia research. Psychiatry Research, 262, 154–161. 10.1016/j.psychres.2018.02.011 [DOI] [PubMed] [Google Scholar]
  35. Joseph DL, & Newman DA (2010). Emotional intelligence: An integrative meta-analysis and cascading model. Journal of Applied Psychology, 95, 54–78. 10.1037/a0017286 [DOI] [PubMed] [Google Scholar]
  36. Karakolidis A, O’Leary M, Sculla D (2021). Animated videos in assessment: Comparing validity evidence from and test-takers’ reactions to an animated and a text-based situational judgment test. International Journal of Testing, 21, 57–79. 10.1080/15305058.2021.1916505 [DOI] [Google Scholar]
  37. Kittel AFD, Olderbak S, & Wilhelm O (2021). Sty in the mind’s eye: A meta-analytic investigation of the nomological network and internal consistency of the “Reading the Mind in the Eyes” test. Assessment. Advance online publication. 10.1177/1073191121996469 [DOI] [PubMed] [Google Scholar]
  38. Klin A (2000). Attributing social meaning to ambiguous visual stimuli in higher-functioning autism and Asperger syndrome: The social attribution task. Journal of Child Psychology and Psychiatry, and Allied Disciplines, 41, 831–846. 10.1111/1469-7610.00671 [DOI] [PubMed] [Google Scholar]
  39. Landers RN (2019). Gamification misunderstood: How badly executed and rhetorical gamification obscures its transformative potential. Journal of Management Inquiry, 28, 137–140. 10.1177/1056492618790913 [DOI] [Google Scholar]
  40. Landers RN, Auer EM, & Abraham JD (2020). Gamifying a situational judgement test with immersion and control game elements. Journal of Managerial Psychology. 10.1108/JMP-10-2018-0446 [DOI] [Google Scholar]
  41. Landers RN, Auer EM, Collmus AB, & Armstrong MB (2018). Gamification science, its history and future: Definitions and a research agenda. Simulation & Gaming, 49, 315–337. 10.1177/1046878118774385 [DOI] [Google Scholar]
  42. Lee HS, Corbera S, Poltorak A, Park K, Assaf M, Bell MD, Wexler BE, Cho YI, Jung S, Brocke S, & Choi KH (2018). Measuring theory of mind in schizophrenia research: Cross-cultural validation. Schizophrenia Research, 201, 187–195. 10.1016/j.schres.2018.06.022 [DOI] [PubMed] [Google Scholar]
  43. Leutner F, Codreanu SC, Liff J, & Mondragon N (2020). The potential of game- and video-based assessments for social attributes: Examples from practice. Journal of Managerial Psychology. Advance online publication. 10.1108/JMP-01-2020-0023 [DOI] [Google Scholar]
  44. Lievens F (2013). Adjusting medical school admission: Assessing interpersonal skills using situational judgment tests. Medical Education, 47, 182–189. 10.1111/medu.12089 [DOI] [PubMed] [Google Scholar]
  45. Lievens F, & Chan D (2010). Practical intelligence, emotional intelligence, and social intelligence. Farr JL & Tippins NT (Eds.) Handbook of employee selection (pp. 342–364). Routledge: New York, NY. [Google Scholar]
  46. Lievens F, De Corte W, & Westerveld L (2015). Understanding the building blocks of selection procedures: Effects of response fidelity on performance and validity. Journal of Management, 41, 1604–1627. 10.1177/0149206312463941 [DOI] [Google Scholar]
  47. Lievens F, & Motowidlo SJ (2016a). Situational judgment tests: From measures of situational judgment to measures of general domain knowledge. Industrial & Organizational Psychology, 9, 3–22. 10.1017/iop.2015.71 [DOI] [Google Scholar]
  48. Lievens F, Patterson F, Corstjens J, Martin S, & Nicholson S (2016b). Widening access in selection using situational judgment tests: Evidence from the UKCAT. Medical Education, 50, 624–636. 10.1111/medu.13060 [DOI] [PubMed] [Google Scholar]
  49. Lievens F, & Sackett PR (2006). Video-based versus written situational judgment tests: A comparison in terms of predictive validity. Journal of Applied Psychology, 91, 1181–1188. 10.1037/0021-9010.91.5.1181 [DOI] [PubMed] [Google Scholar]
  50. Lievens F, Sackett PR, Dahlke JA, Oostrom JK, & De Soete B (2019). Constructed response formats and their effects on minority–majority differences and validity. Journal of Applied Psychology, 104, 715–726. 10.1037/apl0000367. [DOI] [PubMed] [Google Scholar]
  51. Ludwig NN, Hecht EE, King TZ, Revill KP, Moore M, Fink SE, & Robins DL (2020). A novel social attribution paradigm: The Dynamic Interacting Shape Clips (DISC). Brain and Cognition, 138, 105507. 10.1016/j.bandc.2019.105507 [DOI] [PubMed] [Google Scholar]
  52. MacCann C, & Roberts RD (2008). New paradigms for assessing emotional intelligence: Theory and data. Emotion, 8, 540–551. 10.1037/a0012746 [DOI] [PubMed] [Google Scholar]
  53. Magis D, Béland S, Tuerlinckx F, & De Boeck P (2010). A general framework and an R package for the detection of dichotomous differential item functioning. Behavior Research Methods, 42, 847–862. 10.3758/BRM.42.3.847 [DOI] [PubMed] [Google Scholar]
  54. Martin A, & Weisberg J (2003). Neural foundations for understanding social and mechanical concepts. Cognitive Neuropsychology, 20, 575–587. 10.1080/02643290342000005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Martinez G, Mosconi E, Daben-Huard C, Parellada M, Fananas L, Gaillard R, Fatjo-Vilas M, Krebs MO, & Amado I (2019). “A circle and a triangle dancing together”: Alteration of social cognition in schizophrenia compared to autism spectrum disorders. Schizophrenia Research, 210, 94–100. 10.1016/j.schres.2019.05.043 [DOI] [PubMed] [Google Scholar]
  56. Mayer JD, Caruso DR, & Salovey P (2016). The ability model of emotional intelligence: Principles and updates. Emotion Review, 8, 290–300. 10.1177/1754073916639667 [DOI] [Google Scholar]
  57. McCord JL, Harman JL, & Purl J (2019). Game-like personality testing: An emerging mode of personality assessment. Personality and Individual Differences, 143, 95–102. 10.1016/j.paid.2019.02.017 [DOI] [Google Scholar]
  58. Moessnang C, Baumeister S, Tillmann J, Goyard D, Charman T, Ambrosino S, … & Meyer-Lindenberg A (2020). Social brain activation during mentalizing in a large autism cohort: the Longitudinal European Autism Project. Molecular Autism, 11, 1–17. 10.1186/s13229-020-0317-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. O’Boyle E, Humphrey R, Pollack J, Hawver T, & Story P (2011) The relation between emotional intelligence and job performance: A meta-analysis. Journal of Organizational Behavior, 32(5), 788–818. 10.1002/job.714 [DOI] [Google Scholar]
  60. Olderbak S, & Wilhelm O (2020). Overarching principles for the organization of socioemotional constructs. Current Directions in Psychological Science, 29, 63–70. 10.1177/0963721419884317 [DOI] [Google Scholar]
  61. Olderbak S, & Wilhelm O, Hildebrandt A, & Quoidbach J (2018). Sex differences in facial emotion perception ability across the lifespan. Cognition and Emotion, 33, 579–588. 10.1080/02699931.2018.1454403 [DOI] [PubMed] [Google Scholar]
  62. Patterson F, Baron H, Carr V, Plint S, & Lane P (2009). Evaluation of three short-listing methodologies for selection into postgraduate training in general practice. Medical Education, 43, 50–57. 10.1111/j.1365-2923.2008.03238.x [DOI] [PubMed] [Google Scholar]
  63. Pinkham AE, Harvey PD, & Penn DL (2018). Social cognition psychometric evaluation: Results of the final validation study. Schizophrenia Bulletin, 44, 737–748. 10.1093/schbul/sbx117 [DOI] [PMC free article] [PubMed] [Google Scholar]
  64. Ployhart RE, & Holtz BC (2008). The diversity-validity dilemma: Strategies for reducing racioethnic and sex group differences and adverse impact in selection. Personnel Psychology, 61, 153–172. 10.1111/j.1744-6570.2008.00109.x [DOI] [Google Scholar]
  65. Premack D, & Woodruff G (1978). Does the chimpanzee have a theory of mind? Behavioral and Brain Science, 1, 515–526. 10.1017/S0140525X00076512 [DOI] [Google Scholar]
  66. Ratajska A, Brown MI, & Chabris CF (2020). Attributing social meaning to animated shapes: A new experimental study of apparent behavior. American Journal of Psychology, 133, 295–312. https://www.jstor.org/stable/10.5406/amerjpsyc.133.3.0295 [DOI] [PMC free article] [PubMed] [Google Scholar]
  67. Reddock CM, Auer EM & Landers RN (2020). A theory of branched situational judgment tests and their applicant reactions. Journal of Managerial Psychology, 35, 255–270. 10.1108/JMP-10-2018-0434 [DOI] [Google Scholar]
  68. Rho HJ, Brown H, & Fremstad S (2020). A basic demographic profile of workers in frontline industries [white paper]. Washington, DC: Center for Economic and Policy Research. [Google Scholar]
  69. Rogers HJ, & Swaminathan H (1993). A comparison of logistic regression and Mantel-Haenszel procedures for detecting differential item functioning. Applied Psychological Measurement, 17, 105–116. 10.1177/014662169301700201 [DOI] [Google Scholar]
  70. Roth PL, Bevier CA, Bobko P, Switzer FS, & Tyler P (2001). Ethnic group differences in cognitive ability in employment and educational settings: A meta-analysis. Personnel Psychology, 54, 297–330. 10.1111/j.1744-6570.2001.tb00094.x [DOI] [Google Scholar]
  71. Roth PL, Van Iddekinge CH, DeOrtentiis PS, Hackney KJ, Zhang L, & Buster MA (2017). Hispanic and Asian performance on selection procedures: A narrative and meta-analytic review of 12 common predictors. Journal of Applied Psychology, 102, 1178–1202. 10.1037/apl0000195 [DOI] [PubMed] [Google Scholar]
  72. Schlegel K, & Mortillaro M (2019). The Geneva Emotional Competence Test (GECo): An ability measure of workplace emotional intelligence. Journal of Applied Psychology, 104, 559–580. 10.1037/apl0000365 [DOI] [PubMed] [Google Scholar]
  73. Spector PE, Rosen CC, Richardson HA, Williams LJ, & Johnson RE (2019). A new perspective on method variance: A measure-centric approach. Journal of Management, 45, 855–880. 10.1177/0149206316687295 [DOI] [Google Scholar]
  74. Speer AB, Christiansen ND, & Laginess AJ (2019). Social intelligence and interview accuracy: Individual differences in the ability to construct interviews and rate accurately. International Journal of Selection and Assessment, 27, 104–128. 10.1111/ijsa.12237 [DOI] [Google Scholar]
  75. Stark S, Chernyshenko OS, & Drasgow F (2006). Detecting differential item functioning with confirmatory factor analysis and item response theory: Toward a unified strategy. Journal of Applied Psychology, 91, 1292–1306. 10.1037/0021-9010.91.6.1292 [DOI] [PubMed] [Google Scholar]
  76. Thorndike EL (1920). Intelligence and its use. Harper’s Magazine, 140, 227–235. [Google Scholar]
  77. Tippins N (2009). Where is the unproctored internet testing train headed now? Industrial and Organizational Psychology, 2, 69–76. 10.1111/j.1754-9434.2008.01111.x [DOI] [Google Scholar]
  78. Vandenberg RJ, & Lance CE (2000). A Review and Synthesis of the Measurement Invariance Literature: Suggestions, Practices, and Recommendations for Organizational Research. Organizational Research Methods, 3, 4–70. 10.1177/109442810031002 [DOI] [Google Scholar]
  79. Vandewouw MM, Safar K, Mossad SI, Lu J, Lerch JP, Angnostou E, & Taylor MJ (2021). Do shapes have feelings? Social attribution in children with autism spectrum disorder and attention-deficit/hyperactivity disorder. Translational Psychiatry, 11:493. 10.1038/s41398-021-01625-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  80. Wai J, Brown MI, & Chabris CF (2018). Using standardized test scores to include general cognitive ability in education research and policy. Journal of Intelligence, 6, 37; 10.3390/jintelligence6030037 [DOI] [PMC free article] [PubMed] [Google Scholar]
  81. Weekley JA, Hawkes B, Guenole N, & Ployhart RE (2015). Low-fidelity simulations. Annual Review of Organizational Psychology and Organizational Behavior, 2, 295–322. 10.1146/annurev-orgpsych-032414-111304 [DOI] [Google Scholar]
  82. Weiner EJ, & Sanchez DR (2020). Cognitive ability in virtual reality: Validity evidence for VR game-based assessments. International Journal of Selection and Assessment, 28, 215–235. 10.1111/ijsa.12295 [DOI] [Google Scholar]
  83. Whetzel DL, McDaniel MA, & Nguyen NT (2008). Subgroup differences in situational judgment test performance: A meta-analysis. Human Performance, 21, 291–309. 10.1080/08959280802137820 [DOI] [Google Scholar]
  84. Whetzel DL, Sullivan TS, & McCloy RA (2020). Situational judgment tests: An overview of development practices and psychometric characteristics. Personnel Assessment and Decisions, 6, 1–16. https://scholarworks.bgsu.edu/pad/vol6/iss1/1 [Google Scholar]
  85. White SJ, Coniston D, Rogers R, & Frith U (2011). Developing the Frith-Happé animations: A quick and objective test of Theory of Mind for adults with autism. Autism Research, 4, 149–154. 10.1002/aur.174 [DOI] [PubMed] [Google Scholar]
  86. Wilson AC (2021). Do animated triangles reveal a marked difficulty among autistic people with reading minds? Autism, 25, 1175–1186. 10.1177/1362361321989152 [DOI] [PMC free article] [PubMed] [Google Scholar]
  87. Wong CMT, Day JD, Maxwell SE, & Meara NM (1995). A multitrait-multimethod study of academic and social intelligence in college students. Journal of Educational Psychology, 87, 117–133. 10.1037/0022-0663.87.1.117 [DOI] [Google Scholar]
  88. Woods SA, Ahmed S, Nikolaou I, Costa AC, & Anderson NR (2020). Personnel selection in the digital age: A review of validity and applicant reactions, and future research challenges. European Journal of Work and Organizational Psychology, 29, 64–77. 10.1080/1359432X.2019.1681401 [DOI] [Google Scholar]
  89. Wooley AW, Chabris CF, Pentland A, Hashmi N, & Malone TW (2010). Evidence for a collective intelligence factor in the performance of human groups. Science, 330, 686–688. 10.1126/science.1193147 [DOI] [PubMed] [Google Scholar]
  90. Zaccaro SJ, Green JP, Dubrow S, & Kolze MJ (2018). Leader individual differences, situational parameters, and leadership outcomes: A comprehensive review and integration. The Leadership Quarterly, 29, 2–43. 10.1016/j.leaqua.2017.10.003 [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

RESOURCES