Reliability of the sliding scale for collecting affective responses to words

C Imbault; D Shore; V Kuperman

doi:10.3758/s13428-018-1016-9

. Author manuscript; available in PMC: 2019 Dec 1.

Published in final edited form as: Behav Res Methods. 2018 Dec;50(6):2399–2407. doi: 10.3758/s13428-018-1016-9

Reliability of the sliding scale for collecting affective responses to words

C Imbault ^1,^*, D Shore ^2,¹, V Kuperman ^1,²

PMCID: PMC6060013 NIHMSID: NIHMS937758 PMID: 29372489

Abstract

Warriner, Shore, Schmidt, Imbault, and Kuperman (2017) have recently proposed a slider task in which participants move a manikin on a computer screen towards or further away from a word and the distance (in pixels) is a measure of the word’s valence. Warriner et al. have shown this task to be more valid than the widely used rating task but have not examined the reliability of the new methodology. This paper investigates multiple aspects of the task reliability. In Experiment 1 (E1.1–1.6), we show that the sliding scale has high split-half reliability (r = 0.868 to 0.931). In Experiment 2, we also show that the slider task elicits consistent repeated responses within a single session (Experiment 2: r = 0.804), and across two sessions separated by one week (Experiment 3: r = 0.754). Overall, the slider task, in addition to having high validity, is highly reliable.

Keywords: Valence, Arousal, Emotion, Reliability

Words evoke affective responses, which can be indexed through subjective ratings of valence. Norms for these responses are available in English (Bradley & Lang, 1999; Warriner, Kuperman, & Brysbaert, 2013), French (Bonin, Aubert, Malardier, & Niedenthal, 2016; Monnier & Syssau, 2014), Spanish (Hinojosa et al., 2015; Redondo, Fraga, Padrón, & Comesaña, 2007; Stadthagen-Gonzalez, Imbault, Pérez Sánchez, & Brysbaert, 2016), Dutch (Moors et al., 2013) and other languages. All of these studies, dating back to 1999 (Bradley & Lang, 1999), have one method in common; they present words in isolation and instruct participants to evaluate their valence (from negative to positive) and arousal (from calm to excited) on a rating scale from 1 to 9 (or in some cases, 1 to 7). A number of criticisms of this methodology led us to develop a new methodology (Warriner et al., 2017), summarized below. The present paper evaluates the reliability of this sliding scale methodology.

Criticisms of the popular rating scale fall into two main categories (Warriner, et al., 2017; Westbury, Keith, Briesemeister, Hofmann, & Jacobs, 2014). The first class of criticism concerns the nature of the data collected. A rating scale with 9 (or 7) individual points produces an ordinal measure, whereas an interval measure is preferable for most statistical tests. In fact, most papers treat this ordinal scale as an interval scale for statistical analyses, which can lead to violations of assumptions. Related to this general criticism, the typical rating scale does not allow a fine-grained output—observers can only provide integer responses, which can fail to capture the subtle effects of emotion evoked by a word, or individual variability in affective behaviour.

The second class of criticism concerns the necessity, within a typical rating study, to anchor the affective evaluation. Depending on the words chosen for the anchors, observers can be biased to overemphasize some words and diminish the impact of other words (Westbury et al., 2015). Typically, the words chosen (e.g., “pleasant” and “unpleasant”) are too mild for words that raters encounter in these studies, such as “rapist”. Inappropriate anchoring can lead to drift across the study once extreme words are encountered. Ideally, the scale should allow observers to maximize the range of values used without reliance on the specific words presented.

To counter these concerns, we developed the slider task (Warriner et al., 2017). A humanoid manikin is placed in the centre of a vertical line, with a word at the top or the bottom of the line. Participants, who are instructed that the manikin represents themselves, can slide the manikin as close to or as far away from the word as they prefer. The distance from the word represents the participants’ affective response to a word; a greater distance indicates lower valence (more negative) and a smaller distance indicates higher valence (more positive). Distance is a continuous interval-scale variable, which is in practice discretized into a number of pixels¹ on a computer screen, just like the continuous variable of time is discretized into fractions of seconds by chronometric instruments. Warriner et al. (2017) found that distance was negatively correlated with word valence in a variety of populations (undergraduates in lab: r = −0.62; adults online: r = −0.58). Additionally, Warriner et al. (2017) found that the slider task is sensitive to individual differences; those who are shyer tend to position the manikin farther from all stimuli than those who are less shy (56 pixels or 10% of the scale). As well, the slider is sensitive to gender differences; females position the manikin closer to words that are rated as being more positive by female than male raters, and the same is true for males. The range of 600 pixels that we utilize in the slider task also allows researchers to capture subtle individual differences that were lost in a smaller 9 point ordinal scale (e.g., the subtle tendency of relatively sociable individuals to keep a shorter distance to all words; (Warriner et al., 2017). These findings are in line with an earlier proposal to move from discrete ratings scale to a continuous slider by Albaum, Best, and Hawkins (1981). They stated that a discrete ratings scale and a continuous scale produced similar aggregate data, but a continious scale allowed for greater discrimination at the individual level.

Finally, this task has no anchors and removes the mention of valence from task instructions. By tapping into implicit approach–avoidance tendencies, the task avoids explicit linkages to artificial valence terminology (e.g., “pleasant” and “unpleasant”). The lack of traditional semantic labels alluding to valence may alter the psychological construct measured by the slider task, and may cause participants to tap into approach-avoidance behaviour instead of producing affective responses (see the General Discussion below and the Future Directions in Warriner et al., 2017). An argument can also be made that the lack of anchors may cause participants to be confused and not perform the task as intended. Although possible, it is unlikely that participants are confused or engage in a different behavioral pattern. Warriner et al. (2017) administered the slider task with and without anchors and found that there was no difference in the functional relationship between valence and distance. In sum, the slider task provides a new method of collecting affective ratings that is more valid than past methods. The utility of a similar Affective Slider method for measuring the valence and arousal of pictures taken from the International Affective Picture System (IAPS) has been demonstrated by Betella and Verschure (2016).

Given the validity of the task (Warriner et al., 2017), we expect many researchers and clinicians to be interested in the assessments of emotion that come from this task; however, before we can evaluate individual and group differences in emotional responses, we must ask about its reliability. The present paper evaluates the reliability of the slider task through several converging methods: Experiment 1 utilizes previously collected data and applies a split-half analysis (cf. Macleod et al., 2010); Experiment 2 collects new data in a repeated-measures design within a single data collection session; and Experiment 3 uses a standard test–retest design from two sessions separated by a week. In all cases, the measure of interest was the slope of the best fitting regression line between distance from the word (cf. Warriner et al., 2017) as the dependent measure and the normed rated valence of the word (cf. Warriner et al., 2013) as the independent variable. To be specific, we assessed to what extent the slope of this line from one sample predicted the slope from a second sample within the same individual. With a split-half analysis, the two samples came from the same session (Experiment 1). With the repeated measures design, observers rated the same words in two separate blocks of trials in one session (Experiment 2); and, with the test-retest design the same words were rated on two separate occasions separated by one week (Experiment 3). We performed an additional analysis in Experiments 2 and 3 to remove the effect of valence, and instead measure the reliability of distance. We assessed to what extent the distance from the word in one sample predicted the distance from the word in a second sample within the same individual.

Experiment 1: Split Half Analysis

The split-half analysis utilized previously collected data (Warriner et al., 2017) to estimate the reliability of the slider task. We combined the standard split-half analysis with a resampling technique to enhance the robustness of our estimate and provide confidence intervals on our reliability estimate (cf. MacLeod et al., 2010). Each experiment presented the same 250 words drawn from separate quintiles of valence; there were 50 words rated for each of five different valence subranges. To determine split-half reliability, we randomly selected two samples of twenty words from each quintile without replacement, creating two groups of 100 words. For each group and each participant, we estimated a regression slope for distance as predicted by valence: two regression slopes were thus generated for each participant. Reliability was estimated by examining the correlation of these two slopes across participants. This sampling was conducted 10,000 times for each of the five experiments, four of which are reported in Warriner et al. (2017), and the fifth is data from the first session of the current paper’s Experiment 3 (labeled here Experiments 1.1 – 1.5). The data from Experiment 2 was not included in the split-half analysis as Experiment 2 only presented half of the 250 words used in Experiments 1.1 – 1.5. These five experiments varied in experimental settings, task instructions, and participant demographics. Experiment 1.6 analyzed the combined data from all five experiments, weighing each experiment by the number of participants.

Method

Participants and Procedure

All experiments used a similar set of instructions and similar numbers of participants (See Table 1 for specifics from each experiment; see Warriner et al., 2017 for detailed methods). Each participant was seated in front of a computer monitor with a screen resolution of 1024×768 placed approximately 57cm in front of the participant (E1.4 and 1.5 used an online version of the task—the specifics of the monitor and computer varied across participants). Following a central fixation point, a humanoid manikin was presented at the centre of the computer monitor along a line with a single word at the top or bottom of the screen (see Figure 1). Participants moved the manikin up or down to position it at their preferred location as close to, or as far away from, the word. Experiments 1.1, 1.2, 1.3 and 1.4 all had the same instructions, which read:

[… On a] screen, you will see a word at the top of the screen with a vertical line below it. There will be a person in the centre of that line. The person represents you. Your job is to assess how close you would like to be to the word and communicate that by clicking a point on the line to position the person (you). For example, if the word was DISASTER, you'd probably want to be far away and would click somewhere on the line far away from the word. But if the word was TRIUMPH you might want to be close and would place the manikin somewhere on the line really close to the word. […]

Table 1.

A summary of participants, procedure and critical results (correlation between valence and distance, and split-half reliability) of Experiments 1.1–1.5.

	Participants	Procedure	Split Half Results	Spearman Brown Correction
1.1	43 participants (35 female), ranging from 17 to 25 years of age (M = 19.07, SD = 1.98)	Participants at McMaster University took part in the study for partial course credit. No more than 10 participants at a time completed the study in a computer lab on campus. This data is taken from Warriner et al.’s (2017) Experiment 1.	ρ = 0.873, 95% CI [0.800, 0.926]	ρ = 0.932, 95% CI [0.889, 0.962]
1.2	30 participants (all female), ranging in age from 18 to 21 (M = 19, SD = 1.02)	Participants at McMaster University took part in the study for partial course credit. No more than 10 participants at a time completed the study in a computer lab on campus. Prior to the start of the experiment, participants were given 4 personality questionnaires to fill out. This data is taken from Warriner et al.’s (2017) Experiment 2.	ρ = 0.868, 95% CI [0.786, 0.929]	ρ = 0.929, 95% CI [0.880, 0.963]
1.3	36 participants (33 female) ranging in age from 18 to 27 (M = 18.69, SD = 1.59)	Participants at McMaster University took part in the study for partial course credit. No more than 10 participants at a time completed the study in a computer lab on campus. This data is taken from the first session of the present Experiment 3.	ρ = 0.883, 95% CI [0.812, 0.934]	ρ = 0.938, 95% CI [0.896, 0.966]
1.4	32 participants (13 female), ranging in age from 19 to 55 (M = 34.06, SD = 8.68)	An ad was posted on Amazon Mechanical Turk for participants to complete the study from their home for monetary compensation. All participants were based in the USA. This data is taken from Warriner et al.’s (2017) Experiment 3.	ρ = 0.931, 95% CI [0.886, 0.963]	ρ = 0.964, 95% CI [0.939, 0.981]
1.5	36 participants (13 female) ranging in age from 21 to 60 (M = 34.42, SD = 8.95)	An ad was posted on Amazon Mechanical Turk for participants to complete the study from their home for monetary compensation. All participants were based in the USA. Anchors were removed from the instructions given to participants. This data is taken from Warriner et al.’s (2017) Experiment 4.	ρ = 0.898, 95% CI [0.831, 0.945]	ρ = 0.946, 95% CI [0.908, 0.972]
1.6	177 participants (124 female), ranging in age from 17 to 60 (M = 24.81, SD = 9.31)	The data from experiments 1.1 through 1.5 were compiled together to create one data set.	ρ = 0.891, 95% CI [0.859, 0.917]	ρ = 0.942, 95% CI [0.924, 0.957]

Open in a new tab

The slider scale: a humanoid manikin and word for each trial.

Experiment 1.5 had a slightly modified set of instructions where anchor words like disaster or triumph were not mentioned to the participant. Those instructions read:

[…] On each of the following screens, you will first see a plus sign in the centre. That's to center the mouse for the next screen. Click on the plus and you will see a word either at the top or the bottom with a vertical line below or above it. There will be a person in the centre of that line. The person represents you. You can move "yourself" closer to or further away from the word. Position yourself where you prefer to be. […]

Stimuli

The word set for each experiment was the same. It consisted of 250 monosyllabic words chosen from a set of 13,763 words that had previously been rated for valence and arousal (Warriner et al., 2013). The words were divided into 25 bins (5 quintiles of valence × 5 quintiles of arousal) with 10 words randomly drawn from each bin. Thus, there were 50 words at each of 5 levels of valence, which varied in their arousal levels. This ensured that valence and arousal were not correlated (p = −0.019), and thus in what follows we only measure any effects of valence and not arousal. The mean word length was 4.4 characters, and the mean natural-log SUBTLEX frequency was 6.3 (Brysbaert & New, 2009). We used natural-log frequency because frequency is exponentially represented, and the natural log makes this distribution closer to normal and easier to interpret.

Results and discussion

The primary dependent variable for all experiments was the distance between the position of the manikin when the participant pressed the Submit button and the presented word, measured in pixels (range = 1 (closest to the word) − 600 (farthest away from the word)). The participants were able to move the manikin as many times as they wanted before clicking Submit—only the final location of the manikin was used in our data analysis. The independent variable of interest for all experiments presented here was the word’s valence rating (Warriner et al., 2013), which varied between 1 (very unhappy) and 9 (very happy); each word was rated by at least 20 raters.

To assess the split half reliability of each experiment, we separated the 250 words in the stimulus list into 5 bins of valence, broken down by quantiles. We randomly selected 20 words (from approximately 50 words), without replacement, from each bin twice for each participant, to construct two separate groups of 100 words (20 words × 5 bins). In each group, we computed the slope (i.e., the beta coefficient) of the respective regression line between valence and distance. This resulted in 2 slope coefficients for each participant. We then found the correlation for these two slope estimates across all participants. We repeated this process 10,000 times which resulted in 10,000 correlations. The mean of these 10,000 correlations provided our estimate of the split half reliability; the distribution of correlations allowed us to estimate confidence intervals. We used R version 3.01 (R Core Development Team, 2013) to perform our statistical analyses in this and subsequent experiments.

The split half reliability for each of Experiments 1.1 – 1.5 ranged from 0.868 to 0.931 (see Table 1 and Figure 2), which can be considered a very high level of reliability. Experiment 1.6 combines the data from all 5 experiments and weighs each experiment by the number of participants. This analysis produced a reliability of 0.891 [95% CI = 0.859–0.917]. Thus, this new method of collecting affective ratings towards words provides a reliable measure of valence. This was found across multiple experiments with different participant pools, recruitment methods, and experimental settings.

Estimated split-half reliability of Experiments 1.1–1.5 with 95% confidence intervals mapped on the y-axis. The grey shaded areas depict the 95% confidence interval of the weighted mean of the reliability estimates.

Experiment 2: Within session repeated measures reliability

Although the split-half analysis provides a good estimate of reliability, there are some challenges in the present case. First, different words are used across the two samples for each participant in each simulated set. Second, the participant is in the same experimental session and thus the same mental state—if we want to conclude that this task provides a stable measure of valence we need to test participants multiple times. Specifically, we need to assess the ability of the task to elicit similar responses to multiple presentations of the same stimuli. The present experiment used a test–retest paradigm where the same words were presented twice within a one-hour experimental session. Both blocks, which were completed sequentially without any substantial break, contained identical stimuli presented in pseudorandom order. In order to test the reliability of distance as a metric of valence, we evaluated the correlation between two slopes (one from each block) of the best-fitting regression line where valence predicts distance from the word (Analysis 1). Unlike Experiments 1.1–1.5, we used all words in the two blocks to compute the regression slope estimates. In order to test the reliability of distance to a word as a behavioral outcome (regardless of valence), we evaluated the correlation between responses to any given word produced by each participant in block 1 vs block 2 (Analysis 2).

Method

Participants

Sixty-four undergraduate students at McMaster University in Hamilton, ON participated in this experiment for partial course credit. The data from thirteen participants were removed (seven participants did not make a response on more than 25% of the trials (38 trials), and an additional five participants were not native speakers of English). The remaining 52 (43 women, 48 right-handed) participants ranged in age from 17 to 22 (M = 18.75, SD = 1.17).

Affective stimuli

We randomly selected half of the words (125 words) from Warriner et al. (2017) Experiment 1: for word selection criteria see Experiment 1. All participants in this experiment saw the same 125 words.

Procedure

Participants were tested in groups of ten or less in a computer lab. Each participant was seated in front of a monitor with a screen resolution of 1024 × 768. After completing a set of demographic questions (including age, sex, handedness and education), participants were instructed to complete the slider task. The task began with a fixation cross centred on the screen; each trial started by clicking on the cross. The affixation cross was replaced with a humanoid manikin centred on a vertical line in the center of the computer screen; the word for that trial was randomly presented at either the top or the bottom of the vertical line, and participants were instructed to slide the manikin (or click a destination on the line) however close or far they wanted the manikin to be from the word. The instructions were the same as the present Experiments 1.1–1.4. After the participants moved the manikin to its final position, they clicked a Submit button, located to the right of the slider (see Figure 1). The experiment was programmed using Experiment Builder software (SR Research, Kanata, ON, Canada).

Participants completed five practice trials, and then asked if they had any questions before the experiment proceeded. Each participant saw each of the 125 words twice, once in the first half of the Experiment (Block 1), and once in the second half (Block 2). The word order was randomized separately in both blocks of the Experiment. The participants were not made aware of the two separate blocks, and were not told the test–retest purpose of the study. Because of a programming error, some participants saw an additional five words for a third time at the end of the experiment. These additional responses were removed from data analysis.

Results and Discussion

Analysis 1

The variables used in this analysis were the same as those used in Experiment 1 with the manikin’s distance from the word as the dependent variable and word valence as the independent variable.

To compare individual performance between experimental blocks, we first estimated the effect of valence on the distance of the manikin from the word for each participant and for each half of the experiment. Specifically, for each participant we fitted two ordinary linear regression models (one for 125 words in each half) estimating the effect of valence on the manikin’s distance from the word. We operationalized the effect as the slope (the beta coefficient) of the regression line. Since words presented in both blocks were identical (though ordered differently), a reliable performance would elicit similar slopes in an individual, throughout the experiment. This step generated two sets of 52 slope coefficients (corresponding to 52 participants and the two blocks). We then calculated the correlation between the two sets of regression coefficients as a measure of the test–retest reliability within one experimental session.

The distance of the manikin from the word was negatively correlated with valence of the word in both blocks (first block: Pearson’s r = −0.572, df = 6498, 95% CI [−0.587,−0.555], p < .001; second block: Pearson’s r = −0.570, df = 6498, 95% CI [−0.586,−0.553], p < .001) as well as the whole experiment together (Pearson’s r = −0.570, df = 12998, 95% CI [−0.582,−0.559], p < .001). As observed in our previous work (Experiment 1; Warriner et al., 2017), participants moved the manikin closer to positive words, and moved it further away from negative words.

The correlation between two sets of slope coefficients was high (r = 0.804, df = 50, 95% CI [0.680, 0.883], p < .001). This demonstrates that valence affects participants’ performance in the slider task similarly in the first and second half of the experiment, and also elicits similar responses to the same stimuli presented a second time. The slider task has a high test-retest reliability within one experimental session.

Analysis 2

The variables used in this analysis are the manikin’s distance from the word as the dependent variable and block number as the independent variable.

We compared the distance from the word within participant across the two blocks. Specifically, for each word, we generated two lists of distance values for block 1 and block 2 respectively: e.g., distance to word “chore” produced by participant 1 in block 1 would make the first element in list 1, and distance to the same word produced by the same participant in block 2 would make the first element in list 2. Thus, each list contained 52 elements for each participant, repeated for each word. We correlated the two lists for each word, with the correlation coefficient as a measure of how reliably that word is rated within participant across blocks. The second step was to evaluate the central tendency and dispersion of correlation strength across words. The mean correlation, or the reliability of the distance measurement, within a single session was moderately high (r_mean = 0.603, SD = 0.155, CI [0.297, 0.867]). This demonstrates that the participants reliably responded with a similar distance in the first and second block within a single session.

Experiment 3: Across Session Test Retest Reliability

In Experiment 2, we showed that participants’ performance was replicable within one test–retest session. Another aspect of reliability is whether the task elicits similar responses to the same stimuli over time. We presented the same slider task to participants twice, with experimental sessions separated by a week. The task was expected to produce similar results over the two sessions.

Method

Participants

Forty-five undergraduate students at McMaster University participated in this study for partial course credit. None of the participants had participated in similar studies prior to completing this study. Forty-two of the participants attended the second retest part of the experiment. The remaining three participants were excluded from data analysis. Three additional participants were removed from analysis because they did not respond to more than 25% of trials (62 trials). An additional three participants were removed from analysis due to not being native speakers of English. The remaining 36 native English speakers (33 female) ranged in age from 18 to 27 (M = 18.69, SD =1.58).

Affective Stimuli

The stimuli used in this experiment is the same list of 250 words from Warriner et al. (2017), as described in Experiment 1.

Procedure

The procedure for this experiment was nearly identical to Experiment 2. Participants signed up to take part in a two-part study. The second session took place a week after the first session. The instructions given to participants were the exact same as those given in Experiment 2. Participants were first shown 5 practice words, and then were presented with 250 experimental trials. The 250 words were presented in one block, and were randomly ordered. The participants were given the same set of instructions and the same words the second time they came in for the study, though the order of words in each session was randomized differently. They were not told that the words presented would be the same as the first session, and they were not informed of the purpose of the study. Because of a programming error, some participants saw some words for a third or fourth time at the end of the experiment. These additional responses were removed from data analysis.

Results and Discussion

Analysis 1

The variables used in this analysis were the same as those used in Experiment 1 and Analysis 1 of Experiment 2, with the manikin’s distance from the word as the dependent variable and word valence as the independent variable.

To measure the reliability of individual performance in the slider task over time, we first estimated the effect of valence on the distance of the manikin from the word for each participant and for each half of the experiment, using ordinary linear regression models (see Experiment 2). This step generated two sets of 36 slope coefficients (corresponding to 36 participants and the two blocks). We then calculated the correlation between the two sets of regression coefficients as a measure of the test-retest reliability within two experimental sessions separated by one week.

The distance of the manikin was negatively correlated with the valence of the word for both sessions (First session: Pearson’s r = −0.559, df = 9178, 95% CI [−0.573,−0.544], p < .001; second session: Pearson’s r = −0.532, df = 9178, 95% CI [−0.546,−0.517], p < .001) and the two sessions together (Pearson’s r = −0.545, df = 18358, 95% CI [−0.555,−0.535], p < .001). Participants moved the manikin closer to positive words, and moved it further away from negative words.

We calculated the effect of valence on distance for each participant and for each experimental session. A correlation between the two sets of slope coefficients was strong and significant (Pearson’s r = 0.754, df = 34, 95% CI [0.566, 0.868], p < .001). This demonstrates that valence of the same set of stimuli affects participants’ performance in the slider task similarly in the first and second experimental sessions. The slider task has high test-retest reliability over time.

Analysis 2

The variables and statistical analysis is identical to Analysis 2 of Experiment 2; the manikin’s distance from the word as the dependent variable and block number as the independent variable. The only exception is the difference between block 1 and block 2 in Experiment 2 are now represented as session 1 and session 2.

The mean correlation, or the reliability of the distance measurement, across two experimental sessions was moderate (r_mean = 0.436, SD = 0.201, CI [0.047, 0.805]). This demonstrates that the participants reliably responded with a similar distance in the first and second experimental session.

The reliability in Experiment 3 is, unsurprisingly, lower than in Experiment 2 (r = 0.754 & 0.804, respectively). This is due to the difference in the length of time between the retest sessions; Experiment 2 had no time between sessions, whereas Experiment 3 had a week between sessions. The same decrease is found in the reliability of the distance measurement, where a separation by a week (r_mean = 0.436) caused more variability in responses than the same-day session ((r_mean = 0.603).

General Discussion

The slider task provides a reliable, valid and easily reproducible way to capture human affective judgements of words presented in isolation. We hypothesize that this conclusion can be extended to include other types of stimuli, such as pictures, sounds or linguistic phrases. As argued in Warriner et al. (2017), the slider task has addressed numerous criticisms relating to the validity of rating-scale affective judgements. The slider task removes any emotional anchors from the instructions that may potentially affect participants’ ratings (see Westbury et al., 2015). Warriner et al. (2017) demonstrated that the functional relationship between valence and distance was not affected by the presence or absence of anchors. The present study additionally demonstrates that there is no difference in the reliability of the task (see Experiments 1.1–1.4 with anchors, and Experiment 1.5 without anchors). Critically, the slider task also allows for a more fine-grained measure of affective judgements, and robust detection of subtle individual differences in response patterns on an interval scale. A question that was left unanswered in Warriner et al. (2017) is whether the slider task and its measurement of individual differences is reliable.

This paper confirms that the slider task performed well on many psychometric reliability measures. In Experiment 1, we demonstrated that the slider task has a high split-half reliability across multiple studies. Participants responded to a words’ valence with a high degree of similarity across randomly divided halves of experiments. In Experiment 2, we further confirmed that the slider task is reliable across one hour-long session where a set of words were presented twice. Participants responded similarly to a word’s valence in both the first and second half of the session. In Experiment 3, we showed that the slider task is reliable over time. Participants were affected by valence similarly within two sessions which presented the same words that were separated by a week. We also observed in Analyses 2 of Experiments 2 and 3 that the reliability of producing a distance value is consistently lower than the reliability of that distance value as an index of valence. In other words, over two sessions participants were less consistent in how they approached or withdrew from a given word, but in both sessions those responses were driven by the word’s valence to a greater degree and with greater consistency. This suggests that absolute values of the distance to words varying in valence are less reliable (and theoretically less informative) than the values of distance relative to the word’s valence.

We believe these reliability results, along with the findings from Warriner et al. (2017), show that a sliding scale is a useful tool for collecting affective responses to words, in that it is both valid and reliable. The slider scale shows added utility over a typical discrete ratings scale. Both scales produce similar aggregate data, but the continuous slider scale allows for more detailed responses that are crucial when studying individual differences. As well, unlike a ratings scale, a sliding scale does not rely on anchor terms that are related to valence, a problem which was discussed in Westbury, Keith, Briesemeister, Hofmann, and Jacobs (2015). Additionally, affective ratings are often used in analyses as an interval data, even though they collect ordinal data. The data obtained from the sliding scale are interval.

One possible limitation of the slider task is discussed at length in Warriner et al. (2017): the slider task may be tapping into valence or a related but different psychological construct such as approach-avoidance behaviour, or both. Valence is known to be strongly linked to motivational systems which drive appetitive and aversive responses, to a degree that disentangling these constructs presents a considerable difficulty (see among others, Carver & White, 1994; Lang, 1995). The present method does not enable us to partial out an independent contribution of motivational systems to individual affective responses. While the slider scale is demonstrably a reliable tool for quantifying valence, the question of its link to the approach-avoidance behavior will need to be studied further.

Moreover, it is important to note that we are looking at only one of the two dimensions of affect (Russell, 1980), i.e., valence and not arousal. Valence is the measurement of negativity to positivity, whereas arousal is the measurement of calmness to excitedness (Russell, 1980). The wording of our sliding scale only taps into valence measurement, whereas other slider scales were able to capture arousal as well (Betella & Verschure, 2016).

In sum, multiple methods confirm that the slider task is very reliable within participants. Additionally, our split-half analysis of a combined dataset (see Exp. 1.6) showed the task’s reliability between participants. We conclude that the task has the high degree of reliability that is essential not only for the overall utility of the task for measuring affective responses, but also for its ability to detect subtle patterns of individual variability.

Acknowledgments

This work was supported by the Patterson-Wilson Ontario Graduate Scholarship (OGS) to Constance Imbault. Victor Kuperman’s contribution was supported in part by the the Natural Sciences and Engineering Research Council or Canada (NSERC) Discovery Grant 402395–2012, the NIH R01 HD 073288 (PI Julie A. Van Dyke), the SSHRC Partnership Training grant 895-2016-1008 (PI Gary Libben), the Early Research Award from the Ontario Ministry of Research and Innovation, and the Canada Research Chair (Tier 2) award. David I. Shore was supported by an NSERC Discovery Grant.

Footnotes

We thank our reviewers for bringing double density displays to our attention. In future studies, we will refer to “steps” as our dependent variable of interest, rather than “pixels” which may be an inaccurate measurement on double density displays.

References

Albaum G, Best R, Hawkins DEL. Continuous vs Discrete Semantic Differential Rating Scales. Psychological Reports. 1981;49:83–86. [Google Scholar]
Betella A, Verschure PFMJ. The affective slider: A digital self-assessment scale for the measurement of human emotions. PLoS ONE. 2016;11(2):1–11. doi: 10.1371/journal.pone.0148037. https://doi.org/10.1371/journal.pone.0148037 [DOI] [PMC free article] [PubMed] [Google Scholar]
Bonin P, Aubert L, Malardier N, Niedenthal PM. Subjective Et De Valence Émotionnelle Pour 866 Mots. 2016;103:655–694. [Google Scholar]
Bradley MM, Lang PPJ. Affective Norms for English Words (ANEW): Instruction Manual and Affective Ratings. Psychology, Technical. 1999;(C-1):0. https://doi.org/10.1109/MIC.2008.114 [Google Scholar]
Carver CS, White TL. Behavioral Inhibition, Behavioral Activation and affective responses to impending reward and punishment. Journal of Personality and Social Psychology. 1994;67:319–333. [Google Scholar]
Hinojosa Ja, Martínez-García N, Villalba-García C, Fernández-Folgueiras U, Sánchez-Carmona a, Pozo Ma, Montoro PR. Affective norms of 875 Spanish words for five discrete emotional categories and two emotional dimensions. Behavior Research Methods. 2015 Aug; doi: 10.3758/s13428-015-0572-5. https://doi.org/10.3758/s13428-015-0572-5 [DOI] [PubMed]
Lang PJ. The Emotion Probe: Studies of Motivation and Attention. American Psychologist. 1995 doi: 10.1037//0003-066x.50.5.372. https://doi.org/10.1037/0003-066X.50.5.372 [DOI] [PubMed]
Macleod JW, Lawrence Ma, McConnell MM, Eskes G, Klein RM, Shore DI. Appraising the ANT: Psychometric and theoretical considerations of the Attention Network Test. Neuropsychology. 2010;24(5):637–51. doi: 10.1037/a0019803. https://doi.org/10.1037/a0019803 [DOI] [PubMed] [Google Scholar]
Monnier C, Syssau A. Affective Norms for French Words (FAN) Behavior Research Methods. 2014;46(4):1128–37. doi: 10.3758/s13428-013-0431-1. https://doi.org/10.3758/s13428-013-0431-1 [DOI] [PubMed] [Google Scholar]
Moors A, De Houwer J, Hermans D, Wanmaker S, van Schie K, Van Harmelen A-L, Brysbaert M. Norms of valence, arousal, dominance, and age of acquisition for 4,300 Dutch words. Behavior Research Methods. 2013;45(1):169–177. doi: 10.3758/s13428-012-0243-8. https://doi.org/10.3758/s13428-012-0243-8 [DOI] [PubMed] [Google Scholar]
Redondo J, Fraga I, Padrón I, Comesaña M. The Spanish adaptation of ANEW (affective norms for English words) Behavior Research Methods. 2007;39(3):600–605. doi: 10.3758/bf03193031. https://doi.org/10.3758/BF03193031 [DOI] [PubMed] [Google Scholar]
Russell J. A circumplex model of affect. Journal of Personality and Social Psychology. 1980 doi: 10.1037//0022-3514.79.2.286. Retrieved from http://psycnet.apa.org/psycinfo/1981-25062-001. [DOI] [PubMed]
Stadthagen-Gonzalez H, Imbault C, Pérez Sánchez MA, Brysbaert M. Norms of valence and arousal for 14,031 Spanish words. Behavior Research Methods. 2016 May;:1–45. doi: 10.3758/s13428-015-0700-2. https://doi.org/10.3758/s13428-015-0700-2 [DOI] [PubMed]
Warriner AB, Kuperman V, Brysbaert M. Norms of valence, arousal, and dominance for 13,915 English lemmas. Behavior Research Methods. 2013;45(4):1191–207. doi: 10.3758/s13428-012-0314-x. https://doi.org/10.3758/s13428-012-0314-x [DOI] [PubMed] [Google Scholar]
Warriner AB, Shore DI, Schmidt LA, Imbault CL, Kuperman V. Sliding into happiness: A new tool for measuring affective responses to words. Canadian Journal of Experimental Psychology/Revue Canadienne de Psychologie Expérimentale. 2017;71(1):71–88. doi: 10.1037/cep0000112. https://doi.org/10.1037/cep0000112 [DOI] [PMC free article] [PubMed] [Google Scholar]
Westbury C, Keith J, Briesemeister BB, Hofmann MJ, Jacobs AM. Avoid violence, rioting, and outrage; approach celebration, delight, and strength: Using large text corpora to compute valence, arousal, and the basic emotions. The Quarterly Journal of Experimental Psychology. 2015;0(0):1–24. doi: 10.1080/17470218.2014.970204. https://doi.org/10.1080/17470218.2014.970204 [DOI] [PubMed] [Google Scholar]

[R1] Albaum G, Best R, Hawkins DEL. Continuous vs Discrete Semantic Differential Rating Scales. Psychological Reports. 1981;49:83–86. [Google Scholar]

[R2] Betella A, Verschure PFMJ. The affective slider: A digital self-assessment scale for the measurement of human emotions. PLoS ONE. 2016;11(2):1–11. doi: 10.1371/journal.pone.0148037. https://doi.org/10.1371/journal.pone.0148037 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Bonin P, Aubert L, Malardier N, Niedenthal PM. Subjective Et De Valence Émotionnelle Pour 866 Mots. 2016;103:655–694. [Google Scholar]

[R4] Bradley MM, Lang PPJ. Affective Norms for English Words (ANEW): Instruction Manual and Affective Ratings. Psychology, Technical. 1999;(C-1):0. https://doi.org/10.1109/MIC.2008.114 [Google Scholar]

[R5] Carver CS, White TL. Behavioral Inhibition, Behavioral Activation and affective responses to impending reward and punishment. Journal of Personality and Social Psychology. 1994;67:319–333. [Google Scholar]

[R6] Hinojosa Ja, Martínez-García N, Villalba-García C, Fernández-Folgueiras U, Sánchez-Carmona a, Pozo Ma, Montoro PR. Affective norms of 875 Spanish words for five discrete emotional categories and two emotional dimensions. Behavior Research Methods. 2015 Aug; doi: 10.3758/s13428-015-0572-5. https://doi.org/10.3758/s13428-015-0572-5 [DOI] [PubMed]

[R7] Lang PJ. The Emotion Probe: Studies of Motivation and Attention. American Psychologist. 1995 doi: 10.1037//0003-066x.50.5.372. https://doi.org/10.1037/0003-066X.50.5.372 [DOI] [PubMed]

[R8] Macleod JW, Lawrence Ma, McConnell MM, Eskes G, Klein RM, Shore DI. Appraising the ANT: Psychometric and theoretical considerations of the Attention Network Test. Neuropsychology. 2010;24(5):637–51. doi: 10.1037/a0019803. https://doi.org/10.1037/a0019803 [DOI] [PubMed] [Google Scholar]

[R9] Monnier C, Syssau A. Affective Norms for French Words (FAN) Behavior Research Methods. 2014;46(4):1128–37. doi: 10.3758/s13428-013-0431-1. https://doi.org/10.3758/s13428-013-0431-1 [DOI] [PubMed] [Google Scholar]

[R10] Moors A, De Houwer J, Hermans D, Wanmaker S, van Schie K, Van Harmelen A-L, Brysbaert M. Norms of valence, arousal, dominance, and age of acquisition for 4,300 Dutch words. Behavior Research Methods. 2013;45(1):169–177. doi: 10.3758/s13428-012-0243-8. https://doi.org/10.3758/s13428-012-0243-8 [DOI] [PubMed] [Google Scholar]

[R11] Redondo J, Fraga I, Padrón I, Comesaña M. The Spanish adaptation of ANEW (affective norms for English words) Behavior Research Methods. 2007;39(3):600–605. doi: 10.3758/bf03193031. https://doi.org/10.3758/BF03193031 [DOI] [PubMed] [Google Scholar]

[R12] Russell J. A circumplex model of affect. Journal of Personality and Social Psychology. 1980 doi: 10.1037//0022-3514.79.2.286. Retrieved from http://psycnet.apa.org/psycinfo/1981-25062-001. [DOI] [PubMed]

[R13] Stadthagen-Gonzalez H, Imbault C, Pérez Sánchez MA, Brysbaert M. Norms of valence and arousal for 14,031 Spanish words. Behavior Research Methods. 2016 May;:1–45. doi: 10.3758/s13428-015-0700-2. https://doi.org/10.3758/s13428-015-0700-2 [DOI] [PubMed]

[R14] Warriner AB, Kuperman V, Brysbaert M. Norms of valence, arousal, and dominance for 13,915 English lemmas. Behavior Research Methods. 2013;45(4):1191–207. doi: 10.3758/s13428-012-0314-x. https://doi.org/10.3758/s13428-012-0314-x [DOI] [PubMed] [Google Scholar]

[R15] Warriner AB, Shore DI, Schmidt LA, Imbault CL, Kuperman V. Sliding into happiness: A new tool for measuring affective responses to words. Canadian Journal of Experimental Psychology/Revue Canadienne de Psychologie Expérimentale. 2017;71(1):71–88. doi: 10.1037/cep0000112. https://doi.org/10.1037/cep0000112 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Westbury C, Keith J, Briesemeister BB, Hofmann MJ, Jacobs AM. Avoid violence, rioting, and outrage; approach celebration, delight, and strength: Using large text corpora to compute valence, arousal, and the basic emotions. The Quarterly Journal of Experimental Psychology. 2015;0(0):1–24. doi: 10.1080/17470218.2014.970204. https://doi.org/10.1080/17470218.2014.970204 [DOI] [PubMed] [Google Scholar]

PERMALINK

Reliability of the sliding scale for collecting affective responses to words

C Imbault

D Shore

V Kuperman

Abstract

Experiment 1: Split Half Analysis

Method

Participants and Procedure

Table 1.

Figure 1.

Stimuli

Results and discussion

Figure 2.

Experiment 2: Within session repeated measures reliability

Method

Participants

Affective stimuli

Procedure

Results and Discussion

Analysis 1

Analysis 2

Experiment 3: Across Session Test Retest Reliability

Method

Participants

Affective Stimuli

Procedure

Results and Discussion

Analysis 1

Analysis 2

General Discussion

Acknowledgments

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Reliability of the sliding scale for collecting affective responses to words

C Imbault

D Shore

V Kuperman

Abstract

Experiment 1: Split Half Analysis

Method

Participants and Procedure

Table 1.

Figure 1.

Stimuli

Results and discussion

Figure 2.

Experiment 2: Within session repeated measures reliability

Method

Participants

Affective stimuli

Procedure

Results and Discussion

Analysis 1

Analysis 2

Experiment 3: Across Session Test Retest Reliability

Method

Participants

Affective Stimuli

Procedure

Results and Discussion

Analysis 1

Analysis 2

General Discussion

Acknowledgments

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases