Significance
The dominant way in which human behavior, feelings, and attitudes are measured is by self-report on Likert-type scales. However, these methods are plagued by the confounding effects of response bias. Indeed, high-profile journals, including PNAS and Science, recently published official guidelines requiring transparent and unambiguous reporting of survey data to improve the trustworthiness of self-report methods. Here, we present a complementary approach to address this pervasive problem: an easy-to-use model that produces less biased scores of the variable/s of interest. Given the ubiquitous use of Likert scale surveys in politics, psychology, health, and economics, our model’s demonstrated ability to improve the precision with which we measure latent states is a necessary step in reducing the significant consequences of contaminated data.
Keywords: survey responding, decision-making, response bias, Likert scale, response styles
Abstract
Self-reports are used ubiquitously to probe people’s thoughts, feelings, and behaviors and inform medical decisions, enterprise operations, and government policy and legislation. Despite their pervasive use, self-report measures such as Likert scales have a profound problem: Standard analytic approaches do not control for the confounding effects of idiosyncratic response biases. Here, we present a model-based solution to this problem. Our model disentangles response bias from latent constructs of interest to obtain less biased scores of the latent states of respondents. Inspired by Thurstonian approaches in the psychophysics literature, the model requires nothing further than standard Likert scale design assumptions. The model uses a data-driven approach to control for response biases, without the need to prespecify bias types or response strategies. We demonstrate the model’s ability to uncover more precise estimates of latent state associations, outperforming bias-affected standard scoring techniques, and garner insights into previously undetected codependencies between certain latent states and particular forms of response bias. The model is thus a tool which outperforms standard scoring methods and generates insights into, and controls for, the potentially confounding effects of response bias on self-report Likert scale data.
To predict human behavior, it is important to understand people’s values, attitudes, perceptions, and feelings, and one of the most common ways to measure these psychological factors is with survey questionnaires. Surveys are used ubiquitously to measure a wide range of thoughts and behaviors (1) and the results can profoundly impact individuals and societies alike. Psychologists and healthcare professionals use survey responses as screening and diagnostic tools (2–6) and to inform care plans (7–11), while businesses and government agencies use them to assess the efficacy of an organization, program, or product (12–15) and/or employee attitudes (16, 17). At the broadest level, attitudes measured about societal issues, such as climate change (18, 19) or disease-prevention strategies (20, 21), are used to implement or shape public policy (22) and legislation (23), affecting entire populations.
The most common approach to surveying attitudes and behaviors is with a Likert scale. Likert scale surveys present a person with several questions and/or statements (called items) and a fixed number of labeled options which they must choose from in response, as shown in Fig. 1A. Likert scale surveys vary in the number of items a person is presented with, the number and wording of response options and, importantly, the construct(s) they aim to measure (e.g., political attitudes, satisfaction, mental state, and so on). Given their breadth of applications, for simplicity, throughout we will use the abstract term latent state to encompass the wide variety of attitudes, feelings, behaviors, and perceptions that Likert scale surveys may aim to measure. Because they are so simple, Likert scale surveys are cheap to source and construct, quick to administer, and easy for participants to understand. Together, this means Likert scale surveys are used to assess latent states on a large scale (21) and reach populations usually beyond the scope of in-lab research methods (24).
Fig. 1.
A visual representation of the model. (A) Example Likert scale item and response options. (B) Standard scoring methods (Left) assume the latent state directly generates Likert scale responses. Our model-based method (Right) considers Likert scale responses as indicators of the latent state filtered through a decision process. (C and D) schematically illustrate the model’s structure: A respondent’s latent state is represented by a distribution which can take different positions on a latent continuum, and their decision process discretizes the continuum with response thresholds. (C) Different response patterns can result from differing threshold placement, even when the true distributions are identical. (D) The same response patterns can result from differing threshold placement, even when the true distributions are markedly different.
While Likert scale surveys have many advantages, there remains controversy about their ability to accurately measure latent states. For instance, many aspects of a scale’s construction can bias the way people respond, such as the inclusion of negatively worded items (25–28), the number of response options (29–33), their physical presentation (34, 35), and their labeling (36, 37). There are also many recommendations which exist for how best to mitigate the impact of these scale-induced biases (26, 29, 35, 37), though these usually involve making adjustments to the survey design or recommendations for transparency and clarity in reporting (38, 39). While these mitigation strategies address some problems, others remain largely unaddressed, namely: How do people make decisions about Likert scale items (36)?
To motivate the problem further, consider Liz and Jenna—two people who regularly use a ride-share application to travel around the city together and they each provide a star rating upon the completion of the trip. Over their last ten trips together, Liz gave four 1-star reviews when the drivers used their horns too often and gave the other six drivers 5-stars; her response bias reflects the “extreme” style illustrated in Fig. 1C. Jenna, however, has left eight 3-star reviews, one 2-star, and one 5-star review when she was offered bottled water and breath mints; her response bias is better represented by the midpoint style shown in Fig. 1C. Since they have shared these trips, it is reasonable to assume their experiences on each trip were comparable; their latent states should be similar. However, due to their different response biases, their reviews of each ride are often quite different. When the ride-share company compares Liz and Jenna’s star ratings for each driver (analogous to scores on different latent state scales), they would undoubtedly interpret these responses to mean different things—one passenger had either a great or terrible experience with this driver (Liz), the other only satisfactory (Jenna), much the way researchers often assume different scores on a survey reflect different underlying attitudes.
Conversely, Fig. 1D shows how two people with very different underlying latent states may obtain similar scores because of their response biases. Imagine Christopher and John, two middle-aged men who respond “Strongly Agree” to a statement which asks whether they feel lonely. John lives alone on an isolated farm in the Swiss mountains and has a response bias which is not easy to describe in words, while Christopher is a sociable American man with a right-biased response style. In this instance, the high score John reports reflects his high latent state (i.e., he feels lonely), but Chris’ response bias produces the same score on the Likert scale, despite his much lower latent state levels (i.e., he does not feel particularly lonely, though he tends to agree with statements). Given his bias toward agreement, Chris’ responses would often fall in the right-hand response bin, resulting in high scores on most Likert scales, regardless of the latent state they aim to assess. Current approaches to analyzing data of this type cannot always distinguish between responses which result from truly high latent states such as John’s and biased response styles such as Chris’. The individual differences in response styles to Likert scale surveys reflected in Fig. 1 are well documented in the literature (37, 40–46).
These toy examples highlight the problem that Likert scale responses are typically interpreted as representing the latent state without consideration of how people make decisions about the items presented to them, even when the latter may drive, at least partially, response differences across people. As a high-profile example, Infurna et al. (47) recently found that American participants reported noticeably higher loneliness scores on a three-item Likert scale than their European counterparts. The authors proposed several plausible economic, political, and social explanations for these group differences. Nevertheless, it is possible that documented cultural differences in survey response patterns (48–50) may have impacted these findings. Given responses to Likert surveys such as Infurna et al.’s can be used to inform policy and population-level decisions (22, 23), it is essential to control for the confounding effects of response bias which may lead different national, ethnic, or language groups to obtain scores which do not reflect differences in the latent state the scale aims to measure.
1. A Solution: Debiased Survey Responding
We propose a model-based solution that captures individual respondent response biases alongside their degree of endorsement of a latent state. In doing so, we remove the effect of response bias to obtain a cleaner measure of latent states—the target of interest. We start by highlighting the extent of naturally occurring variability in response biases using one of the largest, openly available Likert scale surveys we could access; the Open-Source Psychometrics Project International Personality Item Pool [IPIP; Open Psychometrics Project](51), with over 1 million respondents. We do not intend to comment on the IPIP as a measure of personality, or on personality theory at all. We simply use this dataset because it is openly available and a widely used example of Likert scale responding. The IPIP measures the Big Five personality traits (52)—extraversion, openness, conscientiousness, agreeableness, and neuroticism. Each trait is measured with 10 items and responses given on a 5-point Likert scale with response options labeled “Disagree” (1), “Neutral” (3), and “Agree” (5).
Fig. 2A shows response patterns of five exemplar participants from the IPIP—how often they selected each of the 5 options on the Likert response scale. Four of these participants are each representative of one common response bias observed in Likert scale surveys (e.g., see refs. 40 and 53), while the fifth is an example of a participant who cannot be easily classified. The response biases represent a deviation from the “expected” response patterns, where participants use each response option equally often, except to show that they are high or low in a particular latent state. This is the “expected” response pattern in that it is assumed by standard scoring methods, which sum a numerically coded response (1 to 5) for each item that contributes to each latent state measured in the survey, with items reverse scored as necessary. Since the IPIP includes a similar number of negatively and positively scored questions, visible asymmetry in responses, as seen with left- and right-biased responding, is a deviation from the expected response behavior. These response biases reflect responding which is internally inconsistent as they involve answering “yes” to contradictory questions, which has little theoretical explanation other than response bias.
Fig. 2.
Prototypical response biases observed in a Likert scale survey of the Big 5 Personality Traits (see text for details). (A) Response frequency (-axes) for each of the 5 response options (-axes) for each of 5 participants (panels). There is 1 exemplar participant of each of 4 common response biases and 1 whose response bias is not easily classified. See SI Appendix for details on response bias classification. (B) When subsetting to each response bias ( per bias), the relationship differs between total openness and neuroticism scores, two of the Big 5 Personality Traits. (C) A random sample of 1,000 participants shows participants who adopt different response biases inhabit different regions of the data space when evaluating the relationship between total openness (-axis) and neuroticism (-axis) scores.
Fig. 2 B and C demonstrate that different response biases can impact the interpretation of data when the standard scoring methods are applied (see also refs. 40 and 53). Fig. 2B shows correlations between neuroticism and openness for each response bias; we highlight one pairwise comparison for brevity though note all pairwise combinations of the 5 latent states showed similar overall patterns. The latent states that compose the personality inventory are theoretically orthogonal by design (54, 55). When participants are separated into groups governed by their response bias, the association between latent states varied from essentially no relationship (midpoint responders, ) to strong (left-biased responders, ), with clear differences in the spread, clustering, direction, and location of the response points, indicating qualitative differences in response patterns.
Fig. 2C shows that aggregating response biases in proportion to their base rates in the population attenuates the associations observed in most of the response bias subgroups. This leads to the impression that there is little to no association between the pair of latent states when considered across the sample of participants. This is the goal of the survey instrument—latent states which appear orthogonal by design (which we show here) may be an artifact of aggregating subgroups that are indeed different (Fig. 2B).
Bootstrap resampling from the 1 million participants showed an expected frequency of 20.8% midpoint response bias, 8.7% extreme responders, 0.002% left-biased and 4.9% right biased; combined, over a third of the sample use one of the four characteristic response biases. This likely underestimates the scale of the problem, however, because there are many more ways a participant’s response bias could deviate from the expected response behavior in ways that are impossible to know a priori; we return to this point later. Since it is impossible to predict which response biases will be present in a given dataset, or their relative frequency, it is also impossible to predict how these response biases may systematically confound Likert scale response data. Hence, standard scoring methods may bias measurement of the latent state and confound conclusions, which can have a profound influence when these responses inform important policy, legislative, and social decisions.
2. From Response Biases to Decision Styles: A Model
When response patterns are considered a contamination of the data (40), it is reasonable to use terms which carry stigma such as “response bias” and imbue the discussion with negative connotation. We propose that different response biases simply reflect a decision-making process that can be quantitatively separated from the target latent state. From this perspective, responding to a survey involves making a series of decisions about how to respond to each item; different people simply adopt different decision-making strategies when approaching the task. Hence, we refer to different styles of survey responding as “decision styles” rather than “response biases” from hereon. Fig. 1B schematically contrasts our proposal from the prevailing literature. The standard approach assumes the latent state directly generates responses (left). We assume the latent state and a decision-making process interact to generate responses (right).
Our flexible model of survey responding separates individual differences in decision-making strategies from the latent states measured in a Likert scale, all at the subject level. Our approach is scale-agnostic so it can be applied to Likert scale data after it is collected, regardless of the latent state being measured, the number of response options, and/or their labels. The model involves data-driven determination of decision style so it does not require specification or a priori identification of decision styles. Instead, it combines and contrasts information from multiple latent states to deconfound decision styles from latent states.
Our model adopts an ordinal-probit approach akin to those used in psychophysics (56) and psychometrics (57–59) research and applies it to the preferential judgments elicited in rating scales. Similar models have also been used in psychology to improve the statistical utility of data gained from ordinal scales (57) and make more meaningful between-group comparisons (58). Like the more widely known signal detection models, ordinal-probit models assume that a stimulus can be represented in some dimension by a normal distribution (56) and that discriminability is determined by the difference between distribution means. The assumption of a continuous underlying representation allows an agent to internally represent their beliefs and flexibly adapt how they communicate those beliefs with others based on the situational demands of the question or task. This is achieved by thresholds which divide the continuous distribution into discrete response options which represent those provided on the Likert scale (53, 56–58). These techniques are commonly applied to small-N study designs where very few participants make many judgments about perceptual stimuli (60). The small number of participants in these studies allows for variance in perceptual ability to be easily modeled at the subject level, meaning these models are grounded in strong theoretical frameworks and able to produce robust and precise model estimates (60). Inspired by these previous applications, our proposed approach assumes an interaction between the stimulus properties and a person’s subjective judgments (60) and can take advantage of recent improvements in computational and sampling efficiency, to apply a similar theoretical framework to rating scale decisions.
Ordinal-probit models assume that the latent state of interest is represented by a normal distribution which is divided into n bins, where n is the number of response options on the Likert scale. The model does not assume equal distance between thresholds delineating the response bins and we assume a person’s response to an item is the result of a randomly sampled value from their latent state distribution. As shown in Fig. 1 C and D, different threshold placements allow some response options to be more likely than others and can manifest in different response patterns, even when the mean and SD of a latent distribution are identical. The proposed model freely estimates a mean for each latent state and one SD which is assumed to be the same across all latent states. The estimated variance of the distribution works with the order-constrained yet flexible thresholds to allow the distribution of responses to shift asymmetrically. For more details, see Model Specification.
The key improvement our model makes on existing approaches is the assumption of a single response style estimated across multiple latent dimensions. Through constraint in response style across the different latent states, the model identifies person-level thresholds and derives individual latent state estimates conditional on them. The model combines well-established Thurstonian statistical theory (56, 60) with hierarchical sampling techniques to support a flexible tool that provides information about both latent state expression and response tendencies for each subject and the group overall. Much like item-response theory approaches to modeling Likert scale data (61, 62), our model acknowledges the subject-level random effects which may influence choice behaviors by assuming the same decision thresholds affect different latent state responses. In this way, our model provides a method for exploring and understanding people’s survey response behaviors that is a useful advancement on existing applications.
Our model extends work by Taylor et al. (53), who recently adopted a similar approach to demonstrate how different decision styles can affect Likert scale responses in norm-rating tasks. An important feature of Taylor et al.’s model is the assumption of homogeneity of decision styles; all participants approach the task with the same decision style (e.g., everyone is an extreme responder). While this assumption seems appropriate for norm rating studies like theirs, it is unlikely that all participants will have the same decision style when responding to Likert scales in other (perhaps more common) settings (cf. Fig. 2). The variety of possible person-level differences in how people approach most survey-style tasks cannot be captured when only one (or a few) response styles are assumed to be present in the data. Previous item response theory approaches have aimed to address this by modeling multiple response tendencies concurrently and separating them from latent state-level factors (61, 62). Though these approaches are conceptually similar to ours, they require a priori decisions about which decision styles will be measured or are expected in the data. Similarly, factor mixture models which may be able to control for non-content-based responding also require assumptions about the types of responses which will be present in a dataset to be built into the model (63). A major advantage of our approach is that it captures individual decision styles in data, even when those styles cannot be easily verbalized. It does not require researchers to prespecify decision styles or develop additional tasks to identify decision styles [as proposed by Bolt et al. (64)].
3. Results
3.1. The Model Reveals Important Information About Latent States.
Reducing the confounding effect of decision styles reveals stronger correlations between the latent states investigated in a survey than is observed with standard scoring methods—see Fig. 3. Across all pairwise correlations of latent states, Spearman’s rho coefficients are 71.83% larger, on average, for our debiased scores than the standard sum-scoring method. Interestingly, the same general patterns exist in both the standard scores and debiased scores, whereby the subscale pairs with the strongest and weakest correlations are the same with both methods. In this way, by removing the influence of decision style the model uncovers correlations that were attenuated by the aggregation of different decision styles. This finding is in line with existing literature which demonstrates measurement error (differential impact of uncontrolled decision styles) can attenuate correlations (65, 66). To confirm that these findings did not result from some irregularity in model behavior, we simulated data with known latent state correlations ranging from 0 to 1 (in increments of 0.1). This confirmed the debiased scores are closer to the ground truth than standard scores in all cases and that standard scores had an increasing tendency to attenuate the latent state correlations as the latent state correlations grew larger. See SI Appendix for more details.
Fig. 3.

Standard scoring methods attenuate correlations between the latent states measured in a Likert scale survey compared to our model-based methods. (A) Standard scoring methods attenuate the correlation between latent states when many decision styles coexist in the sample; the scatterplot colors match the decision styles in Fig. 2. (B) Our model-based method overcomes the attenuation of latent state associations observed in the standard scoring methods. The outcome is a stronger association between latent states than previously observed.
The model’s ability to discover stronger correlations between latent states may be, in part, because the edge effects induced by standard scores can attenuate correlation coefficients (clearly visible in Fig. 3A), and these edge effects are eliminated with the model’s debiased scores as they exist on an unbounded continuum. With standard scoring methods, participant scores are limited by the number of items and response options available, restricting the range of possible scores and biasing estimates of the strength of association downward. Since our model naturally produces continuous estimates of the latent state, there is no restriction of range for the latent state which increases discriminability and provides a method for obtaining cleaner measures of scale validity. As an added benefit, the continuous data produced by the model are appropriate for many common statistical analyses which require assumptions of continuity to be met [including but not limited to measures of central tendency (57) and between-group comparisons (58)], which is not always the case for ordinal data obtained through standard scoring methods.
Because the model debiases estimates of the latent state in this way, it can also provide additional information about the cognitive mechanisms associated with the latent states, themselves. As in the motivating example of Chris and John’s self-reported loneliness, some of the mean differences observed between populations may be the result of different decision styles when compared with standard scores. When subgroups of the population adopt different decision styles and are of equal size, Fig. 4A demonstrates that left-biased responders have noticeably lower mean standard scores than most other decision style groups when standard scoring methods are used (bottom row). This was the case across all latent states except extraversion but was nonexistent in the debiased scores (top row). In this case, the group differences present with standard scoring approaches are likely the result of different decision styles and not true differences in the latent states between groups.
Fig. 4.
The model provides interesting insight into, and accurate estimates of, latent state expression, even with missing data. (A) Debiased scores (Top row) reduce the impact of response styles found with standard scoring approaches (Bottom row) and reveal true group differences between latent state estimates for some decision style groups. Error bars reflect 95% credible intervals across sampling iterations and the SE of the mean in the Top and Bottom rows, respectively. As such, they are only interpretable within a row and cannot be appropriately compared between rows. This analysis was performed on equal-sized groups (). (B) In a leave-one-out cross-validation, the subject-level (conditional) estimates the model provides for the held-out latent state make predictions which are always far more likely (y-axis) than those made with the raw data response frequency proportions (model-free), alone. Often, this is also the case with the group-level (marginal) estimates, too. The different plots show the values for the different held-out latent states. The error bars show the SE, controlled for within-subject variability.
Fig. 4A also shows extreme responders have higher debiased scores than other decision style groups across all latent states, except neuroticism where their scores were noticeably lower. Importantly, this remains true even when the influence of decision style is partialled out by the model (top row). This supports existing research which has demonstrated positive relationships between extreme responding and measures of openness (62, 67–69), conscientiousness (67–70), extraversion (62, 67, 69, 70), agreeableness (67–69) and emotional stability (which is generally considered the inverse of neuroticism) (67). Additionally, negative correlations have been found between neuroticism and extreme responding (68, 69). Our findings suggest the same psychological processes responsible for the extreme decision style may be systematically related to the expression of latent states. In this way, our model provides a useful tool to explore whether group differences result from meaningful variation in the latent state or superficial differences caused by decision style. This finding is especially important when we consider that different subpopulations may be more or less strongly associated with certain decision styles (50). Without a method for separating decision styles from latent state measurement, it would be easy to draw incorrect conclusions about differences between those subpopulations. Instead, our model provides a method for establishing whether true latent state differences exist or if people are simply adopting different decision styles.
The strong influence that response styles have on data collected using Likert scales can be seen by in the model’s predictions of missing data, using information about response style. In a leave-one-out cross-validation, we removed one of the five personality dimensions from each participants’ data (200 participants per latent state), such that each participant had information about only four of the five latent states measured by the IPIP. The model’s predictions for responses to the items from the held-out states were then compared against the held-out data. In the “conditional” version, participant-level information about response style and the four observed dimensions was used to estimate each participant’s mean of the latent state for the held-out dimension. In the “marginal” version, participant-level information about response style was used to generate predictions but only group-level information was used to obtain an estimate of the mean of the latent state for the held-out dimension. To compare these to model-free estimates, we calculated the proportion of responses made for each response option in the raw data and used these to inform the held-out latent state responses. For full method details, see SI Appendix.
Fig. 4B shows the log-likelihood of predictions made with the different approaches outlined above, where larger log-likelihoods indicate better predictions. First and foremost, the conditionally informed model estimates make predictions about the held-out latent state which are always considerably more likely than those made with the model-free information and have the closest log-likelihood values to the fully informed model. For some latent states (openness, conscientiousness, and agreeableness), the held-out latent state data are also more likely under the marginally informed estimate predictions than under those obtained from the model-free information. Importantly, the model-free predictions for the held-out latent state never outperform any of the model-informed approaches. The accuracy with which the conditionally informed estimates, especially, can predict the held-out latent state data demonstrates that the subject-level response style and group-level latent state covariate information the model obtains can make predictions which are impressively accurate when latent state data are thin or, as in this case, even completely absent. Not only does the model provide a useful tool for incomplete datasets (a common problem for researchers), but it indicates that response style, alone, can predict much of the apparent latent state expression. In this way, the estimation of a common set of response style thresholds across multiple latent states is a powerful advantage over model-free methods because it not only facilitates more precise latent state estimation but can also predict latent state manifestation.
3.2. The Model Captures Individual Differences in Decision Style.
Fig. 2 and previous research (40) demonstrate the existence of different decision styles, and so we confirm the model’s efficacy in capturing and controlling the impact of these response behaviors. Fig. 5A shows the same exemplar participants as Fig. 2A alongside the model-generated posterior predictive data in black. The model closely captures the qualitative and quantitative trends across the variety of decision styles observed in data, even when the styles are asymmetric, such as the left and right-biased styles.
Fig. 5.
The model captures the qualitative and quantitative response patterns of the exemplar participants shown in Fig. 2A (A), and discovers the response styles of participants who use distinct decision styles that are not easily categorized (B). The frequency (-axis) of each response option (-axis) is shown in black for posterior predictive data of the model (with 95% credible intervals) and colors/gray for the observed data.
Fig. 5B highlights a key strength of our model: the ability to detect response patterns that do not fall neatly into easily verbalized categories. Each participant in Fig. 5B was categorized as the “other” decision style in Fig. 2, but nonetheless have clear preferences for some response options over others. It is also clear that the model can generate complementary decision styles; the two participants shown on the left-hand side of Fig. 5B show essentially opposite response preferences (overuse vs. underuse of response options 2 and 4), as do the easily verbalized “mirrored” decision styles of extreme vs. midpoint responding, and left-biased vs. right- biased responding. Beyond the nine participants shown in Fig. 5, we confirmed there was strong agreement between data and model predictions across the sample of 1,000 participants, and that the model predictions reflect the same decision style group patterns as the observed data; see SI Appendix. This ability to successfully capture a wide range of person-level decision styles without limitation on the type or variety of styles is a strength of our model that is a step-change beyond previous research (53, 61, 62).
Our model provides solutions for many problems caused by response styles, but not all. For example, two people can produce exactly the same Likert scale responses even though they differ in their latent states, as long as they also differ by exactly the same amount in how they interpret response scales. Although it is unlikely that two people would differ by exactly the same amount on all latent states when they are measured on a continuous scale (as is the case in the present model), it is theoretically possible. This highlights a limitation of Likert scales as a measurement tool and ordinal scale data more generally, rather than a limitation of any particular analysis approach, and is beyond the scope of this research. The hierarchical structure of our proposed model may provide interesting avenues for solutions to this difficult problem in the future, for example, in cases where other data streams from the same respondents can reveal response styles. We therefore suggest that, while our model may not provide a solution to all problems with Likert scale data, it is an important and meaningful step-change above existing methods.
Box 1: A note for model users.
We have demonstrated the model’s efficacy for a Likert-scale survey instrument that studies more than two latent states, with a response scale that contains more than two response options and is relatively balanced with positively and negatively worded items. The model can, in principle, be applied to multiple surveys that each investigate a single latent state, provided those surveys have the same response option anchoring. In all cases, our approach assumes the response scale has three or more response options. In its current, most flexible form, the model also requires the presence of reverse-scored items—these are necessary in order to tease apart the effects of overall left/right bias from actual changes in underlying preferences. Surveys without reverse-scored items could be handled by omitting the model elements related to left/right bias. The model may be applied to surveys (or combinations thereof) that extend the strict criteria above, though we have not explored its statistical properties in all possible use cases. In these circumstances, we recommend simulation studies to determine the model’s suitability prior to use for each new use case. To this end, code for applying the model to data is available at https://osf.io/e5zc4/.
4. Conclusions
We have presented a model that capitalizes on well-established statistical and psychological theory (56, 60) to provide a quantitative framework for understanding response behavior on surveys. Decision styles clearly influence responses to Likert scales (40) and, since these tasks are used ubiquitously to measure people’s attitudes, thoughts, and behaviors, it is essential to account for their confounding impact. We have presented a model of survey responding which can be applied to most Likert scale data after it is collected, regardless of the latent state, number of items, number of response options, response option labeling, or decision styles present. We have demonstrated that the model adjusts for the differential impact of decision style across latent states and generates individual-respondent estimates of latent states which are closer to the ground truth than standard scoring methods. Importantly, this also provides additional information about the psychological processes which may be responsible for latent state expression and may be a fruitful avenue for exploring differences across subpopulations or cultures. Since responses to surveys can have such a wide variety of implications and are known to vary between groups (48–50), it is important to have a tool which can control for confounding effects. In this way, our model provides an accessible and easy to use tool which may be used to better understand human behavior, attitudes, values, and feelings.
5. Materials and Methods
Regression analyses (details in SI Appendix) on participants’ raw response frequencies were used to classify response style groups and explore the biasing influence of response style on survey responding (Fig. 2). We then applied the model to a random sample of 1,000 participants and used the posterior means to explore latent state correlations (Fig. 3) and group-level differences (Fig. 4A). The model was estimated in a hierarchical Bayesian framework with random effects for individual-level parameters (see Model Estimation for details). Posterior predictive data from these estimates then demonstrate the model’s successful fit to data across all participants (Fig. 5 and SI Appendix). Analyses were repeated on two different random samples of 1,000 participants to confirm the reliability of results. We also performed a simulation study to determine the model’s ability to recover known latent state correlations (details in SI Appendix) and a leave-one-out cross-validation to demonstrate the model could generate decent predictions for entirely withheld latent states (Fig. 4B). All materials to conduct the analyses in this paper are available at https://osf.io/e5zc4/.
5.1. Data.
We report data from the Open-Source Psychometrics Project IPIP dataset (51). This dataset (which can be found on the Open Psychometrics website labeled “Answers to the IPIP Big Five Factor Markers” updated 11/8/2018) contains responses from over a million participants to Goldberg’s (71) 50-item personality inventory. This survey measures the Big Five personality traits; extraversion, openness, conscientiousness, agreeableness, and emotional stability (which we refer to here as neuroticism for simplicity). Each trait is measured by 10 items. Each scale has some positive and some negatively worded items, though the number differs between scales, ranging from 2 (neuroticism) to 5 (extraversion) negatively worded items. This is a 5-point Likert scale with response options labeled “Disagree,” “Neutral,” and “Agree,” with unlabeled response options on either side of “Neutral.” With over 8,000 citations, the IPIP is one of the most widely used personality inventories and this dataset, specifically, is demonstrably comparable to data collected on Amazon Mechanical Turk (see Open Psychometrics website)(51).
Due to the size of the dataset, we randomly sampled 1,000 participants to simulate a more realistic sample size which would still be considered appropriate for the domain of study (72). To calculate the “Standard Scoring Method” scores, each response option was assigned a numerical value (“Disagree” 1, “Neutral” 3, “Agree” 5) and negatively worded items were reverse-scored (e.g., scores of 1 transformed to 5). The scores were then summed for each latent state.
5.2. Model Specification.
This hierarchical model provides estimates for all decision style and latent state parameters at both the subject- and group-level. The model assumes that for each Likert scale item k targeting latent state j, a person i randomly samples from a normal distribution with mean and SD :
where is freely estimated for each latent state . Higher estimated means correspond to greater endorsement of the latent state of interest. The SD is also freely estimated, though it is assumed to be the same across all latent states examined in the survey instrument.
The distribution is discretized into n bins, where n is the number of points available on the Likert scale, by thresholds for person on the latent scale:
When person is queried for a response to item about latent state , the response is determined by the comparison of the sample against the thresholds :
Therefore, the probability that response is given by the cumulative normal distribution with mean and SD , :
For parameter identifiability, the location of one of the central thresholds is fixed (not estimated from data) to be one unit above the next lowest threshold. If n is even, the central threshold is fixed and if n is an odd number, the higher of the two central thresholds is fixed. Given that n response options require n-1 thresholds, the model will fix one and estimate n-2 thresholds.
5.3. Model Estimation.
The model was estimated with the PMwG algorithm in R (73). This hierarchical Bayesian sampling approach first estimates group-level effects with a Gibbs sampling step before estimating subject-level random effects with a particle metropolis sampling step for each specified iteration in the sampler, occurring over three stages (74, 75). In the first stage (“burn-in”), we sampled 1,000 iterations of 100 particles. In the second (“adaptation”) and third (“sampling”) stages, we sampled 5,000 iterations of 50 particles, though the adaptation stage obtained an adequate number of unique samples before this. In all three stages, the target particle acceptance rate was set to 0.8. Start points were sampled from the prior and both the group- and subject-level effects were sampled from a multivariate normal proposal distribution, with sampling settings and group-level prior distributions specified as described in ref. 73. The group-level prior distribution followed the marginally noninformative distribution of Huang and Wand (76) with a mean vector set to 0 and a covariance matrix set to 1 and 0 for the diagonal and off-diagonal elements, respectively. Visual inspection of parameter chains confirmed these settings were sufficient to achieve stationarity. Twenty posterior predictive samples were generated from the model for each participant.
Supplementary Material
Appendix 01 (PDF)
Acknowledgments
This project was possible because of funding provided by the University of Newcastle’s Research Training Program and the Vice Chancellor’s Academic Career Preparation scholarship.
Author contributions
J.G., S.D.B., and G.E.H. designed research; J.G. performed research; J.G. and G.E.H. analyzed data; S.D.B. and G.E.H. supervision; and J.G., S.D.B., and G.E.H. wrote the paper.
Competing interests
The authors declare no competing interest.
Footnotes
This article is a PNAS Direct Submission.
Data, Materials, and Software Availability
Previously published data were used for this work (https://openpsychometrics.org/_rawdata/) (51).
Supporting Information
References
- 1.Balasubramanian N., Likert technique of attitude scale construction in nursing research. Asian J. Nurs. Educ. Res. 2, II (2012). [Google Scholar]
- 2.Zimmerman M., Sheeran T., Young D., The diagnostic inventory for depression: A self-report scale to diagnose DSM-IV major depressive disorder. J. Clin. Psychol. 60, 87–110 (2004). [DOI] [PubMed] [Google Scholar]
- 3.Glasser M., Gravdal J. A., Assessment and treatment of geriatric depression in primary care settings. Arch. Fam. Med. 6, 433 (1997). [DOI] [PubMed] [Google Scholar]
- 4.Guck T. P., Elsasser G. N., Kavan M. G., Eugene E. J., Depression and congestive heart failure. Congest. Heart Fail. 9, 163–169 (2003). [DOI] [PubMed] [Google Scholar]
- 5.Wang Y. P., Gorenstein C., “The beck depression inventory: Uses and applications” in The Neuroscience of Depression, Martin C. R., Hunter L.-A., Patel V. B., Preedy V. R., Rajendram R., Eds. (Elsevier, 2021), pp. 165–174. [Google Scholar]
- 6.von Glischinski M., von Brachel R., Hirschfeld G., How depressed is “depressed’’? A systematic review and diagnostic meta-analysis of optimal cut points for the beck depression inventory revised (BDI-II) Qual. Life Res. 28, 1111–1118 (2019). [DOI] [PubMed] [Google Scholar]
- 7.Morgan A. L., et al. , Difficulty taking medications, depression, and health status in heart failure patients. J. Card. Fail. 12, 54–60 (2006). [DOI] [PubMed] [Google Scholar]
- 8.Improta G., Perrone A., Russo M. A., Triassi M., Health technology assessment (HTA) of optoelectronic biosensors for oncology by analytic hierarchy process (AHP) and likert scale. BMC Med. Res. Methodol. 19, 1–14 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Sockalingam S., Clarkin C., Serhal E., Pereira C., Crawford A., Responding to health care professionals’ mental health needs during COVID-19 through the rapid implementation of project echo. J. Contin. Educ. Health Prof. 40, 211–214 (2020). [DOI] [PubMed] [Google Scholar]
- 10.Labrague L. J., et al. , The association of nurse caring behaviours on missed nursing care, adverse patient events and perceived quality of care: A cross-sectional study. J. Nurs. Manag. 28, 2257–2265 (2020). [DOI] [PubMed] [Google Scholar]
- 11.Orsulic-Jeras S., Whitlatch C. J., Szabo S. M., Shelton E. G., Johnson J., The share program for dementia: Implementation of an early-stage dyadic care-planning intervention. Dementia 18, 360–379 (2019). [DOI] [PubMed] [Google Scholar]
- 12.Horan T. A., Abhichandani T., Rayalu R., “Assessing user satisfaction of E-government services: Development and testing of quality-in-use satisfaction with advanced traveler information systems (ATIS)” in Proceedings of the 39th Annual Hawaii International Conference on System Sciences (HICSS’06) (IEEE, 2006), vol. 4, pp. 83b–83b. [Google Scholar]
- 13.Shuib L., Yadegaridehkordi E., Ainin S., Malaysian urban poor adoption of E-government applications and their satisfaction. Cogent Soc. Sci. 5, 1565293 (2019). [Google Scholar]
- 14.Nguyen T. T., Phan D. M., Le A. H., Nguyen L. T. N., The determinants of citizens’ satisfaction of E-government: An empirical study in Vietnam. J. Asian Financ. Econ. Bus. 7, 519–531 (2020). [Google Scholar]
- 15.Thomas R., et al. , Concerns and misconceptions about the Australian government’s Covidsafe app: Cross-sectional survey study. JMIR Public Health Surveill. 6, e23081 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Croasmun J. T., Ostrom L., Using likert-type scales in the social sciences. J. Adult educ. 40, 19–22 (2011). [Google Scholar]
- 17.Mullei K., et al. , Attracting and retaining health workers in rural areas: Investigating nurses’ views on rural posts and policy interventions. BMC Health Serv. Res. 10, 1–10 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Tobler C., Visschers V. H., Siegrist M., Addressing climate change: Determinants of consumers’ willingness to act and to support policy measures. J. Environ. Psychol. 32, 197–207 (2012). [Google Scholar]
- 19.Hong J., She Y., Wang S., Dora M., Impact of psychological factors on energy-saving behavior: Moderating role of government subsidy policy. J. Clean. Prod. 232, 154–162 (2019). [Google Scholar]
- 20.Purnama S. G., Susanna D., Attitude to COVID-19 prevention with large-scale social restrictions (PSBB) in Indonesia: Partial least squares structural equation modeling. Front. Public Health 8, 570394 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Sabat I., et al. , United but divided: Policy responses and people’s perceptions in the EU during the COVID-19 outbreak. Health Econ. 124, 909–918 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Ho A. T. K., Cho W., Government communication effectiveness and satisfaction with police performance: A large-scale survey study. Public Adm. Rev. 77, 228–239 (2017). [Google Scholar]
- 23.Tapia M. E. Á., Raúl D. C. M., Vinicio C. N. M., Indeterminate Likert Scale for the Analysis of the Incidence of the Organic Administrative Code in the Current Ecuadorian Legislation (Infinite Study, 2020), vol. 37. [Google Scholar]
- 24.Perrins G., Ferdous T., Hay D., Harreveld B., Reid-Searl K., Conducting health literacy research with hard-to-reach regional culturally and linguistically diverse populations: Evaluation study of recruitment and retention methods before and during COVID-19. JMIR Form. Res. 5, e26136 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Paap K. R., Anders-Jefferson R. T., Balakrishnan N., Majoubi J. B., The many foibles of likert scales challenge claims that self-report measures of self-control are better than performance-based measures. Behav. Res. Methods 56, 908–933 (2024). [DOI] [PubMed] [Google Scholar]
- 26.Weijters B., Baumgartner H., Misresponse to reversed and negated items in surveys: A review. J. Mark. Res. 49, 737–747 (2012). [Google Scholar]
- 27.Suárez Álvarez J., et al. , Using reversed items in likert scales: A questionable practice. Psicothema 30, 149–158 (2018). [DOI] [PubMed] [Google Scholar]
- 28.Wong N., Rindfleisch A., Burroughs J. E., Do reverse-worded items confound measures in cross-cultural consumer research? The case of the material values scale J. Consum. Res. 30, 72–91 (2003). [Google Scholar]
- 29.Kusmaryono I., Wijayanti D., Maharani H. R., Number of response options, reliability, validity, and potential bias in the use of the likert scale education and social science research: A literature review. Int. J. Educ. Methodol. 8, 625–637 (2022). [Google Scholar]
- 30.Leung S. O., A comparison of psychometric properties and normality in 4-, 5-, 6-, and 11-point likert scales. J. Soc. Serv. Res. 37, 412–421 (2011). [Google Scholar]
- 31.Chyung S. Y., Roberts K., Swanson I., Hankinson A., Evidence-based survey design: The use of a midpoint on the likert scale. Perform. Improv. 56, 15–23 (2017). [Google Scholar]
- 32.Nadler J. T., Weston R., Voyles E. C., Stuck in the middle: The use and interpretation of mid-points in items on questionnaires. J. Gen. Psychol. 142, 71–89 (2015). [DOI] [PubMed] [Google Scholar]
- 33.Garland R., The mid-point on a rating scale: Is it desirable. Mark. Bull. 2, 66–70 (1991). [Google Scholar]
- 34.Nicholls M. E., Orr C. A., Okubo M., Loftus A., Satisfaction guaranteed: The effect of spatial biases on responses to likert scales. Psychol. Sci. 17, 1027–1028 (2006). [DOI] [PubMed] [Google Scholar]
- 35.Maeda H., Response option configuration of online administered likert scales. Int. J. Soc. Res. Methodol. 18, 15–26 (2015). [Google Scholar]
- 36.Friedman H. H., Herskovitz P. J., Pollack S., “The biasing effects of scale-checking styles on response to a likert scale” in JSM Proceedings, Survey Research Methods Section (American Statistical Association, Alexandria, VA, 1993), pp. 792–795. [Google Scholar]
- 37.Weijters B., Geuens M., Baumgartner H., The effect of familiarity with the response category labels on item response to likert scales. J. Consum. Res. 40, 368–381 (2013). [Google Scholar]
- 38.Jamieson K. H., et al. , Protecting the integrity of survey research. PNAS Nexus 2, pgad049 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Berenbaum M. R., Thorp H. H., Publishing survey research. Proc. Natl. Acad. Sci. U.S.A. 121, e2406826121 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Baumgartner H., Steenkamp J. B. E., Response styles in marketing research: A cross-national investigation. J. Mark. Res. 38, 143–156 (2001). [Google Scholar]
- 41.Böckenholt U., Measuring response styles in likert items. Psychol. Methods 22, 69 (2017). [DOI] [PubMed] [Google Scholar]
- 42.Park M., Wu A. D., Item response tree models to investigate acquiescence and extreme response styles in likert-type rating scales. Educ. Psychol. Meas. 79, 911–930 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Harzing A. W., Response styles in cross-national survey research: A 26-country study. Int. J. Cross Cult. Manag. 6, 243–266 (2006). [Google Scholar]
- 44.Javaras K. N., Ripley B. D., An “unfolding’’ latent variable model for likert attitude data: Drawing inferences adjusted for response style. J. Am. Stat. Asso. 102, 454–463 (2007). [Google Scholar]
- 45.Johnson T. R., On the use of heterogeneous thresholds ordinal regression models to account for individual differences in response style. Psychometrika 68, 563–583 (2003). [Google Scholar]
- 46.Weijters B., Geuens M., Schillewaert N., The stability of individual response styles. Psychol. Methods 15, 96 (2010). [DOI] [PubMed] [Google Scholar]
- 47.Infurna F. J., et al. , Loneliness in midlife: Historical increases and elevated levels in the united states compared with Europe. Am. Psychol., 10.1037/amp0001322 (2024). [DOI] [PMC free article] [PubMed]
- 48.Lee J. W., Jones P. S., Mineyama Y., Zhang X. E., Cultural differences in responses to a likert scale. J. Res. Nurs. Health 25, 295–306 (2002). [DOI] [PubMed] [Google Scholar]
- 49.Dolnicar S., Grün B., Cross-cultural differences in survey response patterns. Int. Mark. Rev. 24, 127–143 (2007). [Google Scholar]
- 50.Clarke I. III, Extreme response style in cross-cultural research: An empirical investigation. J. Soc. Behav. Pers. 15, 137–152 (2000). [Google Scholar]
- 51.Open-Source Psychometrics Project, Data from “Answers to the IPIP Big Five Factor Markers". Raw Data from Online Personality Tests. https://openpsychometrics.org/_rawdata/IPIP-FFM-data-8Nov2018.zip. Deposited 11 August 2018.
- 52.Goldberg L. R., et al. , A broad-bandwidth, public domain, personality inventory measuring the lower-level facets of several five-factor models. Pers. Psychol. Eur. 7, 7–28 (1999). [Google Scholar]
- 53.Taylor J. E., Rousselet G. A., Scheepers C., Sereno S. C., Rating norms should be calculated from cumulative link mixed effects models. Behav. Res. Methods 55, 2175–2196 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Goldberg L. R., The structure of phenotypic personality traits. Am. Psychol. 48, 26 (1993). [DOI] [PubMed] [Google Scholar]
- 55.Costa P. T., McCrae R. R., Solid ground in the wetlands of personality: A reply to block. Psychol. Bull. 117, 216–220 (1995). [DOI] [PubMed] [Google Scholar]
- 56.Lee H. S., O’Mahony M., The evolution of a model: A review of thurstonian and conditional stimulus effects on difference testing. Food Qual. Prefer. 18, 369–383 (2007). [Google Scholar]
- 57.Liddell T. M., Kruschke J. K., Analyzing ordinal data with metric models: What could possibly go wrong? J. Exp. Soc. Psychol. 79, 328–348 (2018). [Google Scholar]
- 58.Schnuerch M., Haaf J. M., Sarafoglou A., Rouder J. N., Meaningful comparisons with ordinal-scale items. Collabra Psychol. 8, 38594 (2022). [Google Scholar]
- 59.Lee M. D., Ke M. Y., “Modeling individual differences in beliefs and opinions using thurstonian models” in The Cognitive Science of Belief: A Multidisciplinary Approach, Musolino J., Sommer J., Hemmer P., Eds. (Cambridge University Press, 2022), pp. 488–518. [Google Scholar]
- 60.Smith P. L., Little D. R., Small is beautiful: In defense of the small-n design. Psychon. Bull. Rev. 25, 2083–2101 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Khorramdel L., von Davier M., Measuring response styles across the big five: A multiscale extension of an approach using multinomial processing trees. Multivar. Behav. Res. 49, 161–177 (2014). [DOI] [PubMed] [Google Scholar]
- 62.Wetzel E., Carstensen C. H., Multidimensional modeling of traits and response styles. Eur. J. Psychol. Asses. 33, 352–364 (2015). [Google Scholar]
- 63.Arias V. B., et al. , Detecting non-content-based response styles in survey data: An application of mixture factor analysis. Behav. Res. Methods 56, 3242–3258 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Bolt D. M., Lu Y., Kim J. S., Measurement and control of response styles using anchoring vignettes: A model-based approach. Psychol. Methods 19, 528 (2014). [DOI] [PubMed] [Google Scholar]
- 65.Fisher C. R., A pedagogic demonstration of attenuation of correlation due to measurement error. Spreadsheets Educ. 7, Article 4 (2014). [Google Scholar]
- 66.Carroll R. J., Ruppert D., Stefanski L. A., Crainiceanu C. M., Measurement Error in Nonlinear Models: A Modern Perspective (Chapman and Hall/CRC, 2006). [Google Scholar]
- 67.Hibbing M. V., Cawvey M., Deol R., Bloeser A. J., Mondak J. J., The relationship between personality and response patterns on public opinion surveys: The big five, extreme response style, and acquiescence response style. Int. J. Public Opin. Res. 31, 161–177 (2019). [Google Scholar]
- 68.He J., Van de Vijver F. J., A general response style factor: Evidence from a multi-ethnic study in the Netherlands. Pers. Individ. Differ. 55, 794–800 (2013). [Google Scholar]
- 69.Klar A., Costello S. C., Sadusky A., Kraska J., Personality, culture and extreme response style: A multilevel modelling analysis. J. Res. Pers. 101, 104301 (2022). [Google Scholar]
- 70.Austin E. J., Deary I. J., Egan V., Individual differences in response scale use: Mixed rasch modelling of responses to neo-ffi items. Pers. Individ. Differ. 40, 1235–1245 (2006). [Google Scholar]
- 71.Goldberg L. R., The development of markers for the big-five factor structure. Psychol. Assess. 4, 26 (1992). [Google Scholar]
- 72.Comrey A. L., Lee H. B., A First Course in Factor Analysis (Psychology press, 2013). [Google Scholar]
- 73.Gunawan D., Hawkins G. E., Tran M. N., Kohn R., Brown S., New estimation approaches for the hierarchical linear ballistic accumulator model. J. Math. Psychol. 96, 102368 (2020). [Google Scholar]
- 74.Kuhne C., et al. , Hierarchical Bayesian estimation for cognitive models using Particle Metropolis within Gibbs (PMwG): A tutorial. PsyArXiv [Preprint] (2024). https://osf.io/xr37a (Accessed 21 April 2024).
- 75.Cooper G., et al. , Particle metropolis within gibbs [Computer software manual]. R package version 0.2.7 (2024). https://cran.r-project.org/web/packages/pmwg/index.html. Accessed 21 April 2024.
- 76.Huang A., Wand M. P., Simple marginally noninformative prior distributions for covariance matrices. Bayesian Anal. 8, 439–452 (2013). [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Appendix 01 (PDF)
Data Availability Statement
Previously published data were used for this work (https://openpsychometrics.org/_rawdata/) (51).




