Improved Items for Estimating SF-36 Profile and Summary Component Scores: Construction and Validation of an 8-Item QOL General (QGEN) Survey

John E Ware, Jr

doi:10.1097/MLR.0000000000002122

. 2025 Jan 20;63(4):300–310. doi: 10.1097/MLR.0000000000002122

Improved Items for Estimating SF-36 Profile and Summary Component Scores

Construction and Validation of an 8-Item QOL General (QGEN) Survey

John E Ware Jr ^*,^†,^✉

PMCID: PMC11888827 PMID: 39823550

Abstract

Background:

Comprehensive health-related quality of life (QOL) assessment under severe respondent burden constraints requires improved single-item scales for frequently surveyed domains. This article documents how new single-item-per-domain (SIPD) QOL General (QGEN-8) measures were constructed for domains common to SF-36 and results from the first psychometric tests comparing scores for the new measure in relation to those for the SF-36 profile and summary components.

Research Design:

Online NORC surveys of adults, ages 19–93 (mean=52 y) representing the US population in 2020 (N=1648) included QGEN-8 and SF-36 items measuring physical (PF), social (SF), role physical (RP) and role emotional (RE) functioning and feelings of bodily pain (BP), vitality (VT), and mental health (MH). QGEN-8 items were constructed with response categories increasing score ranges for functioning (PF, SF, RP, RE) and directly measuring first-order factors for feelings (BP, VT, and MH). Analyses compared ceiling effects, convergent-discriminant correlations, classic and confirmatory factor analysis (CFA) testing for higher-order physical and mental components, and validity in discriminating across groups differing in comorbid condition severity.

Results:

QGEN-8 reduced response times by 75% and lowered ceiling effect percentages (−2.2% to −27.8%, median=−14%) in comparison with SF-36. Their common measurement model was supported by: (1) substantial convergent correlations (r=0.576–0.778, median r=0.721) between methods for all domains; (2) lower discriminant correlations between different domains; (3) patterns of factor loadings equivalent to previous studies and adequate CFA model fit; (4) high correlations between methods for physical (r=0.813) and mental (r=0.761) component scores; and (5) equivalent average declines across groups reporting worse comorbid conditions.

Conclusions:

Overall, results support the use of QGEN-8 to reduce respondent burden and ceiling effects while maintaining convergent and discriminant validity sufficient to estimate group-level SF-36 physical (PCS) and mental (MCS) summary scores. To facilitate its use, QGEN-8 has been made available in multiple languages from the non-profit Mapi Research Trust at https://eprovide.mapi-trust.org.

Key Words: health-related quality of life, SF-36, QGEN-8, single-item measures, ceiling effects, convergent discriminant validity

Widespread adoption of standardized health-related quality of life (QOL) measures has been enabled by surveys¹ that (1) represent the QOL domain content most frequently studied in population and patient surveys; (2) are psychometrically sound; and (3) yield results that agree with valid measures and enable other guidelines that make results interpretable. Consistent with psychometric theory, historically these desirable measurement characteristics have often been satisfied for each QOL domain using multiple survey items.

In contrast, the pursuit of practicality while maintaining comprehensiveness using, single-item-per-domain (SIPD) surveys, the ultimate short form, has been pursued for decades^2,3 for different purposes,^4–7 using various approaches.^8–11 Single-item selections for commonly measured domains have often been the best-performing available items despite shortcomings,¹² including coarseness inherent in categorical ratings, response distribution skewness, and ceiling effects, particularly for negatively defined functional limitations and specific symptoms. Minimizing such problems has been pursued historically by selecting the best items, most recently fewer and sometimes only one, per domain.^7,11,12

The current study approach to improving measures was based on lessons learned from the Health Insurance Experiment (HIE)¹³ and quasiexperimental Medical Outcomes Study (MOS)¹⁴ evaluations of population health care outcomes. The lessons include (1) demonstrations of the usefulness of repeated health status surveys to quantify health status outcomes; (2) the feasibility of self-administered surveys that are preferred and reduce administrations to a fraction of personal interview costs across diverse populations; (3) conceptually and empirically distinct physical, mental, and role/social functioning and well-being domains^12,15,16; (4) discovery that the great majority of domain covariances are explained by physical and mental components¹⁶ as hypothesized by WHO¹⁷; (5) longitudinal surveys adding health outcomes to health services research^13,14 and (6) among the most valid predictors of health care costs, job loss, and mortality.¹⁸ These advances also influenced clinical outcomes research (see Discussion section).

The transition from full-length HIE methods to the first MOS short forms and their translations have been summarized elsewhere.¹⁹ Briefly, to compare patient outcomes across medical care settings,²⁰ MOS monitoring began with a first generation of “shorter” HIE-influenced 149-item baseline survey of 40 QOL domains.¹⁶ In time for more practical 4-year outcome measures, 36 short form (SF-36) items achieved psychometrically sound estimates of eight domains: (1) 4 functioning domains: physical (PF), social (SF), role physical (RP), and role emotional (RE); (2) 3 feeling domains of bodily pain (BP), vitality [VT and mental health (MH)]; and (3) an evaluation of general health (GH).¹⁴ In parallel, a MOS global item for each of the 8 domains was administered longitudinally.^7,16

During baseline MOS psychometric evaluations of construct validity, physical (PCS) and mental (MCS) components capturing 80% of scale reliable variance were discovered, in support of WHO component and well-being definitions of health.²¹ And, for the first time, PCS and MCS summary components simplified analyses and public reporting of differences in medical care outcomes.^14,21 Because fewer items representing the same eight domains proved to be sufficient for estimating 2 outcomes, PCS and MCS components spawned a third-generation 1–2 items/domain short forms, SF-12.²²

The current study assumed that a fourth generation of measurement improvements for MOS domains and summary components requires new item construction as opposed to further searches. Fortunately, analyses of comprehensive item banks for frequently measured domains have identified approaches to item construction likely to yield more information/item. For example, at a specific PF difficulty level (eg, walking the length of a soccer field), which about half of the general population can do without any limitation, item response theory (IRT) modeling results suggest that the item response category-defining the “ceiling” can simply be raised by about 1.0 SD units by redefining it as “easy to do” in place of the common “not at all limited” category. This finding reported decades ago²³ has been replicated recently and independently.^24,25 The current study is the first to standardize this solution across multiple functioning (PF, RP, SF, RE) domains. For feelings conceptualized and modeled as bipolar (ie, having opposite poles), for example, vitality, short-form measurement was achieved using multiple “energy” and “fatigue” items²⁶ or relied on a much more narrowly defined unipolar fatigue¹¹ or energy²² item. In contrast, the current study constructed a direct SIPD measure that extended the item stem and response categories to cover both of the opposite poles for each VT and MH domain.

This paper documents the (1) approaches used to achieve a fourth generation of improvements in SIPD measures of the 8 most frequently measured QOL domains yielding a new 8-item short form named QOL General (QGEN-8) and (2) results from the first study of all QGEN-8 and SF-36 items administered to the same respondents, in contrast to previous single domain evaluations.^{23,24,26–29} In addition to comparing response distributions and ceiling effects across QGEN-8 and SF-36 methods, this is the first study of their convergent validity across methods and validity in discriminating across common domains as well as the equivalence of their PCS and MCS component estimations across groups known to differ in a representative sample of the US general population. Potential advantages of QGEN-8 include greater survey efficiency in terms of respondent burden reduction, information per item, and coverage of a wider score range sufficient to reduce ceiling effects. Achieving these advantages while maintaining comparability with familiar SF-36 metrics makes a very large body of published norms and other information available for interpreting QGEN-8 differences in QOL status and outcomes.

METHODS

Sampling and Data Collection

Adults (N=1648), ages 19–93, from the National Opinion Research Center (NORC) AmeriSpeak true probability panel²⁹ representing 97% of US households completed surveys on the internet (98.6%) or by telephone (1.3%) in 2020 as documented elsewhere.^30,31 All participants provided informed consent, and surveys were fielded in accordance with the guidelines of the American Association for Public Opinion Research (AAPOR) and were approved by the NORC Institutional Review Board (protocol number 20.05.29).

Survey modules were administered in the following order: QOL General (QGEN-8) survey items (Auxiliary Appendix A, Supplemental Digital Content 1, http://links.lww.com/MLR/C957) and 2 additional QGEN items not analyzed here; a 35-item comorbid conditions checklist³² and for conditions reported, as many as 72 (median=6) QOL Disease Impact Scale (QDIS) and other disease-specific items measuring QOL impact attributed to comorbid conditions;^32,33 SF-36v2 Health Survey;^19,34 ratings of health now versus 3 months ago and symptoms common to respiratory, flu, and COVID conditions, and ratings of COVID-19 outbreak impact.

Survey Item Improvements

Explorations of improvements in SIPD items for frequently measured domains began decades ago.^2,7,8,12,16 To construct item improvements for the current study, new compilations of domain-specific wordings and corresponding changes in response categories proven to be sufficiently homogeneous in previous studies were preserved within each domain. While preserving original attributions to health and specifically to physical and/or mental health (eg, for role and social domains), additions to terminology included references to “your life,” “quality of life,” and “daily” and “everyday activities,” to improve face and content validity. Table 1 documents and compares the wording of QGEN-8 and SF-36 item stems, item response categories that determine ceiling and floor effects, and the 3 approaches applied in constructing improved QGEN-8 items. The rightmost columns are discussed in the Results section.

TABLE 1.

Abbreviated Item Stem and Response Category Content, Ceiling Effect Percentages, and Other Information by Domain and Method

			QGEN-8
		Response	Revision	% @	Difference^‖
Domain/method	Abbreviated SIPD item stem content^†	Category range	Methods^‡	Ceiling^§	Difference^‖
Physical function (PF)
QGEN-8	How easy-hard to do physical activity (walk, climb)	Very easy-very hard	AG, RC	34.9^*	−2.2^*
SF-36^‡	Vigorous activities, running, lifting, strenuous sports^†	Not limited at all-limited a lot		37.1
Role-physical (RP)
QGEN-8	How easy-hard physical health makes work/home activities	Very easy-very hard	RC	27.9^*	−20.9^*
SF-36^‡	Accomplish less at work or other daily activities	Not at all		48.8
Pain (BP)
QGEN-8	How much pain limits everyday activities or QOL	No pain-extremely limited	AG, RC, DH	31.0^*	−14.3^*
SF-36^‡	How much pain interfered with normal work or home	Not at all-Extremely		45.3
General health (GH)
QGEN-8 SF-36	Overall, how would you rate your health	Excellent-Poor		6.9	NA
Social functioning (SF)
QGEN-8	How easy-hard physical health makes having a social life	Very easy-Very hard	RC	31.3^*	−18.8^*
SF-36^‡	How much physical/mental problems limit social activities	Not at all-Extremely		50.1
Vitality (VT)
QGEN-8	On average, feel tired or energetic most of the time	Tired-energetic all of the time	AG, RC, DH	2.5^*	−4.4^*
SF-36^‡	How much of the time have a lot of energy	None-all of the time		6.9
Role-emotional (RE)
QGEN-8	How easy-hard emotional health makes work	Very easy-Very hard	RC	19.7^*	−27.8^*
SF-36^‡	Pers/emot probs keep from daily work	All of time		47.5
Mental health (MH)
QGEN-8	How happy, satisfied with your life	Extremely happy-very unhappy	AG, RC, DH	6.7^*	−2.1^*
SF-36^‡	How much of the time felt calm and peaceful	None-all of the time		8.8

Open in a new tab

Note: Percentages at the ceiling across all SF-36 items by domain are PF-10 item (37%–87%, med=71%); RP-4 item (49%–58%, med=50%); SF-2 item (50%–52%);

VT-4 item (7%–18%, med=8%); RE-3 item (47%–60%, med=57%); MH-5 item (9%–53%, med=38%).

All QGEN-8 item ceiling percentage estimates are significantly (P<0.001) lower than those for the corresponding best performing SF-36 item (next to last column) with estimates of their differences (last/rightmost column). Differences were significant for 23 other SF-36 same-domain items.

^†

Abbreviated item content; for QGEN-8, see Appendix; for SF-36.³⁵

^‡

QGEN-8 item revision methods: AG: Aggregation of content; RC: Response category revision; DH: Direct higher-order domain measurement.

^§

Percentage at the ceiling for QGEN-8 and SF-36 best-performing (lowest %) same-domain item.

^‖

Difference = QGEN-8%tage minus SF-36%.

Approaches to item improvement along with published evidence supporting them are summarized below:

Aggregation (AG) of domain-specific item content to broaden content validity, for 4 of 8 eight domains. This is the logic underlying “global” single-item measures^7,8,11,12 applied successfully to physical and mental domains. Physical functioning SF-8,⁷ COOP Chart,⁸ and PROMIS global¹¹ items aggregated different physical activities (eg, walking, climbing stairs) in a single item. For mental health, research findings^7,8,11,12 support aggregating sufficiently homogeneous feelings of psychological distress (eg, anxious, depressed) for SIPD measurement of MH, as opposed to a single item representing only 1 MH symptom. It is noteworthy that wording very similar to original MOS global items has been adopted in legacy and contemporary PF and MH global items.^7,11 For global item stems, the current study also improved the range of item response categories.
Response category (RC) changes were adopted for all improved items on the strength of previously published item bank studies demonstrating the IRT advantages of bipolar QGEN-8 response categories across “Easy-Difficult or “Hard” levels for PF^23,24 and “Tired-Energetic” categories for Vitality.²⁶ In contrast to widely used PF unipolar 5-level “Not at all limited-Limited a lot” categories,^11,23,24,36 a new 5-category bipolar “very easy-unable to do” set of response choices and corresponding item stems were constructed to extend the range of measurement for those items. Upon replicating these improvements for PF, the current study implemented and tested these response categories for all four (PF, RP, SF, RE) functioning domains (Auxiliary Appendix A, Supplemental Digital Content 1, http://links.lww.com/MLR/C957). Due to the very small samples using the 2 worst “very hard” and “unable to do” response categories, they were combined into a single very hard/unable to do category for all 4 functioning domains. Further, as documented in Auxiliary Appendix B, Supplemental Digital Content 2, http://links.lww.com/MLR/C958, real data simulation methods were applied to the patterns of observed responses to evaluate whether the addition of a new middle response category between “easy” and “hard” is likely to reduce coarseness enough to improve reliability and validity in future studies.

Other noteworthy improvements in response category wording lowered item reading levels, as defined by the Kincaid score,³⁷ by changing the least favorable category from “difficult” or “impossible” category wording used in previous item bank studies^23–25 “to “hard” or “very hard.”
Direct higher-order (DH) factor measurement methods were applied to construct a new 5-category item representing conceptually/empirically distinct aspects within each of 3 first-order feeling factors: vitality (energy vs. fatigue), mental health (psychological distress vs. well-being) and bodily pain (frequency vs. QOL impact attributed to pain). Each had previously been shown to be sufficiently homogeneous to justify a valid domain summary score.^16,26,27 This approach contrasts with the alternative of selecting a single item representing only one aspect of each domain with consequences of limited content validity, covering a fraction of the score range and substantial ceiling effect. As documented in Table 1 for vitality (VT), for example, using the DH approach, a 5-category QGEN-8 rating item anchored by words defining opposite poles (“tired” vs. “energetic”) was selected from the 4-item SF-36 VT scale. For BP in Table 1, distinct frequency and QOL impact SF-36 item content were selected to define a 5-category rating scale ranging from “no pain” to pain “extremely” in response to “how much pain limits your everyday activities or your quality of life.” For MH in Table 1, a 5-category bipolar version of the widely used and previously validated^16,18,27 Mental Health Inventory (MHI) was constructed to represent “happy” versus “unhappy” opposite poles of the 5-item SF-36 MH (psychological distress vs. well-being) short form and its full-length original 38-item MHI.

Scoring Methods

As in previous psychometric evaluation studies of SF-36,^21,34,36 SF-12,²² and SF-8,⁷ QGEN-8 items were scored using 2 methods: (1) raw (1–5) category scores used in testing item-scale convergence and validity in discriminating between domains³⁶ and (2) by using US general population SF-36 norm-based means observed for those choosing each QGEN-8 response category for the same domain as used in scoring SF-8.⁷ The latter is equivalent to predicting the SF-36 score for each domain using QGEN-8 item response category dummy variables for the same domain. Because the current study was the first QGEN-8 and most recent SF-36 US population norming, 2020 norms were applied. Results were compared with those based upon previously published³⁴ norms to confirm the equivalence of conclusions (results not reported).

For both SF-36 and QGEN-8 Physical (PCS) and Mental (MCS) component summaries, standardized domain scale scores were aggregated using the original developer-recommended domain weights (factor score coefficients) derived in the US 1990 population for orthogonal factors.^21,34 All QGEN-8 and SF-36 domain scale and summary components were scored positively (higher is better) and transformed linearly to have a mean of 50 and an SD of 10 in the current study US 2020 general population.

Qualitative Analyses

Although source short-form item stem and response category wordings have been frequently evaluated qualitatively and quantitatively, specific qualitative evaluations of new QGEN-8 item wording configurations were completed in very small samples (N=5), including chronically and acutely ill adults conducted during evaluations of paper versions of source English and other language translations. Evaluations addressed clarity, understandability, and cultural relevance using methods very similar to those used in original SF-36 qualitative evaluations.^38,39 Supervisor overview and developer review at each step in the process were certified by ICON language services.³⁹

Empirical Analyses

Data quality was evaluated in terms of missing data rates and Response Consistency Index (RCI) estimates of how well respondents understand items across 14 pairs of them in the SF-36 yielding 97.8% with three or fewer inconsistencies during supervised and Internet-based surveys.³⁴ Response times were recorded by survey administration software. Other empirical methods are summarized below. Analyses were performed using IBM SPSS Statistics and STATA software.

Score Distributions, Ceiling Effects, and Reliability

To test whether QGEN-8 item response distributions differed in comparison with SF-36 items for each domain, χ² tests were performed and ceiling effect percentages were compared. It was hypothesized that changes in QGEN-8 items would shift responses away from the most favorable category.

Reliability was estimated using internal consistency methods; Cronbach Alpha for multi-item SF-36 and average inter-item correlations for QGEN-8 items as in previous studies.^7,16,22,34

Construct Validity

To maintain direct comparability with published results, the same classical psychometric methods of evaluation and scoring used in previous US evaluations of SF-36,³⁶ SF-12,²² and SF-8⁷ and in international studies⁴⁰ were employed in the current study. In addition, for purposes of adding more formal tests of goodness of fit in relation to their common higher-order orthogonal measurement model, structural equation modeling (SEM) of QGEN-8 items was also performed.

Classic construct validity⁴¹ tests evaluated convergent and discriminant criteria applied to 128 product-moment correlations among QGEN-8 single-item-per-domain (SIPD) and SF-36 multi-item methods of estimating the same 8 domains. Convergent validity was supported by substantial (r>0.40) matrix diagonal (same domain, different methods) correlations between SF-36 and QGEN-8 for each domain. Discriminant validity was supported by correlations between different domains that were lower than convergent estimates for 8 domains.

Principal component analysis (PCA) methods tested construct validity in relation to the 2-component (physical and mental) orthogonal model previously reported for all MOS short forms.^7,21,22 The current study is the first known PCA of correlations among SF-36 and another method for the same 8 domains. The number of factors extracted was determined by multiple criteria (Scree test, Eigenvalue >1, 5% of variance rule, and common factor rule). From previous studies, a 2-factor (physical and mental) higher-order measurement model was hypothesized with four primary domain factor loadings for PCS (PF, RP, BP, and GH), four primary domain factor loadings for MCS (SF, VT, RE, and MH) and only 3 (GH, SF, and VT) secondary loadings (see Hypotheses in Table 2). Opposite rank orderings of magnitudes of loadings across domains for PCS (PF high to MH low) in comparison with MCS (MH high to PF low) previously observed, were hypothesized.

TABLE 2.

Factor Structure of QGEN-8 and SF-36 Measures of Common Domains

Domain/Form	Hypotheses*		Factor loadings
Domain/Form	PCS	MCS	PCS	MCS	h²
Physical functioning (PF)
QGEN	●	–	0.813	0.168	0.690
SF36	●	–	0.848	0.177	0.750
Role Physical functioning (RP)
QGEN	●	–	0.827	0.263	0.753
SF36	●	–	0.776	0.351	0.725
Bodily Pain (BP)
QGEN	●	–	0.786	0.178	0.650
SF36	●	–	0.785	0.285	0.698
General Health (GH)
QGEN	●	○	0.578	0.379	0.478
SF36	●	○	0.609	0.502	0.623
Social Functioning (SF)
QGEN	○	●	0.668	0.398	0.604
SF36	○	●	0.437	0.690	0.668
Vitality (VT)
QGEN	○	●	0.396	0.631	0.555
SF36	○	●	0.334	0.796	0.745
Role Emotional (RE)
QGEN	–	●	0.270	0.749	0.633
SF36	–	●	0.403	0.666	0.606
Mental Health (MH)
QGEN	–	●	0.113	0.816	0.678
SF36	–	●	0.113	0.892	0.808

Open in a new tab

● Strong (>0.70), primary loading estimated.

○ Moderate to substantial (r=0.30 to <0.70), secondary loading. -Weak (r<0.30), no loading or path estimated.

The SEM test of the goodness of fit of QGEN-8 in relation to the original^19,21,40 SF-36 measurement model (Fig. 1) used confirmatory factor analysis (CFA) of the 28 correlations among QGEN-8 items (see the upper left triangle of Table 3). Specifically, the model specified one primary (either physical or mental) path or item factor loading for each domain and 3 secondary paths. All other model parameters were not constrained or estimated. Model fit criteria included comparative fit index (CFI), largest observed standardized residual (r<0.20), and root mean squared average (RMSEA).

QGEN-8 measurement model and estimates of confirmatory factor analysis (CFA) primary and secondary loadings.

TABLE 3.

Estimates of Reliability (Italics), Convergent between Methods (Diagonal Bolded), and Discriminant Correlations Among QGEN-8 and SF-36 Scores

				QGEN-8								SF-36
	PF	RP	BP	GH	SF	VT	RE	MH	PF	RP	BP	GH	SF	VT	RE	MH
QGEN-8
Physical functioning (PF)	0.688
Role-physical (RP)	0.722**	0.553
Bodily pain (BP)	0.566**	0.665**	0.753
General health (GH)	0.529**	0.531**	0.445**	0.744
Social functioning (SF)	0.588**	0.680**	0.555**	0.485**	0.810
Vitality (VT)	0.448**	0.491**	0.455**	0.484**	0.486**	0.861
Role-emotional (RE)	0.391**	0.451**	0.412**	0.385**	0.515**	0.505**	0.856
Mental health (MH)	0.267**	0.354**	0.339**	0.373**	0.403**	0.493**	0.607**	0.858
SF-36
Physical functioning (PF)	0.723**	0.670**	0.663**	0.489**	0.573**	0.412**	0.361**	0.256**	0.939
Role-physical (RP)	0.605**	0.673**	0.698**	0.487**	0.614**	0.482**	0.454**	0.343**	0.779**	0.840
Bodily pain (BP)	0.580**	0.677**	0.756**	0.466**	0.573**	0.458**	0.411**	0.344**	0.676**	0.702**	0.871
General health (GH)	0.547**	0.590**	0.538**	0.778**	0.566**	0.562**	0.474**	0.441**	0.559**	0.588**	0.553**	0.796
Social functioning (SF)	0.443**	0.517**	0.553**	0.437**	0.606**	0.505**	0.586**	0.543**	0.503**	0.621**	0.549**	0.560**	0.822
Vitality (VT)	0.416**	0.481**	0.501**	0.476**	0.473**	0.721**	0.609**	0.640**	0.438**	0.524**	0.511**	0.601**	0.607**	0.825
Role-emotional (RE)	0.384**	0.458**	0.515**	0.393**	0.491**	0.489**	0.576**	0.490**	0.511**	0.669**	0.505**	0.501**	0.687**	0.586**	0.810
Mental health (MH)	0.263**	0.339**	0.377**	0.368**	0.406**	0.523**	0.638**	0.683**	0.271**	0.392**	0.372**	0.500**	0.671**	0.724**	0.633**	0.878

Open in a new tab

Upper diagonal shaded correlations in italics are, upper left, QGEN-8 item score reliability and, lower right, SF-36 multi-item score reliability estimates. In the lower left box diagonal are bolded convergent validity correlations between QGEN-8 and SF-36 methods of measuring the same domains. All other off-diagonal correlations are discriminant validity estimates of correlations between different domains, hypothesized to be lower.

**Correlation is significant at the 0.01 level (2-tailed).

Criterion Validity

For a validity test using an external criterion, analyses of variance (ANOVA) independently compared physical and mental summary components estimated using norm-based QGEN-8 and SF-36 scoring methods across groups known to differ in terms of the severity of comorbid chronic conditions. The criterion was a previously validated QOL Disease Impact Scale (QDIS)^32,33 stratification of chronically ill respondents into 1 of 5 categories according to the severity of their worst comorbid condition (see Auxiliary Appendix C, Supplemental Digital Content 3, http://links.lww.com/MLR/C959 table footnotes for classification details). Separately for physical and mental outcomes, QGEN-8 and SF-36 results were compared to determine whether (1) the methods reached the same overall conclusion regarding whether group means differed, (2) whether PCS or MCS was most responsive to group differences, and (3) whether the relative magnitudes (effect sizes) for group differences were the same across methods.

RESULTS

As summarized in Table 4, survey respondents ranged from 19 to 93 years of age, averaged 47.2 years; 51% were female, nearly 40% were non-White, 34% had a college degree, more than half were employed, and 48% were married. One or more chronic conditions were reported by 31.6%; half reported three or more. Internet data collection facilitated easy survey responding and enabled missing data and survey response time monitoring. Surveys yielded very low (0.6%–1.4% per item) missing item response rates. Overall, 97.5% were complete for both SF-36 and QGEN-8 methods. Most respondents required 6–10 seconds/item for 5-category rating items. The SF-36 Response Consistency Index (RCI) was used to monitor data quality defined as the frequency of contradictory responses across 14 pairs of items known to yield 97.8% with 3 or fewer contradictions consistent with other Internet-based surveys.⁴²

TABLE 4.

Characteristics of NORC Amerispeak Probability Sample Respondents (N=1648)

Mean age, y (SD)	47.17 (17.33)
Range	19–93
Gender, N (%)
Female	838 (50.9)
Male	810 (49.2)
Relationship, N (%)
Divorced	180 (11)
Living with partner	153 (9.3)
Married	783 (47.6)
Never married	435 (26.4)
Separated	30 (1.9)
Widowed	67 (4.1)
Employment status, N (%)
Working—paid employee	902 (54.8)
Not working—retired	270 (16.4)
Working—self-employed	145 (8.8)
Not working—other	113 (6.9)
Not working—disabled	107 (6.5)
Not working—looking for work	98 (6)
Not working—temporary layoff	13 (0.8)
Education, N (%)
Bachelor's or above	558 (33.9)
Some college	669 (40.6)
High school graduate or equivalent	339 (20.6)
No high school diploma	82 (5)
Race, N (%)
White	1028 (62.4)
Black	193 (11.7)
Hispanic	26 (1.6)
Other, non-Hispanic	287 (17.4)
Asian	59 (3.6)
Income, N (%)
Less than $25,000	316 (19.2)
$25,000 to $49,999	436 (26.4)
$50,000 to 74,999	309 (18.7)
$75,000 to 99,999	231 (14.0)
$100,000 to $149,999	223 (13.6)
$150,000 or more	133 (8.1)

Open in a new tab

Qualitative Evaluations

As documented in linguistic validation reports, evaluations of item clarity, understandability, and cultural relevance were deemed satisfactory based on methods very similar to those used in original SF-36 evaluations and were certified by ICON language services.³⁹

The 2 most consequential results from the qualitative evaluations were the clarity of a new “Sometimes Easy-Sometimes Hard” middle response category and its placement in between the current study's “Easy” and “Hard” response categories for all 4 functioning items. The specific middle response category wording (Auxiliary Appendix B, Supplemental Digital Content 2, http://links.lww.com/MLR/C958) was favored over one with fewer words.

Response Distributions and Ceiling Effects

Table 1, in the rightmost columns, compares ceiling effect percentages for QGEN-8 and SF-36 items for each domain. QGEN-8 items shifted score distributions significantly (P<0.001) away from the most favorable category with differences between percentages at the ceiling ranging from −2.2% to −27.8% (median=−14%) in comparison with the best-performing (lowest ceiling %) SF-36 item for that domain. Table 1 footnotes document even higher ceiling effect percentages observed for all other (than best performing) SF-36 items for each domain and document the percentages for SF-8 items from previous US general population studies,⁷ all of which are higher in comparison with QGEN-8 item percentages. For SF-36 multi-item measures of functioning, ceiling effects also remained high: PF (30.1%), RP (37.8%), SF (43.5%), and RE (45.2%). For MH and VT bipolar domains, ceiling effects for feeling measures depend on whether ill or well-being pole items were selected from SF-36 and SF-8 for comparison. Floor effects were very small (0.6%–4.1%) across domains. For GH there is no method comparison as both methods used the same item, which is also used by other methods.^8,11

Construct Validity

Convergent correlations between QGEN-8 item and SF-36 multi-item methods and discriminant (different domains) correlations are presented in Table 3 in the same format used in previous convergent-discriminant validity tests. Reliability estimates in italics are documented in the main diagonal. Convergent validity correlations, bolded in the lower left Table 3 box matrix diagonal between QGEN-8 and SF-36 estimates for each domain, are consistently the highest (r=0.576–0.778 with a median=0.721). For the 4 function domains (PF, RP, SF, and RE), real data simulation estimates suggest that enough respondents are likely to choose a new middle “sometimes easy-Sometimes hard” response category to justify tests of whether they improve convergent validity for all 4 functioning domains (Auxiliary Appendix B, Supplemental Digital Content 2, http://links.lww.com/MLR/C958).

In support of validity in discriminating between different domains, in comparison with the 8 convergent correlations in the Table 3 box, 52/56 off-diagonal discriminant correlations are lower. As hypothesized, in support of discriminant validity, correlations in the upper left triangle (r=0.267–0.722, median r=0.432) for QGEN-8 were lower for 26/28 tests and for SF-36 (r=0.271–0.724, median r=0.559) were lower for 18/28 tests across different domains. For both QGEN-8 and SF-36 methods, the best (lowest) discriminators (r=0.263–0.271) were the very low correlations observed between PF and MH domains.

The PCA justified the extraction and rotation of 2 higher-order factors by all criteria, accounted for 67.1% of the total variance, and yielded patterns of loadings consistent with physical and mental factors. Table 2 documents hypotheses for rotated orthogonal factor loadings and communalities (h²). The reverse rank ordering of their magnitudes across domains for PCS (PF highest to MH lowest) and for MCS (PF lowest to MH highest) and equivalence of patterns of primary and secondary loadings hypothesized for SF-36 also observed for QGEN-8, with one exception. For the SF domain, opposite primary and secondary loading patterns were observed for QGEN-8 (PCS >MCS) in comparison with SF-36 (MCS >PCS) consistent with their differences in attributions, SF-36 to both physical and mental and QGEN-8 only to physical (Table 1). The very low correlation between PF and MH (Table 3) and the opposite patterns of very high and very low PCS and MCS factor loadings, respectively, for RP and RE items, differing only in attributions to physical versus mental health for limitations, supports the orthogonal measurement model for both methods.

Further support of QGEN-8 item construct validity came from the CFA test of their goodness of fit in relation to the original SF-36 measurement model (see Methods). As shown in Figure 1, this CFA yielded 8 substantial-high (0.49–0.81) primary and 3 lower (0.20–0.28) secondary factor loadings and satisfactory comparative fit index (CFI) of 0.983, RMSEA of 0.064 (0.054–0.075) and standardized root mean squared residual (SRMR) of 0.023 and no residual correlations <0.20.

In support of the equivalence of summary component scores across methods, correlations were high between QGEN-8 and SF-36 estimates for both PCS (r=0.813) and MCS (r=0.761) and with only 6.2% and 3.3% of method differences falling outside the +/− 2SD threshold in the scatterplots for each summary component.

Known Groups Validity

In support of known group validity, ANOVA comparisons of QGEN-8 and SF-36 group mean physical and mental score estimates lead to the same overall conclusion that QOL declines for groups with more severe comorbid conditions were observed for both methods (the Auxiliary Appendix C table, Supplemental Digital Content 3, http://links.lww.com/MLR/C959). Further, with each increase in group severity level, both QGEN-8 and SF-36 means declined for both physical and mental health. Other substantive findings regarding disease severity and QOL are noteworthy. In relation to the population average (mean=50), the most well (no comorbid condition impact) group exceeded the population norm and declined by important amounts for both methods, as hypothesized. For both methods, comparisons of ANOVA variances (Eta²) show that the impact of worse comorbid severity is greater for physical than for mental outcomes. Further, comparisons of ANOVA variances (Eta²) show that SF-36 discriminated better than QGEN-8 for physical but not for mental outcomes. Finally, for both methods, magnitudes of physical mean declines tended to increase with increasing comorbidity severity, ie, declines were nonlinear (Auxiliary Appendix C table footnotes, Supplemental Digital Content 3, http://links.lww.com/MLR/C959).

DISCUSSION

Study findings constitute a considerable body of evidence that improvements in item comprehensiveness and score range are possible using QGEN-8, in comparison with SF-36, for their 8 common domains. Also, although a 75% reduction in survey length using SIPD measures is not new,⁷ it explains the increasing adoption of shorter measures in population and clinical research; they are more likely to be used.¹ It is good news that brevity can be achieved while reducing ceiling effects in comparison with SF-36 items and SF-36 multi-item scales measuring functioning in the current study as well as published SF-8 ceiling effects.⁷ It is also noteworthy that QGEN-8 item improvements were achieved while maintaining satisfactory construct validity in tests using both classic and modern CFA psychometric methods, with a noteworthy exception for one domain discussed below.

Collectively, current study results also support the 8-domain and 2 higher-order physical and mental component models underlying all MOS short forms.^7,22,34 This model and standardized scoring enable a crosswalk between QGEN-8 and the very large body of SF-36 findings available for use in interpreting group-level QOL domain profiles and component outcomes. These gains have the potential to increase the number of population and clinical applications for which SIPD survey measures can satisfy study objectives. It is noteworthy that ceiling and floor effects are virtually eliminated with the use of physical and mental component summary scores.

Reducing respondent burden is important because it determines whether and how QOL is surveyed in population and clinical research, particularly under severe survey length constraints. The US government's annual monitoring of Medicare plans was changed from SF-36 to SF-12 methods after their estimates of PCS and MCS outcomes were shown to be sufficiently comparable.²² A randomized public experiment comparing outcomes with and without Medicaid insurance for the poor used the shortest available, comprehensive method, SF-8, to minimize nonresponse bias while monitoring both physical and mental outcomes.⁴³ QGEN-8 may satisfy these practical needs while also providing more information per item.

Whether it is important to extend measurement into higher levels of functioning and well-being has been debated for decades. Soon after such HIE population survey improvements were published, their clinical usefulness was tested in a randomized trial showing significant treatment differences in general and psychological well-being linked to side effects across groups receiving equally clinically efficacious antihypertensive medications.⁴⁴ In addition to withdrawal from therapy, the consequences of such QOL differences included stress equivalent to job loss and increased mental health expenditures sufficient to precipitate an NEJM editorial calling for QOL measurement routinely in clinical trials.⁴⁵ In a clinical trial linking hematocrit and QOL, regression models revealed that improvements throughout the hematocrit range occurred with improvements in well-being, particularly VT.⁴²

A systematic review of the first 17 years of well-controlled drug trials reported 70-100% (78% overall) concordance between clinically defined treatment efficacy and SF-36 PCS/MCS-defined efficacy across 14 treatment areas.⁴⁶ In the HIE, a first-generation bipolar MH measure, the MHI, significantly improved predictions of mental health care costs, controlling for functional limitations, demonstrating practical economic consequences of differences in well-being.¹⁸ Soon afterward, when the UK National Health Service sought to improve the range of health benefits surveyed, short form well-being measures were shown to better identify health problems missed among those at the ceiling for symptom measures of VT and predict practical consequences including employment loss and health care costs.⁴⁷

Despite such evidence, SIPD representation of proven bipolar domains has most often been limited to measuring only one of their poles, for example, a specific symptom such as depression for MH or fatigue for VT²⁸ resulting in substantial ceiling effects.²⁶ In part, this reflects an emphasis on usefulness in tests of clinical validity and regulatory guidelines favoring disease-specific symptoms over broader QOL outcomes for clinical trials of drugs.⁴⁸ Given differences in objectives for clinical and population outcomes research, it is not surprising that they appear to continue to evolve in opposite directions focusing on the more severe levels useful in clinical decision-making versus disease burden and treatment benefits that occur more often and matter most for most of the population. Current study findings suggest that practical alternative approaches based upon improved SIPD measurement can yield information about both ill and well-being for bipolar feeling domains, extended ranges of physical and role/social functioning, and common differences in evaluations of health status.

The substantial current study exception to replication of the measurement model common to QGEN-8 and MOS short forms, a change in the factor structure of Social Functioning (SF), was not unexpected. The change in wording of attributions for SF limitations from “physical or mental” health problems in SF-36 and other MOS short forms to only “physical health” for QGEN-8 reversed the pattern of very high-lower PCS and MCS factor loadings observed in comparison with the SF-36 pattern (Table 2). This finding underscores the power of specific attributions and the now primary physical determinants of QGEN-8 SF domain scores. Whether an additional SF item with mental attribution (analogous to separate role functioning items for physical and emotional problems) is currently under study. Preliminary results suggest that distinct attributions will improve validity for purposes of discriminating between physical and mental higher-order factors.

Study Limitations

This study has noteworthy shortcomings inherent in cross-sectional measurement research including lacking tests of responsiveness and reliance entirely on self-reports worthy of further study. Hopefully, a more practical QGEN-8 form will facilitate such studies. Although it is doubtful that they changed conclusions regarding the relative performance of the 2 methods compared, other current study limitations are discussed below, including focus on only SF-36 domains and summary components, effects of disease-specific surveys administered between methods, coarseness inherent with categorical ratings, use of classic psychometric methods, and only 1 external criterion test of validity.

The current study has the disadvantage inherent in being limited to a profile of only 8 of 40 MOS domains.^1,12 However, 8 have repeatedly been sufficient to estimate physical and mental summary components.^7,21,34,49 Regardless, measurement model standardization also has the disadvantage of stymying new domain representation and scoring alternatives. To facilitate reanalyses addressing such measurement model issues, the QGEN-8 and SF-36 combined current study correlation matrices are documented in Table 3. Research in Japan and the United States is evaluating additions of Health Distress (HD)¹⁶ and Role General (RG) domains to broaden content validity and enable the study of east-west differences in the structure of physical and mental functioning and well-being domains.⁵⁰ To facilitate this research, a 10-item short form (QGEN-10) adds previously validated SIPD measures of HD and RG domains (https://eprovide.mapi-trust.org/login).

Order effects from modules administered after QGEN-8, including lengthy comorbid condition checklist and disease-specific QOL ratings of all reported conditions, but before SF-36 (see Methods section) probably attenuated the magnitude of convergent correlations between QGEN-8 and SF-36 methods. Counterbalanced orders of administration are recommended for future studies. The lowest convergent correlations were observed between methods for the 4 functioning domains for which the largest number with multi-item SF-36 scores at the ceiling (30%–45%) were observed. Because all were assigned the same criterion score but were divided into higher and lower QGEN item response categories, the latter could not be distinguished by that criterion. When criteria cover the full range for PF, QGEN-8 type 5-category item improvements have been shown to yield much higher correlations.^23–25 Further research should explore these issues and others such as reducing SIPD coarseness (Auxiliary Appendix B, Supplemental Digital Content 2, http://links.lww.com/MLR/C958) and more accurate SIPD reliability estimates, best achieved using test-retest methods that treat unique reliable variance as reliable.⁷ Regardless of such potential limitations, the observed patterns of convergent and discriminant correlations for QGEN-8 in relation to the SF-36 methods support the validity of QGEN-8.

Precision-reducing coarseness is a well-known limitation of SIPD and other less reliable measures.⁷ The tradeoff is between a more practical SIPD representation of essential domains and reduced precision/domain. Further research is warranted to determine the limits of QGEN-8 SIPD estimates of domain-specific scores and the applications for which QGEN-8 increases in item content and score range, and reduced item coarseness with improved item response categories are sufficient to justify them as alternatives to SF-36 multi-item estimates of the same domains and summary components.

Because measurement precision standards are less stringent for comparisons of group mean scores, QGEN-8 and SF-36 with norm-based scoring ANOVA comparisons yielded the same overall conclusions regarding group mean differences. Equivalent patterns of declines across methods were also observed in response to increasing comorbid condition severity because confidence intervals for group means are largely determined by sample size.⁴⁰ With sufficiently large samples, results from a head-to-head method comparison suggest that QGEN-8 can provide a more practical alternative to SF-36 multi-item scales for comparisons of acute symptom severity with and without chronic respiratory conditions.³¹ Medical product industry sponsorship of the latter methodological comparison underscores the importance of achieving comprehensive and more practical QOL survey tools enabling larger population and clinical studies.

Classic psychometric methods used in developing and evaluating the SF-36 and in the current study may be seen as a limitation from the perspective of modern method (eg, IRT, CFA) advocates. However, as discussed above, the current study's classic methods of scale construction and scoring enabled direct comparisons with previously published results from convergent-discriminant validity evaluations, factor analyses (eg, PCF), and norm-based comparisons.^{7,16,21,22,34,36} For 2 domains (PF, VT) sufficiently common to the current study and other (eg, PROMIS) surveys, IRT methods have proven useful in providing better descriptions of score ranges and should be applied in future research for other essential domains. With respect to the current study objective of reducing ceiling effect percentages, it should be noted that estimates of percentages in the most favorable response category (ie, ceiling effects) are identical with classic and IRT item scoring. Further, with respect to the classic (eg, PCA) and confirmatory SEM analyses (eg, CFA) of construct validity, the current study's overall conclusions are the same, although SEM model fit statistics are deemed by some to be more objective.

CONCLUSIONS

Overall, results support the use of the 8-item, approximately 1-minute, QGEN-8 survey for purposes of estimating 8 frequently measured MOS QOL domains. Advantages include reduced respondent burden and reduced ceiling effects while maintaining convergent-discriminant validity in relation to SF-36 profile domains sufficient for purposes of estimating group-level SF-36 physical (PCS) and mental (MCS) summary scores. To assure QGEN-8 availability for population and clinical outcomes research, it is available in multiple languages from the nonprofit Mapi Research Trust (MRT) at (https://eprovide.mapi-trust.org).

Supplementary Material

SUPPLEMENTARY MATERIAL

mlr-63-300-s001.docx^{(17KB, docx)}

mlr-63-300-s002.docx^{(73.3KB, docx)}

mlr-63-300-s003.docx^{(16.9KB, docx)}

ACKNOWLEDGMENTS

The author posthumously acknowledges the contributions of Barbara Gandek, PhD, during the research enabling the current study and gratefully acknowledges the helpful comments from anonymous reviewers.

Footnotes

J.E.W. has received US federal grants from the National Institutes of Health (NIH), the National Institute on Aging, and the Agency for Healthcare Research and Quality; unrestricted grants from foundations and industry, honoraria from academic institutions, and is a shareholder in the NIH-funded Small Business Innovation Research (SBIR) company, JWRG Incorporated.

Supplemental Digital Content is available for this article. Direct URL citations are provided in the HTML and PDF versions of this article on the journal’s website, www.lww-medicalcare.com.

REFERENCES

1.McDowell I, Newell C. Measuring health: A guide to rating scales and questionnaires, 2nd ed. Oxford University Press; 1996. [Google Scholar]
2.Bowling A. Just one question: if one question works, why ask several? J Epidemiol Community Health. 2005;59:342–455. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Verbrugge LM, Merrill SS, Liu X. Measuring disability with parsimony. Disability and Rehabilitation. 1999;21:295–306. [DOI] [PubMed] [Google Scholar]
4.Hennessy CH, Moriarty DG, Zack MM, et al. Measuring health-related quality of life for public health surveillance. Public Health Rep. 1994;109:665–672. [PMC free article] [PubMed] [Google Scholar]
5.DeSalvo KB, Jones TM, Peabody J, et al. Health care expenditure prediction with a single item, self-rated health measure. Med Care. 2009;47:440–447. [DOI] [PubMed] [Google Scholar]
6.Bierman AS, Bubolz TA, Fisher ES, et al. How well does a single question about health predict the financial health of Medicare managed care plans? Eff Clin Pract. 1999;2:56–62. [PubMed] [Google Scholar]
7.Ware JE, Jr., Kosinski M, Dewey JE, et al. How to Score and Interpret Single-Item Health Status Measures: A Manual for Users of the SF-8 Health Survey (With a Supplement on the SF-6 Health Survey). QualityMetric Incorporated; 2001. [Google Scholar]
8.Nelson EC, Wasson J, Kirk J, et al. Assessment of function in routine clinical practice: description of the COOP Chart method and preliminary findings. J Chronic Dis. 1987;40:55S–69S. [DOI] [PubMed] [Google Scholar]
9.Kind P, Dolan P, Gudex C, et al. Variations in population health status: results from a United Kingdom national questionnaire survey. BMJ. 1998;316:736–741. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Shaw JW, Johnson JA, Coons SJ. US valuation of the EQ-5D health states: development and testing of the D1 valuation model. Med Care. 2005;43:203–220. [DOI] [PubMed] [Google Scholar]
11.Hays RD, Bjorner JB, Revicki DA, et al. Development of physical and mental health summary scores from the patient-reported outcomes measurement information system (PROMIS) global items. Qual Life Res. 2009;18:873–880. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Ware JE, Jr. The status of health assessment 1994. Annu Rev Public Health. 1995;16:327–354. [DOI] [PubMed] [Google Scholar]
13.Brook RH, Ware JE, Jr, Rogers WH, et al. Does free care improve adults’ health? Results from a randomized controlled trial. N Engl J Med. 1983;309:1426–1434. [DOI] [PubMed] [Google Scholar]
14.Ware JE, Jr, Sherbourne CD. The MOS 36-Item Short-Form Health Survey (SF-36). I. Conceptual framework and item selection. Med Care. 1992;30:473–483. [PubMed] [Google Scholar]
15.Brook RH, Ware JE, Jr., Davies-Avery A, et al. Overview of adult health measures fielded in Rand’s health insurance study. Med Care. 1979;17:1–131. [PubMed] [Google Scholar]
16.Stewart AL, Ware JE, Jr. Measuring functioning and well-being: The Medical Outcomes Study approach. Duke University Press; 1992. [Google Scholar]
17.World Health Organization . World Health Organization Constitution, in Basic Documents. WHO; 1948. [Google Scholar]
18.Ware JE, Jr, Manning WG, Duan N, et al. Health status and the use of outpatient mental health services. Am Psychol. 1984;39:1090–1100. [DOI] [PubMed] [Google Scholar]
19.Ware JE, Jr. SF-36 Health Survey update. Spine. 2000;25:3130–3139. [DOI] [PubMed] [Google Scholar]
20.Ware JE, Jr, Bayliss MS, Rogers WH, et al. Differences in 4-year health outcomes for elderly and poor, chronically ill patients treated in HMO and fee-for-service systems. Results from the Medical Outcomes Study. J Am Med Assoc. 1996;276:1039–1047. [PubMed] [Google Scholar]
21.Ware JE, Jr, Kosinski M, Keller SD. SF-36 Physical and Mental Summary Scales: A User’s Manual. Boston, MA: The Health Institute; 1994. [Google Scholar]
22.Ware JE, Jr, Kosinski M, Keller SD. A 12-item short-form health survey: construction of scales and preliminary tests of reliability and validity. Med Care. 1996;34:220–233. [DOI] [PubMed] [Google Scholar]
23.Fisher WP, Jr, Eubanks RL, Marier RL. Equating the MOS SF36 and the LSU HSI Physical Functioning Scales. J Outcome Meas. 1997;1:329–362. [PubMed] [Google Scholar]
24.Rose M, Bjorner JB, Becker J, et al. Evaluation of a preliminary physical function item bank supported the expected advantages of the Patient-Reported Outcomes Measurement Information System (PROMIS). J Clin Epidemiol. 2008;61:17–33. [DOI] [PubMed] [Google Scholar]
25.Liegl G, Gandek B, Fischer FH, et al. Varying the item format improved the range of measurement in patient-reported outcome measures assessing physical function. Arth Res Ther. 2017;19:66. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Deng N, Guyer R, Ware JE. Energy, fatigue or both? A bifactor modeling approach to the conceptualization and measurement of vitality. Qual Life Res. 2015;24:81–93. [DOI] [PubMed] [Google Scholar]
27.Veit CT, Ware JE, Jr. The structure of psychological distress and well-being in general populations. J Consult Clin Psychol. 1983;51:730–742. [DOI] [PubMed] [Google Scholar]
28.Cella D, Riley W, Stone A, et al. The Patient-Reported Outcomes Measurement Information System (PROMIS) developed and tested its first wave of adult self-reported health outcome item banks: 2005–2008. J Clin Epidemiol. 2010;63:1179–1194. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.National Opinion Research Center (NORC) at the University of Chicago . AmeriSpeak 2019 https://amerispeak.norc.org/Pages/default.aspx
30.Ware JE, Coutinho G, Smith AB, et al. The effects of greater frequency of two most prevalent bothersome acute respiratory symptoms on health-related quality of life in the 2020 US General Population. Qual Life Res. 2023;32:1043–1051. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Smith AB, Ware JE, Aluko P, et al. The validity of single-item measures of health-related quality of life across groups differing in acute respiratory symptom severity. Qual Life Res. 2024;33:2773–2780. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Ware JE, Gandek B, Guyer R, et al. Standardizing disease-specific quality of life measures across multiple chronic conditions: development and initial evaluation of the QOL Disease Impact Scale (QDIS®). Health Qual Life Outcomes. 2016;14:84. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.McEntee ML, Gandek B, Ware JE. Improving multimorbidity measurement using an individualized quality of life impact assessments: predictive validity of a New Comorbidity Index. Health Qual Life Outcomes. 2022;20:108. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Ware JE, Jr, Kosinski M, Bjorner JB, et al. User’s Manual for the SF-36v2^® Health Survey, 2nd ed. QualityMetric Incorporated; 2007. [Google Scholar]
35.Norman GR, Sloan JA, Wyrwich KW. Interpretation of changes in health-related quality of life: the remarkable universality of half a standard deviation. Med Care. 2003;41:582–592. [DOI] [PubMed] [Google Scholar]
36.McHorney CA, Ware JE, Jr, Lu JF, et al. The MOS 36-item Short-Form Health Survey (SF-36): III. Tests of data quality, scaling assumptions, and reliability across diverse patient groups. Med Care. 1994;32:40–66. [DOI] [PubMed] [Google Scholar]
37.Paz SH, Liu H, Fongwa MN, et al. Readability estimates for commonly used health-related quality of life surveys. Qual Life Res. 2009;18:889–900. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Acquadro C Conway K Giroudet C, et al. - Linguistic Validation Manual for Health Outcomes Assessments Second Edition - Mapi Institute, Lyon, France, January 2012 - ISBN: 2-9522021-0-9.
39.https://www.mapi-trust.org/author-collaboration/linguistic-validation-of/
40.Cohen J. Statistical Power Analysis for the Behavioral Sciences, 2nd Edition. Psychology Press; 1988. [Google Scholar]
41.Campbell DT, Fiske DW. Convergent and discriminant validation by the multitrait—multimethod matrix. Psychol Bull. 1959;56:81–105. [PubMed] [Google Scholar]
42.Beusterien KM, Nissenson AR, Port FK, et al. The effects of recombinant human erythropoietin on functional health and well-being in chronic dialysis patients. J Am Soc Nephrol. 1996;7:763–773. [DOI] [PubMed] [Google Scholar]
43.Baicker K, Taubman SL, Allen HL, et al. The Oregon experiment—effects on Medicaid clinical outcomes. N Engl J Med. 2013;368:1713–1722. [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Croog SH, Levine S, Testa MA, et al. The effects of antihypertensive therapy on the quality of life. N Engl J Med. 1986;314:1657–1664. [DOI] [PubMed] [Google Scholar]
45.Chobanian AV. Antihypertensive therapy in evolution. N Engl J Med. 1986;314:1701–1702. [DOI] [PubMed] [Google Scholar]
46.Frendl DM, Ware JE, Jr. Patient-reported functional health and well-being outcomes with drug therapy: a systematic review of randomized trials using the SF-36 Health Survey. Med Care. 2014;52:439–445. [DOI] [PubMed] [Google Scholar]
47.Brazier JE, Harper R, Jones NM, et al. Validating the SF-36 health survey questionnaire: new outcome measure for primary care. Brit Med J. 1992;305:160–164. [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Fehnel S, DeMuro C, McLeod L, et al. US FDA patient-reported outcome guidance: great expectations and unintended consequences. Expert Rev Pharmacoecon Outcomes Res. 2013;13:441–446. [DOI] [PubMed] [Google Scholar]
49.Keller SD, Ware JE, Jr, Bentler PM, et al. Use of structural equation modeling to test the construct validity of the SF-36 Health Survey in ten countries: results from the IQOLA Project. J Clin Epidemiol. 1998;51:1179–1188. [DOI] [PubMed] [Google Scholar]
50.Suzukamo Y, Fukuhara S, Green J, et al. Validation testing of a three-component model of Short Form-36 scores. J Clin Epidemiol. 2011;64:301–308. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

SUPPLEMENTARY MATERIAL

mlr-63-300-s001.docx^{(17KB, docx)}

mlr-63-300-s002.docx^{(73.3KB, docx)}

mlr-63-300-s003.docx^{(16.9KB, docx)}

[R1] 1.McDowell I, Newell C. Measuring health: A guide to rating scales and questionnaires, 2nd ed. Oxford University Press; 1996. [Google Scholar]

[R2] 2.Bowling A. Just one question: if one question works, why ask several? J Epidemiol Community Health. 2005;59:342–455. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Verbrugge LM, Merrill SS, Liu X. Measuring disability with parsimony. Disability and Rehabilitation. 1999;21:295–306. [DOI] [PubMed] [Google Scholar]

[R4] 4.Hennessy CH, Moriarty DG, Zack MM, et al. Measuring health-related quality of life for public health surveillance. Public Health Rep. 1994;109:665–672. [PMC free article] [PubMed] [Google Scholar]

[R5] 5.DeSalvo KB, Jones TM, Peabody J, et al. Health care expenditure prediction with a single item, self-rated health measure. Med Care. 2009;47:440–447. [DOI] [PubMed] [Google Scholar]

[R6] 6.Bierman AS, Bubolz TA, Fisher ES, et al. How well does a single question about health predict the financial health of Medicare managed care plans? Eff Clin Pract. 1999;2:56–62. [PubMed] [Google Scholar]

[R7] 7.Ware JE, Jr., Kosinski M, Dewey JE, et al. How to Score and Interpret Single-Item Health Status Measures: A Manual for Users of the SF-8 Health Survey (With a Supplement on the SF-6 Health Survey). QualityMetric Incorporated; 2001. [Google Scholar]

[R8] 8.Nelson EC, Wasson J, Kirk J, et al. Assessment of function in routine clinical practice: description of the COOP Chart method and preliminary findings. J Chronic Dis. 1987;40:55S–69S. [DOI] [PubMed] [Google Scholar]

[R9] 9.Kind P, Dolan P, Gudex C, et al. Variations in population health status: results from a United Kingdom national questionnaire survey. BMJ. 1998;316:736–741. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Shaw JW, Johnson JA, Coons SJ. US valuation of the EQ-5D health states: development and testing of the D1 valuation model. Med Care. 2005;43:203–220. [DOI] [PubMed] [Google Scholar]

[R11] 11.Hays RD, Bjorner JB, Revicki DA, et al. Development of physical and mental health summary scores from the patient-reported outcomes measurement information system (PROMIS) global items. Qual Life Res. 2009;18:873–880. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Ware JE, Jr. The status of health assessment 1994. Annu Rev Public Health. 1995;16:327–354. [DOI] [PubMed] [Google Scholar]

[R13] 13.Brook RH, Ware JE, Jr, Rogers WH, et al. Does free care improve adults’ health? Results from a randomized controlled trial. N Engl J Med. 1983;309:1426–1434. [DOI] [PubMed] [Google Scholar]

[R14] 14.Ware JE, Jr, Sherbourne CD. The MOS 36-Item Short-Form Health Survey (SF-36). I. Conceptual framework and item selection. Med Care. 1992;30:473–483. [PubMed] [Google Scholar]

[R15] 15.Brook RH, Ware JE, Jr., Davies-Avery A, et al. Overview of adult health measures fielded in Rand’s health insurance study. Med Care. 1979;17:1–131. [PubMed] [Google Scholar]

[R16] 16.Stewart AL, Ware JE, Jr. Measuring functioning and well-being: The Medical Outcomes Study approach. Duke University Press; 1992. [Google Scholar]

[R17] 17.World Health Organization . World Health Organization Constitution, in Basic Documents. WHO; 1948. [Google Scholar]

[R18] 18.Ware JE, Jr, Manning WG, Duan N, et al. Health status and the use of outpatient mental health services. Am Psychol. 1984;39:1090–1100. [DOI] [PubMed] [Google Scholar]

[R19] 19.Ware JE, Jr. SF-36 Health Survey update. Spine. 2000;25:3130–3139. [DOI] [PubMed] [Google Scholar]

[R20] 20.Ware JE, Jr, Bayliss MS, Rogers WH, et al. Differences in 4-year health outcomes for elderly and poor, chronically ill patients treated in HMO and fee-for-service systems. Results from the Medical Outcomes Study. J Am Med Assoc. 1996;276:1039–1047. [PubMed] [Google Scholar]

[R21] 21.Ware JE, Jr, Kosinski M, Keller SD. SF-36 Physical and Mental Summary Scales: A User’s Manual. Boston, MA: The Health Institute; 1994. [Google Scholar]

[R22] 22.Ware JE, Jr, Kosinski M, Keller SD. A 12-item short-form health survey: construction of scales and preliminary tests of reliability and validity. Med Care. 1996;34:220–233. [DOI] [PubMed] [Google Scholar]

[R23] 23.Fisher WP, Jr, Eubanks RL, Marier RL. Equating the MOS SF36 and the LSU HSI Physical Functioning Scales. J Outcome Meas. 1997;1:329–362. [PubMed] [Google Scholar]

[R24] 24.Rose M, Bjorner JB, Becker J, et al. Evaluation of a preliminary physical function item bank supported the expected advantages of the Patient-Reported Outcomes Measurement Information System (PROMIS). J Clin Epidemiol. 2008;61:17–33. [DOI] [PubMed] [Google Scholar]

[R25] 25.Liegl G, Gandek B, Fischer FH, et al. Varying the item format improved the range of measurement in patient-reported outcome measures assessing physical function. Arth Res Ther. 2017;19:66. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Deng N, Guyer R, Ware JE. Energy, fatigue or both? A bifactor modeling approach to the conceptualization and measurement of vitality. Qual Life Res. 2015;24:81–93. [DOI] [PubMed] [Google Scholar]

[R27] 27.Veit CT, Ware JE, Jr. The structure of psychological distress and well-being in general populations. J Consult Clin Psychol. 1983;51:730–742. [DOI] [PubMed] [Google Scholar]

[R28] 28.Cella D, Riley W, Stone A, et al. The Patient-Reported Outcomes Measurement Information System (PROMIS) developed and tested its first wave of adult self-reported health outcome item banks: 2005–2008. J Clin Epidemiol. 2010;63:1179–1194. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] 29.National Opinion Research Center (NORC) at the University of Chicago . AmeriSpeak 2019 https://amerispeak.norc.org/Pages/default.aspx

[R30] 30.Ware JE, Coutinho G, Smith AB, et al. The effects of greater frequency of two most prevalent bothersome acute respiratory symptoms on health-related quality of life in the 2020 US General Population. Qual Life Res. 2023;32:1043–1051. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] 31.Smith AB, Ware JE, Aluko P, et al. The validity of single-item measures of health-related quality of life across groups differing in acute respiratory symptom severity. Qual Life Res. 2024;33:2773–2780. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] 32.Ware JE, Gandek B, Guyer R, et al. Standardizing disease-specific quality of life measures across multiple chronic conditions: development and initial evaluation of the QOL Disease Impact Scale (QDIS®). Health Qual Life Outcomes. 2016;14:84. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] 33.McEntee ML, Gandek B, Ware JE. Improving multimorbidity measurement using an individualized quality of life impact assessments: predictive validity of a New Comorbidity Index. Health Qual Life Outcomes. 2022;20:108. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] 34.Ware JE, Jr, Kosinski M, Bjorner JB, et al. User’s Manual for the SF-36v2^® Health Survey, 2nd ed. QualityMetric Incorporated; 2007. [Google Scholar]

[R35] 35.Norman GR, Sloan JA, Wyrwich KW. Interpretation of changes in health-related quality of life: the remarkable universality of half a standard deviation. Med Care. 2003;41:582–592. [DOI] [PubMed] [Google Scholar]

[R36] 36.McHorney CA, Ware JE, Jr, Lu JF, et al. The MOS 36-item Short-Form Health Survey (SF-36): III. Tests of data quality, scaling assumptions, and reliability across diverse patient groups. Med Care. 1994;32:40–66. [DOI] [PubMed] [Google Scholar]

[R37] 37.Paz SH, Liu H, Fongwa MN, et al. Readability estimates for commonly used health-related quality of life surveys. Qual Life Res. 2009;18:889–900. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] 38.Acquadro C Conway K Giroudet C, et al. - Linguistic Validation Manual for Health Outcomes Assessments Second Edition - Mapi Institute, Lyon, France, January 2012 - ISBN: 2-9522021-0-9.

[R39] 39.https://www.mapi-trust.org/author-collaboration/linguistic-validation-of/

[R40] 40.Cohen J. Statistical Power Analysis for the Behavioral Sciences, 2nd Edition. Psychology Press; 1988. [Google Scholar]

[R41] 41.Campbell DT, Fiske DW. Convergent and discriminant validation by the multitrait—multimethod matrix. Psychol Bull. 1959;56:81–105. [PubMed] [Google Scholar]

[R42] 42.Beusterien KM, Nissenson AR, Port FK, et al. The effects of recombinant human erythropoietin on functional health and well-being in chronic dialysis patients. J Am Soc Nephrol. 1996;7:763–773. [DOI] [PubMed] [Google Scholar]

[R43] 43.Baicker K, Taubman SL, Allen HL, et al. The Oregon experiment—effects on Medicaid clinical outcomes. N Engl J Med. 2013;368:1713–1722. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] 44.Croog SH, Levine S, Testa MA, et al. The effects of antihypertensive therapy on the quality of life. N Engl J Med. 1986;314:1657–1664. [DOI] [PubMed] [Google Scholar]

[R45] 45.Chobanian AV. Antihypertensive therapy in evolution. N Engl J Med. 1986;314:1701–1702. [DOI] [PubMed] [Google Scholar]

[R46] 46.Frendl DM, Ware JE, Jr. Patient-reported functional health and well-being outcomes with drug therapy: a systematic review of randomized trials using the SF-36 Health Survey. Med Care. 2014;52:439–445. [DOI] [PubMed] [Google Scholar]

[R47] 47.Brazier JE, Harper R, Jones NM, et al. Validating the SF-36 health survey questionnaire: new outcome measure for primary care. Brit Med J. 1992;305:160–164. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R48] 48.Fehnel S, DeMuro C, McLeod L, et al. US FDA patient-reported outcome guidance: great expectations and unintended consequences. Expert Rev Pharmacoecon Outcomes Res. 2013;13:441–446. [DOI] [PubMed] [Google Scholar]

[R49] 49.Keller SD, Ware JE, Jr, Bentler PM, et al. Use of structural equation modeling to test the construct validity of the SF-36 Health Survey in ten countries: results from the IQOLA Project. J Clin Epidemiol. 1998;51:1179–1188. [DOI] [PubMed] [Google Scholar]

[R50] 50.Suzukamo Y, Fukuhara S, Green J, et al. Validation testing of a three-component model of Short Form-36 scores. J Clin Epidemiol. 2011;64:301–308. [DOI] [PubMed] [Google Scholar]

PERMALINK

Improved Items for Estimating SF-36 Profile and Summary Component Scores

John E Ware Jr, PhD

Abstract

Background:

Research Design:

Results:

Conclusions:

METHODS

Sampling and Data Collection

Survey Item Improvements

TABLE 1.

Scoring Methods

Qualitative Analyses

Empirical Analyses

Score Distributions, Ceiling Effects, and Reliability

Construct Validity

TABLE 2.

FIGURE 1.

TABLE 3.

Criterion Validity

RESULTS

TABLE 4.

Qualitative Evaluations

Response Distributions and Ceiling Effects

Construct Validity

Known Groups Validity

DISCUSSION

Study Limitations

CONCLUSIONS

Supplementary Material

ACKNOWLEDGMENTS

Footnotes

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases