Validity and reliability of the Patient-Reported Outcomes Measurement Information System (PROMIS®) using computerized adaptive testing in patients with advanced chronic kidney disease

Esmee M van der Willik; Fenna van Breda; Brigit C van Jaarsveld; Marlon van de Putte; Isabelle W Jetten; Friedo W Dekker; Yvette Meuleman; Frans J van Ittersum; Caroline B Terwee

doi:10.1093/ndt/gfac231

. 2022 Aug 1;38(5):1158–1169. doi: 10.1093/ndt/gfac231

Validity and reliability of the Patient-Reported Outcomes Measurement Information System (PROMIS^®) using computerized adaptive testing in patients with advanced chronic kidney disease

Esmee M van der Willik ^1,^2,^✉, Fenna van Breda ³, Brigit C van Jaarsveld ⁴, Marlon van de Putte ⁵, Isabelle W Jetten ⁶, Friedo W Dekker ⁷, Yvette Meuleman ⁸, Frans J van Ittersum ⁹, Caroline B Terwee ¹⁰

PMCID: PMC10157750 PMID: 35913734

ABSTRACT

Background

The Patient-Reported Outcomes Measurement Information System (PROMIS^®) has been recommended for computerized adaptive testing (CAT) of health-related quality of life. This study compared the content, validity, and reliability of seven PROMIS CATs to the 12-item Short-Form Health Survey (SF-12) in patients with advanced chronic kidney disease.

Methods

Adult patients with chronic kidney disease and an estimated glomerular filtration rate under 30 mL/min/1.73 m² who were not receiving dialysis treatment completed seven PROMIS CATs (assessing physical function, pain interference, fatigue, sleep disturbance, anxiety, depression, and the ability to participate in social roles and activities), the SF-12, and the PROMIS Pain Intensity single item and Dialysis Symptom Index at inclusion and 2 weeks. A content comparison was performed between PROMIS CATs and the SF-12. Construct validity of PROMIS CATs was assessed using Pearson's correlations. We assessed the test-retest reliability of all patient-reported outcome measures by calculating the intraclass correlation coefficient and minimal detectable change.

Results

In total, 207 patients participated in the study. A median of 45 items (10 minutes) were completed for PROMIS CATs. All PROMIS CATs showed evidence of sufficient construct validity. PROMIS CATs, most SF-12 domains and summary scores, and Dialysis Symptom Index showed sufficient test-retest reliability (intraclass correlation coefficient ≥ 0.70). PROMIS CATs had a lower minimal detectable change compared with the SF-12 (range, 5.7–7.4 compared with 11.3–21.7 across domains, respectively).

Conclusion

PROMIS CATs showed sufficient construct validity and test-retest reliability in patients with advanced chronic kidney disease. PROMIS CATs required more items but showed better reliability than the SF-12. Future research is needed to investigate the feasibility of PROMIS CATs for routine nephrology care.

Keywords: chronic kidney disease (CKD), minimal detectable change, patient-reported outcome measures (PROMs), reliability, validity

Graphical Abstract

KEY LEARNING POINTS.

What is already known about this subject?

Patient-reported outcome measures (PROMs) are increasingly being used in nephrology care, but there is no consensus on the preferred PROMs because of lack of knowledge of psychometric properties.
The Patient-Reported Outcomes Measurement Information System (PROMIS^®) using computerized adaptive testing (CAT) is one of the proposed PROMs to measure generic health-related quality of life (HRQOL) in patients with advanced chronic kidney disease (CKD).
PROMIS CAT is a relatively novel measurement method in health care and has several advantages compared with traditional, nonadaptive PROMs, but it has not yet been validated in patients with CKD.

What this study adds?

Psychometric properties of seven PROMIS CATs (assessing physical function, pain interference, fatigue, sleep disturbance, anxiety, depression, and the ability to participate in social roles and activities) are compared with the 12-item Short-Form Health Survey (SF-12), which is a validated and commonly used PROM and currently used in Dutch routine nephrology care.
Results show evidence for sufficient construct validity and test-retest reliability of all PROMIS CATs.
Seven PROMIS CATs required more items but showed better reliability than the SF-12.

What impact this may have on practice or policy?

This study provides valuable information about the psychometric properties of seven PROMIS CATs compared with a commonly used PROM (i.e. the SF-12) to assess HRQOL in patients with CKD.
Content comparison and reliability parameters, such as the minimal detectable change, are informative in the interpretation of PROM scores in routine nephrology care.
Knowledge of validity and reliability can support considerations about which PROMs best fit routine nephrology care.

INTRODUCTION

Patients with advanced chronic kidney disease (CKD) experience numerous physical and emotional disease-related symptoms that are associated with a decreased health-related quality of life (HRQOL) [1–4]. Although several symptoms and the impact on physical, mental, and social functioning have been considered of great importance by patients and health care professionals alike [5, 6], these patient-relevant outcomes may still be regularly underrecognized and therefore insufficiently managed in routine nephrology care [4, 7]. Patient-reported outcome measures (PROMs) can be used to improve insight into these important outcomes. PROMs have been incorporated into Dutch routine dialysis care [3] and are now also being implemented into the care for Dutch patients with advanced CKD and kidney transplant recipients [8].

Many different generic and disease-specific PROMs are being used within and across countries [9, 10]. In Dutch nephrology care, the 12-item Short-Form Health Survey (SF-12) and the Dialysis Symptom Index (DSI) are used to assess generic HRQOL and disease-related symptom burden, respectively [3]. A major advantage of using the same PROMs is that doing so enables comparison and monitoring of outcomes across CKD stages and treatments.

Recently, the Patient-Reported Outcomes Measurement Information System (PROMIS^®) was selected as one of the recommended PROMs to measure generic HRQOL in patients with CKD by a consensus group of the International Consortium of Health Outcomes Measurement (ICHOM) [11]. Additionally, PROMIS was recommended by the Linnean initiative, a nationwide network of stakeholders in the Netherlands, for all patient populations to standardize outcome measurement across medical conditions [12]. PROMIS consists of a collection of item banks (i.e. large sets of questions), developed to measure commonly relevant domains across patient conditions, such as physical function, fatigue, and anxiety. Because PROMIS item banks were developed using item response theory (IRT) models, they can also be administered as computerized adaptive tests (CATs). The use of CATs is relatively novel in health care and has several advantages compared with traditional fixed (i.e. nonadaptive) PROMs. In a CAT, the computer selects questions from an item bank based on the answers to previous questions. With this method, the PROM is adapted to the patient, resulting in questions that are likely more relevant to that patient. In addition, on average, fewer questions will be required to obtain similar or even more precise measurements compared with fixed PROMs [13, 14]. Sufficient validity and reliability of fixed PROMIS measures was found in several disease populations [15–17], including patients with CKD [18, 19]. However, the psychometric properties of PROMIS CATs have not yet been studied in patients with CKD.

Therefore, this study aimed to examine and compare the content, construct validity, and test-retest reliability (including minimal detectable change [MDC]) of seven PROMIS CATs (assessing physical function, pain interference, fatigue, sleep disturbance, anxiety, depression, and the ability to participate in social roles and activities) with the SF-12 in patients with advanced CKD. Additionally, we assessed the test-retest reliability of the PROMIS Pain Intensity single item and the DSI, as these PROMs are often used together with the PROMIS CATs and SF-12.

MATERIALS AND METHODS

Study design and population

This observational study included adult patients with advanced CKD and an estimated glomerular filtration rate (eGFR) less than 30 mL/min/1.73 m² not receiving dialysis treatment. Exclusion criteria were kidney replacement therapy (KRT—i.e. dialysis or kidney transplantation) planned within 4 weeks, rapid deterioration of kidney function (i.e. decrease in eGFR >20 mL/min/1.73 m² during the past 6 months), not able to complete PROMs because of cognitive impairment, poor knowledge of the Dutch language, and no informed consent. Patients were recruited between November 2020 and August 2021 by their nephrologist at the outpatient clinics of Amsterdam University Medical Centre in Amsterdam and Niercentrum aan de Amstel in Amstelveen, the Netherlands. Eligible patients received written information by mail and were, if needed, approached by telephone after 2 weeks for further information. After providing written informed consent, patients were invited by e-mail to complete the PROMs digitally in the Kwaliteit van Leven In Kaart (KLIK; www.hetklikt.nu) research platform at inclusion (i.e. baseline), after 2 weeks, and after 6 months. If necessary, two reminders were sent by email or patients were contacted by telephone. Patients without access to an electronic device with an internet connection could participate by telephone. This study used the baseline and 2-week measurements (Fig. 1).

FIGURE 1: — Flow diagram of patient inclusion for baseline and 2-week measurements. All patients who completed the baseline measurement constitute the study sample for validity analyses. All patients who completed the 2-week measurement within 28 days after baseline are included for reliability analyses. The patient indicated the reason for exclusion. Patients who were not digitally skilled were offered participation by telephone but were not willing to participate in that manner.

The study was reviewed by the Medical Ethics Review Committee of VU University Medical Centre in the Netherlands, which confirmed that the Medical Research Involving Human Subjects Act does not apply.

Measures

Demographic and clinical characteristics, including age, sex, primary kidney disease according to European Renal Association codes [20], body mass index (BMI), smoking status, comorbidities (hypertension, diabetes mellitus, cardiovascular disease [CVD], lung disease, liver disease, and malignancy), as defined by ICHOM [11], eGFR (mL/min/1.73 m²), KRT in medical history, start of KRT and death during follow-up were collected from medical records. Educational level and ethnocultural background were self-reported at baseline.

The PROMs included in this study are seven PROMIS CATs, the SF-12, one PROMIS single item, and the DSI. The SF-12 and DSI have demonstrated validity within patients with CKD [10, 21–24]. PROMs were presented in random order across patients but with fixed order within patients during follow-up. The research platform to complete PROMs did not allow for any missing values within a PROM.

Seven Dutch-Flemish PROMIS CATs [25] were administered: version 1.2 Physical Function, version 1.1 Pain Interference, version 1.0 Fatigue, version 1.0 Sleep Disturbance, version 1.0 Anxiety, version 1.0 Depression, and version 2.0 Ability to Participate in Social Roles and Activities. All items have five response options, ranging from ‘never’ to ‘always’ or from ‘not at all’ to ‘very much’. PROMIS CATs are presented as T-scores, where 50 (SD, 10) represents the average score of the US general population. A difference greater than 2 points was considered relevant [26]. Higher scores indicate more of the construct (e.g. a higher Depression score means more depression, a higher Physical Function score means more [better] function). Within each PROMIS CAT, questions were selected one by one from an underlying item bank. The starting item is the item with the highest information value for the average level of the domain in the general population. The next items are subsequently selected from the item bank based on the respondent's answers to previous items. For example, a respondent reports having difficulties doing 2 hours of physical labor (first item). Then, the second item will be an ‘easier’ activity (e.g. a question about ability to do chores, such as vacuuming). The respondent is not asked about more ‘difficult’ activities (e.g. running 5 miles) that he or she is assumably not able to do. By tailoring the next item to the person's ability, questions are more often relevant to that person, and on average, patients must complete fewer questions. (See Supplement A for a visual illustration of a CAT.) After each item, the score and standard error (SE) are estimated based on all items completed so far. In this study, the CAT stopped when an SE of 2.2 on the T-score metric was reached (comparable to a reliability of approximately 0.95) or when a maximum of 12 items per CAT had been administered. We used a lower SE compared with the standard stopping rule (i.e. SE, 3.0) [13] because a higher reliability may be preferable for routine care; by using this setting, the optimal performance of PROMIS CATs could be investigated. PROMIS CATs were administered using CAT software from the Dutch-Flemish Assessment Center, part of the Dutch-Flemish PROMIS National Center [27].

The SF-12, version 2 [28, 29] is a 12-item generic PROM that assesses 8 domains of HRQOL: physical functioning, role-physical, bodily pain, general health, vitality, social functioning, role-emotional, and mental health. Additionally, a physical component summary (PCS) score (including physical functioning, role-physical, bodily pain, and general health) and a mental component summary (MCS) score (including vitality, social functioning, role-emotional, and mental health) can be calculated. Domain and summary scores range from 0 to 100, and the US general population was used as reference, with an average score of 50 (SD, 10). Higher scores indicate a better HRQOL.

The PROMIS item version 1.0 Numerical Rating Scale Pain Intensity 1a is a single-item with a 0 to 10 scale, with higher scores indicating more pain.

The DSI [21] is a 30-item disease-specific PROM for assessing physical and emotional symptom burden. Patients report the presence of 30 symptoms (yes/no) during the past week and, if present, the burden of each symptom on a 5-point Likert scale, ranging from 1 (‘not at all’) to 5 (‘very much’) bothersome. Two overall scores were calculated: (i) total number of symptoms (0–30 symptoms) and (ii) total symptom burden score, which is the sum of burden on individual symptoms, ranging from 0 (no symptoms) to 150 (all 30 symptoms are very much bothersome) [3, 30]. The DSI items ‘feeling tired or lack of energy’, ‘feeling anxious’, ‘trouble falling asleep’, and ‘trouble staying asleep’ (hereafter combined as ‘sleep problems’) were used as comparison items in the construct validity analyses because these items intend to measure constructs comparable to the PROMIS CATs Fatigue, Anxiety, and Sleep Disturbance.

Content comparison

To provide insight into the comparability of PROMIS CATs and the SF-12, we compared their content by providing (i) an overview of the PROM characteristics (e.g. domains, number of items, recall period, scoring, and interpretation) and (ii) a visual comparison of the domain score distributions using an interpretative color indication (from green [better] to red [worse] HRQOL), in line with the use in routine care [31, 32].

Construct validity

We assessed the construct validity of PROMIS CATs by using Pearson's correlations. Hypotheses were formulated a priori about the expected correlations between PROMIS CATs and the SF-12 and DSI based on literature [15–18] and expert judgement (E.vdW. and C.T.). We expect strong correlations (r ≥ 0.7) between PROMIS CATs and comparable SF-12 domains and similar DSI items, moderate correlations (r = 0.5–0.7) between PROMIS CATs and largely related SF-12 domains, and no strong correlations for other comparisons (r ≤ 0.6) (see Table 1). Construct validity was considered sufficient if 75% or more of the results were in accordance with the hypotheses.

Table 1.

Hypotheses for construct validity

PROMIS CAT	Strong correlation: Pearson's r ≥ 0.7	Moderate correlation: Pearson's r ≥ 0.5–0.7	No strong correlation: Pearson's r ≤ 0.6
Physical Function	SF-12 physical functioning	SF-12 general health	All other SF-12 domains
	SF-12 PCS^a	SF-12 bodily pain	DSI total number of symptoms and symptom burden score
Pain Interference	SF-12 bodily pain	SF-12 physical functioning	All other SF-12 domains
		SF-12 PCS	DSI total number of symptoms and symptom burden score
Fatigue	SF-12 vitality		All other SF-12 domains
	DSI feeling tired or lack of energy (1 item)		DSI total number of symptoms and symptom burden score
Sleep Disturbance	DSI sleep problems (2 items)^b		All other SF-12 domains
			DSI total number of symptoms and symptom burden score
Anxiety	SF-12 mental health		All other SF-12 domains
	SF-12 MCS^aDSI feeling anxious (1 item)		DSI total number of symptoms and symptom burden score
Depression	SF-12 mental health		All other SF-12 domains
	SF-12 MCS^a		DSI total number of symptoms and symptom burden score
Ability to Participate in Social Roles and Activities	SF-12 social functioning	SF-12 role physical	All other SF-12 domains
		SF-12 role emotional	DSI total number of symptoms and symptom burden score

Open in a new tab

^aSF-12 PCS includes the domains physical functioning, role-physical, bodily pain, and general health; SF-12 MCS includes the domains vitality, social functioning, role-emotional, and mental health.

^bDSI Sleep problems were defined as trouble falling asleep and/or trouble staying asleep.

Test-retest reliability

We assessed the test-retest reliability of PROMIS CATs, SF-12, PROMIS Pain Intensity single item, and DSI by calculating the intraclass correlation coefficient (ICC) in patients with valid baseline and 2-week measurements (Fig. 1). We calculated the ICC using a two-way random-effects model for absolute agreement: Inline graphic , whereby is the variation between patients, is the variation between measurements, and is random error variance. ICC ≥ 0.70 was considered sufficient [33].

We computed the ICC for each PROMIS CAT and SF-12 domain separately. Additionally, we calculated the ICC for the PROMIS Pain Intensity single item and for the DSI total number of symptoms and symptom burden score. Although the DSI was not designed to be interpreted as an overall score (as it measures 30 different symptoms), the total number of symptoms and symptom burden score are often used within health care, and insight into the reliability of these scores is therefore of clinical relevance.

The MDC was also calculated for each domain of the PROMIS CATs and SF-12, the PROMIS Pain Intensity single item, and the DSI total number of symptoms and symptom burden score. The MDC is a parameter of measurement error and is defined as the ‘smallest change in score that can be detected beyond measurement error’ with 95% confidence [33]. Two different methods were applied to calculate the MDC, in line with the underlying measurement theories—namely, classical test theory (CTT) or IRT, which assume a constant or varying standard error of measurement (SEM) across the PROM scale, respectively [34, 35].

The MDC, based on CTT, of the SF-12 domains, PROMIS Pain Intensity single item, and the DSI total number of symptoms and symptom burden score was calculated using the following formula: Inline graphic , whereby SEM was calculated as .

The MDC, based on IRT, of each PROMIS CAT varies by patient (because with IRT the SE of each score is different) and was calculated using the following formula: Inline graphic , whereby is the patient's IRT estimated SE of the T-score at baseline and at the 2-week measurement. A mean MDC of each PROMIS CAT was subsequently calculated for the whole group.

Data analyses were performed using SPSS statistical software, version 25.0 (IBM Corp., Armonk, NY, USA).

RESULTS

Study participants

Almost half of the patients approached provided written informed consent. In total, 207 participants completed the baseline measurement and were included in current analyses. Of them, 179 (86.5%) participants completed the 2-week measurement within 28 days and were eligible for reliability analyses (Fig. 1). The average time between the baseline and 2-week measurement was 14.1 (SD, 3.7) days. Eleven patients participated by telephone. Sociodemographic and clinical characteristics of the participants at baseline and 2-week measurements are shown in Table 2. The baseline and 2-week study samples were comparable. About 60% of patients were male, the mean (SD) age was 65.5 (13.8) years, and the majority (85%) had a Dutch ethnocultural background. Mean (SD) eGFR was 21.4 (6.7), and 17% of patients had had KRT in the past.

Table 2.

Characteristics of study sample at baseline and 2-week measurement

Characteristic	Study sample at baseline^a (n = 207)	Study sample at 2 weeks^a (n = 179)
Sex, male, n (%)	124 (59.9)	107 (59.8)
Age (years), mean (SD)	65.5 (13.8)	66.1 (13.1)
Ethnocultural group^b, Dutch, n (%)	176 (85.0)	152 (84.9)
Educational level^c, n (%)
Low	85 (41.0)	74 (41.3)
Middle	49 (23.7)	43 (24.0)
High	73 (35.3)	62 (34.6)
Primary kidney disease, n (%)
Glomerulonephritis/sclerosis	34 (16.6)	33 (18.6)
Pyelonephritis	7 (3.4)	7 (4.0)
Polycystic kidney disease	16 (7.8)	15 (8.5)
Other congenital/hereditary kidney diseases	15 (7.3)	13 (7.3)
Hypertension/renal vascular disease	46 (22.5)	42 (23.7)
Diabetes mellitus	14 (6.8)	12 (6.8)
Miscellaneous	63 (30.7)	49 (27.7)
Unknown	10 (4.9)	6 (3.4)
Kidney function (eGFR), mean (SD)	21.4 (6.7)	21.6 (6.6)
KRT in medical history^d, yes, n (%)	35 (17.0)	30 (16.9)
BMI, mean (SD)	26.8 (5.2)	26.9 (5.2)
Smoking, n (%)
Yes	25 (13.2)	19 (11.7)
No, stopped	94 (49.7)	82 (50.6)
No, never smoked	70 (37.0)	61 (37.7)
Comorbidities, n (%)
Hypertension, yes	164 (79.2)	140 (78.2)
Diabetes mellitus, yes	62 (30.0)	53 (29.6)
CVD, yes	53 (25.6)	43 (24.0)
Lung disease, yes	30 (14.5)	28 (15.6)
Liver disease, yes	11 (5.3)	8 (4.5)
Malignancy, yes	50 (24.2)	43 (24.0)

Open in a new tab

Missing values at baseline: primary kidney disease: n = 2 (1.0%); KRT in medical history: n = 1 (0.5%); BMI: n = 11 (5.3%); smoking: n = 18 (8.7%). Missing values at 2 weeks: primary kidney disease: n = 2 (1.1%); KRT in medical history: n = 1 (0.6%); BMI: n = 9 (5.0%); smoking: n = 17 (9.5%).

Study sample at baseline was used for validity analyses. Study sample at 2-week measurement was used for reliability analyses.

Self-reported ethnocultural group: ‘What ethnic group do you consider yourself to belong to?’

Educational level according to International Standard Classification of Education levels 2011, classified as low (primary, lower secondary, or lower vocational education), middle (upper secondary or upper vocational education), and high (tertiary education [college/university]).

KRT in medical history includes patients who have undergone (temporary) dialysis treatment or a kidney transplant in the past. At study inclusion, all patients had an eGFR < 30 mL/min/1.73 m² and did not require dialysis treatment, in accordance with inclusion criteria.