. 2024 Mar 15;14(3):643–669. doi: 10.1007/s13555-024-01114-2

Table 2.

Summary of psychometric analyses performed in the phase 3 clinical trial data

Analysis	Description
Stage 1: Item properties
Quality of completion	The quality of completion of the HESD was evaluated to identify any items with unexpectedly high levels of missing data. Missing data at the form-level and item-level were summarized at Baseline and Weeks 2, 4, 8, 12 and 16.
Item response distributions	Response distributions for each HESD item were examined to assess the frequency and percentage of each endorsed response and to identify any response options that were overly favored or evidence of unexpected or skewed distributions. This was assessed at Baseline and Weeks 2, 4, 8, 12 and 16. Percentages of minimum and maximum responses were also calculated to examine floor and ceiling effects for all items. A floor effect was defined as a high percentage of patients endorsing the response option 0 and a ceiling effect was defined as a high percentage of patients endorsing the response option 10. A substantial effect was defined as > 15% of respondents. Items with substantial proportion of patients scoring floor/ceiling were flagged for further consideration.
Stage 2: Dimensionality and scoring
Inter-item correlations	Inter-item correlations were examined for each pair of items in the HESD to ensure each item measured a distinct concept without any redundancy. Items that correlated very highly with one another (≥ 0.90) were flagged for review.
Rasch analysis	Rasch analysis was conducted to confirm the underlying unidimensional structure of the HESD and assess item performance. Item fit and person fit statistics were examined with values between 0.5 and 1.5 considered optimal for infit mean square (MNSQ) and values between 0.5 and 2.0 are deemed optimal for outfit MNSQ [24]. Item characteristic curves were evaluated to graphically represent the probability of a participant selecting each response category (response option) for each item given the latent severity as estimated by the Rasch model. Evidence of overlapping categories (suggesting there may be too many response options), or disordered curves (suggesting that the response options are not behaving as intended) was flagged for consideration. Yen’s Q₃ local dependence indices were produced, with the expectation item residual correlations should deviate < 0.30 from the average residual correlation, with higher deviation indicating that the responses to one item depend on the responses to another (i.e., the items are locally dependent) [25–27]. Item-person maps presenting the location of both respondents and items on the same latent trait were examined to evaluate the spread of item difficulty parameters and their ability to capture the severity of participants in the population.
Stage 3: Reliability and validity of scores
Reliability
Internal consistency reliability	Internal consistency reliability of the HESD score was evaluated using Cronbach’s alpha coefficient (≥ 0.70 for good internal consistency) to assess the homogeneity of items within the HESD score [28]. The impact of item removal on internal consistency reliability was examined by calculating Cronbach’s alpha with each item removed from the HESD score in turn. If the removal of an item causes the alpha value to notably increase, then that item may not be fitting well within its domain. Corrected item-total correlations were also calculated by computing the Pearson correlation coefficient of each item with the sum of the remaining items within its corresponding score. Items with a correlation < 0.40 were considered evidence that the item did not fit well with the other items [29].
Test-retest reliability	Test-retest reliability was evaluated for the HESD Itch score (weekly average), HESD Pain score (weekly average) and HESD score (weekly average) by examining the stability of scores between Week 2 and 4 and between Week 4 and 8 in patients defined as having ‘stable’ CHE based on other assessments in the trial. Subgroups of patients with ‘stable’ CHE were defined as patients with no change in the relevant PGI-S measure, no change on the PaGA or no change in the IGA-CHE in separate analyses. Intra-class correlation coefficients (ICCs) were calculated and evaluated using pre-specified cut-off criteria: < 0.50 indicating poor reliability, 0.5–0.75 indicating moderate reliability, 0.75–0.90 indicating good reliability and > 0.90 indicating excellent reliability [30]. Pearson’s correlation coefficients were also calculated.
Construct validity
Convergent validity	Convergent validity was evaluated by calculating polyserial and Spearman’s correlations of the HESD Itch score, HESD Pain score and HESD score with the DLQI total score, DLQI symptoms and feelings score and DLQI item 1 (which assesses itch, pain, soreness and stinging). Convergent validity evaluates the relationship with other measures that assess similar or related concepts. Correlations of < 0.50 were defined a priori as ‘weak’, those ≥ 0.50 and ≤ 0.70 were defined as ‘moderate’, those ≥ 0.70 and < 0.90 were defined as ‘strong’, and those ≥ 0.90 were considered ‘very strong’ [31]. The following relationships were hypothesized to exist: DLQI total score > 0.30 with HESD Itch score, HESD Pain score and HESD score DLQI symptoms and feelings subscale > 0.50 with HESD Itch score, HESD Pain score and HESD score DLQI item 1: itch, pain, soreness and stinging > 0.50 with HESD Itch score and HESD Pain score
Known-groups analysis	The known-groups method was used to evaluate differences in HESD Itch, HESD Pain and HESD total score among groups of patients expected to differ in severity. Known-groups were defined using scores on the PaGA, IGA-CHE, PGI-S Itch, PGI-S Pain and HESD PGI-S Mean HESD Itch, HESD Pain and HESD total scores were compared between severity groups, with F-test one-way ANOVAs used to test the mean score differences between groups. The pre-specified criterion for known-groups validity was considered met if statistically significant differences (p < 0.50) in mean HESD Itch, HESD Pain and HESD total scores were observed between the known groups, and scores increased monotonically as expected. Between-group effect sizes (ES) were also calculated as a measure of the magnitude of differences in scores between groups. The following pre-specified cut-offs were used to interpret the magnitude of each ES: small (ES = 0.20), moderate (ES = 05.0) and large (ES = 0.80) [32].
Ability to detect change	Ability to detect change was assessed for the HESD Itch (weekly average), HESD Pain (weekly average) and HESD score (weekly average) using data from Baseline to Week 16 in the ability to detect change analysis population. Within-group effect sizes [32, 33] and between-group one-way ANOVA F-test were calculated to evaluate the magnitude and significance of differences in change scores between each group, respectively. Patients were categorized into ‘improved’, ‘no change’ and ‘worsened’ groups as follows: PaGA, IGA-CHE, PGI-S Itch, PGI-S Pain, HESD PGI-S: Improved: ≥ 1-level improvement Stable: No change Worsened: ≥ 1-level worsening PGI-C Itch, PGI-C Pain, HESD PGI-C: Improved: ‘Much better’ or ‘A little better’ Stable: ‘No change’ Worsened: ‘Much worse’ or ‘A little worse’
Interpretation of scores
Anchor-based methods	Anchor-based analyses were performed in the psychometric analysis population for the HESD Itch (weekly average), HESD Pain (weekly average) and HESD score (weekly average) using data on change from Baseline to Week 16. First, the suitability of proposed anchors was tested using a polyserial correlation coefficient to establish the relationship between the anchor categories and change in HESD scores. Anchors with correlations of < 0.3 were not taken forward for analysis [34]. Within-individual change thresholds were recommended by considering the mean change in PROM score for participants classified as either ‘moderately’ or ‘minimally’ improved according to PGI-S (Itch, Pain and HESD), PGI-C (Itch, Pain and HESD), PaGA and IGA-CHE Empirical cumulative distribution function (eCDF) and probability density function (PDF) plots were used to allow various proposed responder definitions generated to be evaluated simultaneously. Between-group differences in mean change PROM score was calculated for participants as defined above. The minimal important difference (MID) estimate for each anchor was defined as the difference in mean change score between minimal improvement and no change groups. Additionally, to guide triangulation of estimates (but not explicitly define them), a correlation-weighted average was calculated, where estimates were weighted by the observed correlations between change in anchor and score as follows: $M_{weighted} = \frac{\sum_{i = 1}^{n} \| r_{i} \| \|x_{i}\|}{\sum_{i = 1}^{n} \| r_{i} \|}$ where x denotes each [absolute] estimate and r denotes the [absolute] correlation coefficient of each anchor-scale combination, for each i of n total estimates. Fisher’s z transformation was applied to the correlation coefficients [35].
Distribution-based methods	Distributional properties of the HESD Itch (weekly average), HESD Pain (weekly average) and HESD score (weekly average) were used to provide an indication of the amount of change beyond measurement error that may be considered meaningful. Estimates were calculated as 0.5 of the standard deviation (SD) at baseline [36, 37] and the standard error of measurement (SEM). The SEM was calculated as the SD at baseline multiplied by the square root of one minus the reliability of the score at baseline [SD * (1 − r)^1/2]. The ICC calculated as part of test-retest reliability analyses using the PGI-S anchor was used for the reliability coefficient.