. 2024 Mar 20;316(4):110. doi: 10.1007/s00403-024-02818-3

Table 2.

Summary of psychometric analyses in the phase 3 clinical trial

Analysis	Description
Test–retest reliability	Test–retest reliability evaluated consistency in scores between Weeks 2 and 4 and between Weeks 4 and 8 in a subset of subjects defined as having ‘stable’ CHE severity using other trial measures (detailed below). Test–retest reliability was evaluated through calculation of Cohen’s Kappa (k) coefficient with quadratic weighting for subjects defined as stable
	The following cutoffs were employed to interpret the kappa values: > 0.75 indicated excellent agreement, 0.40–0.75 indicated good–fair agreement, and > 0.40 indicated poor agreement [16]
	Stability was defined based on subjects with: No change on the PaGA between Weeks 2 and 4 No change on the PaGA between Weeks 4 and 8 No change on the HESD PGI-S between Weeks 2 and 4 No change on the HESD PGI-S between Weeks 4 and 8 Change on the HECSI of < 0.50 the Baseline Standard Deviation (SD) between Weeks 2 and 4 Change on the HECSI of < 0.50 the Baseline SD between Weeks 4 and 8
Convergent validity	Convergent validity of the IGA–CHE was evaluated using data collected at Week 4, by examining correlations with the PaGA, HESD PGI-S, and HECSI scores
	When evaluating convergent validity, score assessing similar or related concepts are expected to be at least moderately correlated. It was hypothesized that all of the above concurrent measures would correlate at ≥ 0.40 with the IGA–CHE [17]
	Correlation size was interpreted as: correlations of < 0.50 were defined a priori as ‘weak’, those ≥ 0.50 and < 0.70 as ‘moderate’, those ≥ 0.70 and < 0.90 as ‘strong’, and those ≥ 0.90 were considered ‘very strong’ [18]
	Week 4 was chosen for the convergent validity and known groups validity (see row below) analyses as it was expected that there would be a greater distribution of scores across the IGA–CHE scale than at Baseline (when trial inclusion criteria required that all subjects would be at the upper end of the response scale)
Known-groups validity	Construct validity was also assessed using the known-groups method to evaluate differences in scores among groups of patients who differ on variables hypothesized to influence the construct of interest
	Again, this analysis was performed at Week 4
	CHE severity groups for comparison were defined by responses to the PaGA (comparison of patients scoring: 0–1 = ‘clear or almost clear’, 2 = ‘mild’, 3 = ‘moderate’, and 4 = ‘severe’) and HESD PGI-S (comparison of patients scoring: ‘none’, ‘mild’, ‘moderate’, and ‘severe’)
	The magnitude of differences between the groups was evaluated using between-group effect size estimates, calculated using the pooled standard deviation as the denominator, and based on the differences between each adjacent pair of groups [19]. Use of the pooled SD assumed both groups have similar variance
	The following cutoffs were used to interpret the magnitude of each effect size: small difference = 0.20, moderate difference = 0.50, large difference = 0.80 [20]
	F test calculated by one-way ANOVA (comparison of more than two groups) and Fisher’s exact test were used to evaluate if differences among the groups were statistically significant (p ≤ 0.05)
Ability to detect change	Ability to detect change assesses whether a score fluctuates in line with true change in the construct it measures
	Changes in IGA–CHE scores from baseline to Week 16 were compared between groups defined as ‘improved’, ‘stable’, and ‘worsened’ based on changes in PaGA, HESD PGI-S, HESD PGI-C, and HECSI scores
	Within- and between-group effect sizes and between groups one-way ANOVA F test were calculated to evaluate the magnitude and significance of the differences in change scores within and between these groups, respectively
	Patients were categorized into ‘improved’, ‘stable’, and ‘worsened’ as follows:
	PaGA and HESD PGI-S Improved: ≥ 1 grade improvement Stable: no change Worsened: ≥ 1 grade worsening HESD PGI-C Improved: ‘a little better’ or ‘much better’ Stable: ‘no change’ Worsened: ‘a little worse’ or ‘much worse’ HECSI Improved: subjects who have a HECSI improvement ≥ 0.50 Baseline SD Stable: subjects who have a HECSI change score < 0.50 Baseline SD Worsened: subjects who have a HECSI worsening ≥ 0.50 Baseline to SD
	The between-group effect sizes were calculated and interpreted as described for the known groups. Within-group effect sizes [21] were calculated as the mean change score divided by the SD of the score at the earlier of the two timepoints. The same thresholds as in known groups were again used to interpret the changes within groups and differences in changes between groups
Interpretation of scores: anchor-based analyses to inform within-subject meaningful change thresholds	Score interpretation characterizes how meaning is attributed to observed changes and differences in scores, beyond that provided for by statistically significant results
	In anchor-based approaches to defining meaningful change thresholds, an external indicator is used to identify subjects who have experienced an improvement in the concept being measured
	The suitability of the proposed anchors was tested by examining the correlation of the change in anchor and change in IGA–CHE scores. Anchors with correlations of < 0.30 were not taken forward for analysis [22].
	Thresholds for within-subject and between groups meaningful change were estimated by calculating the mean changes in IGA–CHE scores for subjects classified as ‘moderately improved’ or ‘minimally improved’ based on the following anchors: PaGA, HESD PGI-S, HESD PGI-C, HECSI-75 (subjects who improved in their HECSI scores by 75%), and HECSI-90 (subjects who improved in their HECSI scores by 90%)
	Estimates were plotted on a forest plot to visualise the range of estimates and identify a plausible range of values for meaningful change
	A correlation weighted average with Fisher’s Z transformation (considering the strength of each anchors’ correlation with the target score) was used to identify a single value [23]
	Analyses were conducted for change from Baseline to Week 16
	In addition, a cross-tabulation of IGA–CHE scores with PaGA scores was performed at Baseline, Week 8 and Week 16 to further aid score interpretation; this was the only post-hoc analysis
Interpretation of scores: distribution-based analyses	In addition to the anchor-based analyses, distribution-based analyses involved using the distributional properties of the IGA–CHE score to provide an indication of the amount of change beyond measurement error that may be considered meaningful
Interpretation of scores: distribution-based analyses	This involved calculation of 0.5 of the SD at Baseline and the standard error of measurement (SEM) [24, 25]. The SEM was calculated as the SD at Baseline multiplied by the square root of one minus the reliability of the score at Baseline [SD × (1 – r)1/2]. The Kappa coefficient calculated within the test–retest analyses using the HESD PGI-S anchor (Weeks 2–4) was used for the reliability coefficient. A value of ‘1 SEM’ was used as the estimate of the meaningful change threshold

ANOVA analysis of variance, CHE Chronic Hand Eczema, HECSI hand eczema severity index, HESD PGI-C hand eczema symptom diary patient global impression of change, HESD PGI-S hand eczema symptom diary patient global impression of severity, IGA–CHE Investigator Global Assessment of Chronic Hand Eczema, PaGA patient global assessment of disease severity