Supplemental Digital Content is Available in the Text.
Keywords: outcomes, evaluation, psychometrics, measurement, assessment
Abstract
Introduction:
Traditional evaluation models, often linear and outcome-focused, are increasingly inadequate for the complexities of modern medical education, which demands more comprehensive and nuanced assessment approaches.
Methods:
A standardized continuing professional development activity evaluation instrument was developed and implemented. An iterative process was performed, using a repeat Rasch analysis, to improve reliability of the evaluation instrument. Category Probability Curves and Test Information Function were generated by the Rasch analysis to refine the construction of the assessment. All educational activities completed between 2022 and 2024 were eligible for inclusion. The study incorporated a diverse range of educational activities and included multiple health care professions.
Results:
The pilot analysis included 250 educational activities with 26,554 individual learners completing evaluations for analysis. Initial Rasch findings demonstrated a need to remove redundancies and change from a five to four-point rating scale. The final instrument validation included 21 activities and 529 learners. Improvement was seen in reliability after modifications, with an increase in Cronbach alpha from 0.72 to 0.80.
Discussion:
Use of psychometrics to improve assessments can yield a more reliable and less redundant evaluation instrument. This research demonstrates a psychometrically informed, flexible evaluation tool that can inform future educational efforts and serve as a data driven metric to enhance the quality of continuing professional development programs.
In the rapidly evolving landscape of medical education, particularly within Continuing Medical Education (CME) and Continuing Professional Development (CPD), there is a pressing need for innovative and effective evaluation methods and outcomes. The Accreditation Council for Continuing Medical Education (ACCME) and the Joint Accreditation for Interprofessional Education have been instrumental in setting standards in this field.1,2 In addition, the application of Moore's framework, ranging from participation to changes in patient health, has become a benchmark in evaluating CME effectiveness.3 Despite this, there is a notable gap in the exploration of methodologies addressing the validity and reliability of these assessment tools. Furthermore, the adaptation of educational activities to diverse medical specialties and providers calls for tailored evaluation approaches.4
Validity and reliability are seldom discussed in research articles that report CME/CPD activity satisfaction or evaluation data.5 However, psychometric advancements in medical education assessments have increasingly been recognized for minimizing measurement biases, ensuring fairness and accuracy in evaluations.6 Still, few studies delve into robust item response theory models, such as the Rasch model, which enhances the interpretation of CME evaluation data and is commonplace in educational measurement research. This model, rooted in psychometric theory, provides a modern approach to analyzing assessment data, facilitating the development of reliable and valid measures.7
Our study aimed to fill these gaps by examining a unified psychometric evaluation tool's efficacy in CPD.8 By using the Rasch model, this study seeks to enhance the validity and reliability of evaluation assessment tools, aligning with Joint Accreditation standards and interprofessional education goals. This approach contributes significantly to medical education by offering a more comprehensive evaluation method, thus enhancing the measurement of satisfaction and quality of CME/CPD programs.
METHODS
Instrument Development and Pilot Rasch Analysis
Key stakeholders, composed of educational experts and staff within the organization, collaborated to develop a single evaluation form generated from the best parts of all available evaluation forms within the organization. Before consensus, five forms in circulation were reviewed. Items were assessed, with considerations for discernability, readability, and terminology. To test the instrument, we used the Rasch measurement model for both initial pilot analysis and final instrument validation. The Rasch calibration provides latent trait estimates of where each item and person lands on a single unidimensional measurement scale. These estimates are logits, which are log-odds units that quantify the probability of each of the respondents' agreement levels given their latent trait estimate satisfaction.
Final Instrument Validation
The pilot analysis was used to evaluate the success of the standardized template and identify opportunities for improvement to the instrument. This was performed on learner responses collected in 2023. After adjustments to the evaluation tool, a repeat Rasch analysis, using the methods outlined below, was performed on data collected in March and April 2024. This research was found exempt by Sterling IRB (IRB #11700).
Statistical Analysis
Winsteps (version 5.7.1.0) Rasch Measurement software was used to analyze the data. The Rasch model facilitated the computation of person (learner) satisfaction measures and the calibration of item difficulties. We then used Category Probability Curves and a Test Information Function (TIF) generated by the Rasch analysis to refine the construction of the assessment. The TIF illustrated the precision of our test across the satisfaction spectrum, while the probability curves informed the fine-tuning of our response scales.
Reliability and Item Efficiency
From our Rasch analysis, we also obtained summary statistics to assess the reliability of the evaluation measure, which included an average measure, SD, and a reliability coefficient. In addition, we examined the largest standardized residual correlations between items to ensure the efficiency of our instrument by identifying any potential redundancies, as indicated by high residual correlations between items.
Transforming Measures
Finally, we transformed the Rasch logit estimates into scaled scores to facilitate interpretability. A linear transformation of the logit metric produced in Rasch calibration was used to facilitate score use and reporting to a scale in which person scores ranged between 0 and 100. This allowed for comparison of the effectiveness of various educational activities and to identify which ones were most successful in achieving high learner satisfaction, guiding improvements in educational content delivery.
RESULTS
Pilot Analysis
A pilot test was performed, and the collaborative instrument was used to evaluate a broad variety of learning activities (n = 250), including many content types and specialty areas; 26,554 individuals completed the evaluation form between December 28, 2022, and September 19, 2023. Most of the participants were advanced practice nurses (41%), physicians (18%), and nurses (15%). Participants used a 1 to 5 agreement scale to rate various facets of the educational activity, including content clarity, evidence base, relevance, knowledge enhancement, and appropriateness.
The overall category frequencies summarized in Table 1 for responses to the questions measuring agreement show that most respondents (67%) indicated “Strongly Agree,” followed by 25% who selected “Agree.” For the Practice Change Response Set, the data show that 47% of respondents plan to implement changes, while 45% believe that their current practice has been reinforced.
TABLE 1.
Overall Category Frequencies
| Scale Options | No. (%) |
| 2023 Agreement Scale | |
| Strongly agree | 142,705 (67) |
| Agree | 53,263 (25) |
| Neutral | 13,696 (6) |
| Disagree | 733 (0) |
| Strongly disagree | 2832 (1) |
| 2024 Agreement Scale | |
| Strongly agree | 3559 (61) |
| Agree | 1914 (33) |
| Somewhat agree | 304 (5) |
| Disagree | 42 (1) |
| 2023 Practice Change | |
| Yes, I plan to implement changes | 12,445 (47) |
| My current practice has been reinforced | 11,962 (45) |
| I need more information before I will change my practice | 2247 (8) |
| 2024 Practice Change | |
| Yes, I plan to implement changes | 249 (47) |
| My current practice has been reinforced | 228 (43) |
| I need more information before I will change my practice | 52 (10) |
Residual correlations (ie, correlations among items after accounting for variance explained by the latent trait) identified potential item redundancies between six items, ranging in correlations between 0.36 and 0.71, suggesting an overlap in the concepts they measure.
Changes Derived From Pilot Analysis
The inability to differentiate among individuals with high levels of satisfaction resulted in a lower reliability at high levels of satisfaction. Furthermore, the lack of endorsement by respondents at the lower end of the rating scale resulted in effectively having a four-point scale instead of the desired five-point. A rating scale change from five-point to four-point agreement scale, with three of the four options being some degree of agreement, was made to improve reliability and differentiation among highly satisfied learners.
Items with high residual correlations were evaluated and questions were modified or removed to reduce redundancies and make room for new questions. The final instrument adjusted the multiselect options and overall terminology to include more appropriate health care terminology, including soft-skill changes, and reduce overlap in selectable options.
Final Instrument Validation
The final instrument had a smaller sample size for validation, although still more than satisfactory for a Rasch calibration; it included 21 activities and a sample size of 529 learners. Similar consistency was seen in sample representation (Table 2).
TABLE 2.
Demographics
| Profession | 2023 Sample, No. (%) | 2024 Sample, No. (%) |
| Advanced practice nurse | 10,994 (41) | 164 (31) |
| Physician | 4849 (18) | 114 (22) |
| Nurse | 4125 (15) | 128 (24) |
| Pharmacist | 3561 (13) | 80 (15) |
| Physician associate/physician assistant | 1868 (7) | 27 (5) |
| Pharmacy technician | 781 (3) | 5 (1) |
| Other HCP | 194 (1) | 3 (1) |
| Non-HCP | 105 (0) | 5 (1) |
| Psychologist | 105 (0) | |
| Social work | 41 (0) | |
| Clinical laboratory professional | 17 (0) | |
| Nutrition dietetics | 8 (0) | |
| Dentist | 3 (0) | 2 (0) |
| Optometrist | 1 (0) | |
| Genetic counselor | 1 (0) |
From the Rasch model, we derived summary statistics for each iteration, as summarized in Table 3, signifying an improved measurement instrument between the pilot and final analysis. The TIF graphs presented in Figure 1 indicate the improved discernment at higher levels of learner satisfaction after implementation. However, the scale still performs best at distinguishing between learners with a lower level of satisfaction.
TABLE 3.
Scale Reliability
| Statistic | 2023 Form | 2024 Form |
| Average Scaled Score | 75.94 | 71.35 |
| Standard deviation | 21.75 | 20.33 |
| Cronbach alpha (internal consistency reliability) | 0.72 | 0.80 |
| Test Information at Average Scaled score | 1.38 | 2.25 |
FIGURE 1.

Test Information Function Changes.
The examination of the Category Probability Curves for the Agreement Response Set helps evaluate the Rasch expectation of monotonicity in the progression response probabilities.8 This expectation is met when each response option is the most likely to be selected at some point on the latent trait continuum and the peak probabilities are ordered as theoretically expected (ie, from Disagree to Strongly Agree in order of increasing magnitude). The Category Probably Curves for the 2023 and 2024 scales (Fig. 2) indicate an improved adherence to the above stated expectation by transitioning from a five-point (2023 scale) to four-point rating scale (2024 scale).
FIGURE 2.

2023 and 2024 Scale Category Probabilities.
The review of the person-item distribution map (see Supplementary Appendix 1, Supplemental Digital Content 1, http://links.lww.com/JCEHP/A354) reveals a common challenge with education satisfaction measurement. In our experience, continuing education learners tend to be quite satisfied with the education they consume. This is shown in the person-item distribution map as a lot of people who have ability (satisfaction) estimates far higher than the highest item (see Supplementary Appendix 1, Supplemental Digital Content 1, http://links.lww.com/JCEHP/A354). This creates challenges for differentiating amongst learners with very high satisfaction, but such scales can still be used to identify poorly performing learning interventions and differentiate them from high-performing activities. Still, the observed shift to the right of the TIF graph in Figure 1 indicate; changes in the scale were somewhat successful in introducing items that were harder to endorse (ie, required more latent satisfaction).
DISCUSSION
Implications of Findings
The novel evaluation tool we have developed in our study has far-reaching implications for CME/CPD. Grounded in psychometric theory and using the Rasch model, it significantly improves on traditional evaluation tools. By offering a validated assessment tool, we can reliability capture and address the reported complexities inherent in health care, especially in diverse professions or specialties and interprofessional education contexts.
Conducting a pilot analysis allowed for insights that were pivotal for refining the evaluation tool. The ability to set a benchmark and examine relationality highlighted high residual correlations, allowing for purposeful modification of the pilot instrument to eliminate potential redundancy. In addition, the improvement seen during the repeat TIF analysis indicates improvement in the differentiation of levels of satisfaction (Figure 1). More information indicates greater differentiation between respondents' levels of satisfaction. Whereas both scales have peaks of information at the low end of the scale, the 2024 version of the scale successfully moved some of the information to higher score points. This helps the metric detect smaller differences between respondent satisfaction levels.
In education and social sciences, debate continues over the appropriate number of rating options in educational research.9 The Rasch framework, however, stresses the importance of making scaling decisions based on the data fitting the model and our analysis of the data led to the decision to remove a rarely used category (Figure 2). The examination of the Category Probability Curves for the 2023 Agreement Response Set indicates a distribution where the 'Disagree' option is never the most probable choice across the measured person ability levels. This observation suggests that respondents have a lower propensity to select ‘Disagree' compared to other response options, irrespective of their position on the latent trait continuum. Consequently, the absence of a peak for the ‘Disagree' response may imply potential redundancy in this option. The transition to a four-point rating scale has shown improvement in the ability to discern differentiating levels of agreement among respondents and created a more efficient measurement tool. The 2024 curves represented display each category as the most probable response at various points along the ability level. This demonstrates that each option captures a distinct aspect of the respondents' agreement, with clear peaks indicating the most probable response based on the respondents' latent trait levels. The 2024 scale category probability curves showed the monotonicity expected by the Rasch model.
Integration With Recent Studies
Use of psychometric analysis is common in formal standardized testing and in educational assessment. However, published use in the CPD space is limited and a recent call to action provides multiple use cases for Rasch analysis in CPD.7 Ramazanzadeh et al10 published a study on developing a clinical reasoning rubric in nursing, tailored to specific educational objectives. A similar method to our study was used, including initial rubric creation by key stakeholders, pilot testing, and a validation process. Inter-rater intraclass correlation coefficients were performed to assess reliability. While performed in undergraduate nursing students, the authors illustrate the importance of context-specific tools, resonating with our results showcasing applicability across various medical professions and specialties.
A review of the CME/CPD literature yields a single study describing creation and validation of a CME evaluation instrument.11 Authors report creation of a physician-specific evaluation instrument, grounded in perceptions of service quality using the SERVQUAL instrument. Additional refinement was conducted using physician feedback and several iterations were considered during the pilot testing. The use of this validated instrument has been well-established but does include some limitations. The sample only included physician learners and feedback, instead of the interprofessional cohort presented in our study. In addition, the process reported here included a larger sample size and more comprehensive psychometrics for analysis, building on the aforementioned study. As we continue to depend on interdisciplinary and interprofessional education to improve patient outcomes, education and assessment tools must evolve to consider the varied needs of these learners. Our findings help move the needle forward on the importance of continued development and assessment of evaluation tools in CME/CPD.
The findings from this research can be applicable across the CME/CPD space, irrespective of provider. Partnership with a psychometrician or individuals with formal educational measurement and assessment training can provide robust analysis for similar efforts. Ultimately, the varied performance of different educational activities, as indicated by the scaled score distribution example from applying the scaled score across many activities (see Supplementary Appendix 2, Supplemental Digital Content 1, http://links.lww.com/JCEHP/A354), highlights the tool's potential in identifying both highly effective and underperforming activities. This differentiation is vital for organizations to continuously improve their educational offerings. Finally, the use of these data can feed development of other metrics, such as a faculty scorecard to help identify impactful speakers or testing of new metrics or business analytics to feed growth in specific areas, nuanced by content type, health care profession or specialty, and years in practice. When scaled scores, such as those developed in this article, are consistently applied to a learning portfolio, an organization can truly move toward data-driven decision-making.
Our study marks a significant advancement in CME/CPD evaluations, presenting a psychometrically informed tool that aligns with ACCME standards and interprofessional education goals. The insights from both historical and recent research can guide the refinement of our tool for broader applications. Future research might focus on optimizing the tool for a wider range of satisfaction levels and exploring its applicability in diverse cultural and educational contexts. Longitudinal studies examining the long-term impact of educational activities on practice change would be valuable in confirming the tool's effectiveness in enhancing patient care outcomes. Similarly, consideration of nontraditional metrics is also worth considering. The Net Promoter Score is a common customer service score that helps standardize satisfaction and capturing during educational programming may offer another lens to examine satisfaction data through.12
CONCLUSION AND FUTURE DIRECTIONS
This research documents several aspects of content and construct validity for our measure. Other elements of validity were not applied or discussed, as our approach was designed to be iterative and incremental. Future considerations should consider aspects of unidimensionality, local item independence, and group invariance. Our hope is that we learn what is successful with each iteration and apply those learnings to future enhancements. Future studies should explore additional aspects of validity and new applications for validated metrics such as this one.
In summary, research to date suggests the necessity of diverse, innovative, and adaptable evaluation tools in medical education. While our research contributes a psychometrically robust and versatile tool, the integration of different evaluation methods should be performed to create a more rounded and comprehensive understanding of program effectiveness. This multifaceted approach ensures that our evaluation methods remain relevant and effective in addressing the complex and evolving needs of modern health care professional education.
Lessons for Practice
■ Application of educational measurement, such as the Rasch model, can be used to improve reliability of CPD assessment tools.
■ Health care education providers can use psychometrics to enhance the rigor of educational measurement and quality assurance.
ACKNOWLEDGMENTS
The authors thank Summer Alvarez for her contributions to this research.
Footnotes
Disclosures: The authors declare no conflict of interest.
This research was approved and found exempt by Sterling IRB (ID#11700).
Supplemental digital content is available for this article. Direct URL citations appear in the printed text and are provided in the HTML and PDF versions of this article on the journal's Web site (www.jcehp.org).
Contributor Information
Anthony Gage, Email: agage@cealliance.com.
Sarah A. Nisly, Email: snisly@cealliance.com.
REFERENCES
- 1.Standards for integrity and independence in accredited continuing education ACCME. Available at: https://accme.org/accreditation-rules/standards-for-integrity-independence-accredited-ce. Accessed August 16, 2023.
- 2.Accreditation Council for Continuing Medical Education. Standards for integrity and independence in accredited continuing education; 2020. Available at: https://accme.org/wp-content/uploads/2020/12/884_20241028_standardsforintegrityandindependenceinaccreditedcontinuingeducation-1.pdf. Accessed March 4, 2024.
- 3.Moore DE, Green JS, Gallis HA. Achieving desired results and improved outcomes: integrating planning and assessment throughout learning activities. J Contin Educ Health Prof. 2009;29:1–15. [DOI] [PubMed] [Google Scholar]
- 4.Yong E, Manoharan K, Gent D. The European examination in core cardiology in focus: evaluation and recommendations using educational theory. J Eur CME. 2022;11:2055266. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Ratanawongsa N, Thomas PA, Marinopoulos SS, et al. The reported validity and reliability of methods for evaluating continuing medical education: a systematic review. Acad Med. 2008;83:274–283. [DOI] [PubMed] [Google Scholar]
- 6.Tavakol M, O'Brien D. Psychometrics for physicians: everything a clinician needs to know about assessments in medical education. Int J Med Educ. 2022;13:100–106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Farlie M, Johnson C, Wilkinson T, et al. Refining assessment: Rasch analysis in health professional education and research. Focus Health Prof Educ A Multi-Professional J. 2021;22:88–104. [Google Scholar]
- 8.Smith EV, Smith RM, eds. Introduction to Rasch Measurement: Theory, Models and Applications. Maple Grove, MN, USA: JAM Press; 2004. [Google Scholar]
- 9.Kusmaryono I, Wijayanti D, Maharani HR. Number of response options, reliability, validity, and potential bias in the use of the likert scale education and social science research: a literature review. Int J Educ Methodol. 2022;8:625–637. [Google Scholar]
- 10.Ramazanzadeh N, Ghahramanian A, Zamanzadeh V, et al. Development and psychometric testing of a clinical reasoning rubric based on the nursing process. BMC Med Educ. 2023;23:98. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Shewchuk RM, Schmidt HJ, Benarous A, et al. A standardized approach to assessing physician expectations and perceptions of continuing medical education. J Contin Educ Health Prof. 2007;27:173–182. [DOI] [PubMed] [Google Scholar]
- 12.Lucero KS. Net promoter score (NPS): what does net promoter score offer in the evaluation of continuing medical education? J Eur CME. 2022;11:2152941. [DOI] [PMC free article] [PubMed] [Google Scholar]
