Skip to main content
. 2024 Sep 6;16(9):5718–5726. doi: 10.21037/jtd-24-841

Table 1. Different sources of validity evidence based on Messick’s validity framework (19).

Source of evidence for validity and description Validity evidence for AIBA
Content: the test content should measure what it is supposed to measure Three pulmonary consultants (A.O.N., C.S.A. and S.S.) with more than 40 years’ combined experience in bronchoscopy hypothesized along with a thoracic surgeon and professor of medical education, who has a PhD in bronchoscopy assessment (L.K.), and two doctors with research experience in bronchoscopy education (K.M.C. and K.A.) that simulated bronchoscopy performance can be assessed automatically using an AIBA, relying on the following outcome measures: DC, SP, PT, and MIT
Response process: integrity of data should always be maintained. Test administration should be controlled or standardized at a maximum level possible All procedures were performed in a controlled, simulated environment, making the tests comparable as the participants were using the same scope, monitor and phantom. For data integrity and to avoid bias, all recordings were automatically rated by AIBA. All videos were additionally rated in a blinded fashion by two expert bronchoscopists (A.O.N. and C.S.A.) using an established rating tool (11)
Internal structure: this refers to the reliability of the test results. The outcome measures should correlate with one another DC correlated significantly with SP (Pearson’s r=0.75, P<0.001). DC did not correlate significantly to PT (r=0.22, P=0.11), neither did SP with PT (r=−0.09, P=0.51). DC correlated significantly with MIT (r=−0.55, P<0.001), as did SP with MIT (r=−0.51, P<0.001).
Relationship to other variables: assessment scores should correlate with known measures of competence—AIBA should correlate with the experts’ ratings All the four outcome measures of AIBA correlated significantly with the experts’ anatomy rating: DC (Pearson’s r=0.47, P<0.001), SP (r=0.57, P<0.001), PT (r=−0.32, P=0.02) and MIT (r=−0.55, P<0.001), and with the experts’ dexterity rating: DC (r=0.38, P=0.006), SP (r=0.53, P<0.001), PT (r=−0.34, P=0.01) and MIT (r=−0.47, P<0.001)
Consequences: consequences of testing relates to the pass/fail standard that is set The pass-/fail criterion of 8 points in anatomy-rating made 30 participants fail and 22 pass. The participants that passed performed significantly better on all four outcome measures: DC (P=0.01), SP (P=0.004), PT (P=0.03), MIT (P<0.001)

AIBA, artificial intelligence bronchoscopy assessment; DC, diagnostic completeness; SP, structured progress; PT, procedure time; MIT, mean intersegmental time.