Abstract
Objectives
The Ki-67 proliferation index is integral to gastroenteropancreatic neuroendocrine tumor (GEP-NET) assessment. Automated Ki-67 measurement would aid clinical workflows, but adoption has lagged owing to concerns of nonequivalency. We sought to address this concern by comparing 2 digital image analysis (DIA) platforms to manual counting with same-case/different-hotspot and same-hotspot/different-methodology concordance assessment.
Methods
We assembled a cohort of GEP-NETs (n = 20) from 16 patients. Two sets of Ki-67 hotspots were manually counted by three observers and by two DIA platforms, QuantCenter and HALO. Concordance between methods and observers was assessed using intraclass correlation coefficient (ICC) measures. For each comparison pair, the number of cases within ±0.2xKi-67 of its comparator was assessed.
Results
DIA Ki-67 showed excellent correlation with manual counting, and ICC was excellent in both within-hotspot and case-level assessments. In expert-vs-DIA, DIA-vs-DIA, or expert-vs-expert comparisons, the best-performing was DIA Ki-67 by QuantCenter, which showed 65% cases within ±0.2xKi-67 of manual counting.
Conclusions
Ki-67 measurement by DIA is highly correlated with expert-assessed values. However, close concordance by strict criteria (>80% within ±0.2xKi-67) is not seen with DIA-vs-expert or expert-vs-expert comparisons. The results show analytic noninferiority and support widespread adoption of carefully optimized and validated DIA Ki-67.
Keywords: Ki-67, Grading, Neuroendocrine, Digital image analysis, Whole-slide imaging, Concordance, Validation immunohistochemistry
Key Points.
Expert Ki-67 measurement in gastroenteropancreatic neuroendocrine tumors by manual counting exhibits the same variability and reproducibility as digital image analysis (DIA).
Measurement of Ki-67 proliferation index by whole-slide DIA is a robust method that can be adopted in routine clinical practice.
We propose a validation framework for digital Ki-67 that (1) ensures DIA Ki-67% exhibits high correlation (r > 0.9) with manual counting and (2) shows more than 50% cases within ±0.2xKi-67 of manual counting.
The Ki-67 proliferation index is integral to the diagnosis and prognosis of gastroenteropancreatic neuroendocrine tumors (GEP-NETs) and in many instances guides therapy.1 The Ki-67 index has been shown to be more powerful than tumor stage as a prognostic factor in pancreatic tumors,2 and there are significant overall survival differences between World Health Organization (WHO) G1 and G3 tumors.3 Manual counting of camera-captured images represents the de facto gold standard for Ki-67 assessment. Computer-assisted systems (CAS 2004,5) dating back to the mid-1990s did not widely penetrate clinical practice. Given increased access to whole-slide imaging and technological leaps in computing, the time is ripe for more widespread adoption of Ki-67 digital image analysis (DIA), although concerns about nonequivalency persist.6
The University of Iowa Hospitals and Clinics has a large neuroendocrine tumor program; we follow over 2,000 patients and see 200 new patients a year. We perform Ki-67 immunohistochemistry on all neuroendocrine tumors (NETs) (ie, primary, recurrent, metastatic), and on resections we separately evaluate primary, regional, and distant disease.7 Although we gestalt (aka “eyeball estimate”) rare Ki-67 proliferation indices (eg, those close to 0%), we manually count most (especially those around the WHO G1/2 and 2/3 thresholds), which is exceedingly time intensive. We perform manual counting of camera-captured images, as eyeball estimation and manual counting under the microscope have been shown to be clearly inferior.8-11 Some in our group have used the ImmunoRatio web application,12,13 which is an automated counting method with specific image input requirements. Observers have contended that DIA-based quantification and immunohistochemical assessment carry limitations. Reid et al,10 for example, ascribed high costs to implementing digital quantification. In addition, they contended that miscounting due to inclusion of nontarget cells was a contributor to inaccuracy with digital methods. This study partly grew from our Ki-67 DIA validation process. One of us is an immunohistochemistry laboratory director, and we drew heavily on that experience and the College of American Pathologists guideline for the analytic validation of immunohistochemical assays (eg, cohort size of 20 cases, expected concordance between comparator and the gold standard of 90%, and investigation of sources of lower-level concordance).14
The Ki-67 proliferation index is unique in surgical pathology in that it represents a continuous variable ranging from 1% to 100%. Studies evaluating Ki-67 DIA assessment across tumor types,15-22 including GEP-NETs,9-11,13,23-26 typically report the Pearson correlation coefficient (r) or the intraclass correlation coefficient (ICC).27 Despite the fact that Ki-67 is a continuous variable, NET grades are assigned at (arbitrary) proliferation index thresholds (ie, G1/2 at 3%, G2/3 at 20%), and crossing a grade threshold can have clinical import (eg, clinical trial eligibility). Two assays can be highly correlated but consistently produce results on opposite sides of these thresholds. Thus, we concluded that correlation coefficients alone were an insufficient metric for a validation and that, more broadly, there was a need for a simpler index or parameter of evaluating agreement between methods in histopathology that produce continuous data. The 3% GEP-NET G1/G2 threshold is frequently encountered in clinical practice; counts within 10% of this value range from 2.7% to 3.3%. However, this was intuitively and by consensus deemed “too difficult to reproduce.” The ±20% range around this value is 2.4% to 3.6%, and this represented our reproducibility target, and we refer to comparisons within the ±0.2xKi-67 index as “close matches.”
With the above, we sought to perform an in-depth examination of the reproducibility of digital Ki-67 measurement in GEP-NET with two widely available commercial whole-slide image–based DIA platforms with both hotspot- and case-level concordance assessment in comparison to manual counting. We examined the effect of strict predetermined thresholds to quantify the proximity of Ki-67 indices obtained by manual and digital methods. In addition, we examined the extent of grade-discrepant Ki-67 values in cases obtained by manual and digital methods.
Materials and Methods
Case Inclusion
The validation study was carried out as part of a quality improvement project. The University of Iowa Institutional Review Board determined the validation to be exempted from a full review by the board. We initially assembled a cohort of 20 GEP-NETs, including a mix of primary and metastatic tumors, from 16 patients. The cohort was assembled with consideration of the grade distribution in our patient population (66% G1, 30% G2, <5% G3). We intentionally overrepresented G2 tumors (50% of the cohort) (as G1 tumors with proliferation indices frequently close to 0% were not considered a “fair test”) and intentionally challenged the G1/G2 threshold (35% of the cohort with reported proliferation indices from 2%-5%). The cases were selected from 2017 and 2018 to reduce variation in counterstain (hematoxylin) quality and intensity. They were all initially signed out by a single expert gastroenterology (GI) pathologist with 11 years of experience (observer 1), who had performed manual counting of camera-captured Ki-67 hotspot images at the time of clinical reporting. Mitotic activity was not considered in grade assignment.
G3 GEP-NETs are relatively infrequently encountered, although they are overrepresented in our institutional consults. At the recommendation of the reviewers, we evaluated an additional group of five tumors from four patients, targeting Ki-67 proliferation indices in the 10% to 30% range. Clinicopathologic features of these 25 total cases are presented in Table 1.
Table 1.
Patient Demographics of the Study Casesa
| Patient No. | Slide No. | Age, y | Sex | Tumor Location | Primary or Metastatic | Clinical Reported Ki-67 PI | WHO Tumor Grade |
|---|---|---|---|---|---|---|---|
| 1 | 1 | 51 | F | Distal pancreas | Primary | 6.20 | G2 |
| 2 | 2 | 58 | F | Terminal ileum | Primary | 2.00 | G1 |
| 3 | 3 | 75 | M | Duodenum | Primary | 2.00 | G1 |
| 4 | 4 | 62 | M | Distal pancreas | Primary | 23.00 | G3 |
| 5 | 5 | 51 | F | Duodenum | Primary | 1.20 | G1 |
| 6 | 51 | F | Lymph node | Metastatic | 1.60 | G1 | |
| 6 | 7 | 75 | F | CBD lymph node | Metastatic | 4.00 | G2 |
| 8 | Duodenum | Primary | 5.50 | G2 | |||
| 7 | 9 | 69 | F | Duodenum | Primary | 9.00 | G2 |
| 8 | 10 | 54 | F | Proximal ileum | Primary | 3.00 | G2 |
| 11 | Ileal tumor nodule | Metastatic | 9.00 | G2 | |||
| 12 | Liver | Metastatic | 8.50 | G2 | |||
| 9 | 13 | 54 | F | Proximal ileum | Primary | 0.70 | G1 |
| 10 | 14 | 70 | F | Pararenal node | Metastatic | 10.70 | G2 |
| 15 | Liver | Metastatic | 10.10 | G2 | |||
| 12 | 16 | 68 | M | Duodenum | Primary | 4.50 | G2 |
| 13 | 17 | 63 | M | Liver | Metastatic | 2.30 | G1 |
| 14 | 18 | 56 | M | Terminal ileum | Primary | 1.90 | G1 |
| 15 | 19 | 67 | M | Diaphragmatic nodule | Metastatic | 2.99 | G1 |
| 16 | 20 | 59 | F | Stomach | Primary | 2.00 | G1 |
| 17 | 21 | 77 | M | Stomach | Primary | 15 | G2 |
| 18 | 22 | 67 | F | Liver | Metastatic | 7.1 | G2 |
| 19 | 23 | 60 | F | Ileum | Primary | 8 | G2 |
| 20 | 24 | 78 | F | Liver metastasis | Metastatic | 16 | G2 |
| 25 | Pancreas | Primary | 36 | G3 |
CBD, common bile duct; PI, proliferation index; WHO, World Health Organization.
aKi-67 proliferation index values at clinical reporting and corresponding tumor grades according to the WHO cutoffs (<3%, grade 1; 3%-20%, grade 2; >20%, grade 3) are shown.
Ki-67 Counting and Digital Measurement
Manual Counting of Hotspot Photomicrographs From Glass Slides
Immunohistochemistry for Ki-67 was performed using standard techniques (MIB1; 1:200 dilution, heat-induced epitope retrieval; Dako). In routine reporting, Ki-67 proliferation index values were measured by manual counting of immunostained nuclei (all by observer 1).Hotspot Selection.—In brief, hotspot selection was performed by light microscopy on glass slides by carefully combined low- and high-magnification examination of tumor areas to identify the area/focus with the highest apparent density of immunopositive tumor nuclei Figure 1. Depending on the proportions of tumor and stroma, two to three nonoverlapping photomicrographs were acquired (×400, 2,448 × 1,920 pixels, 72 pixels per inch) from each hotspot area with an Olympus DX27 CCD camera (Olympus) Figure 2A. The captured images together were known from routine practice to contain approximately 1,000 tumor cells.Manual Counting.—Manual counting was performed with cellSens version 1.7 (Olympus) using the point-annotation function applied to the hotspot images viewed in cellSens on computer monitors Figure 2C. Immunonegative tumor nuclei were included, and stromal, inflammatory, and endothelial cells were excluded. Ki-67–positive tumor nuclei were counted separately, and positive nontumor elements (inflammatory cell, stromal and endothelial nuclei) were excluded by morphologic assessment. The historical/retrospective data from routine reporting as described in the above method were included in the analysis Table 2. The remainder of the measurements were obtained prospectively.
Figure 1.
Study design. The circles indicate hotspots and the site of hotspot selection (glass vs whole-slide image [WSI]) and the connected rectangles the method of Ki-67 measurement and the image substrate used to perform counting (photomicrograph [PMG]). “Study hotspots” were annotated by the digital image analysis (DIA) pathologist on WSIs and were counted by observer 1 and observer 3 and measured by HALO and QuantCenter. Observer 1 and observer 2 hotspots were identified on glass slides and counted using on-screen photomicrographs in Olympus cellSens. The dashed lines (green, expert vs expert; blue, expert vs DIA; red, DIA vs DIA) indicate the intercomparisons performed in the analysis.
Figure 2.
Ki-67 measurement by manual and digital image analysis (DIA) methods. A representative case (A) with segmented nuclei in HALO (B) and the same field with manual counting annotation of nuclei performed in Olympus cellSens (C). An annotated hotspot on a digital whole-slide image (D). This is subject to DIA measurement in 3D Histech Quantcenter (E) and manual counting in cellSens (F). The open plus symbols in C and F are from manual counting.
Table 2.
Summary of Ki-67 Measurement Experiments Performed or Included in the Studya
| Hotspot Selector | Manual Counting | Analysis Platform |
|---|---|---|
| DIA pathologist | Observer 3 Observer 1 |
QuantCenter HALO |
| Observer 1 | Observer 1 | |
| Observer 2 | Observer 2 | HALO |
aEach row indicates the hotspot selector and the methods of counting each of the selected hotspots were subjected to. The digital image analysis (DIA) pathologist study hotspots and observer 2 chosen hotspots were counted by both DIA platforms. Observer 1 counts on observer 1–chosen hotspots were performed as part of clinical reporting of each case.
In the study, a second independent set of hotspots was prospectively chosen on glass slides by another expert GI pathologist (observer 2). Using the same method as described above, two to three TIFF images were acquired with an Olympus DX27 camera at ×400 from each hotspot area. The acquired photomicrographs were manually counted in cellSens by observer 2, as described above.
Manual Counting of Hotspots Acquired From Digital Whole Slides
Snapshot images containing study hotspots from whole-slide images (WSIs; see below) were manually counted by two independent observers (observer 1, observer 3) using cellSens’s point-annotation function, as described above (Figure 2C and Figure 2F).
DIA-Based Ki-67 Measurement on WSI and Hotspot Photomicrographs
The Ki-67 immunohistochemistry slides were digitally scanned with the ×20 objective (0.24-μm/pixel resolution) by a P1000 Pannoramic scanner (3DHistech) to yield .mrxs digital WSIs.Hotspot Selection.—On WSI slides, a ×400 high-power field–equivalent circular area (0.152 mm2), known from prior experience to contain ~1,000 tumor cells, was annotated by a third pathologist independent of the prior two selections using 3DHistech Caseviewer’s gradient visualization function that facilitates identification of the highest Ki-67–labeled areas Figure 2D. This was carried out by the pathologist performing digital image analysis (DP), and the selected hotspots are termed study hotspots. In addition, the study hotspots were extracted using Caseviewer’s Snapshot function in a 1,920 × 1,017-pixel JPEG image that contained within it the circular annotation.Automated Counting by DIA.—The DP annotated study hotspots were analyzed in 3DHistech QuantCenter with the CellQuant algorithm to digitally enumerate nuclei on WSIs Figure 2E. Because of a lack of cross-compatibility between 3DHistech Caseviewer and HALO (Indica Labs), digital whole-slide annotated study hotspots could not be directly opened in HALO. Instead, study hotspot circular annotated areas were exported using Caseviewer’s Export function as standalone .mrxs WSI files, opened in HALO, and analyzed using the Cytonuclear algorithm. Similarly, observer 2–chosen TIFF hotspot images acquired by photomicrography from glass slides were imported and analyzed in HALO using the Cytonuclear algorithm (Figure 2B).
The selection of hotspots and counting methods are depicted in Figure 1, Figure 2, and Table 2. CellQuant and Cytonuclear algorithms were independently optimized for tumor nuclear detection and positive nucleus classification using multiple areas from nonstudy neuroendocrine tumor and other WSI slides. The same algorithm settings were used in QuantCenter and HALO when analyzing WSIs or imported hotspot photomicrographs each, respectively. Nuclear count data were exported as .xls files. The optimized algorithm parameters for 3DHistech CellQuant and HALO CytoNuclear, and example results from both platforms are included in the supplementary data (Supplementary Figure S1, Supplementary Figure S2, and Supplementary Figure S3; all supplemental materials can be found at American Journal of Clinical Pathology online).
Five cases with higher clinically reported Ki-67 index values were selected and analyzed separately. For these slides, hotspots were chosen and annotated on WSIs by an expert GI pathologist (observer 2). The hotspots were extracted as JPEG images using Caseviewer’s Snapshot function, resulting in JPEG images that contained the circular annotation. These were manually counted by two pathologists (observer 1, observer 2) in Olympus cellSens. The hotspots were analyzed in 3DHistech QuantCenter with the CellQuant algorithm to digitally enumerate nuclei on WSIs. Ki-67 measurement by observer 3 manual counting and HALO were not performed. Because the five cases were not subject to the uniform measurement process as the remaining 20, their data are analyzed and presented separately (Supplemental Figure S4).
Analysis
Ki-67 proliferation indices were calculated by the standard method: positive tumor nuclei/total tumor nuclei. For every comparison, the Pearson correlation coefficient (r) was calculated. Concordance within methodologic and observer groups was assessed by ICC measures. As a measure of the closeness of match between measured Ki-67 indices, a preset threshold was assessed, namely, the number of cases for each method that fell within ±0.2x of its comparator. In all comparisons, where applicable, the manual counts were used to derive close match thresholds. In addition, for each comparison pair, κ statistics and the number of cases that were WHO GEP-NET grade discrepant were assessed. Confidence intervals (CIs) for mean Ki-67 (%) were estimated using 1,000 bootstrap samples. Differences between mean Ki-67 values were compared using the Student t test and differences in proportions analyzed using Fisher exact test. Statistical analysis was performed using SPSS version 26 and RStudio version 1.2.1335 running R 64-bit version 3.6.1.
Results
Profile of Cases
Taken together, the age of the patients (n = 20) ranged from 51 to 78 years. There were 7 men and 13 women (male/female = 1:1.8). Sections from 25 tumors from these patients were studied; of these, 15 were primary tumors, and the remaining 10 were metastatic foci. The primary sites studied included pancreas (n = 3), duodenum (n = 5), ileum (n = 5), and stomach (n = 2). There were four liver metastases, with nodal metastases (n = 3) and peritoneal nodules (n = 3) forming the remainder. The clinically reported Ki-67 proliferation index values ranged from 0.70% to 36%, with a mean of 5.5%, with 9 G1, 14 G2, and 2 G3 (WHO) tumors. The five cases analyzed separately showed a mean Ki-67 of 16.42%, with three reported with Ki-67 values more than 10%. The details of the included cases and slides are presented in Table 1.
Correlation Analysis
The measured Ki-67 values are shown in Table 3 and Supplementary Table S1. Ki-67 measurement by both DIA platforms showed excellent correlation with their human counterparts Figure 3❚ and Table 4. Table 4 displays the findings of Ki-67 tumor slides that were subjected to the full analysis (n = 20) described above. The Pearson correlation coefficient (r) ranged from 0.881 to 0.980 (P < .001, all possible pairs), indicating a high degree of correlation between different observers and between methods (Table 4). The ICC was excellent in both within-hotspot (0.94, and 0.89, respectively) and case-level assessments (0.926, all P < .001). In addition, ICC values were high when examined expert vs expert (0.97; 95% CI, 0.95-0.98) and expert vs DIA (0.91; 95% CI, 0.84-0.96). Reproducibility between different DIA platforms (ie, DIA-vs-DIA assessments) was 0.88 (95% CI, 0.77-0.94).
Table 3.
Ki-67 Index Values (%) Calculated by Different Methodsa
| Tumor No. | Manual Counting | Digital Image Analysis | |||||
|---|---|---|---|---|---|---|---|
| Observer 1 (DP) | Observer 1 (Ob 1) | Observer 2 (Ob 2) | Observer 3 (DP) | QuantCenter (DP) | HALO (DP) | HALO (Ob 2) | |
| 1 | 6.31 | 6.20 | 5.85 | 5.92 | 5.70 | 4.70 | 5.72 |
| 2 | 1.95 | 2.00 | 2.40 | 1.72 | 3.34 | 2.49 | 3.40 |
| 3 | 1.85 | 2.00 | 1.24 | 1.04 | 2.16 | 2.06 | 1.85 |
| 4 | 24.67 | 23.00 | 23.84 | 22.61 | 24.33 | 20.90 | 36.50 |
| 5 | 0.60 | 1.20 | 1.24 | 0.60 | 0.52 | 0.43 | 1.04 |
| 6 | 1.37 | 1.60 | 1.64 | 1.26 | 2.75 | 1.13 | 1.69 |
| 7 | 3.81 | 4.00 | 4.74 | 3.79 | 3.46 | 3.14 | 4.23 |
| 8 | 4.80 | 5.50 | 4.39 | 5.38 | 5.48 | 4.80 | 3.78 |
| 9 | 6.42 | 9.00 | 5.77 | 7.06 | 6.62 | 2.11 | 5.08 |
| 10 | 2.12 | 3.00 | 2.44 | 2.12 | 1.95 | 1.32 | 1.88 |
| 11 | 8.13 | 9.00 | 4.27 | 7.95 | 4.36 | 3.34 | 3.42 |
| 12 | 5.95 | 8.50 | 7.57 | 6.61 | 6.07 | 4.80 | 6.09 |
| 13 | 0.44 | 0.70 | 0.98 | 0.44 | 1.16 | 0.90 | 1.04 |
| 14 | 11.73 | 10.70 | 9.51 | 11.40 | 10.29 | 3.82 | 7.91 |
| 15 | 8.08 | 10.10 | 8.58 | 7.71 | 6.05 | 4.61 | 8.01 |
| 16 | 2.41 | 4.50 | 2.13 | 2.34 | 2.63 | 2.36 | 1.49 |
| 17 | 3.61 | 2.30 | 4.65 | 4.11 | 4.24 | 4.13 | 4.04 |
| 18 | 1.72 | 1.90 | 2.51 | 1.71 | 1.61 | 1.64 | 1.91 |
| 19 | 3.41 | 2.99 | 4.11 | 3.13 | 1.94 | 1.51 | 2.82 |
| 20 | 2.09 | 2.00 | 2.00 | 1.96 | 1.60 | 1.38 | 1.56 |
| 21 | 14.9 | 15 | 20.2 | — | 23.15 | — | — |
| 22 | 7.9 | 7.1 | 9.8 | — | 13.48 | — | — |
| 23 | 7.1 | 8 | 8.1 | — | 6.45 | — | — |
| 24 | 26 | 16 | 29.6 | — | 27.56 | — | — |
| 25 | 21.1 | 36 | 26.5 | — | 24.5 | — | — |
DP, digital image analysis pathologist; Ob, observer; —, data unavailable.
aThe column names list the method of counting (manual counting vs Quantcenter vs HALO) and the observer performing the count. Listed in parentheses are observers choosing the hotspots. For cases 21 to 25, hotspots were chosen by observer 2.
Figure 3.
Correlation coefficients between manual counting and all digital image analysis (DIA) methods. In each panel (A, observer 2; B, observer 1; C, observer 1) (clinical reporting), manual counts are plotted along the x-axis against all corresponding DIA Ki-67 values (both hotspot and case level) from both platforms (QuantCenter and HALO) on the y-axis. A, P < .001; Pearson r = 0.93; 99% confidence interval [CI], 0.87-0.97. B, P < .001; Pearson r = 0.91; 99% CI, 0.83-0.95. C, P < .001; Pearson r = 0.88; 99% CI, 0.77-0.94.
Table 4.
Comparisons Between Different Methods of Ki-67 Measurementa
| Comparison Pair | Pearson Correlation Coefficient | Closeness of Match: No. (%) of Cases Within ±0.2xKi-67 Index | No. of Grade-Discordant Cases | κ |
|---|---|---|---|---|
| Manual counting vs manual counting | ||||
| Observer 1 (DP) vs observer 2 (Ob 2) (case level) | 0.976 | 10/20 (50) | 4 | 1 |
| Observer 1 (DP) vs observer 1 (Ob 1) (case level) | 0.977 | 12/20 (60) | 4 | 0.63 |
| Observer 1 (Ob 1) vs observer 2 (Ob 2) (case level) | 0.952 | 10/20 (50) | 0 | 0.63 |
| Manual counting vs DIA | ||||
| Observer 2 (Ob 2) vs HALO (Ob 2) (hotspot level) | 0.971 | 11/20 (55) | 2 | 0.81 |
| Observer 2 (Ob 2) vs HALO (DP) (case level) | 0.949 | 7/20 (35) | 2 | 0.81 |
| Observer 2 (Ob 2) vs QuantCenter (DP) (case level) | 0.978 | 10/20 (50) | 2 | 0.81 |
| Observer 1 (Ob 1) vs HALO (Ob 2) (case level) | 0.902 | 7/20 (35) | 2 | 0.81 |
| Observer 1 (Ob 1) vs HALO (DP) (case level) | 0.881 | 7/20 (35) | 4 | 0.81 |
| Observer 1 (Ob 1) vs QuantCenter (DP) (case level) | 0.946 | 8/20 (40) | 4 | 0.81 |
| Observer 1 (DP) vs HALO (Ob 2) (case level) | 0.942 | 9/20 (45) | 2 | 0.63 |
| Observer 1 (DP) vs HALO (DP) (hotspot level) | 0.922 | 9/20 (45) | 2 | 0.63 |
| Observer 1 (DP) vs QuantCenter (DP) (hotspot level) | 0.976 | 13/20 (65) | 2 | 0.63 |
| DIA vs DIA | ||||
| QuantCenter (DP) vs HALO (DP) (hotspot level) | 0.953 | 10/20 (50) | 2 | 0.81 |
| QuantCenter (DP) vs HALO (Ob 2) (case level) | 0.969 | 9/20 (45) | 0 | 1 |
| HALO (DP) vs HALO (Ob 2) (case level) | 0.980 | 6/20 (30) | 2 | 0.81 |
DIA, digital image analysis; DP, digital image analysis pathologist; Ob, observer.
aThe comparison pairs are grouped under three headings. For each method, the name/label in the leftmost column indicates the observer or software device performing the count followed by the person choosing the hotspot in parentheses. The grade-discordant cases column enumerates the number of cases that fall under different World Health Organization grades (G1/G2/G3) for each comparison, and κ indicates the Cohen’s κ statistic value for the pair.
Close-Match Analysis
Among possible meaningful pairs (expert vs DIA, DIA vs DIA, expert vs expert), the number of cases that were within the ±0.2xKi-67 index of each other ranged from 6 of 20 to 13 of 20, with the lowest closeness of match occurring in HALO measurements from different hotspots (Table 4). The median close match was 45%. The highest (13/20; 65%) occurred with QuantCenter measurements compared with the expert pathologist’s manual counting of study hotspots. Mean close-match levels between experts were comparable to expert-vs-DIA and DIA-vs-DIA matches (53% vs 45% vs 41% of cases, respectively). The number of closely matched cases was not significantly different between within-hotspot and case-level measurement comparisons (mean, 8.8 ± 1.3 vs 10.75 ± 2.7; P = .09, t test).
The five separately analyzed cases showed a mean close match of 40% for expert-vs-expert comparisons and a slightly higher mean close match of 46% for expert-vs-DIA comparisons.
Close concordance by strict criteria (>80% cases within ±0.2xKi-67) was not seen with DIA-vs-expert or expert-vs-expert comparisons (Table 4).
Comparison of Dispersion of Values Between Manual Counting and DIA Methods
Mean Ki-67 (%) values by DIA (three sets of counting) and manual counting (four sets of measurements) fell within 95% CI of each other in 7 (35%) of 20 cases Table 5❚ and Figure 4. Manual counting, overall, showed a lower dispersion of counts, with mean width of the 95% CIs being 1.14%, compared with a 2.057% seen with DIA methods. In nonoverlapping cases, manual counting values with wide CIs for the means (eg, cases 9, 11, 12, 14, and 15; see Figure 4) and constrained manual and DIA counts but with entirely nonoverlapping confidence intervals counts (eg, cases 2, 13, and 20; Figure 4) were both seen.
Table 5.
Mean Ki-67 Proliferation Index (%) and Bootstrapped 95% Confidence Intervals (1,000 Samples) by Manual Counting and Digital Image Analysis
| Case No. | Manual Counting, Mean (95% CI) | Digital Image Analysis, Mean (95% CI) |
|---|---|---|
| 1 | 6.07 (5.88-6.26) | 5.38 (4.70-5.72) |
| 2 | 2.02 (1.79-2.29) | 3.08 (2.49-3.40) |
| 3 | 1.53 (1.14-1.96) | 2.02 (1.85-2.12) |
| 4 | 23.53 (22.80-24.25) | 27.25 (20.90-36.50) |
| 5 | .91 (.60-1.22) | .66 (.43-1.04) |
| 6 | 1.47 (1.32-1.62) | 1.86 (1.13-2.40) |
| 7 | 4.09 (3.80-4.51) | 3.61 (3.14-4.23) |
| 8 | 5.02 (4.59-5.47) | 4.69 (3.78-5.25) |
| 9 | 7.06 (6.10-8.52) | 4.60 (2.11-6.11) |
| 10 | 2.42 (2.12-2.78) | 1.72 (1.32-1.93) |
| 11 | 7.34 (5.24-8.78) | 3.71 (3.34-4.04) |
| 12 | 7.16 (6.28-8.04) | 5.65 (4.80-6.09) |
| 13 | .64 (.44-.85) | 1.03 (.90-1.12) |
| 14 | 10.84 (9.98-11.57) | 7.34 (3.82-9.49) |
| 15 | 8.62 (7.90-9.60) | 6.22 (4.61-8.01) |
| 16 | 2.85 (2.20-3.98) | 2.16 (1.49-2.54) |
| 17 | 3.67 (2.63-4.39) | 4.14 (4.04-4.20) |
| 18 | 1.96 (1.72-2.31) | 1.72 (1.62-1.91) |
| 19 | 3.41 (3.02-3.87) | 2.09 (1.51-2.82) |
| 20 | 2.01 (1.97-2.07) | 1.51 (1.38-1.59) |
CI, confidence interval.
Figure 4.
A, Mean Ki-67 proliferation index (%) and bootstrapped 95% confidence intervals (1,000 samples). Black horizontal bars depict Ki-67 by manual counting (four observers) and blue digital image analysis (three sets). Open circles indicate the upper and lower limits and means. For values, see Table 4: case 4 is excluded for ease of plotting. B, Ki-67 proliferation index values by all methods. Colored lines depict each method of Ki-67 measurement and dots depict values obtained. For each, the hotspot selector is listed in parentheses (Ob 1, observer 1; Ob 2, observer 2). DP, pathologist performing digital image analysis.
WHO Tumor Grade Comparisons
Interobserver agreement between experts on the assigned WHO grade according to the Ki-67 index (<3%, grade 1; 3%-20%, grade 2; >20%, grade 3) showed κ values ranging from 0.63 to 1 (good to excellent). Agreement in expert-vs-DIA and DIA-vs-DIA comparisons was 0.63 to 0.81 and 0.81 to 1, respectively (Table 4). Despite a slightly better close match between experts, grade-discordant cases in expert-vs-DIA comparisons were seen at a rate no greater than in expert-vs-expert comparisons (P = .6, Fisher exact test) Table 6.
Table 6.
Tumor Grade According to the World Health Organization Cutoffs (<3%, Grade 1; 3%-20%, Grade 2; >20%, Grade 3) From Different Methodsa
| Manual Counting | Digital Image Analysis | ||||||
|---|---|---|---|---|---|---|---|
| Case No. | Observer 1 (DP) | Observer 1 (Ob 1) | Observer 2 (Ob 2) | Observer 3 (DP) | QuantCenter (DP) | HALO (DP) | HALO (Ob 2) |
| 1 | G2 | G2 | G2 | G2 | G2 | G2 | G2 |
| 2 | G1 | G1 | G1 | G1 | G2 | G1 | G2 |
| 3 | G1 | G1 | G1 | G1 | G1 | G1 | G1 |
| 4 | G3 | G3 | G3 | G3 | G3 | G3 | G3 |
| 5 | G1 | G1 | G1 | G1 | G1 | G1 | G1 |
| 6 | G1 | G1 | G1 | G1 | G1 | G1 | G1 |
| 7 | G2 | G2 | G2 | G2 | G2 | G2 | G2 |
| 8 | G2 | G2 | G2 | G2 | G2 | G2 | G2 |
| 9 | G2 | G2 | G2 | G2 | G2 | G1 | G2 |
| 10 | G1 | G2 | G1 | G1 | G1 | G1 | G1 |
| 11 | G2 | G2 | G2 | G2 | G2 | G2 | G2 |
| 12 | G2 | G2 | G2 | G2 | G2 | G2 | G2 |
| 13 | G1 | G1 | G1 | G1 | G1 | G1 | G1 |
| 14 | G2 | G2 | G2 | G2 | G2 | G2 | G2 |
| 15 | G2 | G2 | G2 | G2 | G2 | G2 | G2 |
| 16 | G1 | G2 | G1 | G1 | G1 | G1 | G1 |
| 17 | G2 | G1 | G2 | G2 | G2 | G2 | G2 |
| 18 | G1 | G1 | G1 | G1 | G1 | G1 | G1 |
| 19 | G2 | G1 | G2 | G2 | G1 | G1 | G1 |
| 20 | G1 | G1 | G1 | G1 | G1 | G1 | G1 |
DP, digital image analysis pathologist; Ob, observer.
aSee Table 3 for Ki-67 proliferation index (%) values. The column names list the method of counting (manual counting vs Quantcenter vs HALO) and the observer performing the count.
Discussion
We assessed and characterized the level of agreement between manual counting–based and semiautomated DIA-based measurement of Ki-67 proliferation index (%) by several metrics. Ki-67 measurement by DIA showed excellent agreement with expert-performed manual counting values and with each other. To our knowledge, this is the first study to perform an interplatform comparison between two widely used commercial DIA platforms, 3DHistech Quantcenter and HALO, in GEP-NETs. The high interplatform reproducibility is similar to results obtained in Ki-67 measurement in breast carcinoma using the same two platforms.21
The employment of a threshold-based requirement yielded several insights into the nature of Ki-67 assessment by manual counting. While high correlation coefficients suggest a high degree of agreement, the introduction of a Ki-67 index requirement to be within 0.2x of a target value showed that human expert-vs-expert close-match reproducibility was met ~53% on average, which is comparable to a 45% average close-match rate in expert-vs-DIA reproducibility. When strict criteria for case counts meeting the threshold are added (≥80% cases within ±0.2xKi-67 index by manual counting), close concordance is not seen by either method (Table 2). In other words, the performance of the de facto gold standard was equivalent to DIA.
The ostensible superiority of manual counting rests on the notion that expert morphologic assessment directly underpins the quantitative evaluation. The task of meaningful separation of positive and negative tumor nuclei from stromal, inflammatory, and endothelial cells and other noncellular elements is thought to contribute to the accuracy of the Ki-67 proliferation index.10 On a per-nucleus basis, however, decisions to include or exclude objects in the count are driven by a multitude of factors that include the morphologic features, judgment of whether it belongs in the plane of focus and is to be included, and whether it represents a faintly immunopositive vs a darkly counterstained negative nucleus, in addition to whether the object truly represents a tumor nucleus. At this level, fatigue, interpretative variation, differing stringency levels for object identification, and the limitations imposed by print/monitor resolution effects—all causes for variability with visual assessment—have a greater impact. With a perfect image substrate, say a fully white background with a low number of widely spaced perfect dark circular objects as nuclei, decisions can be expected to be driven with a high degree of reproducibility. With imperfectly defined, less-than-ideally contrasted objects, numerous admixed mimickers, and the number of decisions required to be made in the evaluation of a hotspot (~1,000), manual counting itself is a noisy process. The resulting variability ensures that the same image/field counted by two different human observers will result in slightly different counts. Analogously, similar variability is present in image analysis algorithms whose parameters such as size limits, color characteristics, and edge contrast function as the machine vision counterparts of human morphologic assessment and result in the inclusion or exclusion of objects as nuclei.26 In both instances, the uncertainty involved in within-hotspot enumeration is additive to that imposed by the morphologic variation of nuclei between different hotspots. The net effect, as the results above show, is emergence of variability inherent in nuclear object enumeration. This variability is comparable in extent between manual counting and semiautomated image analysis, as well as possibly irreducible.
The findings suggest the existence of a “concordance ceiling” independent of the method of assessment. This has implications for the setting up and calibration of DIA algorithms in routine practice: typically, an expert user “tweaks” algorithm settings based on visual inspection of nuclear identification to finalize an algorithm state. Even if an algorithm instance can be thereby set to achieve high or near-perfect concordance with the expert, its expected rate of close agreement with other independent observers would be no more than what was possible between the observers themselves.
Interobserver variability in case-level evaluation of Ki-67 has been evaluated in the literature. In a recent large-scale study28 involving a panel of experts (n = 23) performing manual Ki-67 measurement in breast carcinoma on step-sectioned glass slides, there was a ~10% range of variability in median Ki-67 between observers in evaluating 30 cases. However, in the range comparable to values of the Ki-67 index (%) in the present study (0%-20%), in examining their raw data, the average difference between observers is much higher (18.5%). Saadeh et al26 reported interobserver variability in the counts of negative (counterstained, or blue) nuclei between three observers in gastrointestinal neuroendocrine tumors; they did not, however, report significantly large differences of the Ki-67 index. Volynskaya et al29 compared manual counting with automated analysis in primary and metastatic gastrointestinal neuroendocrine tumors. Four observers performed manual counts on printed hotspot images from 20 cases that were subject to DIA counting. A maximum of ~15% range in variation in absolute value terms between observers was noted in Ki-67 counts, but this was mainly confined to cases with high Ki-67 values, whereas low dispersion among counts with lower Ki-67 values was found. Apart from the study examining breast carcinoma, the reported variability and the findings of the present study are comparable to the above studies.
Our study has certain limitations. Although each slide was subjected to multiple rounds of Ki-67 measurement by different methods, the total number of cases subject to full analysis (n = 20) is low. The wider 95% CIs for mean Ki-67 (%) from manual counting and the relative inability of DIA methods to attain Ki-67 values close to manual counting could be a reflection of the low numbers of cases. There are additional identifiable sources of variability arising from the methodology. First, the Ki-67 values obtained at clinical reporting were included in the analysis. While this incorporates data from real-world reporting and was performed by a single pathologist (observer 1) using the same method throughout, the measurements were performed over a long period of time and under different conditions compared with the study slides. Second, both 3DHistech QuantCenter and HALO CellQuant algorithms were tuned to identify neuroendocrine tumor nuclei by a single pathologist. Another possible deficiency in the data could be the low number of cases with Ki-67 values close to the G2/G3 thresholds. Further studies with larger sample sizes and/or different software platforms would be needed to explore whether a higher “close match” with manual counting is possible with DIA Ki-67 measurement.
It is frequently lost that hotspot manual counting owes its emergence as a reproducible and clinically adopted method more to the limitations of what is possible30,31 in routine practice. With WSI analysis, a proliferation index can be calculated for larger samples of tumor nuclei approaching what is possible in quantitative analyses in flow cytometry. There may be a need to evolve new thresholds based on WSI analysis that may permit a more granular correlation with clinical outcomes.
In conclusion, based on our findings, we propose a possible validation framework for digital Ki-67 measurement in tumors to examine whether (1) the DIA Ki-67 index (%) values exhibit a high degree of correlation (r > 0.9) with manual counting with the same hotspot images under uniform conditions of immunostaining, counterstaining, and WSI acquisition and (2) the number of cases in which a close match—defined as the number of cases with the DIA Ki-67 index (%) within ±0.2xKi-67 index by manual counting—is 50% or more. Our results show that expert-vs-expert Ki-67 measurements exhibit the same variability and reproducibility as DIA: κ and close count rates were no worse for manual count–DIA comparisons than for manual counts between experts. Given the noninferiority and substantial time savings, we support more widespread adoption of DIA Ki-67 measurement in routine clinical practice.
Supplementary Material
This work was supported by NIH grant P50 CA174521-01A1 (A.M.B.).
References
- 1. Rindi G, Klimstra DS, Abedi-Ardekani B, et al. A common classification framework for neuroendocrine neoplasms: an International Agency for Research on Cancer (IARC) and World Health Organization (WHO) expert consensus proposal. Mod Pathol. 2018;31:1770-1786. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Martin-Perez E, Capdevila J, Castellano D, et al. Prognostic factors and long-term outcome of pancreatic neuroendocrine neoplasms: Ki-67 index shows a greater impact on survival than disease stage. The large experience of the Spanish National Tumor Registry (RGETNE). Neuroendocrinology. 2013;98:156-168. [DOI] [PubMed] [Google Scholar]
- 3. Richards-Taylor S, Ewings SM, Jaynes E, et al. The assessment of Ki-67 as a prognostic marker in neuroendocrine tumours: a systematic review and meta-analysis. J Clin Pathol. 2016;69:612-618. [DOI] [PubMed] [Google Scholar]
- 4. Makkink-Nombrado SV, Baak JP, Schuurmans L, et al. Quantitative immunohistochemistry using the CAS 200/486 image analysis system in invasive breast carcinoma: a reproducibility study. Anal Cell Pathol. 1995;8:227-245. [PubMed] [Google Scholar]
- 5. Pinder SE, Wencyk P, Sibbering DM, et al. Assessment of the new proliferation marker MIB1 in breast carcinoma using image analysis: associations with other prognostic factors and survival. Br J Cancer. 1995;71:146-149. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Klimstra DS, Modlin IR, Adsay NV, et al. Pathology reporting of neuroendocrine tumors: application of the Delphic consensus process to the development of a minimum pathology data set. Am J Surg Pathol. 2010;34:300-313. [DOI] [PubMed] [Google Scholar]
- 7. Keck KJ, Choi A, Maxwell JE, et al. Increased grade in neuroendocrine tumor metastases negatively impacts survival. Ann Surg Oncol. 2017;24:2206-2212. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Tang LH, Gonen M, Hedvat C, et al. Objective quantification of the Ki67 proliferative index in neuroendocrine tumors of the gastroenteropancreatic system: a comparison of digital image analysis with manual methods. Am J Surg Pathol. 2012;36:1761-1770. [DOI] [PubMed] [Google Scholar]
- 9. Young HT, Carr NJ, Green B, et al. Accuracy of visual assessments of proliferation indices in gastroenteropancreatic neuroendocrine tumours. J Clin Pathol. 2013;66:700-704. [DOI] [PubMed] [Google Scholar]
- 10. Reid MD, Bagci P, Ohike N, et al. Calculation of the Ki67 index in pancreatic neuroendocrine tumors: a comparative analysis of four counting methodologies. Mod Pathol. 2016;29:93. [DOI] [PubMed] [Google Scholar]
- 11. Cottenden J, Filter ER, Cottreau J, et al. Validation of a cytotechnologist manual counting service for the Ki67 index in neuroendocrine tumors of the pancreas and gastrointestinal tract. Arch Pathol Lab Med. 2018;142:402-407. [DOI] [PubMed] [Google Scholar]
- 12. Tuominen VJ, Ruotoistenmäki S, Viitanen A, et al. ImmunoRatio: a publicly available web application for quantitative image analysis of estrogen receptor (ER), progesterone receptor (PR), and Ki-67. Breast Cancer Res. 2010;12:R56-R. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Jin M, Roth R, Gayetsky V, et al. Grading pancreatic neuroendocrine neoplasms by Ki-67 staining on cytology cell blocks: manual count and digital image analysis of 58 cases. J Am Soc Cytopathol. 2016;5:286-295. [DOI] [PubMed] [Google Scholar]
- 14. Fitzgibbons PL, Bradley LA, Fatheree LA, et al. ; College of American Pathologists Pathology and Laboratory Quality Center . Principles of analytic validation of immunohistochemical assays: guideline from the College of American Pathologists Pathology and Laboratory Quality Center. Arch Pathol Lab Med. 2014;138:1432-1443. [DOI] [PubMed] [Google Scholar]
- 15. Koopman T, Buikema HJ, Hollema H, et al. Digital image analysis of Ki67 proliferation index in breast cancer using virtual dual staining on whole tissue sections: clinical validation and inter-platform agreement. Breast Cancer Res Treat. 2018;169:33-42. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Pham DT, Skaland I, Winther TL, et al. Correlation between digital and manual determinations of Ki-67/MIB-1 proliferative indices in human meningiomas. Int J Surg Pathol. 2020;28:273-279. [DOI] [PubMed] [Google Scholar]
- 17. Sugita S, Hirano H, Hatanaka Y, et al. Image analysis is an excellent tool for quantifying Ki-67 to predict the prognosis of gastrointestinal stromal tumor patients. Pathol Int. 2018;68:7-11. [DOI] [PubMed] [Google Scholar]
- 18. Liu SZ, Staats PN, Goicochea L, et al. Automated quantification of Ki-67 proliferative index of excised neuroendocrine tumors of the lung. Diagn Pathol. 2014;9:174. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Wang M, McLaren S, Jeyathevan R, et al. Laboratory validation studies in Ki-67 digital image analysis of breast carcinoma: a pathway to routine quality assurance. Pathology. 2019;51:246-252. [DOI] [PubMed] [Google Scholar]
- 20. Wang HY, Li ZW, Sun W, et al. Automated quantification of Ki-67 index associates with pathologic grade of pulmonary neuroendocrine tumors. Chin Med J (Engl). 2019;132:551-561. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Acs B, Pelekanou V, Bai Y, et al. Ki67 reproducibility using digital image analysis: an inter-platform and inter-operator study. Lab Invest. 2019;99:107-117. [DOI] [PubMed] [Google Scholar]
- 22. Del Rosario Taco Sanchez M, Soler-Monsó T, Petit A, et al. Digital quantification of KI-67 in breast cancer. Virchows Arch. 2019;474:169-176. [DOI] [PubMed] [Google Scholar]
- 23. Kroneman TN, Voss JS, Lohse CM, et al. Comparison of three Ki-67 index quantification methods and clinical significance in pancreatic neuroendocrine tumors. Endocr Pathol. 2015;26:255-262. [DOI] [PubMed] [Google Scholar]
- 24. Basile ML, Kuga FS, Del Carlo Bernardi F. Comparation of the quantification of the proliferative index KI67 between eyeball and semi-automated digital analysis in gastro-intestinal neuroendrocrine tumors. Surg Exp Pathol. 2019;2:21. [Google Scholar]
- 25. Dogukan FM, Yilmaz Ozguven B, Dogukan R, et al. Comparison of monitor-image and printout-image methods in Ki-67 scoring of gastroenteropancreatic neuroendocrine tumors. Endocr Pathol. 2019;30:17-23. [DOI] [PubMed] [Google Scholar]
- 26. Saadeh H, Abdullah N, Erashdi M, et al. Histopathologist-level quantification of Ki-67 immunoexpression in gastroenteropancreatic neuroendocrine tumors using semiautomated method. J Med Imaging (Bellingham). 2020;7:012704. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychol Bull. 1979;86:420-428. [DOI] [PubMed] [Google Scholar]
- 28. Leung SCY, Nielsen TO, Zabaglo LA, et al. ; International Ki67 in Breast Cancer Working Group of the Breast International Group and North American Breast Cancer Group (BIG-NABCG) . Analytical validation of a standardised scoring protocol for Ki67 immunohistochemistry on breast cancer excision whole sections: an international multicentre collaboration. Histopathology. 2019;75:225-235. [DOI] [PubMed] [Google Scholar]
- 29. Volynskaya Z, Mete O, Pakbaz S, et al. Ki67 quantitative interpretation: insights using image analysis. J Pathol Inform. 2019;10:8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Cavalcanti MS, Gönen M, Klimstra DS. The ENETS/WHO grading system for neuroendocrine neoplasms of the gastroenteropancreatic system: a review of the current state, limitations and proposals for modifications. Int J Endocr Oncol. 2016;3:203-219. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Scarpa A, Mantovani W, Capelli P, et al. Pancreatic endocrine tumors: improved TNM staging and histopathological grading permit a clinically efficient prognostic stratification of patients. Mod Pathol. 2010;23:824-833. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.





