Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2013 Jun 1.
Published in final edited form as: Osteoarthritis Cartilage. 2012 Feb 18;20(6):476–485. doi: 10.1016/j.joca.2011.12.018

COMPARISON OF CARTILAGE HISTOPATHOLOGY ASSESSMENT SYSTEMS ON HUMAN KNEE JOINTS AT ALL STAGES OF OSTEOARTHRITIS DEVELOPMENT

C Pauli 1,2, R Whiteside 3, F Las Heras 4, D Nesic 5, J Koziol1 1, SP Grogan 1,2, J Matyas 6, KPH Pritzker 7, DD D’Lima 1,2, MK Lotz 1,*
PMCID: PMC3348372  NIHMSID: NIHMS354306  PMID: 22353747

Abstract

Objective

To compare the MANKIN and OARSI cartilage histopathology assessment systems using human articular cartilage from a large number of donors across the adult age spectrum representing all levels of cartilage degradation.

Design

Human knees (n=125 from 65 donors; age range 23–92) were obtained from tissue banks. All cartilage surfaces were macroscopically graded. Osteochondral slabs representing the entire central regions of both femoral condyles, tibial plateaus, and the patella were processed for histology and Safranin O – Fast Green staining. Slides representing normal, aged, and OA tissue were scanned and electronic images were scored online by five observers. Statistical analysis was performed for inter- and intra-observer variability, reproducibility and reliability.

Results

The inter-observer variability among five observers for the MANKIN system showed a similar good intra-class coefficient (ICC >0.81) as for the OARSI system (ICC >0.78). Repeat scoring by three of the five readers showed very good agreement (ICC >0.94). Both systems showed a high reproducibility among four of the five readers as indicated by the Spearman’s rho value. For the MANKIN system, the surface represented by lesion depth was the parameter where all readers showed an excellent agreement. Other parameters such as cellularity, Safranin O staining intensity and tidemark had greater inter-reader disagreement.

Conclusion

Both scoring systems were reliable but appeared too complex and time consuming for assessment of lesion severity, the major parameter determined in standardized scoring systems. To rapidly and reproducibly assess severity of cartilage degradation, we propose to develop a simplified system for lesion volume.

Keywords: Osteoarthritis, cartilage, histology, grading

Introduction

The histologic/histochemical grading system (MANKIN system) proposed by Mankin et al. in 1971 has been widely used for the evaluation of osteoarthritic (OA) cartilage [1, 2]. This system was developed originally for the assessment of human hip OA cartilage and subsequently it has also been used to evaluate cartilage degradation, repair and regeneration in various animal models of OA. The MANKIN system assesses four parameters, cartilage structure, cellularity, Safranin O staining, and tidemark integrity. Each parameter has subcategories and the scores are summed to provide a total score ranging from 0 (normal) to 14 (most severe OA).

Over the last four decades, several “modified Mankin scores” have been developed. These systems assess similar parameters as the original MANKIN system, but parameters such as Safranin O staining intensity or cellularity for example are either scored in a different fashion, or an overall score is normally applied instead of separate subscores [38]. Since the MANKIN system was based on specimens with advanced OA, it may have limitations for mild and moderate OA [9]. Additionally, the horizontal extent to which the cartilage surface is affected by the disease process is not assessed with this system. Scoring features such as ‘pannus’ and ‘surface irregularities’ worsen the score, but these features may also be found in certain areas of healthy or regenerative tissue. In the past, different authors validated [10] but also questioned the reproducibility and the validity of the MANKIN system [9, 11]. Also, there are conflicting reports with respect to intra- and inter-observer variability [9, 10, 12].

To address limitations of the MANKIN system and obtain a useful method for applications in clinical as well as experimental OA assessment, the OARSI System Working Group developed the Osteoarthritis Cartilage Histopathology Assessment System (in this manuscript referred to as ‘OARSI system’) [13]. With this system, the “stage” of OA is based on the extent of the joint cartilage surface, area or volume involved in the local OA process and points are assigned ranging from 0 [normal] to 4 [>50%]. The “grade” of OA is based on the extent of pathology into the depth of the cartilage and points are assigned ranging from 0 [surface intact] to 6 [full-thickness loss of cartilage and bone deformation]. Optional “subgrades” were also proposed, ranging from 1.0 (cells intact) to 6.5 (joint margin and central osteophytes). For the “stage” of OA, a score of 0 to 4 is assigned to indicate the extent to which the surface, area or volume is affected. The values of “stage” and “grade” are multiplied to yield an overall joint “score”. The OARSI system was intended to be more sensitive to different grades of mild OA and that it can be applied more consistently by less experienced observers than the MANKIN system. The OARSI system was published as a model to be validated in other studies.

Thus far, three comparative studies of the MANKIN and OARSI systems have been performed, using goat [12] and human tissue from patients that underwent unilateral knee arthroplasty [14, 15]. The studies on human tissues used OA knees with very advanced disease and did not include knees with early OA changes.

The objective of the present study was to use a large collection of human knee joints from donors across the entire adult age spectrum and covering the complete range of cartilage pathology to compare the two systems. A detailed data analysis reveals potential limitations of each system and suggests that they should be further simplified for use as a standardized, easy to use lesion severity assessment tool, or modified to address specific questions on disease mechanisms.

Materials and methods

Human cartilage procurement

Human knees (n=125) from 65 donors (29 males, 36 females; age range = 23–92) were obtained from tissue banks (approved by Scripps Institutional Review Board) and processed within 24–72 hours post mortem.

Tissue harvesting and processing

Sagittal osteochondral slabs were harvested from both femoral condyles. A coronal osteochondral slab through the central part of the tibial plateau was harvested. The location of the slabs was selected to represent the central region in each compartment that is most exposed to mechanical loading. A transverse osteochondral slab was harvested from the patella (Supplementary Fig. 1). The samples were fixed in Z-Fix (Anatech, Battle Creek, MI) immediately after harvesting and subsequently decalcified with TBD-2 (Thermo Fisher). Decalcified specimens were cut to smaller tissue blocks at defined locations. Each femoral condyle was divided into 5–7 tissue blocks, the patella into 3 and the entire tibia into 4 tissue blocks. After dehydration in an alcohol series and clearing in Pro-Par (Anatech), the tissue blocks were infiltrated and embedded in paraffin. From this collection of 125 knees, we prepared approximately 1600 osteochondral tissue blocks. Five micron-thick sections were cut from each block and stained with Safranin O - Fast Green. These sections were scored by an experienced observer using the Mankin and OARSI system.

Histological scoring

From the collection of approximately 1600 scored slides, a set of 300 slides was selected. These slides represented all locations (femoral condyles, tibia, patella) and all grades and subgrades. All sections were scanned with a digital slide scanner (Aperio ScanScope System, Aperio Technologies, Vista, CA) at a magnification of 40x (pixel size = 0.25 micrometres2) and the scans were evaluated online with WebScope (Spectrum Digital Information Management System, Aperio Technologies). We recruited five observers that were familiar with cartilage histopathology and had different levels of experience with both grading systems. The observers were blinded regarding donor age, gender and disease state for all specimens as well as to the grades of the other observers. Three observers scored the 300 slides twice, at least three weeks apart with both grading systems (Supplementary Tables 1–3).

The manuscript of Pritzker et al. was used as the OARSI system grading template [13]. The staging parameter of the OARSI system was applied to the entire tissue section on each slide. For the MANKIN system, we prepared a template with representative images (Figs. 25).

Fig. 2.

Fig. 2

Histological assessment of the surface structure parameter according to MANKIN on sections from femoral condyles: (A) Normal (intact smooth surface), score 0. (B) Surface irregularities (undulations), score 1. (C)

Pannus and surface irregularities (fibrillation), score 2. (D) Clefts to transitional zone, score 3. (E) Clefts to radial zone, score 4. (F) Clefts to calcified zone, score 5. (G) Complete disorganization, score 6. Safranin O - Fast Green, pictures taken with 4x and 40x objectives.

Fig. 5.

Fig. 5

Histological tidemark assessment according to MANKIN on sections from femoral condyles: (A) Tidemark intact, score 0. H&E stain, pictures taken with 4x and 40x objectives. (B) Tidemark crossed by blood vessels (± tidemark duplication), score 1. Safranin O - fast green stain, pictures taken with 4x and 40x objectives.

Statistical analyses

Reliability and reproducibility

were assessed by comparing scores from all observers for all histological specimens and for both scoring systems. Two methods were used to quantify and summarize intra- and inter-observer agreement. Intra-class correlation coefficients (ICCs) [16, 18] were determined for all pairwise comparisons among and within observers. These were calculated from a two-way random effects analyses of variance, with an objective absolute agreement [17]. In this regard, intra-observer ICCs were calculated with the initial and repeat scores from 3 observers, and inter-observer ICCs were calculated with the initial scores from 5 observers. Bootstrap resampling with 1000 samples was used to construct 95% confidence intervals for the ICCs, via the percentile method. We also used the Bland-Altman limits of agreement (LOA) method [18, 19] to assess intra- and inter-observer agreement. We report 95% limits of agreement for these pairwise comparisons. Such comparisons provide intervals within which 95% of differences between the two measurements are expected to be.

Correlations between the 2 scoring systems

We used Spearman’s nonparametric correlation coefficient rho to compare the scores of the MANKIN and OARSI scoring systems. Spearman’s rho is preferable to Pearson’s (parametric) correlation coefficient in this setting since both scoring systems represent ordinal rather than continuous scales. Bootstrap resampling with 1000 samples was used to construct 95% confidence intervals for rho via the percentile method. Calculations were performed in Stata 9.2 (Statacorp, College Station, TX) and SPSS 16.0 (SPSS Inc., Chicago, IL).

Results

MANKIN system reliability

The inter-observer variability between five observers for the MANKIN system showed a good intra-class coefficient (ICC) range of 0.811 to 0.961. The ICC between the readers for the surface parameter ranged from 0.832 to 0.945, while the other parameters such as cellularity, Safranin O staining intensity, and tidemark showed a lower range of the ICC. The ICC for the intra-observer variability between the readers was higher for all parameters and the overall score (Table 4).

Table 4.

Intra-class correlation coefficients for MANKIN total scores and each parameter: MANKIN total scores and each parameter on 300 specimens were assessed by each of five observers. Intra-class correlation coefficients (ICC) and associated 95% confidence intervals were calculated from the observers’ scores. The three entries per cell are: lower 95% confidence limit, observed ICC (in bold), and upper 95% confidence limit. The diagonal cell entries, that is, the (A, A), (B, B), and (C, C) cells, compare the replicate scores of observer A, B, and C respectively. Graders D and E did not perform second scoring. The off-diagonal entries correspond to inter-observer comparisons.

Grader for
Total
Scores
A B C D E
A .957,
.966,
.973
.935,
.950,
.961
.933,
.946,
.957
.903,
.933,
.952
.694,
.847,
.911
B .979,
.983,
.986
.941,
.959,
.970
.951,
.961,
.969
.580,
.839,
.920
C .954,
.963,
.971
.895,
.932,
.954
.714,
.844,
.905
D .484,
.811,
.909
Grader for
Surface
A B C D E
A .935,
.948,
.958
.900,
.920,
.936
.891,
.912,
.929
.884,
.907,
.926
.571,
.840,
.921
B .969,
.975,
.980
.931,
.945,
.956
.889,
.914,
.933
.675,
.879,
.940
C .955,
.964,
.971
.894,
.915,
.932
.527,
.867,
.942
D .417,
.832,
.928
Grader for
Cells
A B C D E
A .784,
.831,
.867
.562,
.720,
.811
.600,
.746,
.830
.421,
.673,
.800
.639,
.714,
.773
B .891,
.912,
.930
.784,
.824,
.858
.801,
.839,
.870
.671,
.731,
.781
C .843,
.874,
.899
.748,
.799,
.840
.679,
.736,
.784
D .638,
.720,
.782
Grader for
Safranin O
A B C D E
A .902,
.921,
.937
.827,
.860,
.887
.777,
.844,
.887
.793,
.832,
.864
.472,
.780,
.887
B .901,
.920,
.936
.733,
.842,
.899
.899,
.919,
.935
.370,
.772,
.892
C .879,
.902,
.921
.715,
.826,
.886
.704,
.799,
.859
D .377,
.757,
.881
Grader for
Tidemark
A B C D E
A .661,
.720,
.770
.589,
.658,
.718
.381,
.534,
.646
.451,
.568,
.660
.436,
.523,
.601
B .882,
.905,
.923
.516,
.635,
.723
.617,
.701,
.766
.609,
.675,
.733
C .650,
.711,
.763
.563,
.636,
.699
.418,
.541,
.639
D .510,
.607,
.685

MANKIN system reproducibility

Average differences and 95% limits of agreement (LOA) for intra- and inter-observer differences are given in Table 5. The 95% LOA for intra-observer [test-retest] differences were typically within 2 points for total scores, and 1 point for surface, cellularity, safranin O staining intensity and tidemark scores. The 95% LOA were somewhat wider for inter-observer differences: with the exception of observer E, inter-observer differences were typically within 3 points for total scores, within 2 points for surface and cellularity scores, and 1 point for safranin O staining intensity and tidemark scores. The LOAs indicate that scores from observer E were lower, and had much higher variability than scores from the other observers: one cannot rule out a 5 point discrepancy in MANKIN total scores between grader E and each of the other observers, on a 14 point scale.

Table 5.

Limits of agreement for MANKIN total scores and each parameter: Total scores and each parameter on 300 specimens were assessed by each of five observers. Differences in the total scores and the parameters were calculated as (row designated observer score) minus (column designated observer score). The three entries per cell are: lower 95% limit of agreement, average difference (in bold), and upper 95% limit of agreement. The diagonal cell entries represent intra-observer differences: (A, A), (B, B), and (C, C) compare the replicate scores, namely first score – second score, of graders A, B, and C respectively. Graders D and E did not undertake replicate scoring. The off-diagonal entries correspond to inter-observer differences.

Grader for
Total
Scores
A B C D E
A −1.963,
−0.144,
1.676
−2.596,
−0.288,
2.020
−2.313,
0.060,
2.433
−2.949,
−0.441,
2.066
−2.747, 1.140,
5.028
B −1.468,
−0.054,
1.361
−1.695,
0.348,
2.391
−2.225,
−0.154,
1.918
−2.444,
1.428,
5.300
C −2.012,
−0.120,
1.772
−2.986,
−0.502,
1.983
−2.911,
1.080,
5.072
D −2.421,
1.582,
5.584
Grader for
Surface
A B C D E
A −0.927,
0.067,
1.061
−1.307,
0.064,
1.434
−1.424,
−0.030,
1.364
−1.486,
−0.114,
1.259
−1.107,
0.611,
2.238
B −0.860,
−0.037,
0.787
−1.281,
−0.094,
1.094
−1.594,
−0.177,
1.239
−0.983,
0.550,
2.084
C −0.984,
−0.064,
0.857
−1.478,
−0.084,
1.311
−0.827,
0.641,
2.109
D −0.865,
0.725,
2.315
Grader for
Cells
A B C D E
A −1.253,
−0.140,
0.973
−1.681,
−0.334,
1.012
−1.496,
−0.294,
0.908
−1.714,
−0.408,
0.898
−1.693,
−0.207,
1.278
B −0.787,
0.000,
0.787
−0.972,
−0.972,
1.053
−1.026,
−0.074,
0.879
−1.260,
0.127,
1.514
C −0.815,
−0.064,
0.688
−1.078,
−0.114,
0.851
−1.206,
0.087,
1.380
D −1.084,
0.201,
1.485
Grader for
Safranin O
A B C D E
A −0.850,
−0.043,
0.763
−1.166,
−0.060,
1.046
−0.851,
0.194, 1.239
−1.309,
−0.074, 1.162
−0.769,
0.455,
1.678
B −0.881,
−0.043,
0.794
−0.778,
0.254, 1.287
−0.892,
−0.013, 0.866
−0.701,
0.515,
1.731
C −0.796,
0.007, 0.809
−1.383,
−0.268, 0.848
−0.991,
0.261,
1.513
D −0.769,
0.528,
1.825
Grader for
Tidemark
A B C D E
A −0.761,
−0.027,
0.707
−0.763,
0.043,
0.850
−0.675,
0.191, 1.056
−0.703, 0.154,
1.010
−0.926,
0.030,
0.987
B −0.395,
0.027,
0.448
−0.619,
0.147, 0.913
−0.602, 0.110,
0.823
−0.801,
−0.013,
0.774
C −0.681,
0.000, 0.681
−0.812,
−0.037, 0.738
−1.029,−0.161,
0.707
D −0.947,
−0.124,
0.699

OARSI system reliability

For the OARSI system, the ICC for the grades between the five observers ranged from 0.781 to 0.965. For the staging component, the ICC ranged from 0.365 to 0.902 for inter-observer variability. This high variability was mainly due to the divergent scores of one observer. The OARSI system score showed an intra-class coefficient (ICC) range of 0.790 to 0.974. The ICC ranged from 0.698 to 0.895 for the intra-observer variability (Table 6).

Table 6.

Intra-class correlation coefficients for OARSI grade, stage and total scores: OARSI grade, stage and total scores on 300 specimens were assessed by each of 5 observers. Intra-class correlation coefficients and associated 95% confidence intervals were calculated from the observers’ scores. The three entries per cell are: lower 95% confidence limit, observed ICC (in bold), and upper 95% confidence limit. The diagonal cell entries, that is, the (A, A), (B, B), and (C, C) cells, compare the replicate scores of graders A, B, and C respectively. Observers D and E did not perform second scoring. The off-diagonal entries correspond to inter-observer comparisons.

Grader for
Grade
A B C D E
A .960,
.968, .975
.924,
.940, .953
.901,
.920, .936
.881, .934,
.960
.675, .829,
.898
B .973,
.979, .983
.937,
.954, .965
.948, .965,
.975
.594, .810,
.895
C .928,
.942, .953
.861, .926,
.955
.715, .827,
.888
D .416, .781,
.894
Grader for Stage A B C D E
A .634,
.698, .752
.856,
.889, .913
.838,
.870, .895
.777,
.840, .882
.161,
.365, .521
B .870,
.895, .916
.876,
.900, .920
.878, .902,
.922
.273, .464,
.602
C .697,
.752, .799
.794, .834,
.866
.215, .394,
.533
D .280, .440,
.566
Grader for
Score
A B C D E
A .955,
.964, .972
.943,
.956, .966
.918,
.934, .947
.933, .951,
.964
.644, .815,
.891
B .975,
.980, .984
.936,
.951, .962
.968, .974,
.980
.545, .798,
.892
C .927,
.941, .953
.922, .942,
.957
.663, .820,
.892
D .514, .790,
.890

OARSI system reproducibility

For total scores, 95% LOA for intra-observer [test-retest] differences were typically within 3 points with observer B, 4 points with observer A, and 5 points with observer C (Table 7). Intra-observer 95% LOAs were somewhat tighter for grade and stage: differences were typically within 2 points for stage, and 1 point for grade. Inter-observer differences were somewhat greater: with the exception of observer E, 95% LOAs are typically within 2 (in absolute value) for grade and stage, and within 5 for total score.

Table 7.

Limits of agreement for OARSI grade, stage and total scores: OARSI grade, stage and total scores on 300 specimens were assessed by each of 5 observers. Differences in the grades were calculated as (row designated grader’s grade) – (column designated grader’s grade). The three entries per cell are: lower 95% limit of agreement, average difference (in bold), and upper 95% limit of agreement. The diagonal cell entries represent intra-observer differences: (A, A), (B, B), and (C, C) compare the replicate grade, namely first grade – second grade, of observers A, B, and C respectively. Observer D and E did not undertake replicate scoring. The off-diagonal entries correspond to inter-observer differences.

Grader for
Grade
A B C D E
A −0.696,
0.072,
0.839
−1.219,
−0.109,
1.002
−1.277,
0.040,
1.357
−1.307,
−0.259,
0.789
−1.292,
0.513,
2.319
B −0.695,
−0.012,
0.671
−0.858,
0.149,
1.156
−0.983,
−0.151,
0.682
−1.292,
0.622,
2.536
C −1.085,
0.047,
1.178
−1.465,
−0.299,
0.867
−1.478,
0.473,
2.424
D −1.129,
0.773,
2.674
Grader for
Stage
A B C D E
A −1.463,
−0.094,
1.275
−0.827,
0.120,
1.068
−0.971,
0.077,
1.125
−0.887,
0.191,
1.269
−2.003,
0.833,
3.668
B −1.016,
0.000,
1.016
−1.036,
−0.043,
0.949
−0.876,
0.070,
1.017
−2.016,
0.712,
3.440
C −1.604,
−0.127,
1.350
−1.120,
0.114,
1.348
−2.177,
0.756,
3.689
D −2.136,
0.642,
3.420
Grader
Total Score
A B C D E
A −3.340,
0.154,
3.648
−4.431,
−0.485,
3.461
−4.877,
0.057,
4.991
−4.770,
−0.627,
3.515
−5.450,
2.251,
9.952
B −2.688,
0.087,
2.862
−3.692,
0.542,
4.775
−3.340,
−0.142,
3.056
−5.205,
2.736,
10.677
C −4.460,
0.100,
4.660
−5.277,
−0.684,
3.909
−5.575,
2.194,
9.963
D −5.205,
2.878,
10.961

Correlations between MANKIN and OARSI systems

As expected, there was a strong positive correlation between the two scoring systems. Individually, Spearman’s rho values (95% CIs) from the two scoring systems were: observer A, 0.921 (95% CI 0.898 to 0.937); observer B, 0.945 (95% CI 0.928 to 0.956); observer C, 0.939 (95% CI 0.917 to 0.955); observer D, 0.915 (95% CI 0.888 to 0.935); observer E, 0.886 (95% CI 0.835 to 0.926). We also averaged the scores from observers A, B, C, and D, and found that rho improved to 0.960 (95% CI 0.949 to 0.970). Averaging is a proven technique for smoothing minor fluctuations, and can result in more stable scores. We considered observer E’s scores to evince more than minor fluctuations relative to the others, hence excluded this individual’s scores from the summary calculation.

A plot of average MANKIN versus average OARSI system among 4 graders showed a strong monotonic relationship between the two scoring systems [reflective of the Spearman rho value near 1], which is roughly linear (Fig. 6).

Fig. 6.

Fig. 6

Summary plot of average MANKIN versus average OARSI system: Summary plot comprising 95% confidence ellipse, and marginal histograms of the frequency distribution. The graph shows a strong monotonic relationship between the two scoring systems, reflective of the Spearman rho value of 0.96, The relationship is roughly, but not perfectly, linear: Given a particular MANKIN [or OARSI] score, there is a fair amount of scatter around the corresponding OARSI [or MANKIN] scores. The marginal MANKIN and OARSI system distributions are rather flat.

Level of experience

We hypothesized that the level of experience could be an important factor in inter- and intra-observer variability. In our study, all graders were familiar with the analysis of articular cartilage for at least more than 5 years yet with a different level of experience for both systems as well as for different species.

From examination of the LOA tables (Tables 5, 7), observers A and C, on average, scored slightly less OA severity than observers B and D on both scales, with comparable levels of variability. Observer E scored slides at significantly lower severity than the other raters: on average, discrepancies were 1 to 1.5 for the MANKIN system total scores, and 2 to 3 with OARSI system total scores; and, LOA were typically twice the widths of all other pairwise comparisons. Observers E’s high level of variability was also reflected in reduced ICC values, and increased widths of LOA intervals, relative to the other observers.

Discussion

Standardized histological scoring systems for cartilage are needed to assess the severity of degradation in human tissues and experimental models. The MANKIN system proposed by Mankin in 1971 is the most widely used system yet has several limitations [9, 10, 12]. To overcome these limitations, the OARSI system Working Group postulated five principles for an ideal cartilage histopathology system: simplicity, utility, scalability, extensibility and comparability. In, 2006 the Osteoarthritis Cartilage Histopathology Assessment System (OARSI system) was published [13]. Thus far, the MANKIN system continues to be the most widely used system, with modifications across different studies [24]. As a consequence, studies with animal models are difficult to compare because of the varied assessment systems employed [20]. On the other hand, the OARSI system has not yet been widely implemented, in part because of the historical bias towards the MANKIN system, and in part because it has not yet been sufficiently validated.

Three studies compared the MANKIN and OARSI systems. The observers were experienced with the MANKIN, but were new to the OARSI system [12, 14, 15]. In the study using goats, cartilage sections were collected from four animals that developed cartilage damage on the femoral condyle due to articulation with a chromium-cobalt implant on the tibial plateau. While both MANKIN and OARSI systems were equally reproducible, the OARSI system was more reliable. Observer experience appeared less important when using the OARSI system but the value of its staging component was difficult to determine [12]. In the study by Pearson et al., on ten human knee OA specimens, both systems proved reliable, reproducible, and exhibited similar variability [14]. Rout et. al. performed a study on sixteen cases undergoing unicompartmental knee arthroplasty and concluded that both the modified MANKIN and OARSI system are useful for histological grading, although the OARSI system was easier and quicker to use [15]. While these studies provide helpful information on the relative utility of the two systems in severe or end-stage cartilage degradation, validation on a broad range of severities remained to be performed. To address this, the present study used an extensive collection of human knee joints across the entire adult age spectrum at all stages of OA severity.

This study is the first comparison of the OARSI and MANKIN systems using a large number of human knee joints including a selection of donors representing all stages of OA severity. For each knee, a standard topographic sampling for each joint compartment was used. Standardized histology preparation and staining of the sections was used to minimize technical variability.

Intra- and inter-observer variability, reproducibility and reliability

In our study, the inter-reader variability was good for both systems, with the ICC range for the total score of the MANKIN system slightly higher compared to the OARSI system. Among five readers both scoring systems appeared to be reliable and reproducible especially among four readers for all stages of OA and not only for normal and end-stage OA tissue as previously validated.

There is no gold standard for either MANKIN or OARSI scoring. Nevertheless, our study does provide some guidelines for rater performance, relative to individuals undertaking MANKIN or OARSI scoring. We suggest that intra-rater intra-class correlations on MANKIN or OARSI scores should exceed 0.95, and inter-rater intra-class correlations should exceed 0.90, in representative samples. Or, in testing scenarios as undertaken here, about 95% of intra-rater differences on MANKIN scores should be within 2 units of each other, and 95% of inter-rater differences should be within 3 units of each other. With OARSI scores, about 95% of intra-rater differences should be within 1 unit of each other, and 95% of inter-rater differences should be within 2 units of each other.

It was reported that the OARSI system is easier and quicker to use, presumably because it requires assessment of fewer parameters as compared to the MANKIN system. However, the OARSI system appears more difficult to apply by inexperienced readers. Even experienced readers in this study did find the OARSI system more complex to use and there was less agreement especially in the staging parameter.

Detecting early histological changes

The OARSI system was designed to detect histological features prior to the recognition of overt clinical OA. None of the published validation approaches included samples with early OA changes [12, 14, 15]. Our data showed a high agreement level between the two systems in the overall scores and compartment scores. According to the OARSI system, cartilage matrix swelling is the earliest histologically detectable change, which in an extreme form would lead to cartilage hypertrophy. Edema in cartilage may reflect condensation of the collagen fibers in the superficial and/or upper mid zones or variation in matrix cationic staining [13]. We question how accurately this parameter can be recognized and whether it can be differentiated from artifacts due to processing. The approach towards detecting early changes depends on the type of study and question being addressed. Changes observed on Toluidine Blue or Safranin O stained sections are not quantitative indicators of proteoglycan depletion and may be at least to some extent reversible. On the other hand, macroscopic structural defects such as fibrillations and partial erosions are relatively late features that are preceded by molecular changes. In this regard, detection of changes in gene expression by in situ hybridization or in protein expression and degradation of matrix components by immunohistochemistry would be more sensitive and accurate. For example, immunohistochemistry for collagen type II and aggrecan helped to identify differences within the lowest histological grades of articular cartilage [21]. Although such markers are useful to detect early changes that precede manifestations on standard histology, such changes may also be reversible and not necessarily reflect OA initiation. However, this approach is more appropriate for a specific mechanistic study rather than as a routine assessment tool.

Potential limitations of the MANKIN system

The MANKIN system describes features such as surface irregularities with pannus and complete disorganization (Supplementary Table 1). We consider these two parameters as critical for the assessment of cartilage degradation. Characteristics such as pannus and surface irregularities without proper classification may also be found in healthy cartilage and lead to a lower score [2].

Safranin O or Toluidine Blue staining intensity as a parameter in a grading system may lead to false results. In cartilage where Safranin O staining was not detectable, monoclonal antibodies revealed the presence of both keratan sulphate and chondroitin sulphate [22]. In addition, fixation, decalcification and protocol variability can affect Safranin O staining intensity [23] and therefore it has to be questioned how sensitive it would be for detection of early changes. It has been suggested that assessing certain parameters such as cellularity, cell morphology and tidemark need more reader consensus or better illustrations. In our study, we had the highest reader variability for the assessment of tidemark and the cellularity. Moreover, the MANKIN system does not include a staging component for the extent of degeneration across the tissue and is therefore mainly useful for localized lesions. Finally, detection of changes is confined mainly to cartilage while bone changes are not considered.

Potential limitations of the OARSI system

The OARSI system describes the grade as an index of the severity of the OA process and can serve as a good indicator of disease progression (Supplementary Tables 2, 3A and 3B). Grade 1 in the OARSI system is considered the threshold for OA. The primary criterion for Grade 1 is intact surface with other features of OA present such as uneven surface or fibrillation within the superficial zone being present. The challenge is to reproducibly score slight to mild edema, uneven surface or slight fibrillation and distinguishing this from surface artifacts during tissue processing. Staging is not a representative measurement when only specific regions are analyzed as with a histology section from a larger animal or human joint. In our study, we observed smaller ICCs for the staging component for the inter-reader agreement compared to the grade and the total score. This was caused mainly by a disagreement between grade 0 (normal cartilage, no OA), which requires a stage of 0 (no OA involvement) and a grade 1 (threshold for OA), which in most cases was staged with a 4 (more than 50% involvement). The ICC is substantially lower as we can observe 0 (for a grade 0) and 4 (for a grade of 1) in replicates, since 0 and 4 are not considered "close" in a range of 0–4. The individual components - grade and stage - can be misleading in isolation but may prove more insightful when used to calculate an overall score (with a range from 0–24), yet the staging appears definitely more critical for agreement. Bone changes are not examined in the MANKIN system. In the OARSI system, subchondral bone lesions are not included in the detection of early changes as OARSI grades 1–4 address only cartilage changes. OARSI grades 5 and 6 integrate bone changes. This arrangement implies a progression from initial cartilage damage to subsequent bone involvement, which may not apply to all cases of human OA. Bone changes at later OA stages include subchondral bone sclerosis, microfracture with reparative tissue, subchondral cysts, osseous repair and osteophytosis [24, 25]. The OARSI system parameter for bone deformation depends on the location within the joint that is represented by the section and is thus not useful as a routine tool.

While the OARSI system has not yet been validated through correlations with macroscopic or biochemical parameters [2], the MANKIN system was correlated to a macroscopic score [9] and to biochemical parameters [1].

Conclusions

The most common and important question being addressed with cartilage scoring systems is lesion severity. To obtain this information both systems appear complex, time-consuming and generate variability. In fact, most publications that score the large number of parameters in the two systems do not discuss them in detail but only interpret overall lesion severity.

Semi-quantitative histological scoring systems such as the MANKIN and OARSI system are observer-dependent and thus subjective. Automated computerized histomorphometry might enable more objective, accurate and reproducible analysis of cartilage [26, 27]. An automated image analysis program based on the MANKIN system has been developed [28]. Among the four subcomponents of the MANKIN scale, the computer program correlations with observer scores were best for surface defect and proteoglycan depletion, but less favorable for cellularity and tidemark invasion. These results are similar to our present observations, underscoring advantages of a system based on fewer and the most reliable parameters.

In conclusion, for the purpose of rapidly assessing severity of cartilage degradation, we propose to develop a simplified system for scoring lesion volume as measured by lesion depth and width. A similar system regarding lesion depth was proposed by Glasson for experimental OA in mice [29] and may serve as a model for a generally applicable system. Furthermore, a library of images and illustrations of stained tissue sections, similar to that used for grading radiographs [30], would be a valuable tool for training of observers and facilitate reproducible and consistent scoring within and between studies. This library could be also be used to develop online training programs.

Supplementary Material

01

Fig. 3.

Fig. 3

Histological assessment of cellularity according to MANKIN on sections from femoral condyles: (A) Normal (1–2 cells/chondron), score 0. (B) Diffuse hypercellularity, score 1. (C) Chondrocyte cloning (clusters), score 2. (D) Hypocellularity, score 3. Safranin O – fast green, pictures taken with 4x and 40x objectives.

Fig. 4.

Fig. 4

Histological assessment of the Safranin O staining intensity parameter according to MANKIN on sections from femoral condyles. (A) Normal (staining except for surface zone), score 0. (B) Slight reduction (particularly superficial zone, score 1). (C) Moderate reduction (extending down to mid zone), score 2. (D) Severe reduction (entire cartilage thickness), score 3. (E) No dye noted, score 4. Safranin O-fast green stain, pictures were taken with 4x and 40x objectives.

Acknowledgments

We are grateful to Lilo Creighton, Margaret Chadwell and Anita San Soucie for histologic processing of the specimens, and to Thomas Kryton for digitizing slides. This study was supported by the National Institutes of Health (AG007996), by Donald and Darlene Shiley, the Arthritis Foundation and the Sam and Rose Stein Endowment Fund, and the Canadian Arthritis Network, Network Centres of Excellence.

Footnotes

Author contributions

Study conception and design: Chantal Pauli, Darryl D’Lima, Martin Lotz.

Acquisition of data: Chantal Pauli, Robert Whiteside, Dobrila Nesic, Facundo Las Heras, John Matyas.

Statistical analysis: Jim Koziol.

Drafting of article or revising it critically for important intellectual content: Chantal Pauli, Jim Koziol, Darryl D’Lima, Shawn Grogan, Ken Pritzker, Robert Whiteside, Dobrila Nesic, Facundo Las Heras, John Matyas and Martin Lotz.

Conflict of interest

No author has any conflict of interest related to this work.

References

  • 1.Mankin HJ. Biochemical and metabolic aspects of osteoarthritis. Orthop Clin North Am. 1971;2:19–31. [PubMed] [Google Scholar]
  • 2.Rutgers M, van Pelt MJ, Dhert WJ, Creemers LB, Saris DB. Evaluation of histological scoring systems for tissue-engineered, repaired and osteoarthritic cartilage. Osteoarthritis Cartilage. 2010;18:12–23. doi: 10.1016/j.joca.2009.08.009. [DOI] [PubMed] [Google Scholar]
  • 3.Kuroki H, Nakagawa Y, Mori K, Ohba M, Suzuki T, Mizuno Y, et al. Acoustic stiffness and change in plug cartilage over time after autologous osteochondral grafting: correlation between ultrasound signal intensity and histological score in a rabbit model. Arthritis Res Ther. 2004;6:R492–R504. doi: 10.1186/ar1219. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Thomas CM, Fuller CJ, Whittles CE, Sharif M. Chondrocyte death by apoptosis is associated with cartilage matrix degradation. Osteoarthritis Cartilage. 2007;15:27–34. doi: 10.1016/j.joca.2006.06.012. [DOI] [PubMed] [Google Scholar]
  • 5.Piskin A, Gulbahar MY, Tomak Y, Gulman B, Hokelek M, Kerimoglu S, et al. Osteoarthritis models after anterior cruciate ligament resection and medial meniscectomy in rats. A histological and immunohistochemical study. Saudi Med J. 2007;28:1796–1802. [PubMed] [Google Scholar]
  • 6.Irlenbusch U, Schaller T. Investigations in generalized osteoarthritis. Part 1: genetic study of Heberden's nodes. Osteoarthritis Cartilage. 2006;14:423–427. doi: 10.1016/j.joca.2005.11.016. [DOI] [PubMed] [Google Scholar]
  • 7.Otte P. [The nature of coxarthrosis and principles of its management] Dtsch Med J. 1969;20:341–346. [PubMed] [Google Scholar]
  • 8.Saal A, Gaertner J, Kuehling M, Swoboda B, Klug S. Macroscopic and radiological grading of osteoarthritis correlates inadequately with cartilage height and histologically demonstrable damage to cartilage structure. Rheumatol Int. 2005;25:161–168. doi: 10.1007/s00296-004-0582-6. [DOI] [PubMed] [Google Scholar]
  • 9.Ostergaard K, Andersen CB, Petersen J, Bendtzen K, Salter DM. Validity of histopathological grading of articular cartilage from osteoarthritic knee joints. Ann Rheum Dis. 1999;58:208–213. doi: 10.1136/ard.58.4.208. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Van der Sluijs JA, Geesink RG, Van der Linden AJ, Bulstra SK, Kuyer R, Drukker J. The reliability of the Mankin score for osteoarthritis. J Orthop Res. 1992;10:58–61. doi: 10.1002/jor.1100100107. [DOI] [PubMed] [Google Scholar]
  • 11.Ostergaard K, Petersen J, Andersen CB, Bendtzen K, Salter DM. Histologic/histochemical grading system for osteoarthritic articular cartilage: reproducibility and validity. Arthritis Rheum. 1997;40:1766–1771. doi: 10.1002/art.1780401007. [DOI] [PubMed] [Google Scholar]
  • 12.Custers RJ, Creemers LB, Verbout AJ, van Rijen MH, Dhert WJ, Saris DB. Reliability, reproducibility and variability of the traditional Histologic/Histochemical Grading System vs the new OARSI Osteoarthritis Cartilage Histopathology Assessment System. Osteoarthritis Cartilage. 2007;15:1241–1248. doi: 10.1016/j.joca.2007.04.017. [DOI] [PubMed] [Google Scholar]
  • 13.Pritzker KP, Gay S, Jimenez SA, Ostergaard K, Pelletier JP, Revell PA, et al. Osteoarthritis cartilage histopathology: grading and staging. Osteoarthritis Cartilage. 2006;14:13–29. doi: 10.1016/j.joca.2005.07.014. [DOI] [PubMed] [Google Scholar]
  • 14.Pearson RG, Kurien T, Shu KS, Scammell BE. Histopathology grading systems for characterisation of human knee osteoarthritis--reproducibility, variability, reliability, correlation, and validity. Osteoarthritis Cartilage. 2011;19:324–331. doi: 10.1016/j.joca.2010.12.005. [DOI] [PubMed] [Google Scholar]
  • 15.Rout R, McDonnell S, Benson R, Athanasou N, Carr A, Doll H, et al. The histological features of anteromedial gonarthrosis--the comparison of two grading systems in a human phenotype of osteoarthritis. Knee. 2011;18:172–176. doi: 10.1016/j.knee.2010.04.010. [DOI] [PubMed] [Google Scholar]
  • 16.Schuster C. A note on the interpretation of weighted kappa and its relations to other rater agreement statistics for metric scales. Educational and Psychological Measurement. 2004;64:243–253. [Google Scholar]
  • 17.McGraw KO, SP W. Forming inferences about some intraclass correlation coefficients. Psychological Methods. 1996;1:30–46. [Google Scholar]
  • 18.Bland JM, DG A. Agreement between methods of measurement with multiple observations per individual. Journal of Biopharmaceutical Statistics. 2007;17:571–582. doi: 10.1080/10543400701329422. [DOI] [PubMed] [Google Scholar]
  • 19.Bland JM, DG A. Measuring agreement in method comparison studies. Statistical Methods in Medical Research. 1999;8:135–160. doi: 10.1177/096228029900800204. [DOI] [PubMed] [Google Scholar]
  • 20.Aigner T, Cook JL, Gerwin N, Glasson SS, Laverty S, Little CB, et al. Histopathology atlas of animal model systems - overview of guiding principles. Osteoarthritis and cartilage / OARS, Osteoarthritis Research Society. 2010;18(Suppl 3):S2–S6. doi: 10.1016/j.joca.2010.07.013. [DOI] [PubMed] [Google Scholar]
  • 21.Barley RD, Bagnall KM, Jomha NM. Histological scoring of articular cartilage alone provides an incomplete picture of osteoarthritic disease progression. Histol Histopathol. 2010;25:291–297. doi: 10.14670/HH-25.291. [DOI] [PubMed] [Google Scholar]
  • 22.Camplejohn KL, Allard SA. Limitations of safranin 'O' staining in proteoglycan-depleted cartilage demonstrated with monoclonal antibodies. Histochemistry. 1988;89:185–188. doi: 10.1007/BF00489922. [DOI] [PubMed] [Google Scholar]
  • 23.Hyllested JL, Veje K, Ostergaard K. Histochemical studies of the extracellular matrix of human articular cartilage--a review. Osteoarthritis Cartilage. 2002;10:333–343. doi: 10.1053/joca.2002.0519. [DOI] [PubMed] [Google Scholar]
  • 24.Gelse K, Soder S, Eger W, Diemtar T, Aigner T. Osteophyte development--molecular characterization of differentiation stages. Osteoarthritis Cartilage. 2003;11:141–148. doi: 10.1053/joca.2002.0873. [DOI] [PubMed] [Google Scholar]
  • 25.Felson DT, Gale DR, Elon Gale M, Niu J, Hunter DJ, Goggins J, et al. Osteophytes and progression of knee osteoarthritis. Rheumatology (Oxford) 2005;44:100–104. doi: 10.1093/rheumatology/keh411. [DOI] [PubMed] [Google Scholar]
  • 26.O'Driscoll SW, Marx RG, Fitzsimmons JS, Beaton DE. Method for automated cartilage histomorphometry. Tissue Eng. 1999;5:13–23. doi: 10.1089/ten.1999.5.13. [DOI] [PubMed] [Google Scholar]
  • 27.O'Driscoll SW, Marx RG, Beaton DE, Miura Y, Gallay SH, Fitzsimmons JS. Validation of a simple histological-histochemical cartilage scoring system. Tissue Eng. 2001;7:313–320. doi: 10.1089/10763270152044170. [DOI] [PubMed] [Google Scholar]
  • 28.Moussavi-Harami SF, Pedersen DR, Martin JA, Hillis SL, Brown TD. Automated objective scoring of histologically apparent cartilage degeneration using a custom image analysis program. J Orthop Res. 2009;27:522–528. doi: 10.1002/jor.20779. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Glasson SS, Chambers MG, Van Den Berg WB, Little CB. The OARSI histopathology initiative - recommendations for histological assessments of osteoarthritis in the mouse. Osteoarthritis Cartilage. 2010;18(Suppl 3):S17–S23. doi: 10.1016/j.joca.2010.05.025. [DOI] [PubMed] [Google Scholar]
  • 30.Altman RD, Gold GE. Atlas of individual radiographic features in osteoarthritis, revised. Osteoarthritis Cartilage. 2007;15(Suppl A):A1–A56. doi: 10.1016/j.joca.2006.11.009. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

01

RESOURCES