Abstract
Objectives: The aim of the study was to investigate the interobserver agreement for categorical and quantitative scores of liver fibrosis.
Methods: Sixty-five consecutive biopsy specimens from patients with mixed liver disease etiologies were assessed by three pathologists using the Ishak and nonalcoholic steatohepatitis Clinical Research Network (NASH CRN) scoring systems, and the fibrosis area (collagen proportionate area [CPA]) was estimated by visual inspection (visual-CPA). A subset of 20 biopsy specimens was analyzed using digital imaging analysis (DIA) for the measurement of CPA (DIA-CPA).
Results: The bivariate weighted κ between any two pathologists ranged from 0.57 to 0.67 for Ishak staging and from 0.47 to 0.57 for the NASH CRN staging. Bland-Altman analysis showed poor agreement between all possible pathologist pairings for visual-CPA but good agreement between all pathologist pairings for DIA-CPA. There was good agreement between the two pathologists who assessed biopsy specimens by visual-CPA and DIA-CPA. The intraclass correlation coefficient, which is equivalent to the κ statistic for continuous variables, was 0.78 for visual-CPA and 0.97 for DIA-CPA.
Conclusions: These results suggest that DIA-CPA is the most robust method for assessing liver fibrosis followed by visual-CPA. Categorical scores perform less well than both the quantitative CPA scores assessed here.
Keywords: Collagen proportionate area, Digital imaging analysis, Bland-Altman plot, κ, statistic
Fibrosis represents the final common pathway of injury in chronic liver pathologies, including chronic viral hepatitis and nonalcoholic fatty liver disease (NAFLD). The assessment of liver fibrosis is therefore central to the evaluation of patients with liver disease, and liver biopsy is frequently needed to achieve this. Differences in the pathophysiology of each liver disease can result in distinct patterns of fibrosis distribution, and this has led to the development of disease-specific semiquantitative categorical scores like the Ishak score1 for viral hepatitis or the scoring system for fibrosis developed by the nonalcoholic steatohepatitis Clinical Research Network (NASH CRN) for NAFLD.2
Quantitative digital imaging analysis (DIA) techniques have also been used for fibrosis evaluation. The quantification of collagen as a proportion of the total liver biopsy area (collagen proportionate area [CPA]) was found to correlate with categorical scores of fibrosis and with portal pressure,3 as well as to predict outcomes in patients with chronic hepatitis C4 and cirrhosis.5 Despite this, the use of CPA remains restricted to research studies and has not seen widespread clinical application yet.
In clinical practice, liver fibrosis assessment is needed to prioritize patients for treatment and to monitor disease progression over time. Up until now, the majority of development and validation for histologic scores has been focused on hepatitis C assessment. However, recent development of highly effective directly acting antiviral drugs (DAAs) is likely to change this. There is already evidence suggesting that the most cost-effective way to manage hepatitis C in the era of DAAs would be to offer treatment to all patients irrespective of disease severity,6 therefore negating the need for fibrosis assessment. The focus of liver fibrosis evaluation is therefore shifting toward other etiologies and NAFLD in particular. The reexamination of histologic scores in this context is now required.
The reproducibility of categorical scores for viral hepatitis has been studied extensively.7‐9 DIA-CPA techniques, which in theory could be used universally irrespective of the underlying etiology, have also been studied primarily in chronic hepatitis C.3,10 The interobserver variability of categorical scores and visual-CPA and DIA-CPA remains largely untested in routine clinical populations and mixed cohorts.
The aim of this study was to assess interobserver variability in the interpretation of biopsy specimens from a patient population with mixed liver disease etiologies using both categorical and continuous histologic scores of liver fibrosis.
Materials and Methods
Study Design and Patient Population
Sixty-five consecutive liver biopsy slides from patients participating in a study of a multiparametric magnetic resonance imaging technique for liver fibrosis evaluation11 were assessed. Biopsies were performed under radiologic guidance using 18-gauge cutting needles, according to the local clinical practice. The study was approved by the UK National Research Ethics Service (Ref 11/H0504/2), and all patients gave written informed consent.
Routine Clinical Reporting of Biopsy Specimens
All biopsies were clinically indicated and were included irrespective of biopsy length. Biopsy specimens were processed according to the local clinical routine, which involves review by two pathologists and a discussion in a clinicopathologic meeting before a final consensus report is issued. For the purposes of this study, this consensus report was used as the reference standard.
In our local practice, fibrosis is staged using the Ishak score1 in all biopsy specimens. The NASH CRN score2 is used in addition for cases of NAFLD.
Blinded Reporting Using Standard Microscopy
At least 6 months after the routine clinical reporting, three liver pathologists (L.M.W., D.D., and K.A.F.) independently and blindly reassessed the biopsy specimens. They evaluated fibrosis using the Ishak score1 and the NASH CRN fibrosis score,2 in all biopsy specimens irrespective of the underlying etiology. The amount of excess fibrosis (collagen) as a percentage of the total biopsy area (CPA) was also estimated by visual inspection of Sirius red–stained slides, using a standard light microscope (visual-CPA).
DIA-CPA
At least 6 months later, three pathologists (L.M.W., K.A.F., and E.F.) used DIA to calculate the CPA (DIA-CPA) in a subset of 20 slides that were representative of the disease etiologies and staging distribution of the whole cohort. Slides stained with Sirius red were scanned using a Hamamatsu Nanozoomer 2.0 HT Digital Pathology System (Hamamatsu, Hamamatsu City, Japan) to produce high-quality digital images. Image processing was done using ImageJ (version 1.47; US National Institutes of Health, Bethesda, MD).
Low-magnification images were used so that the whole biopsy sample was visible in one frame. The digital images were manually cropped to remove native collagen from large portal tracts or liver capsule, artifacts in the background, and any nonliver tissue (eg, skeletal muscle). The images were then split into the red, green, and blue channels, and the green channel was used for further processing. The “threshold” function of the software was used to estimate the total area of liver tissue and the total area of collagen. The area of collagen was expressed as a percentage of the total biopsy area (CPA). Image 1 illustrates the digital image-processing steps.
Image 1.
Digital imaging analysis for estimation collagen proportionate area. A, Liver biopsy image produced using a high- definition slide scanner. B, Manual editing step to crop out artifacts in the background and any nonliver tissue. The image is then split into the red, green, and blue components, and the green channel (C) is selected for further analysis. Manual thresholding is used to select the collagen in the slide (D), and the number of pixels in this selection is then automatically counted by the software. A further step allows the collagen selection (yellow outline) to be superimposed onto the original image (E and magnified section in F), which allows the reporting pathologist to validate the collagen selection. The total area of the biopsy slide can also be estimated using this technique, and collagen proportionate area is calculated as (number of pixels in the collagen area/number of pixels in the whole biopsy specimen) ×100.
The DIA technique was developed by M.P. (hepatology clinical researcher) and L.M.W. (liver histopathologist) based on the published literature.3 The other pathologists (E.F. and K.A.F.) received training by M.P. and L.M.W. in the use of the DIA technique. The training involved a 45-minute session where the use of the software and the DIA technique was demonstrated. A leaflet with stepwise instructions of how to use the software was supplied. The pathologists were provided with a training set of five slides and asked to use these to practice the DIA technique until they were confident in the use of the software. The five training slides were different from the 20 used in the final analysis.
Statistical Analysis
The statistical analysis plan was designed and executed by J.B., an experienced medical statistician. The weighted κ statistic was used to assess interobserver agreement in the ordinal fibrosis scores (Ishak and NASH CRN fibrosis). Higher κ values indicate better agreement between observers, and agreement is generally considered almost perfect if κ is 0.81 to 0.99, substantial if κ is 0.61 to 0.80, moderate if κ is 0.41 to 0.60, fair if κ is 0.21 to 0.40, and only slight if κ is 0.01 to 0.20.
Bland-Altman plots were used to calculate the mean of the difference between pairs of assessments for visual-CPA and DIA-CPA by different pathologists or where the same pathologist assessed both visual-CPA and DIA-CPA in the same slide. A mean difference of 0 would be expected if the two sets of measurements are in complete agreement. A significant deviation from 0 would suggest poor agreement. The Student t test was used to evaluate whether the mean difference of the two measurements was significantly different from 0.
A mixed-model method taking into account the correlation between measurements made on the same slide was used to calculate the between-slide and within-slide variation for the two methods of CPA evaluation. The between-slide variation is a measure of the variation of CPA in the whole cohort of slides and largely depends on the range of “true CPA” values in the cohort. The within-slide variation is a measure of the CPA variation for the value of each individual slide (ie, the smaller the within-slide variation, the closer the agreement between assessors for each individual slide).
Two pathologists assessed a set of 20 biopsy specimens by all three methods (categorical scores, visual-CPA, and DIA-CPA). The intraclass correlation coefficient (ICCC) was calculated for the interobserver agreement in the visual-CPA and DIA-CPA analysis. The ICCC is equivalent to the κ statistic for the interobserver assessment of continuous variables.
Results
In the overall biopsy cohort (n = 65), most patients had mild fibrosis (F1 in 15 [23%], F2 in 17 [26%]), and cirrhosis (F6) was present in 11 (17%) cases. The two main primary pathologies were NAFLD (n = 21; 32%) and chronic viral hepatitis (n = 28; 43%). The biopsy specimens had a median (interquartile range [IQR]) length of 20 mm (16-29 mm) and contained a median (IQR) of 11 (8-15) portal tracts and six (4-9) central veins. The distribution of fibrosis stage, primary diagnosis, and quality measures for biopsy specimens assessed by DIA (n = 20) was similar to the overall cohort Table 1.
Table 1.
Liver Biopsy Characteristics
Characteristic | Overall Cohort (n = 65) | DIA-CPA (n = 20) |
---|---|---|
Ishak stage, No. (%) | ||
0 | 3 (5) | 0 |
1 | 15 (23) | 3 (15) |
2 | 17 (26) | 5 (25) |
3 | 10 (15) | 3 (15) |
4 | 4 (6) | 3 (15) |
5 | 5 (8) | 1 (5) |
6 | 11 (17) | 5 (25) |
Primary diagnosis, No. (%) | ||
NAFLD | 21 (32) | 7 (35) |
ALD | 5 (8) | 3 (15) |
Chronic viral hepatitis | 28 (43) | 9 (45) |
Cholestatic and autoimmune | 4 (6) | 1 (5) |
Other | 7 (11) | 0 |
Biopsy specimen quality, median (IQR) | ||
Biopsy specimen length, mm | 20 (16-29) | 19 (16-29) |
Portal tracts, No. | 11 (8-15) | 10 (7-16) |
Central veins, No. | 6 (4-9) | 6 (4-11) |
ALD, alcoholic liver disease; DIA-CPA, digital imaging analysis for the measurement of collagen proportionate area; IQR, interquartile range; NAFLD, nonalcoholic fatty liver disease.
Interobserver Agreement—Categorical Scores of Fibrosis (n = 65)
The bivariate weighted κ statistics between any two pathologists ranged from 0.57 to 0.67 for the Ishak score and from 0.47 to 0.57 for the NASH CRN fibrosis score Table 2.
Table 2.
Bivariate Weighted κ for Categorical Fibrosis Scores (n = 65)
Assessors | Ishak, wκ (95% CI) | NASH CRN, wκ (95% CI) |
---|---|---|
1 vs 2 | 0.67 (0.55-0.79) | 0.57 (0.44-0.70) |
1 vs 3 | 0.62 (0.49-0.74) | 0.59 (0.45-0.74) |
2 vs 3 | 0.57 (0.44-0.70) | 0.47 (0.33-0.61) |
CI, confidence interval; NASH CRN, nonalcoholic steatohepatitis Clinical Research Network; wκ, weighted κ.
Interobserver Agreement—Visual-CPA (n = 65)
The mean of the difference between the visual-CPA assessments was 2.8 between pathologists 1 and 2, –3.68 between pathologists 1 and 3, and –6.55 between pathologists 2 and 3. In all the three possible pairings, the mean of the difference was significantly different from 0, indicating poor agreement Table 3. In the 20 slides that were subsequently assessed by DIA-CPA, the between-slide variation was 73, and the within-slide variation was 21.
Table 3.
Bland-Altman Test for Interobserver Agreement in Collagen Proportionate Area Estimation by Visual Inspection (n = 65)
Assessors | Mean (SD) of Difference | 95% CI | t | P Value |
---|---|---|---|---|
1 vs 2 | 2.87 (7.26) | 1.07 to 4.67 | 3.19 | .0022 |
1 vs 3 | −3.68 (8.48) | −5.78 to 1.57 | −3.50 | .0009 |
2 vs 3 | −6.55 (7.85) | −8.49 to 4.60 | −6.73 | <.00001 |
CI, confidence interval.
Interobserver Agreement—DIA-CPA (n = 20)
The mean of the difference between the DIA-CPA assessments was –0.60 between pathologists 1 and 3, 0.04 between pathologists 1 and 4, and 0.64 between pathologists 3 and 4. None of the three pairings showed any significant difference from 0, indicating good agreement Table 4. The between-slide variation was 82, and the within-slide variation was 4.6.
Table 4.
Bland Altman Test for Interobserver Agreement in Collagen Proportionate Area Computation by Digital Imaging Analysis (n = 20)
Assessors | Mean (SD) of Difference | 95% CI | t | P Value |
---|---|---|---|---|
1 vs 3 | −0.60 (2.84) | −1.93 to 0.73 | −0.95 | .36 |
1 vs 4 | 0.04 (3.26) | −1.49 to 1.56 | 0.05 | .96 |
3 vs 4 | 0.64 (3.20) | −0.86 to 2.14 | 0.89 | .38 |
CI, confidence interval.
Agreement Between Methods for CPA Assessment (n = 20)
Pathologists 1 and 3 assessed 20 slides using both methods (visual-CPA and DIA-CPA). The mean of the difference between visual-CPA and DIA-CPA for the assessments was 0.49 by pathologist 1 and 1.78 by pathologist 3, both of which were not significantly different from 0, indicating good agreement Table 5.
Table 5.
Bland-Altman Test for Agreement Between Methods of Collagen Proportionate Area Measurement (n = 20)
Assessors | Mean (SD) of Difference | 95% CI | t | P Value |
---|---|---|---|---|
1 | −0.49 (6.26) | −3.42 to − 2.44 | −0.35 | .73 |
3 | 1.78 (5.70) | −3.42 to − 2.44 | 1.40 | .18 |
CI, confidence interval.
ICCCs for CPA Assessments (n = 20)
The ICCC between the assessments of pathologists 1 and 3 was 0.78 for the visual-CPA and 0.97 for DIA-CPA.
Discussion
The study evaluated interobserver variability of quantitative (DIA-CPA and visual-CPA) and categorical scores (Ishak and NASH CRN) for histologic assessment of liver fibrosis in biopsy specimens from patients with mixed liver disease etiologies. DIA-CPA was the most reproducible score with the highest ICCC (0.97), followed by visual-CPA (ICCC, 0.78). Both categorical scores performed less well (best weighted κ between pathologist pairings was 0.67 for Ishak and 0.59 for NASH CRN). Furthermore, Bland-Altman analysis showed good agreement between all pairs of assessors for DIA-CPA but in none of the pairs for visual-CPA. DIA-CPA also had a lower within-slide variability (4.6) compared with visual-CPA (21).
This finding becomes even more significant when training and experience are considered. All the pathologists who took part in the study have been trained and have extensive experience in the use of the categorical scores, which are indeed routinely used at our center. In contrast, they only received a 45-minute training session in the use of the DIA-CPA technique and were given the chance to practice only in five slides. No specific training was given in the reporting of visual-CPA.
The high interobserver agreement for DIA-CPA reported here is in keeping with other studies.3,10,12 However, comparisons between DIA-CPA techniques and categorical scores have produced some conflicting results. For example, Pilette et al12 found DIA-CPA to have an ICCC of 0.996 compared with κ values ranging from 0.29 to 0.87 for the Knodell scoring system.13 However, in another study, categorical scores were reported to have better interobserver agreement than DIA-CPA methods.14 The most likely explanation for this discrepancy is the difference method used in the study by Wright et al,14 where different sections from the same biopsy core were used to assess observer dependent agreement, and this may have introduced bias due to the quality of staining between the two sections. Despite these conflicting results, we feel that our results in the context of the published literature suggest that DIA-CPA is the most robust and reproducible method for histologic liver fibrosis assessment and should be used where the resources are available.
However, considerable hurdles need to be overcome if this technique is to see widespread use. The DIA-CPA techniques depend on the quality of the digital images for the analysis. To achieve the necessary image quality in this study, a slide scanner was used, together with freely available imaging analysis software. Other DIA-CPA methods rely on both proprietary image acquisition systems and software costing up to US$6,850.3,10 Furthermore, DIA-CPA would require additional time for the preparation and reporting of digital images. We did not assess the time aspects of DIA-CPA analysis, but previous studies have reported that it would take an additional 5 to 10 minutes per biopsy specimen for the image analysis alone.14
Several aspects of visual-CPA assessment make it appealing for further evaluation as an alternative to DIA-CPA in routine clinical practice. Visual-CPA can be done quickly and requires no extra equipment. Furthermore, in this study, visual-CPA assessment achieved lower interobserver variability than the routinely used categorical scores, and we found good agreement between DIA-CPA and visual-CPA in this study. More experience and training may therefore improve the performance of visual-CPA further. It is also possible for visual-CPA to be used alongside routine categorical scores, and this may provide additional information that could be used clinically.
There are very limited published data for CPA techniques in NAFLD. For example, in the study by Hall et al,15 evaluating DIA-CPA in explanted livers, patients with NAFLD were not included as it was difficult to establish any contribution to disease from alcohol. Validation of CPA techniques in prospective NAFLD cohorts is therefore required.
Study Limitations
The study evaluated interobserver agreement of different histologic fibrosis scoring systems in a cohort of patients with mixed liver disease etiologies. The categorical scores we chose to assess were not necessarily the ones designed for each specific etiology. For example, the Ishak scoring system1 was designed for patients with chronic hepatitis C while here it was applied to patients of all etiologies. However, both the categorical scores are descriptive and rely on the recognition of specific features by the reporting pathologists (eg, cirrhotic nodules or fibrotic bridges between portal tracks). The recognition of these features should therefore be possible on biopsy specimens from any liver disease etiology.
Furthermore, there were no minimum biopsy quality criteria despite evidence showing that accuracy depends on biopsy length and number of portal tract present.16,17 Despite this, the biopsy specimens included in our study are representative of those in routine clinical care, where biopsies are rarely repeated for reasons of adequacy.
In conclusion, our data suggest that DIA-CPA is the most reproducible method of histologic liver fibrosis assessment and should be performed where the necessary resources are available. Furthermore, visual-CPA is an attractive technique that warrants further evaluation as it could be easily implemented alongside traditional reporting. Both CPA techniques would provide adjunctive data to semiquantitative scores, but further studies are needed to examine whether quantitative techniques could replace categorical reporting in routine practice.
Funding
This work was supported by grants from the Oxford NIHR Biomedical Research Centre and the British Heart Foundation.
M.P., R.B., S.N., and E.B. are shareholders in Perspectum Diagnostics (PD), a university spin-out company. R.B. and S.N. are on the board of directors of PD. R.B. is employed by PD. J.B., E.F., D.D., K.A.F., and L.M.W. declare no conflict of interest.
References
- 1. Ishak K, Baptista A, Bianchi L, et al. Histological grading and staging of chronic hepatitis. J Hepatol. 1995;22:696-699. [DOI] [PubMed] [Google Scholar]
- 2. Kleiner DE, Brunt EM, Van Natta M, et al. Design and validation of a histological scoring system for nonalcoholic fatty liver disease. Hepatology. 2005;41:1313-1321. [DOI] [PubMed] [Google Scholar]
- 3. Calvaruso V, Burroughs AK, Standish R, et al. Computer-assisted image analysis of liver collagen: relationship to Ishak scoring and hepatic venous pressure gradient. Hepatology. 2009;49:1236-1244. [DOI] [PubMed] [Google Scholar]
- 4. Huang Y, de Boer WB, Adams LA, et al. Image analysis of liver biopsy samples measures fibrosis and predicts clinical outcome. J Hepatol. 2014;61:22-27. [DOI] [PubMed] [Google Scholar]
- 5. Tsochatzis E, Bruno S, Isgro G, et al. Collagen proportionate area is superior to other histological methods for sub-classifying cirrhosis and determining prognosis. J Hepatol. 2014;60:948-954. [DOI] [PubMed] [Google Scholar]
- 6. Tsochatzis EA, Crossan C, Longworth L, et al. Cost-effectiveness of noninvasive liver fibrosis tests for treatment decisions in patients with chronic hepatitis C. Hepatology. 2014;60:832-843. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. The French Metavir Cooperative Study Group. Intraobserver and interobserver variations in liver biopsy interpretation in patients with chronic hepatitis C. Hepatology. 1994;20:15-20. [PubMed] [Google Scholar]
- 8. Goldin RD, Goldin JG, Burt AD, et al. Intra-observer and inter-observer variation in the histopathological assessment of chronic viral hepatitis. J Hepatol. 1996;25:649-654. [DOI] [PubMed] [Google Scholar]
- 9. Gronbaek K, Christensen PB, Hamilton-Dutoit S, et al. Interobserver variation in interpretation of serial liver biopsies from patients with chronic hepatitis C. J Viral Hepat. 2002;9:443-449. [DOI] [PubMed] [Google Scholar]
- 10. Campos CF, Paiva DD, Perazzo H, et al. An inexpensive and worldwide available digital image analysis technique for histological fibrosis quantification in chronic hepatitis C. J Viral Hepat. 2014;21:216-222. [DOI] [PubMed] [Google Scholar]
- 11. Banerjee R, Pavlides M, Tunnicliffe EM, et al. Multiparametric magnetic resonance for the non-invasive diagnosis of liver disease. J Hepatol. 2014;60:69-77. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Pilette C, Rousselet MC, Bedossa P, et al. Histopathological evaluation of liver fibrosis: quantitative image analysis vs semi-quantitative scores. Comparison with serum markers. J Hepatol. 1998;28:439-446. [DOI] [PubMed] [Google Scholar]
- 13. Knodell RG, Ishak KG, Black WC, et al. Formulation and application of a numerical scoring system for assessing histological activity in asymptomatic chronic active hepatitis. Hepatology. 1981;1:431-435. [DOI] [PubMed] [Google Scholar]
- 14. Wright M, Thursz M, Pullen R, et al. Quantitative versus morphological assessment of liver fibrosis: semi-quantitative scores are more robust than digital image fibrosis area estimation. Liver Int. 2003;23:28-34. [DOI] [PubMed] [Google Scholar]
- 15. Hall A, Germani G, Isgrò G, et al. Fibrosis distribution in explanted cirrhotic livers. Histopathology. 2012;60:270-277. [DOI] [PubMed] [Google Scholar]
- 16. Pierre B, Delphine D, Valerie P.. Sampling variability of liver fibrosis in chronic hepatitis C. Hepatology. 2003;38:1449-1457. [DOI] [PubMed] [Google Scholar]
- 17. Andrew RH, Emmanuel T, Richard M, et al. Sample size requirement for digital image analysis of collagen proportionate area in cirrhotic livers. Histopathology. 2013;62:421-430. [DOI] [PubMed] [Google Scholar]