Reproducibility and Feasibility of Strategies for Morphologic Assessment of Renal Biopsies Using the Nephrotic Syndrome Study Network Digital Pathology Scoring System

Jarcy Zee; Jeffrey B Hodgin; Laura H Mariani; Joseph P Gaut; Matthew B Palmer; Serena M Bagnasco; Avi Z Rosenberg; Stephen M Hewitt; Lawrence B Holzman; Brenda W Gillespie; Laura Barisoni

doi:10.5858/arpa.2017-0181-OA

. Author manuscript; available in PMC: 2019 May 1.

Published in final edited form as: Arch Pathol Lab Med. 2018 Feb 19;142(5):613–625. doi: 10.5858/arpa.2017-0181-OA

Reproducibility and Feasibility of Strategies for Morphologic Assessment of Renal Biopsies Using the Nephrotic Syndrome Study Network Digital Pathology Scoring System

Jarcy Zee ¹, Jeffrey B Hodgin ¹, Laura H Mariani ¹, Joseph P Gaut ¹, Matthew B Palmer ¹, Serena M Bagnasco ¹, Avi Z Rosenberg ¹, Stephen M Hewitt ¹, Lawrence B Holzman ¹, Brenda W Gillespie ¹, Laura Barisoni ¹

PMCID: PMC5946059 NIHMSID: NIHMS957480 PMID: 29457738

Abstract

Context

Testing reproducibility is critical for the development of methodologies for morphologic assessment. Our previous study using the descriptor-based Nephrotic Syndrome Study Network Digital Pathology Scoring System (NDPSS) on glomerular images revealed variable reproducibility.

Objective

To test reproducibility and feasibility of alternative scoring strategies for digital morphologic assessment of glomeruli and explore use of alternative agreement statistics.

Design

The original NDPSS was modified (NDPSS1 and NDPSS2) to evaluate (1) independent scoring of each individual biopsy level, (2) use of continuous measures, (3) groupings of individual descriptors into classes and subclasses prior to scoring, and (4) indication of pathologists’ confidence/uncertainty for any given score. Three and 5 pathologists scored 157 and 79 glomeruli using the NDPSS1 and NDPSS2, respectively. Agreement was tested using conventional (Cohen κ) and alternative (Gwet agreement coefficient 1 [AC₁]) agreement statistics and compared with previously published data (original NDPSS).

Results

Overall, pathologists’ uncertainty was low, favoring application of the Gwet AC₁. Greater agreement was achieved using the Gwet AC₁ compared with the Cohen κ across all scoring methodologies. Mean (standard deviation) differences in agreement estimates using the NDPSS1 and NDPSS2 compared with the single-level original NDPSS were −0.09 (0.17) and −0.17 (0.17), respectively. Using the Gwet AC₁, 79% of the original NDPSS descriptors had good or excellent agreement. Pathologist feedback indicated the NDPSS1 and NDPSS2 were time-consuming.

Conclusions

The NDPSS1 and NDPSS2 increased pathologists’ scoring burden without improving reproducibility. Use of alternative agreement statistics was strongly supported. We suggest using the original NDPSS on whole slide images for glomerular morphology assessment and for guiding future automated technologies.

In the setting of clinical trials and translational research, the morphologic evaluation of renal biopsies has progressively transitioned from use of conventional light microscopy to digital pathology on whole slide images (WSIs).^1–3 Previous studies have revealed that interreader and intrareader reproducibility of morphology scoring or diagnoses are generally higher when using WSIs and enhanced by annotation.^1,4–9 The establishment of digital pathology repositories also facilitates the testing of different scoring systems and metrics, simultaneously or at different times, using the same set of WSIs.²

The Nephrotic Syndrome Study Network (NEPTUNE) pathology working group pioneered the establishment of the NEPTUNE digital pathology protocol to enable standardized morphologic assessment of digital renal biopsies from children and adults with minimal change disease (MCD), focal segmental glomerulosclerosis (FSGS), and membranous nephropathy (MN).^2,10 The NEPTUNE digital pathology protocol includes protocols to populate a digital pathology repository, to annotate (enumerate) individual glomeruli across biopsy levels, to morphologically assess renal biopsies using the descriptor-based NEPTUNE Digital Pathology Scoring System (NDPSS), and for digital morphometry.^2,11 This multicenter effort has served as a model for other international consortia such as the International Digital Nephropathology Network.¹²

A critical element in establishing new scoring systems, besides their clinical significance, is their reproducibility. Reproducibility can be modulated by several factors, including pathologists’ training, the type of lesions being scored, the metrics, or the statistical approach applied.¹² For example, cross-training of pathologists prior to scoring and grouping of individual descriptors that share common features into categories can increase reproducibility.^3,4 Furthermore, morphologic features captured as dichotomous measures (ie, present versus absent) may be better represented by continuous measures. Lastly, the agreement statistic used to evaluate reproducibility needs to be carefully chosen. For example, the Cohen κ is conventionally used in pathology partly because it makes a correction for agreement by chance, but it also inherently assumes that all ratings may be rated randomly.^13,14 However, when the scoring process is performed by experts and preceded by rigorous cross-training processes, it is plausible that only a portion of observations is subject to random ratings, in which case the Cohen κ may overestimate and therefore overcorrect for chance agreement. An alternative agreement statistic that tends to be more liberal by assuming a lower proportion of random ratings may be more suitable in this case, such as the Gwet agreement coefficient 1 (AC₁).^14,15

Although the NDPSS was designed to include all biopsy levels available for assessment, our first reproducibility test was conducted on single static (JPEG) images of glomeruli.³ With the current study, we aim to explore reproducibility and feasibility of alternative scoring strategies, metrics, and statistical approaches for optimizing the original NDPSS, with the goal of establishing a robust methodology for morphologic assessment of digital renal biopsies in the settings of clinical research, clinical trials, and ultimately routine practice.

MATERIALS AND METHODS

Study Cohort

The WSIs included in this study are part of the set of cases enrolled in the multicenter and multiethnic prospective cohort study NEPTUNE.¹⁰ As previously described, renal biopsy material was collected according to the NEPTUNE digital pathology protocol and made available to study pathologists through password-protected access to the NEPTUNE digital pathology repository.²

Overall Study Design

Our study was designed to address 3 goals: (1) to test whether alternative scoring strategies and metrics improve interpathologist reproducibility, we modified the original NDPSS to create the NDPSS1 and NDPSS2; (2) to determine the statistical approach that would most accurately measure the agreement (or disagreement) among pathologists, we compared Cohen κ and Gwet AC₁ statistics across all scoring strategies; and (3) to evaluate the feasibility of each scoring strategy, we collected pathologists’ feedback on the use of the different approaches.

Scoring Systems

Original NDPSS

Previously published scoring data using the original NDPSS were retrieved from the NEPTUNE database and reanalyzed in the current study. Data were previously obtained by 12 pathologists, who reviewed 315 JPEG images of individual glomeruli (equivalent to assessing the glomerular profile on a single biopsy level) and recorded the presence or absence of 51 glomerular descriptors using an electronic scoring matrix (Figure 1, A).³ In the current study, we used scores from 39 of 51 descriptors pertinent to MCD, FSGS, and MN; we also generated classes and subclasses of descriptors by applying postscoring grouping strategies mimicking those used in NDPSS1 and NDPSS2 described below (Tables 1 and 2; Figure 2, A through D).

Table 1.

Groupings of Individual Descriptors Into Classes Included in Each Set of Scoring Strategies

	Classes
	Any Sclerosis, Wrinkling, or Tip	Global Obliteration	Segmental Obliteration	Podocyte Injury	Mesangiopathic Changes	GBM Spikes
Individual descriptor
No/minimal changes
Global sclerosis with hyalinosis	X	X
Global sclerosis without hyalinosis	X	X
Global collapse	X	X
Global deflation	X	X
Obsolescence	X	X
Global mesangial sclerosis		X
Segmental perihilar sclerosis	X		X
Segmental extended perihilar sclerosis	X		X
Segmental sclerosis away from vascular and tubular pole	X		X
Segmental sclerosis cannot determine location	X		X
Cellular tip lesion	X		X
Sclerosing tip lesion	X		X
Extended cellular tip lesion	X		X
Extended sclerosing tip lesion	X		X
Midglomerular sclerosis			X
Cellular nontip	X		X
Segmental collapse	X		X
Segmental deflation	X		X
Periglomerular fibrosis
Glomerular foam cells			X
Segmental podocyte hyaline droplets^a				X
Global podocyte hyaline droplets^a				X
Hyalinosis at the vascular pole			X
Hyalinosis at the tubular pole			X
Hyalinosis away from vascular and tubular pole			X
Hyalinosis cannot determine location			X
Synechia			X
Segmental podocyte hypertrophy				X
Global podocyte hypertrophy				X
Segmental podocyte hyperplasia				X
Global podocyte hyperplasia				X
Halo				X
Segmental mesangial expansion					X
Global mesangial expansion					X
Segmental mesangial cell proliferation					X
Global mesangial cell proliferation					X
Segmental spikes						X
Global spikes						X
Marginating leukocytes
Set of Scoring Strategies
Original NDPSS^b	D	D	D	D	D	D
NDPSS1		P	P	P	P	P%
NDPSS2	D	D	D	D	D	D

Open in a new tab

Abbreviations: D, dichotomous; GBM, glomerular basement membrane; NDPSS, Nephrotic Syndrome Study Network Digital Pathology Scoring System; P, probability; %, percentage.

Podocyte hyaline droplets was originally scored as a single descriptor and split into 2 individual descriptors (global versus segmental) for the NDPSS1 only.

Postscoring grouping into classes.

Table 2.

Groupings of Individual Descriptors Into Subclasses Included in Each Set of Scoring Strategies

	Subclasses
	Any Sclerosis or Tip	Any Wrinkling	Any Deflation	Any Collapse	Global Sclerosis	Global Wrinkling
Individual descriptor
No/minimal changes
Global sclerosis with hyalinosis	X				X
Global sclerosis without hyalinosis	X				X
Global collapse		X		X		X
Global deflation		X	X			X
Obsolescence	X				X
Global mesangial sclerosis
Segmental perihilar sclerosis	X
Segmental extended perihilar sclerosis	X
Segmental sclerosis away from vascular and tubular pole	X
Segmental sclerosis cannot determine location	X
Cellular tip lesion	X
Sclerosing tip lesion	X
Extended cellular tip lesion	X
Extended sclerosing tip lesion	X
Midglomerular sclerosis	X
Cellular nontip	X
Segmental collapse		X		X
Segmental deflation		X	X
Periglomerular fibrosis
Glomerular foam cells
Segmental podocyte hyaline droplets^a
Global podocyte hyaline droplets^a
Hyalinosis at the vascular pole
Hyalinosis at the tubular pole
Hyalinosis away from vascular and tubular pole
Hyalinosis cannot determine location
Synechia
Segmental podocyte hypertrophy
Global podocyte hypertrophy
Segmental podocyte hyperplasia
Global podocyte hyperplasia
Halo
Segmental mesangial expansion
Global mesangial expansion
Segmental mesangial cell proliferation
Global mesangial cell proliferation
Segmental spikes
Global spikes
Marginating leukocytes
Set of Scoring Strategies
Original NDPSS^b	D	D	D	D	D	D
NDPSS1			%	%	P	P
NDPSS2	%	%	%	%	D	D

Subclasses
Segmental Sclerosis	Segmental Wrinkling	Tip Lesions	Segmental Hyalinosis	Other Segmental Lesions	Mesangial Expansion	Mesangial Cell Proliferation	Podocyte Hypertrophy	Podocyte Hyperplasia	Podocyte Hyaline Droplets^a













X
X
X
X
		X
		X
		X
		X
X
X
	X
	X
				X
									X
									X
			X
			X
			X
			X
				X
							X
							X
								X
					X			X
					X	X
						X



D	D	D	D	D	D	D	D	D	D
P	P	P	P	P	P%	P%	P%	P%	P%
D	D	D	D	D	%	%	%	%	D

Open in a new tab

Abbreviations: D, dichotomous; NDPSS, Nephrotic Syndrome Study Network Digital Pathology Scoring System; P, probability; %, percentage.

Podocyte hyaline droplets was originally scored as a single descriptor and split into 2 individual descriptors (global versus segmental) for the NDPSS1 only.

Postscoring grouping into subclasses.

Example of classes and subclasses of descriptors (images) and how they are organized in the modified Nephrotic Syndrome Study Network Digital Pathology Scoring System (NDPSS), NDPSS1, and NDPSS2. The NDPSS2 class any sclerosis, wrinkling, or tip includes the subclasses any sclerosis (A and B) and any wrinkling (C and D). The NDPSS2 subclass any sclerosis contains additional subclasses global sclerosis (A) and segmental sclerosis (C); the NDPSS2 subclass any wrinkling contains additional subclasses global wrinkling (C) and segmental wrinkling (D). The NDPSS1 and 2 class global obliteration includes the subclasses global sclerosis (A) and global wrinkling (C); the NDPSS1 and 2 class segmental obliteration includes the subclasses segmental sclerosis (C) and segmental wrinkling (D). Examples of descriptors in the various classes and subclasses: A, The descriptors global sclerosis with hyalinosis (periodic acid–Schiff) and obsolescence (hematoxylin-eosin) are grouped in the NDPSS1 and 2 subclass global sclerosis, the NDPSS2 subclass any sclerosis, and the NDPSS1 and 2 class global obliteration. B, The descriptors segmental sclerosis away from vascular and tubular pole (silver stain), tip lesion (silver stain; yellow arrows), and segmental perihilar sclerosis (periodic acid–Schiff; blue arrow) are grouped in the NDPSS1 and 2 subclass segmental sclerosis, the NDPSS2 subclass any sclerosis, and the NDPSS1 and 2 class segmental obliteration. C, The descriptors global collapse (trichrome) and global deflation (silver stain) are grouped in the NDPSS1 and 2 subclass global wrinkling, the NDPSS2 subclass any wrinkling, and the NDPSS1 and 2 class global obliteration. D, The descriptors segmental collapse (silver stain; green arrows) and segmental deflation (periodic acid–Schiff; red arrows) are grouped in the NDPSS1 and 2 subclass segmental wrinkling, the NDPSS2 subclass any wrinkling, and the NDPSS1 and 2 class segmental obliteration. The descriptors global collapse and segmental collapse are also grouped in the NDPSS2 any collapse, and the descriptors global deflation and segmental deflation are also grouped in the NDPSS2 subclass any deflation (not shown in figure) (original magnifications ×60 [A, global sclerosis with hyalinosis] and ×40 [A, obsolescence, and B through D]).

The NDPSS1

Scoring Strategies

An electronic scoring matrix specifically designed for NDPSS1 (Figure 1, B) was used to test alternative scoring strategies (Table 3), including the use of all biopsy levels (Figure 3, A through F), grouping descriptors prior to scoring (Tables 1 and 2), and the application of ordinal-scale and continuos-scale scoring (Tables 1 and 2).

Table 3.

Scoring Strategies Tested in First (Nephrotic Syndrome Study Network Digital Pathology Scoring System [NDPSS] 1) and Second (NDPSS2) Modifications of the NDPSS

Purpose	Scoring Strategy
NDPSS1
To evaluate the agreement in descriptor scoring using all biopsy levels rather than a single image	All tuft cross sections for a given annotated glomerulus were reviewed and collectively used to generate a single descriptor score (Figure 3), ie, the presence of an individual or group of descriptors was recorded if it appeared in one or more tuft cross sections. Although this strategy is part of the NDPP and NDPSS, our previously published study tested agreement using individual JPEG images only. One of the 39 individual descriptors from the original NDPSS was split into 2 (segmental versus global) for NDPSS1, so NDPSS1 included 40 individual descriptors.
To test if grouping descriptors with common characteristics prior to scoring improves agreement	40 individual glomerular descriptors relevant to MCD, FSGS, and MN were organized into 5 classes and 12 subclasses (Tables 1 and 2; Figure 2). In contrast to the previously published study where grouping was performed after the scoring process, pathologists directly scored classes, subclasses, and individual descriptors in a hierarchical fashion. Each class or subclass was endorsed if any one of the component descriptors was present.
To identify the descriptors for which pathologists had some uncertainty and to test whether scoring on an ordinal scale would improve agreement	Pathologists indicated their confidence in scoring the presence of any given class, subclass, or individual descriptor as a probability (0 = no, 0.25 = probably not, 0.50 = maybe, 0.75 = probably yes, or 1 = yes).
To test whether scoring on a continuous measure improves agreement compared with a dichotomous approach	The percentage of the glomerular tuft involved (0%, 5%, 10%, 20%, …, 90%, 100%) was indicated for 8 classes or subclasses of descriptors (Tables 1 and 2).
NDPSS 2
To test whether reproducibility was modulated by having pathologists focus on a single glomerular level at a time, independently from descriptors present in other levels	Biopsy sections/levels containing each annotated glomerulus were individually scored using separate columns in the NDPSS2 scoring matrix. These section/level-specific scores were later combined to obtain a glomerulus-specific score, such that presence in any section implies presence in the glomerulus (Figure 3).
To test whether reproducibility could be increased by grouping descriptors in different ways than previously done	Descriptors were reorganized into 6 classes and 16 subclasses (Tables 1 and 2; Figure 2). Only 7 individual descriptors were included in NDPSS2 for scoring.
To test whether scoring on a continuous measure improves agreement compared with a dichotomous approach	In 8 of 16 subclasses, the score was recorded as a percentage of the glomerular tuft involved (Tables 1 and 2). For the remaining 8 subclasses, 6 classes, and 7 individual descriptors, dichotomous metrics (ie, present versus absent) were used for scoring.
To test reproducibility of each pathologist’s subjective interpretation of the overall severity of damage in the biopsy	Pathologists were asked to indicate a gestalt overall damage score (from 1 = good prognosis to 5 = really bad prognosis). No cross-training was provided for this measure.
To evaluate whether removal of poor quality images or stratification by stain type affected reproducibility results	Pathologists indicated the stain type for each biopsy section and whether there were any images with poor quality.

Open in a new tab

Abbreviations: FSGS, focal segmental glomerulosclerosis; MCD, minimal change disease; MN, membranous nephropathy; NDPP, Nephrotic Syndrome Study Network Digital Pathology Protocol; NDPSS, Nephrotic Syndrome Study Network Digital Pathology Scoring System.

Multilevel representation of a single glomerulus showing different descriptors in different levels. A, Level 2, intraglomerular foam cells. B, Level 5, an example of segmental obliteration involving at least 75% of the glomerular tuft, with foam cells and segmental podocyte hypertrophy and hyperplasia. C, Level 7; here the segmental obliteration involves less than 50% of the glomerular tuft. Other descriptors present in this section are foam cells and segmental podocyte hypertrophy. D, Level 10, no/minimal changes. E, Level 11, no/minimal changes. F, Level 12, segmental mesangial proliferation (circled) (hematoxylin-eosin, original magnification ×40 [A through C and F]; trichrome, original magnification ×40 [D]; silver, original magnification ×40 [E]).

Pathologist Training

Three NEPTUNE pathologists received 2 hours of training using an online webinar to review the NDPSS1 scoring protocol and the corresponding electronic scoring matrix (Figure 1, B). Understandability of the NDPSS1 protocol was then tested by having each pathologist score 4 example glomeruli, which was then followed by an additional 2 hours of webinar discussion and cross-training to increase reproducibility.

Case Selection and Distribution

The NEPTUNE database contains cases previously scored using the original NDPSS. From these data, we identified glomeruli with high numbers of structural features present to maximize the information gained from each glomerulus. 157 glomeruli from 60 FSGS/MCD and 2 MN cases were selected to test NDPSS1. Each case contributed between 1 and 5 glomeruli and had at least 1 WSI of a biopsy section stained with hematoxylin-eosin, periodic acid–Schiff, trichrome, or silver. Cases were randomly assigned to each of the 3 scoring pathologists such that each pathologist scored about 100 glomeruli, with overlap such that each glomerulus would have 2 sets of scores.

The NDPSS2

Scoring Strategies

Based on initial reproducibility estimates using the Cohen κ statistic and pathologists’ feedback from the NDPSS1 study (see Results), a second set of scoring strategies, the NDPSS2, was implemented (Table 3) and was recorded on an electronic scoring matrix specifically designed for the NDPSS2 (Figure 1, C). The scoring strategies included scoring of individual biopsy sections/levels independently (Figure 3), different groupings of individual descriptors (Tables 1 and 2; Figure 2), continuous-scale scoring (Tables 1 and 2), a gestalt overall damage score, and indication of poor image quality and stain type.

Pathologist Training

Two additional NEPTUNE pathologists were added to the study, and all 5 pathologists collectively reviewed the results of the NDPSS1 data using case examples and discussed disagreements. The 5 pathologists received a 2-hour webinar training to review the NDPSS2 scoring protocol and the corresponding electronic scoring matrix (Figure 1, C). All pathologists participated in a practice round by scoring every level of 2 glomeruli to ensure understandability of the scoring protocol, followed by an additional 2 hours of cross-training to improve reproducibility.

Case Selection and Distribution

A total of 79 annotated glomeruli on WSIs from the same 60 FSGS/MCD and 2 MN NEPTUNE cases were scored. Each case contributed between 1 and 5 glomeruli. Cases were randomly assigned to each of the 5 scoring pathologists, with overlap such that each glomerulus would have at least 2 sets of scores to evaluate interpathologist reproducibility.