Abstract
Although still in its infancy, artificial intelligence (AI) analysis of kidney biopsy images is anticipated to become an integral aspect of renal histopathology. As these systems are developed, the focus will understandably be on developing ever more accurate models, but successful translation to the clinic will also depend upon other characteristics of the system.
In the extreme, deployment of highly performant but “black box” AI is fraught with risk, and high-profile errors could damage future trust in the technology. Furthermore, a major factor determining whether new systems are adopted in clinical settings is whether they are “trusted” by clinicians. Key to unlocking trust will be designing platforms optimized for intuitive human-AI interactions and ensuring that, where judgment is required to resolve ambiguous areas of assessment, the workings of the AI image classifier are understandable to the human observer. Therefore, determining the optimal design for AI systems depends on factors beyond performance, with considerations of goals, interpretability, and safety constraining many design and engineering choices.
In this article, we explore challenges that arise in the application of AI to renal histopathology, and consider areas where choices around model architecture, training strategy, and workflow design may be influenced by factors beyond the final performance metrics of the system.
Keywords: arteriosclerosis, glomerulosclerosis, interstitial fibrosis, kidney biopsy, renal fibrosis, renal pathology, renal transplantation, transplant pathology, artificial intelligence, AI
Integrated Discussion
Broadly, there are two contrasting approaches to developing AI systems. One view is that intelligent machines ought to follow a set of rules that are accessible to human intuition (e.g., good old-fashioned AI). The other view considers human understanding of AI processes as unnecessary,1 and perhaps even counterproductive, because it limits the full potential of AI to discover complex, high-order relationships in data (e.g., deep learning).2 For visual recognition tasks, humans naturally blend semantic and perceptual knowledge to recognize objects across variations in object location, perspective, and magnification.3,4 Attempts to recreate these capacities in computers using rule-based approaches alone have not been particularly successful, whereas learning-based approaches have shown great recent potential.2,5 AI models “learn” to recognize objects by adapting their network weights and biases to fit statistical patterns in data. This approach is flexible, and with modifications AI models can perform many of the tasks required for complex image analysis, including the recognition of histopathologic features in animal and human whole slide images.6–11
The performance of contemporary AI systems appears to have settled the question in favor of data-driven approaches that distance human oversight. However, because the inner workings of these systems are poorly understood (effectively a black box), this limits selection between competing models to a comparison of some aspect of their output, typically their accuracy. In this regard, there are numerous examples that illustrate the tendency of AI to underperform, or exhibit unwanted behaviors, when the learning from a limited research dataset is translated to the wider world, where models are exposed to “out-of-distribution” examples. This phenomenon can result from trivial changes to data sources, image preparation, or following a well-designed adversarial attack.12,13 It is tempting to believe these errors will be resolved by better versions in the future, yet cautious observers would note that even highly evolved mammalian visual systems are still easily fooled by low-tech visual illusions, and it may be that real world AI cannot be finally “solved.” If so, implementing black box systems in areas of social importance is fraught with risk; high-profile errors could damage trust in the technology, particularly if clinicians are provided AI-generated predictions or assessments, for which they assume responsibility when used in patient care, but that they cannot fully interrogate.
In this article, we take the example of automated renal pathology assessment to explore the challenges that arise in the application of AI to clinical problems, highlighting areas where choices around model architecture, training strategy, and workflow design may be influenced by factors beyond performance.
Goals Determine Which Capabilities To Prioritize
Although extensive deployment of AI systems in renal histopathology is anticipated,14,15 the choice of capabilities will be governed by the clinical expectations for any given system. One area where AI may be particularly useful is in characterizing chronic injury in deceased donor kidneys that have been retrieved for transplantation. It is now widespread practice (up to 80% of retrieved kidneys in some US states)16 to perform urgent preimplantation biopsy analysis of kidneys from elderly, “expanded criteria” deceased donors, in order to identify those kidneys that have no, or minimal, chronic injury, and that are therefore suitable for transplantation.14,17–23 However, time and cost constraints, allied to a relative scarcity of specialist expertise, limit evaluation of preimplantation biopsies to a single slide. There are concerns that this approach may be inadequate: small biopsies overestimate renal injury24; only moderate agreement is achieved when comparing biopsies with whole kidney assessments25; biopsy of the contralateral kidney can offer conflicting results26–29; and even expert renal pathologists do not perfectly agree.30 Limited evaluation of a single biopsy slide may therefore not necessarily reflect the overall “quality” of a kidney, nor its suitability for transplantation.
Attempts to improve the reliability and reproducibility of biopsy assessment may require an increase in the amount of renal tissue analyzed (e.g., multiple biopsies, multilevel/whole biopsy assessment). Through its ability to perform more intelligent work more rapidly, AI systems could theoretically process much larger tissue volumes in a set time. In this way alone, AI assessment of preimplantation renal biopsies could transform clinical practice. In support, Marsh and colleagues have shown that by pooling tissue sections, the effect of variability on assessments of chronic injury was reduced for both model and human assessments. Importantly, they estimated that pooled analysis of glomerulosclerosis would hypothetically have reduced kidney discard rates from 14% to <2%.31 Human assessment of multiple slides (>10 slides) from a preimplantation biopsy is impractical: it is time-consuming and laborious; often occurs at unsociable hours; and risks introducing delays to a time-critical process. However, multilevel or multislide assessment would be comparatively straightforward to implement as part of a workflow that utilized AI to assist the rapid assessment of multiple samples in parallel, with human oversight to identify and address errors. In this case, marginal improvements in accuracy that depended upon significant increases in computational effort, and consequently analytical time, would not necessarily be advantageous. In comparison, less accurate, but more rapid assessment methods could be preferred if they facilitate bulk assessment of renal biopsies.
Other applications such as the detection of more complex processes that could be missed during human assessment (e.g., transplant glomerulitis, transplant glomerulopathy,32 segmental scars) require a focus on sensitivity. For these applications, all potentially abnormal areas ought to be flagged for additional review by a histopathologist, thereby ensuring detection of subtle lesions that have clinical significance. In this context, sensitivity might be prioritized at the expense of precision (positive predictive value), noting that it is better to highlight an area for review that a pathologist might not have seen, than to miss a detection. However, the optimum balance between these characteristics may be further influenced by user factors (such as the individual’s intolerance to excess flagging of essentially normal areas) and the prevalence of these objects in the image (e.g., common objects require high precision to avoid excess flagging).
Addressing AI Misbehavior
The fact that AI models learn exclusively from data is both a strength (supports general applicability) and a source of weakness. These models are unconstrained in their approach to analyzing data,33 and their behavior may parallel Clever Hans,34,35 the horse that appeared to learn to “count” by recognizing expectation in human faces.36 AI models have similarly learned to “detect pneumonia” on chest x-rays by reading the source label and finding other unintended cues.37 Researchers were likewise disappointed to find their model CycleGAN “master of steganography” had solved an image-to-image translation task simply by hiding a copy of the original within its translated version.38 Rather than solve the hard problem, it simply recovered the precursor from those imperceptible changes. One might wish to discard these examples as curiosities in the context of a rapidly evolving technology, but if such behaviors were replicated in clinical applications it might profoundly undermine the technology’s perceived safety. Additionally, there is a growing recognition that AI systems may, by learning associations between immutable characteristics (e.g., race or sex) and poor outcomes, entrench disparities in decision-making processes (AI bias).15,39,40 The transition of AI tools to clinical use therefore depends not only on competence (getting the “right” answer) but also on a continual process of assurance that the finished system is both capable and trustworthy. Systems that provide observers with insights into decision making may be more readily accepted, even if these features or designs do not enhance performance.
Much of image recognition relates to three basic tasks: classification (determining the type of object)41; localization (determining where an object is)42; and segmentation (defining its boundaries).43 Although the underlying process of statistical learning is shared, different types of AI models have a different emphasis on each of these tasks, producing varied outputs (Figure 1). Detecting, and therefore correcting, misbehavior may be less straightforward if models solve tasks nonintuitively. Assessing the extent of global glomerulosclerosis is a common target for automation in renal pathology, and several excellent tools have used multiclass segmentation models for this purpose. Segmentation models produce a vector map of the image and assign each pixel in the map to a class (e.g., glomeruli, cortex, background), producing impressive feature maps. Accuracy is assessed by overlaying “ground truth” annotations with the prediction of the model and measuring the extent of pixel overlap. Marsh reported their segmentation model demonstrated high concordance, with ground truth annotations performing better than pathologists.31 Nevertheless, most organ assessments require both identification and quantification of objects, not pixels. Segmentation maps achieve counts indirectly, either via a post hoc processing step (e.g., blob detection) or by human counting (defeating the purpose). Critically, the subsequent human-AI interaction with a segmentation map is binary and limited to either acceptance or rejection of the prediction, with the latter necessitating side-by-side reassessment of the slide (hard disagreement). Architectures that alternatively allow the machine to share the concept of object and display bounding boxes would allow a supervisor to take a gestalt view and only attend to and interact with missed or miscategorized objects (soft disagreement). These models may be preferred for use in clinical assessment because of the straightforward interaction they offer, even if there were to be a small performance penalty (Figure 2).44
Figure 1.
Common image recognition methods. Classification models (left) apply a label to the entire image. Semantic segmentation models (center left) assign a label to each pixel, but do not differentiate instances of objects. Object detection models (center right) predict object locations with imprecise boundaries, whereas instance segmentation models (right) combine these approaches.
Figure 2.
Example of a renal whole slide image with AI model predictions demonstrating typical errors. Green indicates sclerosed glomeruli, blue indicates nonsclerosed glomeruli. Original image (left). Object detection prediction (middle): object detection model is unsure and predicts two bounding boxes for the same object in the upper left aspect of the image (gray pointer) and fails to detect a nonsclerosed glomeruli at the upper right border (red pointer). Semantic segmentation prediction (right): model makes an equivalent error in the uppermost part of the image (red pointer). Individual objects are not represented by the model, detected areas are either manually reassessed with close reference to the original (hard disagreement), or automated methods are used to count the number of clustered objects within the map (blob detection). Accurate counting is complicated by additional error types (e.g., two closely located objects share borders [blue pointer]), and the scattering of misclassified pixels that arise throughout the image (green pointer). Human in the loop workflows may prioritize outputs that offer the most intuitive human-AI interaction, whereas research or anatomic tools may prefer models that offer the highest pixel concordance with ground truth annotations.
Training Sets and the Importance of Data Curation
Even common disease states such as glomerulosclerosis may affect as few as 10% of the glomeruli in training sets.45 In the presence of such unbalanced class labels, lazy strategies (e.g., “always guess nonsclerosed glomerulus”) may yield high accuracy (i.e., 90%) and are readily settled upon. Specific actions may combat this, such as data augmentation, where images depicting the underrepresented feature are transformed (e.g., flipping, changing the color, rotating them) to boost their numbers and provide “new” training examples. The concept of creating new images for training can be taken further using generative adversarial networks, transformers, or graph representation learning, which can create synthetic images for various purposes, such as virtual staining (taking, for example, a hematoxylin-and-eosin–stained image and outputting a periodic acid–Schiff–stained equivalent), data sharing, and education.46–49 These strategies, while useful, imperfectly approximate new data, and augmented samples retain deep statistical consistencies with originals. Thus, their use may increase the risk of “overfitting,” whereby performance degrades rapidly upon exposure to images from outside the training set.50 The risks of lazy AI or overfitting are ever present during model training and are closely monitored by engineers, with the use of predefined test images that are withheld from the model during its development. Although this is standard practice for assessing performance, these training sets require careful curation, to ensure that they are not contaminated. These training sets require careful curation, to ensure that they are not contaminated with related images (e.g., different level from the same biopsy, similar level but different stain) and that objects are balanced and representative across datasets. An even more stringent assessment is to use unrelated images from an external institution.
Accurate image labeling is clearly critical for effective AI model learning. Interstitial fibrosis and tubular atrophy (IFTA) are, however, complex features to recognize and label. Human estimations of IFTA are highly variable29 because our brains are poorly adapted to numerical estimations of complex areas. Moreover, IFTA is often patchy, its borders indistinct, and co-existent pathologies (e.g., edema, acute tubular injury) may mimic the changes seen in IFTA. Tubular atrophy also varies in severity. One challenge in training models to detect IFTA relates to the context-dependent nature of recognition. Pathologists will typically scan slides at low magnification to appreciate context, but AI labeling and training are performed at the individual pixel level. Labeling images for training is therefore difficult, as perfect (pixel-by-pixel) annotation may be beyond reasonable human effort for large numbers of images, and consequently labeled areas may contain contradictory training pressures (i.e., some pixels within an area labeled “IFTA” will be histologically normal). This reduces training efficacy and complicates interpretation of performance metrics. The idiosyncrasies of labeled data, either due to error or due to bias from individual labelers, raise a more general issue. For published research, it can be unclear whether the performance reported by researchers relates to their model architecture and training strategy or is specific to their labeled training and/or test sets.
Gray Areas and Computing Judgments
A core premise of AI training is that objects can be reliably categorized by an expert and provided to the model as labeled images. However, reliable categorization can be problematic for some elements of slide assessment, and in particular the determination of arteriosclerosis poses a number of challenges. To begin with, recognition of arterial objects is complicated by their resemblance to other cortical structures such as veins and tubules. Furthermore, even when successfully recognized, the distinction between objects that are labeled as arteries from those determined to be arterioles is essentially arbitrary. Pathologic distinctions, such as the presence of an internal elastic lamina or three layers of medial smooth muscle, rely on extraneous concepts and complicate machine recognition. Classification is only one area of difficulty. Arteriosclerosis is recognized by a narrowing of the vessel lumen with respect to the wall, but an apparently narrow lumen may arise not from disease but from oblique sectioning (Figure 3). This is a critical issue because it means that objects of interest (arteries with severe disease) and objects that ought to be discarded (obliquely sectioned arteries) can appear similar.
Figure 3.
The same artery is sectioned in two different planes according to its orientation when the cut is made. This results in an oval shaped object with a narrow lumen (gray) and a patent object (blue). These objects are likely to attain different chronic injury scores and automated systems ought to prefer the blue object for scoring.
If multiple arteries are present in an image, automated assessment tools require a method to choose which artery best represents the degree of arteriosclerosis. Selecting the artery with the smallest measurable lumen, assuming that this is the most affected, risks oversampling poorly sectioned objects. Conversely, there is no minimum lumen size that might be used as a flag for a sectioning issue, as completely occluded arteries can also arise from disease. Alternative strategies, such as averaging the “lumen:wall” ratio over all detected objects, risk underestimating the extent of hypertensive injury, which may be pronounced in only one or two arteries. Furthermore, this strategy worsens the bias resulting from the inclusion of arterioles wrongly classified as arteries; arterioles are less likely to be affected by hypertension and their inclusion may therefore underestimate the severity of injury.
Testament to the challenge of resolving these ambiguities is the lack of any published automated solutions for arteriosclerosis, whereas multiple teams have presented AI systems capable of assessing other chronic injury features. On the one hand, the complex and contingent nature of arterial selection and scoring makes it a tempting candidate to offload onto the learning capabilities of an AI system, relying on its ability to extract complex associations from data. With this approach, the researcher would label “suitable” and “unsuitable” arteries separately and trust the learning process to decipher these complex distinctions implicitly during training. Unfortunately, ambiguity is not resolved by such labeling; if pathologists cannot reliably distinguish between suitable and unsuitable arteries, such that multiple assessors label similar arterial objects differently, the ability of the model to learn is severely compromised.51 To improve the reliability of labeling, researchers could instead try to create more uniform groups by increasing the number of classes. Instead of a basket category for unsuitable arteries, one might choose to assign objects to groups for each reason the artery is deemed unsuitable (e.g., those with multiple lumens, absent lumens, abnormal aspect ratios). This approach improves the homogeneity of each group, but fractions the dataset, potentially leaving the training process underpowered with insufficient examples in each group to support robust learning across all categories. Given these difficulties, one could consider the use of handwritten rules to define features of suitability. Unsuitable arteries would be expected to have characteristic profiles that differ from a well-sectioned artery in the typical transverse view (e.g., measurable versus no measurable lumen). The presence of unsuitable features could be used to assign a negative preference, and therefore an ordering of objects from most to least suitable for assessment. A complete solution of this kind, however, ultimately requires one to provide a complete description of the visual characteristics that define a well-sectioned artery, and this is precisely the complex task that the AI was intended to solve.
To resolve this, future implementations could seek to blend data-driven and rule-based methods, aiming to offer the performance of state-of-the-art deep learning models alongside “explainers” that provide human understandable insights into decision making.52–54 Random forests, a classic approach in machine learning, are an attractive choice to bridge these approaches. Random forests use decision trees as a building block; they combine observations of data with rules for decision making (e.g., if no measurable lumen, classify artery as unsuitable). Simple rules (nodes) can be combined into vast networks (forests) capable of fine discrimination, but these models rarely match the performance of the most capable deep learning algorithms. To address this, boosting algorithms can advance the performance of random forest–based models by using them to approximate the workings of highly performant neural networks.54 Post hoc, researchers aim to extract nodes from the forest that are most important to its performance and use these key rules to provide explanations of “why” a particular choice was made.55 Alternatively, key nodes could be used to select “concepts” that are provided to a neural network model during its training. In this context, rather than the standard approach of asking a neural network model to directly predict an object classification from an image, so-called “concept bottleneck models” include an intermediate concept layer that represents higher-order attributes in the image (e.g., no measurable lumen).56 The combination of concepts that an image evokes (e.g., no measurable lumen + side on view +…+ transverse section) is used as the basis for making a prediction (e.g., unsuitable) rather than the image itself. The benefit of this approach is that activated concepts can be provided to users as the “reason” for a decision made by a concept bottleneck model, providing a means to understand the otherwise black box model and intervene on the decision-making process, a critical feature for soft disagreement and human-AI collaboration. Recent works have shown that the combination of logic, attention, and concepts seems to improve the explicability of deep learning networks,53 and progress on cell graph segmentation may suggest an opportunity to enhance the predictive accuracy of classification systems used for clinical prognostication.57
Whichever approach is taken, ultimately the choice of which artery to score and how may remain ambiguous, and assessors could reasonably disagree. This has implications for determining the performance of the model against a perceived “gold standard.” Although this is not a problem limited to AI assessments,22 regardless, AI models are expected to work well for all users, and designs that provide the highest quality insight into the decision-making process may therefore be favored. These should allow a human overseer to intuitively grasp and interact with model output. Engineers building for this kind of task may find solutions emanating from research in “explainable AI”; alternatively they may consider limiting AI processes to more objective areas (e.g., object detection) and leaving preference or suitability assessments entirely to a human overseer.
Conclusion
AI assessment tools are anticipated to provide substantial benefits to the delivery of renal histopathologic expertise and may help improve the reliability and accuracy of renal digital slide assessments. However, the black box approaches many currently adopt allow the machine to consider tasks in fundamentally different ways to the human assessor. They may also entrust decision making to implicit and uninterpretable processes inside the neural network. Perfectly accurate and perfectly intelligible systems are not yet a reality, but they are not required; clinically useful tools are likely to blend islands of AI automation with rule-based workflows that improve interpretability of the decision-making process and allow effective human supervision.58 Determining the optimal design for AI systems depends on factors beyond performance, with considerations of interpretability and the nature of the intended human-AI interaction also informing many design and engineering choices. To effectively appraise research and new market tools, clinicians need an awareness of the common problems that arise in AI training and deployment. In parallel, teams developing AI image classifiers need continual dialogue with clinicians that is centered on the trade-offs that their chosen designs entail; to secure trust, and facilitate a smooth translation to the clinic.
Disclosures
G. Pettigrew and J. Ayorinde reports research funding from Addenbrooke’s Charitable Trust (ATC) and the Medical Research Council. V. Bardsley reports consultancy fess from Pathognomics Ltd. F. Citterio is employed by SAS Institute. T. Islam is employed by Google Cloud and SAS Institute; and reports ownership interest with Alphabet. E. Peruzzo is employed by SAS Institute and University of Trento. S. Tilley is employed by SAS Institute. M. Landrò reports Employer: SAS Institute. G. Taylor reports Employer: SAS. P. Liò reports Consultancy: GSK Advisory Panel on 2021; Research Funding: Astrazeneca (1 PhD fellowship), GSK (1 PhD fellowship); Honoraria: GSK consulting on AI panel; and Advisory or Leadership Role: member of advisory panel AI for GSK, 15 hours in 2021. All remaining authors have nothing to disclose.
Funding
This work was funded by awards from the Medical Research Council (Confidence in Concept A094757) and Addenbrooke’s Charitable Trust in the UK (900189).
Acknowledgments
We thank SAS Institute for in-kind support for the clinician/data scientist collaboration through their Data for Good funding mechanism, and we thank Larry Orimoloye for his involvement in the early stages of the project. This work would not have taken place without the determined support of the Office for Translational Research, University of Cambridge.
This study was also supported by the National Institute for Health and Care Research (NIHR) Blood and Transplant Research Unit in Organ Donation and Transplantation (NIHR203332), a partnership between NHS Blood and Transplant, University of Cambridge and Newcastle University. The views expressed are those of the author(s) and not necessarily those of the NIHR, NHS Blood and Transplant, or the Department of Health and Social Care.
Footnotes
Published online ahead of print. Publication date available at www.jasn.org.
Author Contributions
J. Ayorinde, T. Islam, G. Pettigrew, and A. Samoshkin conceptualized the study; J. Ayorinde, M. Landrò, and E. Peruzzo were responsible for data curation; J. Ayorinde, F. Citterio, M. Landrò, and E. Peruzzo were responsible for formal analysis; J. Ayorinde, G. Pettigrew, A. Samoshkin, and S. Tilley were responsible for funding acquisition; J. Ayorinde, V. Bardsley, F. Citterio, E. Peruzzo, and G. Pettigrew were responsible for investigation; J. Ayorinde, F. Citterio, T. Islam, and G. Pettigrew were responsible for methodology; J. Ayorinde, F. Citterio, G. Pettigrew, A. Samoshkin, G. Taylor, and S. Tilley were responsible for project administration; J. Ayorinde, F. Citterio, T. Islam, M. Landrò, E. Peruzzo, and G. Taylor were responsible for software; J. Ayorinde was responsible for validation and wrote the original draft; J. Ayorinde, M. Landrò, and G. Pettigrew were responsible for visualization; V. Bardsley, G. Pettigrew, A. Samoshkin, G. Taylor, and S. Tilley were responsible for supervision; A. Samoshkin and G. Taylor were responsible for resources; and J. Ayorinde, V. Bardsley, F. Citterio, T. Islam, M. Landrò, P. Liò, E. Peruzzo, and G. Pettigrew reviewed and edited the manuscript.
References
- 1.Frankish K, Ramsey W, editors: The Cambridge Handbook of Artificial Intelligence, Cambridge, MA, Cambridge University Press, 2014 [Google Scholar]
- 2.LeCun Y, Bengio Y, Hinton G: Deep learning. Nature 521: 436–444, 2015 [DOI] [PubMed] [Google Scholar]
- 3.Puri AM, Wojciulik E: Expectation both helps and hinders object perception. Vision Res 48: 589–597, 2008 [DOI] [PubMed] [Google Scholar]
- 4.Teufel C, Fletcher PC: Forms of prediction in the nervous system. Nat Rev Neurosci 21: 231–242, 2020 [DOI] [PubMed] [Google Scholar]
- 5.Djuric U, Zadeh G, Aldape K, Diamandis P: Precision histology: How deep learning is poised to revitalize histomorphology for personalized cancer care. NPJ Precis Oncol 1: 22, 2017 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Kolachalama VB, Singh P, Lin CQ, Mun D, Belghasem ME, Henderson JM, et al. : Association of pathological fibrosis with renal survival using deep neural networks. Kidney Int Rep 3: 464–475, 2018 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Marsh JN, Matlock MK, Kudose S, Liu TC, Stappenbeck TS, Gaut JP, et al. : Deep learning global glomerulosclerosis in transplant kidney frozen sections. IEEE Trans Med Imaging 37: 2718–2728, 2018 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Abdeltawab H, Shehata M, Shalaby A, Khalifa F, Mahmoud A, El-Ghar MA, et al. : A novel CNN-based CAD system for early assessment of transplanted kidney dysfunction. Sci Rep 9: 5948, 2019 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Ginley B, Jen K-Y, Rosenberg A, Yen F, Jain S, Fogo A, et al. : Neural network segmentation of interstitial fibrosis, tubular atrophy, and glomerulosclerosis in renal biopsies. 2020
- 10.Kannan S, Morgan LA, Liang B, Cheung MG, Lin CQ, Mun D, et al. : Segmentation of glomeruli within trichrome images using deep learning. Kidney Int Rep 4: 955–962, 2019 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Gadermayr M, Dombrowski A-K, Klinkhammer BM, Boor P, Merhof D: CNN cascades for segmenting sparse objects in gigapixel whole slide images. Comput Med Imaging Graph 71: 40–48, 2019 [DOI] [PubMed] [Google Scholar]
- 12.Finlayson SG, Bowers JD, Ito J, Zittrain JL, Beam AL, Kohane IS: Adversarial attacks on medical machine learning. Science 363: 1287–1289, 2019 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Buckner C: Understanding adversarial examples requires a theory of artefacts for deep learning. Nat Mach Intell 2: 731–736, 2020 [Google Scholar]
- 14.Ayorinde JOO, Summers DM, Pankhurst L, Laing E, Deary AJ, Hemming K, et al. : PreImplantation Trial of Histopathology In renal Allografts (PITHIA): A stepped-wedge cluster randomised controlled trial protocol. BMJ Open 9: e026166, 2019 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Ayorinde JOO, Saeb-Parsy K, Hossain A: Opportunities and challenges in using social media in organ donation. JAMA Surg 155: 797–798, 2020 [DOI] [PubMed] [Google Scholar]
- 16.Lentine KL, Naik AS, Schnitzler MA, Randall H, Wellen JR, Kasiske BL, et al. : Variation in use of procurement biopsies and its implications for discard of deceased donor kidneys recovered for transplantation. Am J Transplant 19: 2241–2251, 2019 [DOI] [PubMed] [Google Scholar]
- 17.Remuzzi G, Cravedi P, Perna A, Dimitrov BD, Turturro M, Locatelli G, et al. ; Dual Kidney Transplant Group : Long-term outcome of renal transplantation from older donors. N Engl J Med 354: 343–352, 2006 [DOI] [PubMed] [Google Scholar]
- 18.Aubert O, Kamar N, Vernerey D, Viglietti D, Martinez F, Duong-Van-Huyen JP, et al. : Long term outcomes of transplantation using kidneys from expanded criteria donors: Prospective, population based cohort study. BMJ 351: h3557, 2015 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Remuzzi G, Grinyò J, Ruggenenti P, Beatini M, Cole EH, Milford EL, et al. ; Double Kidney Transplant Group (DKG) : Early experience with dual kidney transplantation in adults using expanded donor criteria. J Am Soc Nephrol 10: 2591–2598, 1999 [DOI] [PubMed] [Google Scholar]
- 20.Kosmoliaptsis V, Salji M, Bardsley V, Chen Y, Thiru S, Griffiths MH, et al. : Baseline donor chronic renal injury confers the same transplant survival disadvantage for DCD and DBD kidneys. Am J Transplant 15: 754–763, 2015 [DOI] [PubMed] [Google Scholar]
- 21.Summers DM, Watson CJ, Pettigrew GJ, Johnson RJ, Collett D, Neuberger JM, et al. : Kidney donation after circulatory death (DCD): State of the art. Kidney Int 88: 241–249, 2015 [DOI] [PubMed] [Google Scholar]
- 22.Ayorinde JOO, Hamed M, Goh MA, Summers DM, Dare A, Chen Y, et al. : Development of an objective, standardized tool for surgical assessment of deceased donor kidneys: The Cambridge Kidney Assessment Tool. Clin Transplant 34: e13782, 2020 [DOI] [PubMed] [Google Scholar]
- 23.Dare AJ, Pettigrew GJ, Saeb-Parsy K: Preoperative assessment of the deceased-donor kidney: From macroscopic appearance to molecular biomarkers. Transplantation 97: 797–807, 2014 [DOI] [PubMed] [Google Scholar]
- 24.Muruve NA, Steinbecker KM, Luger AM: Are wedge biopsies of cadaveric kidneys obtained at procurement reliable? Transplantation 69: 2384–2388, 2000 [DOI] [PubMed] [Google Scholar]
- 25.Mazzucco G, Magnani C, Fortunato M, Todesco A, Monga G: The reliability of pre-transplant donor renal biopsies (PTDB) in predicting the kidney state. A comparative single-centre study on 154 untransplanted kidneys. Nephrol Dial Transplant 25: 3401–3408, 2010 [DOI] [PubMed] [Google Scholar]
- 26.Husain SA, Chiles MC, Lee S, Pastan SO, Patzer RE, Tanriover B, et al. : Characteristics and performance of unilateral kidney transplants from deceased donors. Clin J Am Soc Nephrol 13: 118–127, 2018 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Wang HJ, Kjellstrand CM, Cockfield SM, Solez K: On the influence of sample size on the prognostic accuracy and reproducibility of renal transplant biopsy. Nephrol Dial Transplant 13: 165–172, 1998 [DOI] [PubMed] [Google Scholar]
- 28.Snoeijs MGJ, Boonstra LA, Buurman WA, Goldschmeding R, van Suylen RJ, van Heurn LW, et al. : Histological assessment of pre-transplant kidney biopsies is reproducible and representative. Histopathology 56: 198–202, 2010 [DOI] [PubMed] [Google Scholar]
- 29.Furness PN, Taub N, Assmann KJ, Banfi G, Cosyns JP, Dorman AM, et al. : International variation in histologic grading is large, and persistent feedback does not improve reproducibility. Am J Surg Pathol 27: 805–810, 2003 [DOI] [PubMed] [Google Scholar]
- 30.Liapis H, Gaut JP, Klein C, Bagnasco S, Kraus E, Farris AB 3rd, et al. ; Banff Working Group : Banff histopathological consensus criteria for preimplantation kidney biopsies. Am J Transplant 17: 140–150, 2017 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Marsh JN, Liu T-C, Wilson PC, Swamidass SJ, Gaut JP: Development and validation of a deep learning model to quantify glomerulosclerosis in kidney biopsy specimens. JAMA Netw Open 4: e2030939, 2021 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Aubert O, Higgins S, Bouatou Y, Yoo D, Raynaud M, Viglietti D, et al. : Archetype analysis identifies distinct profiles in renal transplant recipients with transplant glomerulopathy associated with allograft survival. J Am Soc Nephrol 30: 625–639, 2019 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Schramowski P, Stammer W, Teso S, Brugger A, Herbert F, Shao X, et al. : Making deep neural networks right for the right scientific reasons by interacting with their explanations. Nat Mach Intell 2: 476–486, 2020 [Google Scholar]
- 34.Lapuschkin S, Wäldchen S, Binder A, Montavon G, Samek W, Müller KR: Unmasking Clever Hans predictors and assessing what machines really learn. Nat Commun 10: 1096, 2019 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Kauffmann J, Ruff L, Montavon G, Müller K-R: The Clever Hans effect in anomaly detection. 2020
- 36.Pfungst O: Clever Hans (the Horse of Mr. Von Osten): A Contribution to Experimental Animal and Human Psychology, New York, NY, Holt, Rinehart and Winston, 1911 [Google Scholar]
- 37.Zech JR, Badgeley MA, Liu M, Costa AB, Titano JJ, Oermann EK: Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. PLoS Med 15: e1002683, 2018 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Chu C, Zhmoginov A, Sandler M: CycleGAN, a master of steganography. 2017
- 39.Ntoutsi E, Fafalios P, Gadiraju U, Iosifidis V, Nejdl W, Vidal M-E, et al. : Bias in data-driven artificial intelligence systems—An introductory survey. WIREs Data Min Knowl 10: e1356, 2020 [Google Scholar]
- 40.Parikh RB, Teeple S, Navathe AS: Addressing bias in artificial intelligence in health care. JAMA 322: 2377–2378, 2019 [DOI] [PubMed] [Google Scholar]
- 41.Krizhevsky A, Sutskever I, Hinton GE: ImageNet Classification with Deep Convolutional Neural Networks. In: Advances in Neural Information Processing Systems, Vol. 25, New York, Curran Associates, Inc., 2012 [Google Scholar]
- 42.Zhao Z-Q, Zheng P, Xu S, Wu X: Object detection with deep learning: A review. 2019 [DOI] [PubMed]
- 43.Ronneberger O, Fischer P, Brox T: U-Net: Convolutional networks for biomedical image segmentation. 2015
- 44.Jiang L, Chen W, Dong B, Mei K, Zhu C, Liu J, et al. : A deep learning-based approach for glomeruli instance segmentation from multistained renal biopsy pathologic images. Am J Pathol 191: 1431–1441, 2021 [DOI] [PubMed] [Google Scholar]
- 45.Bago-Horvath Z, Kozakowski N, Soleiman A, Bodingbauer M, Mühlbacher F, Regele H: The cutting (w)edge--comparative evaluation of renal baseline biopsies obtained by two different methods. Nephrol Dial Transplant 27: 3241–3248, 2012 [DOI] [PubMed] [Google Scholar]
- 46.Vasiljević J, Feuerhake F, Wemmert C, Lampert T: Towards histopathological stain invariance by unsupervised domain augmentation using generative adversarial networks. Neurocomputing 460: 277–291, 2021 [Google Scholar]
- 47.Huo Y, Deng R, Liu Q, Fogo AB, Yang H: AI applications in renal pathology. Kidney Int 99: 1309–1320, 2021 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Gadermayr M, Gupta L, Appel V, Boor P, Klinkhammer BM, Merhof D: Generative adversarial networks for facilitating stain-independent supervised and unsupervised segmentation: A study on kidney histology. IEEE Trans Med Imaging 38: 2293–2302, 2019 [DOI] [PubMed] [Google Scholar]
- 49.Falahkheirkhah K, Guo T, Hwang M, Tamboli P, Wood CG, Karam JA, et al. : A generative adversarial approach to facilitate archival-quality histopathologic diagnoses from frozen tissue sections. Lab Invest 102: 554–559, 2022 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Webster R, Rabin J, Simon L, Jurie F: Detecting Overfitting of Deep Generative Networks via Latent Recovery. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11265–11274, 2019 [Google Scholar]
- 51.Girolami I, Gambaro G, Ghimenton C, Beccari S, Caliò A, Brunelli M, et al. : Pre-implantation kidney biopsy: Value of the expertise in determining histological score and comparison with the whole organ on a series of discarded kidneys. J Nephrol 33: 167–176, 2020 [DOI] [PubMed] [Google Scholar]
- 52.Müller, TT, Lio P: PECLIDES Neuro: A personalisable clinical decision support system for neurological diseases. Front Artif Intell 3: 23, 2020 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Ciravegna G, Giannini F, Gori M, Maggini M, Melacci S: Human-Driven FOL Explanations of Deep Learning. In: Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, 2234–2240, 2020 [Google Scholar]
- 54.Barbiero P, Ciravegna G, Giannini F, Lio P, Gori M, Melacci S: Entropy-based logic explanations of neural networks. 2022
- 55.Shams Z, Dimanov B, Kola S, Simidjievski N, Terre HA, Scherer P, et al. : REM: An integrative rule extraction methodology for explainable data analysis in healthcare. 2021
- 56.Koh PW, Nguyen T, Tang YS, Mussmann S, Pierson E, Kim B, et al. : Concept Bottleneck Models. In: Proceedings of the 37th International Conference on Machine Learning, 5338–5348, 2020
- 57.Wang Y, Wang YG, Hu C, Li M, Fan Y, Otter N, et al. : Cell graph neural networks enable the precise prediction of patient survival in gastric cancer. NPJ Precis Oncol 6: 45, 2022 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Rudin C: Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat Mach Intell 1: 206–215, 2019 [DOI] [PMC free article] [PubMed] [Google Scholar]



