Skip to main content
Journal of the American Society of Nephrology : JASN logoLink to Journal of the American Society of Nephrology : JASN
. 2012 Jan;23(1):13–21. doi: 10.1681/ASN.2010111124

Imperfect Gold Standards for Kidney Injury Biomarker Evaluation

Sushrut S Waikar *,, Rebecca A Betensky , Sarah C Emerson †,, Joseph V Bonventre *,§
PMCID: PMC3695762  PMID: 22021710

Abstract

Clinicians have used serum creatinine in diagnostic testing for acute kidney injury for decades, despite its imperfect sensitivity and specificity. Novel tubular injury biomarkers may revolutionize the diagnosis of acute kidney injury; however, even if a novel tubular injury biomarker is 100% sensitive and 100% specific, it may appear inaccurate when using serum creatinine as the gold standard. Acute kidney injury, as defined by serum creatinine, may not reflect tubular injury, and the absence of changes in serum creatinine does not assure the absence of tubular injury. In general, the apparent diagnostic performance of a biomarker depends not only on its ability to detect injury, but also on disease prevalence and the sensitivity and specificity of the imperfect gold standard. Assuming that, at a certain cutoff value, serum creatinine is 80% sensitive and 90% specific and disease prevalence is 10%, a new perfect biomarker with a true 100% sensitivity may seem to have only 47% sensitivity compared with serum creatinine as the gold standard. Minimizing misclassification by using more strict criteria to diagnose acute kidney injury will reduce the error when evaluating the performance of a biomarker under investigation. Apparent diagnostic errors using a new biomarker may be a reflection of errors in the imperfect gold standard itself, rather than poor performance of the biomarker. The results of this study suggest that small changes in serum creatinine alone should not be used to define acute kidney injury in biomarker or interventional studies.


Diagnostic tests are judged on the basis of their ability to classify individuals according to disease status. The actual disease status of an individual is often not known with certainty in clinical medicine. Myocardial infarction, for example, is ultimately a pathologic diagnosis; however, in practice, the diagnosis must be made premortem on the basis of blood biochemical markers of myocardial necrosis along with electrocardiographic changes or ischemic symptoms. Cardiac troponins (I and T) are widely considered to be adequate biomarkers for the diagnosis of acute myocardial infarction on the basis of their myocardial tissue specificity1 and their association with important clinical outcomes.2,3 Few diagnostic tests, however, enjoy such acceptance as biomarkers.

Serum creatinine (SCr) concentration is the surrogate test used to diagnose AKI. Importantly, SCr is acknowledged as an inadequate gold standard for several reasons. SCr has poor specificity in the settings of prerenal azotemia, changes in dietary intake, and drug-induced changes in tubular secretion of creatinine, all of which may lead to changes in SCr without actual injury to the kidney.4 SCr has poor sensitivity in the setting of adequate renal reserve,5 when SCr may not change despite acute tubular injury because of compensatory increases in function by other nephrons. The use of SCr may also lead to delays in diagnosis because of the relatively slow kinetics of the rise in SCr after injury.6 A recent study of nephrotoxicity in rodents showed that the sensitivity of SCr is poor, especially when histologic injury to the kidney is mild.7 Because of the limitations of SCr, there has been considerable interest recently in identifying a troponin for the kidney.8 Despite widespread acknowledgment of the limitations, definitions of AKI continue to rely on SCr as a diagnostic standard, perhaps because of the historical absence of validated primary biomarkers of injury.6,9,10 New biomarkers of tubular injury have been sought because the kidney tubule is the most metabolically active segment of the nephron and is uniquely susceptible to ischemic and nephrotoxic insults.11 Animal and human studies have resulted in a number of promising biomarkers that may revolutionize the diagnosis of AKI, enabling more accurate and earlier diagnosis of tubular injury, and clinical studies of these biomarkers in humans are increasing.

In clinical studies, researchers have used changes in SCr as the gold standard against which to test novel tubular injury biomarkers. This method is less than ideal, however, because the inadequacies of SCr are the raison d’être for kidney injury biomarker discovery and qualification studies in the first place. Furthermore, SCr is an approximate and imperfect measure of glomerular filtration and does not directly reflect tubular function or injury, which many kidney injury biomarkers identify. Change in SCr is a continuous variable, but it is dichotomized to define a binary outcome (AKI present or absent). The choice of a cutoff value will directly affect the true sensitivity and specificity of SCr as the gold standard. Using small changes in SCr to define AKI will lead to relatively higher sensitivity but lower specificity, whereas using larger changes in SCr—or the need for renal replacement therapy in severe AKI—will result in lower sensitivity but higher specificity for true tubular injury. As was recently discussed12 and now addressed in detail in this study, even minor imperfections in the diagnostic performance of a gold standard test such as SCr can result in significant misinterpretations of the diagnostic performance of a novel biomarker under investigation. The conceptual framework for this exercise is that true AKI—the actual disease or clinical condition that our diagnostic tests intend to identify—is not synonymous with the clinical conditions identified by changes in SCr. Using changes in SCr as the gold standard may therefore lead to substantial distortions of the apparent diagnostic performance characteristics of a novel biomarker, as discussed later in this article. For the purposes of this exercise, we will assume that the tubular injury process we are attempting to identify with a biomarker can be unequivocally known and dichotomized. It may be reasonable to use histopathology as a dichotomized marker, although even histopathology may not be adequate due to a time delay in the development of pathologic lesions and insensitivity of pathology when injury is subtle.7 Furthermore, histopathology is not practical for clinical studies, in which biopsies are infrequent.

MODELING THE EFFECTS OF USING AN IMPERFECT GOLD STANDARD ON EVALUATION OF THE PERFORMANCE OF AN IDEAL BIOMARKER

To understand how an imperfect gold standard can distort the apparent diagnostic performance of a new test, consider the following scenario. In a study of 1000 individuals, assume that 200 truly have AKI with tubular injury, with the diagnosis based not on changes in SCr, but on another ideal diagnostic test that has 100% sensitivity and 100% specificity. The imperfect gold standard test, SCr, would have its own sensitivity and specificity for the true diagnosis of AKI: SCr would have neither perfect sensitivity due to renal reserve in some patients nor perfect specificity due to prerenal azotemia. Even if the sensitivity and specificity of SCr are each 90% (likely to be overestimates), a 2 × 2 contingency table (Table 1) can be constructed that shows how many individuals are correctly and incorrectly classified by SCr as having AKI.

Table 1.

The effect of an imperfect gold standard on the sensitivity and specificity of a new biomarker that is in fact 100% sensitive and specific for AKI

True AKI(i.e., tubular injury) Total
AKI No AKI
AKI according to SCr 180 80 260
No AKI according to SCr 20 720 740
Total 200 800 1000
Sensitivity = 90% Specificity = 90%
AKI According to SCr Total
AKI No AKI
New biomarker positive 180 20 200
New biomarker negative 80 720 800
Total 260 740 1000
Apparent sensitivity = 69% Apparent specificity = 97%

Of the 800 individuals without true AKI as defined by tubular injury, SCr would falsely identify 80 as having AKI. Of the 200 individuals with true AKI, SCr would falsely identify 20 as not having AKI. Now imagine that a new biomarker is studied in this cohort of patients and that the new biomarker is in fact perfect compared with the true gold standard. How would such a perfect biomarker seem to perform compared with SCr? Table 1 shows the results: The apparent sensitivity of the perfect biomarker is only 69%, and the apparent specificity is 97%. Figure 1 illustrates the phenomenon graphically for sensitivity and specificity of 80%.

Figure 1.

Figure 1.

Apparent diagnostic performance characteristics of a perfect biomarker compared with an imperfect gold standard. (A) Actual disease prevalence = 20%. Red squares represent true disease positives. Blue squares represent true disease negatives. (B) Imperfect gold standard sensitivity = 80%. Of the 20 true disease positives, 4 are classified as disease negative by the imperfect gold standard (darker blue). (C) Imperfect gold standard specificity = 80%. Of the 80 true disease negatives, 16 are classified as disease positive by the imperfect gold standard (darker red). (D) The imperfect gold standard has classified 32 individuals as disease positive (red) and 68 individuals as disease negative (blue). A perfect biomarker will correctly identify only 16 of the original true positives (lighter red) and fail to identify the 16 false positives (darker red), leading to an apparent sensitivity of 16/32 = 50%. A perfect biomarker will identify as disease negative only 64 of the original true negatives (lighter blue) and fail to identify the 4 false negatives (darker blue), leading to an apparent specificity of 64/68 = 94%.

Assuming that the results of the gold standard and the novel biomarker are independent given disease status (an assumption termed conditional independence), the equations that describe the apparent sensitivity (Equation 1) and apparent specificity (Equation 2) of a novel biomarker are as follows:

graphic file with name ASN.2010111124equ1.jpg
graphic file with name ASN.2010111124equ2.jpg

where the subcripts G and B refer to the imperfect gold standard and the novel biomarker, respectively. Receiver operating characteristic (ROC) curves are graphical plots of sensitivity versus 1 − specificity; the area under the ROC curve (AUC-ROC) is a summary statistic widely used to assess diagnostic test performance characteristics. Because ROC curves are monotonic, the upper and lower bounds of the AUC-ROC are calculated for a given sensitivity and specificity value as follows:

graphic file with name ASN.2010111124equ3.jpg

The lower and upper bounds for AUC-ROC curves are derived by plotting the point (1 − specificity, sensitivity) and finding monotone curves through the given point that have minimal and maximal AUC-ROCs, respectively. These curves will be step functions with a vertical jump at X = 1 − specificity.

We now consider the implications for diagnostic studies of novel biomarkers using the imperfect gold standards likely used in diagnostic studies of AKI. For simplicity, we consider the special case in which the novel biomarker is in fact a perfect test.

Defining AKI by Small Changes in SCr

A recently proposed consensus definition of AKI is an increase in SCr of as little as 0.3 mg/dl over 48 h.9 The motivation for this definition (known as the Acute Kidney Injury Network [AKIN] stage 1) was the finding that small changes in SCr herald a significantly increased risk of death in hospitalized individuals.13,14 Lowering the threshold for a diagnostic test usually increases sensitivity for disease detection at the expense of specificity.

The diagnostic performance characteristics of SCr have not been adequately assessed in humans, but results from animal models of nephrotoxicity have shown poor sensitivity when using histopathology as the gold standard.7 We generously assume that AKIN stage 1 has 80% sensitivity (i.e., 20% of individuals with true AKI have no increase in SCr, perhaps due to renal reserve) and 90% specificity (i.e., 10% of individuals with no actual parenchymal kidney injury have a ≥0.3 mg/dl increase in SCr, perhaps due to prerenal azotemia). At a true disease prevalence of 20%, the apparent sensitivity of a perfect biomarker (compared with AKIN stage 1) is 67%, apparent specificity is 95%, and the lower and upper bounds of the AUC-ROC are 0.63 and 0.98, respectively (0.81 under the assumption of conditional independence).

Figure 2 illustrates how differences in true disease prevalence affect these estimates. At low disease prevalence, the dominant effect is on a perfect biomarker’s apparent sensitivity. In contrast, apparent specificity remains high at low disease prevalence. The apparent AUC-ROC exhibits a wide range of possible values, depending on the nature of the concordance between the imperfect gold standard and the perfect biomarker (Table 2).

Figure 2.

Figure 2.

Apparent diagnostic performance characteristics of a perfect biomarker when an imperfect gold standard has 80% sensitivity and 90% specificity. (A) The apparent sensitivity and specificity. (B) The apparent AUC-ROC curve (solid line, under the assumption of conditional independence) with lower and upper bounds (dotted lines). Results are plotted for a range of true disease prevalence estimates.

Table 2.

Apparent diagnostic performance characteristics of a perfect biomarker as a function of disease prevalence and sensitivity and specificity of an imperfect gold standard

Imperfect Gold Standard Diagnostic Performance True Disease Prevalence, % Apparent Sensitivity, % Apparent Specificity, % Lower Bound of AUC-ROC Upper Bound of AUC-ROC AUC-ROC Assuming Conditional Independence
80% sensitive, 90% specific 5 30 99 0.29 0.99 0.64
10 47 98 0.46 0.99 0.72
20 67 95 0.63 0.98 0.81
40 84 87 0.73 0.98 0.86
60 92 75 0.69 0.98 0.84
80% sensitive, 80% specific 5 17 99 0.17 0.99 0.58
10 31 97 0.30 0.98 0.64
20 50 94 0.47 0.97 0.72
40 73 86 0.62 0.96 0.79
60 86 73 0.64 0.96 0.79
25% sensitive, 100% specific 5 100 96 0.96 1.00 0.98
10 100 92 0.92 1.00 0.96
20 100 84 0.84 1.00 0.92
40 100 67 0.67 1.00 0.83
60 100 47 0.47 1.00 0.74
25% sensitive, 99% specific 5 57 96 0.55 0.98 0.76
10 74 92 0.68 0.98 0.83
20 86 84 0.72 0.98 0.85
40 94 66 0.63 0.98 0.80
60 97 47 0.46 0.99 0.72
100% sensitive, 25% specific 5 7 100 0.07 1.00 0.53
10 13 100 0.13 1.00 0.56
20 25 100 0.25 1.00 0.63
40 47 100 0.47 1.00 0.74
60 67 100 0.67 1.00 0.83

Contrast nephropathy is one example in which the risk of significant AKI is low.15 The majority of cases of contrast nephropathy are diagnosed on the basis of small changes in SCr alone,16 which may very well represent misclassifications of conditions such as prerenal azotemia. If 100 patients are studied and 5 have true AKI (i.e., tubular injury from contrast media) and 5 have prerenal azotemia (i.e., no tubular injury but SCr changes), a perfect biomarker of tubular injury may seem to have only 50% sensitivity using the small increase in SCr as the gold standard.

Defining AKI According to the Need for DIALYSIS

The need for renal replacement therapy after AKI usually reflects severe parenchymal kidney injury. Rare exceptions may include dialysis initiation in patients with advanced chronic kidney disease or dialysis for volume overload, electrolyte abnormalities, or toxic ingestions. Assume that the gold standard in this case, the need for acute dialysis, has a specificity of 100% (i.e., among patients without true AKI or tubular injury, no one requires acute dialysis) and a sensitivity of 25% (i.e., of all individuals with true AKI, only 25% require dialysis). At a true disease prevalence of 20%, the apparent sensitivity of a perfect biomarker is 100%, apparent specificity is 84%, and the lower and upper bounds of the apparent AUC-ROC are 0.84 and 1.00, respectively (0.92 under the assumption of conditional independence).

Figure 3 illustrates how differences in true disease prevalence affect these estimates. Note that, in this case, the apparent sensitivity and upper bound of the AUC-ROC remain 100% only when the gold standard is in fact perfectly specific. Even rare false positives (specificity of 99% of the imperfect gold standard) lead to an apparent sensitivity of 86% and lower and upper bounds of the apparent AUC-ROC of 0.72 and 0.98, respectively (0.85 under the assumption of conditional independence; Table 2).

Figure 3.

Figure 3.

Apparent diagnostic performance characteristics of a perfect biomarker when an imperfect gold standard has 25% sensitivity and 100% specificity. (A) The apparent sensitivity and specificity. (B) The apparent AUC-ROC curve (solid line, under the assumption of conditional independence) with lower and upper bounds (dotted lines). Results are plotted for a range of true disease prevalence estimates.

PREVIOUS WORK ON THE IMPERFECT GOLD STANDARD

The effect of imperfect reference standards has generally been neglected in the expanding clinical literature on diagnostic test accuracy. In the biostatistical literature, several approaches have been proposed based on the assumption of conditional independence. If the gold standard has a known false positive and false negative rate, and the true disease prevalence is known, the apparent sensitivity and specificity of a new diagnostic test can be calculated.1719 Unfortunately, the required parameters are not usually known with certainty. Hui and Walter20 proposed a method to estimate the error rate of a diagnostic test even when the error rates of the gold standard are unknown by applying both tests simultaneously in two populations with different disease prevalence. Walter and Irwig21 reviewed latent class models for use when no gold standard exists; the approach requires a minimum of three (imperfect) diagnostic tests and the use of maximum likelihood techniques to yield estimates of disease prevalence and test accuracy. All of these approaches make the assumption of conditional independence of the new diagnostic test and the gold standard, which may not be a reasonable assumption in many clinical settings. Analytical approaches that incorporate conditional dependence have been described by Vacek22 and Phelps and Hutson.19

EXAMPLES ASIDE FROM NEPHROLOGY: IRON DEFICIENCY ANEMIA

AKI is not the only condition in clinical medicine that is classified on the basis of an imperfect and continuous value gold standard. Iron deficiency anemia, for example, is diagnosed in clinical practice on the basis of serum ferritin concentration because the gold standard—examination of a bone marrow aspirate for stainable iron—is invasive, expensive, and time-consuming, and carries its own set of diagnostic uncertainties. Another potential gold standard examination—assessing the therapeutic response to a trial of iron supplementation—is not practical for making an initial diagnosis. Soluble transferrin receptor (sTfR) concentration was examined as an alternative index of iron status in anemia. Wians et al.,23 for example, reported an AUC-ROC of 0.958 for sTfR as a biomarker to distinguish iron deficiency from anemia of chronic disease: ferritin <20 ng/ml was used to define iron deficiency, and ferritin >1.5 × the sex-specific upper limit of the normal reference range defined anemia of chronic disease. In this study, exclusion of intermediate values for ferritin may have led to overestimation of the accuracy of sTfR. Furthermore, the authors did not address misclassification of disease status by the imperfect gold standard, ferritin. Although ferritin has become the de facto gold standard for the assessment of iron stores, ferritin does not perfectly correlate with the result of the true gold standard: examination of stainable iron in bone marrow aspirates. Reports of the sensitivity and specificity of ferritin have ranged from 52% to 71% and 93% to 100%, respectively.24,25

DEFINING AKI FOR BIOMARKER STUDIES

Despite consensus definitions of AKI,9,26 recent clinical studies of novel biomarkers used a variety of SCr criteria to define AKI (e.g., a 50% increase over baseline,27 0.3 mg/dl or 50% increase over baseline,28 and 25% increase over baseline29). As discussed in this study, varying the choice of definition for AKI will affect the apparent diagnostic performance of a novel biomarker. This has, in fact, been shown for the biomarker neutrophil gelatinase-associated lipocalin (NGAL). Haase et al.30 reported that the AUC-ROC for NGAL was 0.75 when using the AKIN stage 1 definition (0.3 mg/dl or 50% increase within 48 h) and 0.83 when defining AKI as the need for dialysis.

The finding of low sensitivity of a promising kidney injury biomarker in a setting such as contrast nephropathy, in which the expected true prevalence of disease is low, should raise the question of disease misclassification by the gold standard. In fact, this has been reported in a meta-analysis30 of NGAL’s performance as a biomarker of AKI in various clinical settings. Sensitivity was only 77.8%, whereas specificity was 96.3% for three pooled studies involving contrast nephropathy defined by using small changes in SCr (25% or 0.5 mg/dl in two adult studies,31,32 and 50% in a pediatric study with mean baseline SCr values < 1.0 mg/dl33).

In clinical settings with a higher expected prevalence of true AKI, apparent sensitivity is less affected by false positive designations by the gold standard, but apparent specificity may be reduced by false negative designations. Higher true AKI prevalence may occur in biomarker studies performed in the intensive care unit (ICU), in which the frequency of hypotension and sepsis may lead to frequent tubular injury. The prediction of lower specificity for a biomarker in high-risk clinical settings has been borne out for NGAL, which had a pooled specificity of 75.5% for the diagnosis of AKI in critically ill patients.30

Mishra et al.27 reported an AUC-ROC of 0.998 for NGAL as a biomarker in pediatric cardiac surgery using a 50% rise in SCr as the definition of AKI. The inability to replicate this level of performance in subsequent larger studies3436 suggests that the findings were either specific to the unique pediatric population studied or reflect conditional dependence between SCr and NGAL. The latter point merits emphasis: an AUC-ROC of 1.0 may in fact indicate a biomarker that is subject to the same limitations as SCr.

NON–CREATININE-BASED ENDPOINTS FOR BIOMARKER STUDIES

Longer-term outcomes such as mortality, or the eventual need for renal replacement therapy, may be used to compare a new biomarker against an imperfect gold standard. Indeed, the association of troponin with mortality,2,3 in conjunction with its known tissue specificity,1 contributed to its adoption for the diagnosis of myocardial infarction.37 One difficulty with extrapolating this approach to AKI biomarker studies may be the large sample sizes required for statistical power, the long latency between an episode of kidney injury and outcomes such as progressive chronic kidney disease, and confounding by other risk factors and clinical events.

A biomarker may also be associated with mortality or another long-term outcome because of an association with sepsis or inflammation, without being reflective of actual kidney injury. The observation that even a 0.1 mg/dl increase in SCr is associated with mortality does not assure that this small change in SCr is a reliable predictor of kidney injury.38 This may have important implications for using a biomarker as a surrogate endpoint in an interventional study because a surrogate biomarker should be in the causal pathway of the disease process.

Another possible study design involves using exposure status to test a biomarker’s accuracy. Consider, for example, a study in which biomarkers are tested after exposure to a drug with known nephrotoxic potential, such as cisplatin. If biomarkers are measured in well matched patients who did and did not receive cisplatin, exposure status could be used as the criterion against which biomarkers are compared, assuming that there is a high correlation between exposure status and kidney injury. In this type of design, SCr does not need to be used as a gold standard. The risk of such a study design, however, is the identification of biomarkers that are too sensitive to be of clinical use. Tubular enzymes such as N-acetyl-β-(d)-glucosaminidase as well as α-glutathione and π-glutathione S-transferases, for example, are known to be elevated in the urine after cardiac surgery3941 but have not been adopted (perhaps inappropriately) into clinical practice because of concerns regarding possible nonspecificity of the appearance of tubular enzymes in the urine. Nevertheless, these types of studies may be useful to identify biomarkers that fulfill the vision put forth by the US Food and Drug Administration regarding qualification and use of biomarkers in drug development, dose regulation, and clinical monitoring of nephrotoxic drug exposure.42

The ultimate validation of a biomarker’s utility would be to demonstrate, ideally in a randomized controlled trial, that biomarker measurement actually alters clinical management and improves clinical outcomes. For example, results might show that clinical decisions aided by knowledge of AKI status as inferred by biomarker elevations lead to reductions in length of stay, ICU-related complications, need for renal replacement therapy, long-term renal function decline, or mortality.

BIOMARKERS OF TUBULAR INJURY VERSUS BIOMARKERS OF GFR

It should be noted that tubular injury—the focus of many biomarkers—may not always couple with reductions in GFR, thereby leading to apparent false positive results. Conversely, reductions in GFR from prerenal azotemia may not always reflect tubular injury, thereby leading to apparent false negative results with the biomarker. In this regard, one unresolved question is which pathophysiological process is more likely to be clinically relevant and important to monitor: changes in GFR or tubular injury? The answer to this question is clear in some contexts, such as preclinical nephrotoxicity studies, in which SCr underperforms tubular injury biomarkers when histology is used as the arbiter.7

In hospital-acquired AKI, the measure of a biomarker’s performance may be the additional clinical value afforded by its measurement. In this context, validation of a biomarker may come from a trial in which patients at risk for AKI (e.g., after cardiac surgery, in the ICU, or with acute decompensated heart failure) are randomized to a diagnostic strategy involving conventional biomarkers alone (BUN, SCr, or urine output) or conventional plus novel biomarkers of tubular injury. Fluid or hemodynamic management may be adjusted on the basis of information on kidney injury. Rational endpoints for such a trial could include: net fluid balance, duration of mechanical ventilation, length of stay, or need for renal replacement therapy. Such an approach will inform the assessment of biomarkers relative to functional changes reflected by changes in SCr.

SUGGESTED SCr-BASED DEFINITIONS FOR BIOMARKER STUDIES

Using a highly sensitive but poorly specific definition for AKI will lead to a relatively large number of false positive cases compared with true positive cases, thereby significantly reducing the apparent sensitivity of a novel biomarker. In contrast nephropathy, for example, one currently accepted definition43 of an increase of ≥0.5 mg/dl or 25% in SCr may be adequate for epidemiology studies but is inappropriate for a diagnostic biomarker study. Figure 4 shows that, for the low prevalence estimates we may expect in many AKI studies, maintaining high specificity of an AKI definition is more important than high sensitivity. If the choice is between a highly sensitive definition of AKI with some false positives versus a perfectly specific AKI definition with some false negatives, the latter leads to far less distortion of biomarker performance. Accordingly, we suggest at a minimum using RIFLE I, AKI stage 2, or stage 2 of the definition based on creatinine kinetics that we previously proposed.6 Higher specificity will be achieved with definitions that require larger changes in SCr. Studies should be adequately powered ideally for hard endpoints, such as the need for renal replacement therapy for uremia to reduce ambiguity with respect to the presence of true kidney injury, because smaller studies with liberal endpoints may lead to gross misrepresentations of biomarker performance. Consideration should also be given to approaches suggested in the biostatistical literature when a reliable gold standard test does not exist; for example, measuring two biomarkers that are assumed to be conditionally independent in individuals drawn from two populations with different disease prevalence.20,21 As described in Table 3, several study designs should be considered to prevent researchers from prematurely discarding novel biomarkers.

Figure 4.

Figure 4.

AUC-ROC curve of a perfect biomarker when an imperfect gold standard has variable sensitivity or specificity, as shown. Results are plotted for a true disease prevalence of 10%.

Table 3.

Suggested study designs of biomarker studies in AKI

1. Use an AKI definition that minimizes false positive misclassifications of disease status, particularly in clinical settings with relatively low expected prevalence of true AKI. Examples of appropriate AKI definitions include the following: Acute Kidney Injury Network (AKIN) stage 2 or 3, RIFLE stage “I” or “F,” or stage 2 or 3 of a definition based on SCr kinetics.6
2. Examine hard endpoints (e.g., in-hospital or 28-d mortality, need for renal replacement therapy (RRT), or substantial and sustained reduction in kidney function at 28 d) instead of short-term changes in SCr—recognizing, however, that the “injury” marker is then being tested as a prognostic marker and not necessarily an injury marker.
3. Test whether biomarkers outperform conventional measures of kidney function in the accurate identification of individuals exposed to nephrotoxic agents versus well matched, nonexposed individuals.
4. Test whether clinical outcomes (most likely intermediate endpoints such as length of stay or postoperative fluid balance, but ideally hard endpoints such as renal replacement therapy or mortality) are improved in patients who do versus do not undergo kidney injury biomarker testing with subsequent clinical decision making on the basis of biomarker results.
5. Use the approaches suggested in the biostatistical literature when a reliable gold standard does not exist20,21: for example, measurement of two biomarkers that are assumed to be conditionally independent (e.g., SCr and a novel tubular injury biomarker) in individuals drawn from two populations with different disease prevalence.

CONCLUSION

Biomarker development in nephrology is crucial in the development of therapeutic strategies for AKI prevention and treatment. Underpowered studies using small changes in SCr as endpoints may have the unintended and perverse effect of underestimating the utility of novel biomarkers that actually outperform SCr itself. When using a nonideal gold standard to evaluate novel biomarkers, appropriate study design considerations become critical to avoid misleading conclusions that would preclude the acceptance into clinical medicine of new useful biomarkers that have the chance to revolutionize the approach to AKI diagnosis and therapeutics.

DISCLOSURES

This paper discusses biomarkers. One such biomarker is KIM-1. J.V.B. is a co-inventor on KIM-1 patents that have been licensed by the Partners Office for Research Ventures & Licensing (RVL) to a number of companies, including Johnson & Johnson, Biogen Idec, R&D systems, BioAssay Works, Rules-Based Medicine, and Genzyme. J.V.B. had a significant equity interest in Genzyme, which, for 1 year during the study, created a conflict of interest. After review by the institution, J.V.B. relinquished this interest to be consistent with Harvard and Partners Healthcare System policies, and Partners put a management plan into place.

Acknowledgments

This work was supported by National Institutes of Health Grants R33DK074099 (S.S.W., R.A.B., and J.V.B.), K23DK075941 (S.S.W.), and T32NS048005 (S.C.E.). This work was conducted with support from Harvard Catalyst | The Harvard Clinical and Translational Science Center (NIH Award UL1 RR 025758 and financial contributions from Harvard University and its affiliated academic health care centers).

The content is solely the responsibility of the authors and does not necessarily represent the official views of Harvard Catalyst, Harvard University and its affiliated academic health care centers, the National Center for Research Resources, or the National Institutes of Health.

Footnotes

Published online ahead of print. Publication date available at www.jasn.org.

REFERENCES

  • 1.Wade R, Eddy R, Shows TB, Kedes L: cDNA sequence, tissue-specific expression, and chromosomal mapping of the human slow-twitch skeletal muscle isoform of troponin I. Genomics 7: 346–357, 1990 [DOI] [PubMed] [Google Scholar]
  • 2.Antman EM, Tanasijevic MJ, Thompson B, Schactman M, McCabe CH, Cannon CP, Fischer GA, Fung AY, Thompson C, Wybenga D, Braunwald E: Cardiac-specific troponin I levels to predict the risk of mortality in patients with acute coronary syndromes. N Engl J Med 335: 1342–1349, 1996 [DOI] [PubMed] [Google Scholar]
  • 3.Hamm CW, Ravkilde J, Gerhardt W, Jørgensen P, Peheim E, Ljungdahl L, Goldmann B, Katus HA: The prognostic value of serum troponin T in unstable angina. N Engl J Med 327: 146–150, 1992 [DOI] [PubMed] [Google Scholar]
  • 4.Blantz RC: Pathophysiology of pre-renal azotemia. Kidney Int 53: 512–523, 1998 [DOI] [PubMed] [Google Scholar]
  • 5.Bosch JP, Saccaggi A, Lauer A, Ronco C, Belledonne M, Glabman S: Renal functional reserve in humans. Effect of protein intake on glomerular filtration rate. Am J Med 75: 943–950, 1983 [DOI] [PubMed] [Google Scholar]
  • 6.Waikar SS, Bonventre JV: Creatinine kinetics and the definition of acute kidney injury. J Am Soc Nephrol 20: 672–679, 2009 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Vaidya VS, Ozer JS, Dieterle F, Collings FB, Ramirez V, Troth S, Muniappa N, Thudium D, Gerhold D, Holder DJ, Bobadilla NA, Marrer E, Perentes E, Cordier A, Vonderscher J, Maurer G, Goering PL, Sistare FD, Bonventre JV: Kidney injury molecule-1 outperforms traditional biomarkers of kidney injury in preclinical biomarker qualification studies. Nat Biotechnol 28: 478–485, 2010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Szczech LA: The development of urinary biomarkers for kidney disease is the search for our renal troponin. J Am Soc Nephrol 20: 1656–1657, 2009 [DOI] [PubMed] [Google Scholar]
  • 9.Molitoris BA, Levin A, Warnock DG, Joannidis M, Mehta RL, Kellum JA, Ronco C, Shah SV; Acute Kidney Injury Network working group: Improving outcomes of acute kidney injury: report of an initiative. Nat Clin Pract Nephrol 3: 439–442, 2007 [DOI] [PubMed] [Google Scholar]
  • 10.Wu I, Parikh CR: Screening for kidney diseases: older measures versus novel biomarkers. Clin J Am Soc Nephrol 3: 1895–1901, 2008 [DOI] [PubMed] [Google Scholar]
  • 11.Bonventre JV, Vaidya VS, Schmouder R, Feig P, Dieterle F: Next-generation biomarkers for detecting kidney toxicity. Nat Biotechnol 28: 436–440, 2010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Siew ED, Ware LB, Ikizler TA: Biological markers of acute kidney injury. J Am Soc Nephrol 22: 810–820, 2011 [DOI] [PubMed] [Google Scholar]
  • 13.Chertow GM, Burdick E, Honour M, Bonventre JV, Bates DW: Acute kidney injury, mortality, length of stay, and costs in hospitalized patients. J Am Soc Nephrol 16: 3365–3370, 2005 [DOI] [PubMed] [Google Scholar]
  • 14.Lassnigg A, Schmidlin D, Mouhieddine M, Bachmann LM, Druml W, Bauer P, Hiesmayr M: Minimal changes of serum creatinine predict prognosis in patients after cardiothoracic surgery: a prospective cohort study. J Am Soc Nephrol 15: 1597–1605, 2004 [DOI] [PubMed] [Google Scholar]
  • 15.Parfrey PS, Griffiths SM, Barrett BJ, Paul MD, Genge M, Withers J, Farid N, McManamon PJ: Contrast material-induced renal failure in patients with diabetes mellitus, renal insufficiency, or both. A prospective controlled study. N Engl J Med 320: 143–149, 1989 [DOI] [PubMed] [Google Scholar]
  • 16.Mitchell AM, Jones AE, Tumlin JA, Kline JA: Incidence of contrast-induced nephropathy after contrast-enhanced computed tomography in the outpatient setting. Clin J Am Soc Nephrol 5: 4–9, 2010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Buck AA, Gart JJ: Comparison of a screening test and a reference test in epidemiologic studies. I. Indices of agreement and their relation to prevalence. Am J Epidemiol 83: 586–592, 1966 [DOI] [PubMed] [Google Scholar]
  • 18.Staquet M, Rozencweig M, Lee YJ, Muggia FM: Methodology for the assessment of new dichotomous diagnostic tests. J Chronic Dis 34: 599–610, 1981 [DOI] [PubMed] [Google Scholar]
  • 19.Phelps CE, Hutson A: Estimating diagnostic test accuracy using a “fuzzy gold standard”. Med Decis Making 15: 44–57, 1995 [DOI] [PubMed] [Google Scholar]
  • 20.Hui SL, Walter SD: Estimating the error rates of diagnostic tests. Biometrics 36: 167–171, 1980 [PubMed] [Google Scholar]
  • 21.Walter SD, Irwig LM: Estimation of test error rates, disease prevalence and relative risk from misclassified data: a review. J Clin Epidemiol 41: 923–937, 1988 [DOI] [PubMed] [Google Scholar]
  • 22.Vacek PM: The effect of conditional dependence on the evaluation of diagnostic tests. Biometrics 41: 959–968, 1985 [PubMed] [Google Scholar]
  • 23.Wians FH, Jr, Urban JE, Keffer JH, Kroft SH: Discriminating between iron deficiency anemia and anemia of chronic disease using traditional indices of iron status vs transferrin receptor concentration. Am J Clin Pathol 115: 112–118, 2001 [DOI] [PubMed] [Google Scholar]
  • 24.Ali MA, Luxton AW, Walker WH: Serum ferritin concentration and bone marrow iron stores: a prospective study. Can Med Assoc J 118: 945–946, 1978 [PMC free article] [PubMed] [Google Scholar]
  • 25.Mast AE, Blinder MA, Lu Q, Flax S, Dietzen DJ: Clinical utility of the reticulocyte hemoglobin content in the diagnosis of iron deficiency. Blood 99: 1489–1491, 2002 [DOI] [PubMed] [Google Scholar]
  • 26.Kellum JA, Bellomo R, Ronco C, Mehta R, Clark W, Levin NW: The 3rd International Consensus Conference of the Acute Dialysis Quality Initiative (ADQI). Int J Artif Organs 28: 441-444, 2005 [DOI] [PubMed] [Google Scholar]
  • 27.Mishra J, Dent C, Tarabishi R, Mitsnefes MM, Ma Q, Kelly C, Ruff SM, Zahedi K, Shao M, Bean J, Mori K, Barasch J, Devarajan P: Neutrophil gelatinase-associated lipocalin (NGAL) as a biomarker for acute renal injury after cardiac surgery. Lancet 365: 1231–1238, 2005 [DOI] [PubMed] [Google Scholar]
  • 28.Wagener G, Jan M, Kim M, Mori K, Barasch JM, Sladen RN, Lee HT: Association between increases in urinary neutrophil gelatinase-associated lipocalin and acute renal dysfunction after adult cardiac surgery. Anesthesiology 105: 485–491, 2006 [DOI] [PubMed] [Google Scholar]
  • 29.Koyner JL, Bennett MR, Worcester EM, Ma Q, Raman J, Jeevanandam V, Kasza KE, O’Connor MF, Konczal DJ, Trevino S, Devarajan P, Murray PT: Urinary cystatin C as an early biomarker of acute kidney injury following adult cardiothoracic surgery. Kidney Int 74: 1059–1069, 2008 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Haase M, Bellomo R, Devarajan P, Schlattmann P, Haase-Fielitz A. NGAL Meta-analysis Investigator Group: Accuracy of neutrophil gelatinase-associated lipocalin (NGAL) in diagnosis and prognosis in acute kidney injury: a systematic review and meta-analysis. Am J Kidney Dis 54: 1012–1024, 2009 [DOI] [PubMed] [Google Scholar]
  • 31.Ling W, Zhaohui N, Ben H, Leyi G, Jianping L, Huili D, Jiaqi Q: Urinary IL-18 and NGAL as early predictive biomarkers in contrast-induced nephropathy after coronary angiography. Nephron Clin Pract 108: c176–c181, 2008 [DOI] [PubMed] [Google Scholar]
  • 32.Makris K, Demponeras C, Zoubouloglou F, Potamitis S, Kafkas N, Drakopoulos I, Rizos D, Nikolaou A, Babalis D, Haliassos A: The role of urinary NGAL to urinary creatinine ratio in the early detection of contrast agent induced acute kidney injury after coronary artery angiography. Poster presented at the American Association for Clinical Chemistry; July 19-23, 2009; Chicago, IL [Google Scholar]
  • 33.Hirsch R, Dent C, Pfriem H, Allen J, Beekman RH, 3rd, Ma Q, Dastrala S, Bennett M, Mitsnefes M, Devarajan P: NGAL is an early predictive biomarker of contrast-induced nephropathy in children. Pediatr Nephrol 22: 2089–2095, 2007 [DOI] [PubMed] [Google Scholar]
  • 34.Wagener G, Gubitosa G, Wang S, Borregaard N, Kim M, Lee HT: Urinary neutrophil gelatinase-associated lipocalin and acute kidney injury after cardiac surgery. Am J Kidney Dis 52: 425–433, 2008 [DOI] [PubMed] [Google Scholar]
  • 35.Koyner JL, Vaidya VS, Bennett MR, Ma Q, Worcester EM, Akhter SA, Raman J, Jeevanandam V, O’Connor MF, Devarajan P, Bonventre JV, Murray PT: Urinary biomarkers in the clinical prognosis and early detection of acute kidney injury. Clin J Am Soc Nephrol 5: 2154–2165, 2010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Han WK, Wagener G, Zhu Y, Wang S, Lee HT: Urinary biomarkers in the early detection of acute kidney injury after cardiac surgery. Clin J Am Soc Nephrol 4: 873–882, 2009 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Thygesen K, Alpert JS, White HD, Jaffe AS, Apple FS, Galvani M, Katus HA, Newby LK, Ravkilde J, Chaitman B, Clemmensen PM, Dellborg M, Hod H, Porela P, Underwood R, Bax JJ, Beller GA, Bonow R, Van der Wall EE, Bassand JP, Wijns W, Ferguson TB, Steg PG, Uretsky BF, Williams DO, Armstrong PW, Antman EM, Fox KA, Hamm CW, Ohman EM, Simoons ML, Poole-Wilson PA, Gurfinkel EP, Lopez-Sendon JL, Pais P, Mendis S, Zhu JR, Wallentin LC, Fernández-Avilés F, Fox KM, Parkhomenko AN, Priori SG, Tendera M, Voipio-Pulkki LM, Vahanian A, Camm AJ, De Caterina R, Dean V, Dickstein K, Filippatos G, Funck-Brentano C, Hellemans I, Kristensen SD, McGregor K, Sechtem U, Silber S, Tendera M, Widimsky P, Zamorano JL, Morais J, Brener S, Harrington R, Morrow D, Lim M, Martinez-Rios MA, Steinhubl S, Levine GN, Gibler WB, Goff D, Tubaro M, Dudek D, Al-Attar N. Joint ESC/ACCF/AHA/WHF Task Force for the Redefinition of Myocardial Infarction: Universal definition of myocardial infarction. Circulation 116: 2634–2653, 2007 [DOI] [PubMed] [Google Scholar]
  • 38.Newsome BB, Warnock DG, McClellan WM, Herzog CA, Kiefe CI, Eggers PW, Allison JJ: Long-term risk of mortality and end-stage renal disease among the elderly after small increases in serum creatinine level during hospitalization for acute myocardial infarction. Arch Intern Med 168: 609–616, 2008 [DOI] [PubMed] [Google Scholar]
  • 39.Boldt J, Brenner T, Lang J, Kumle B, Isgro F: Kidney-specific proteins in elderly patients undergoing cardiac surgery with cardiopulmonary bypass. Anesth Analg 97: 1582–1589, 2003 [DOI] [PubMed] [Google Scholar]
  • 40.Eijkenboom JJ, van Eijk LT, Pickkers P, Peters WH, Wetzels JF, van der Hoeven HG: Small increases in the urinary excretion of glutathione S-transferase A1 and P1 after cardiac surgery are not associated with clinically relevant renal injury. Intensive Care Med 31: 664–667, 2005 [DOI] [PubMed] [Google Scholar]
  • 41.Hamada Y, Kanda T, Anzai T, Kobayashi I, Morishita Y: N-acetyl-beta-D-glucosaminidase is not a predictor, but an indicator of kidney injury in patients with cardiac surgery. J Med 30: 329–336, 1999 [PubMed] [Google Scholar]
  • 42.Amur S, Frueh FW, Lesko LJ, Huang SM: Integration and use of biomarkers in drug development, regulation and clinical practice: a US regulatory perspective. Biomarkers Med 2: 305–311, 2008 [DOI] [PubMed] [Google Scholar]
  • 43.Mehran R, Nikolsky E: Contrast-induced nephropathy: definition, epidemiology, and patients at risk. Kidney Int Suppl 100: S11–S15, 2006 [DOI] [PubMed] [Google Scholar]

Articles from Journal of the American Society of Nephrology : JASN are provided here courtesy of American Society of Nephrology

RESOURCES