Abstract
Background
High through-put laboratory technologies coupled with sophisticated bioinformatics algorithms have tremendous potential for discovering novel biomarkers, or profiles of biomarkers, that could serve as predictors of disease risk, response to treatment or prognosis. We discuss methodological issues in wedding high through-put approaches for biomarker discovery with the case-control study designs typically used in biomarker discovery studies, especially focusing on nested case-control designs.
Methods
We review principles for nested case-control study design in relation to biomarker discovery studies and describe how the efficiency of biomarker discovery can be effected by study design choices. We develop a simulated prostate cancer cohort data set and a series of biomarker discovery case-control studies nested within the cohort to illustrate how study design choices can influence biomarker discovery process.
Result
Common elements of nested case-control design, incidence density sampling and matching of controls to cases, are not typically factored correctly into biomarker discovery analyses, inducing bias in the discovery process. We illustrate how incidence density sampling and matching of controls to cases reduces the apparent specificity of truly valid biomarkers “discovered” in a nested case-control study. We also propose and demonstrate a new case-control matching protocol, we call “anti-matching”, that improves the efficiency of biomarker discovery studies.
Conclusions
For a valid, but as yet undiscovered, biomarker(s) disjunctions between correctly designed epidemiologic studies and the practice of biomarker discovery reduce the likelihood that true biomarker(s) will be discovered and increases the false positive discovery rate.
Keywords: High Through-put, Biomarker Discovery, Epidemiology, case-control
Introduction
High-throughput laboratory technologies that measure thousands of parameters per study subject, coupled with sophisticated bioinformatics systems have tremendous potential for discovering novel biomarkers, or profiles of biomarkers, that could serve as predictors of disease risk, response to treatment or prognosis [1–4]. Often referred to as “-omic” approaches, investigators may search for a disease relevant biomarker through multitudes of expressed genes (transcriptomes), proteins present in blood or other biological fluids (proteomics), metabolites in biological fluids (metabolomics) or CpG island methylation sites (epigenomics) [2, 4–8]. For RNA transcription, protein, metabolite and epigenetic markers of disease incidence it is thought that these markers represent biological changes arising during disease development (see Figure 1). For instance, during carcinogenesis various genes are abnormally expressed giving rise to differences in the type and quantity of RNA transcription and protein expression. Regardless of the “–omic” target, the basic approach for discovery studies is to apply a high-throughput technology (e.g. gene methylation chips) to measure enormous numbers of biomarker targets in biological samples from a limited series of subjects with contrasting health states (e.g. with and without cancer, or exhibiting an aggressive or non-aggressive disease course). Bio-informatics tools are then applied to the biomarker data from the subjects to ‘discover’ biomarkers that have the highest sensitivity and specificity for discriminating between subjects with the contrasting health states. Generally biomarker discovery studies use a training set of subjects in which potential biomarkers are identified and an independent validation or test set of subjects in which the predictive ability of the biomarkers is validated [9–12]. These studies assume that there is a truly valid biomarker(s) that can identify individuals with the health state of interest, that awaits discovery, and that bio-informatics tools can identify this needle in the haystack of biomarker data generated by the high-throughput technology.
Figure 1.

Schematic illustration of types of biomarkers arising during carcinogenesis amenable to high throughput approaches to biomarker discovery. During carcinogenesis adherent gene expression is thought to occur, giving rise to differences in RNA transcription, protein expression and metabolic pathways that produce targets for biomarker discovery studies. Similarly during carcinogenesis there can be changes in the patterns of CpG methylation in DNA promoter regions that also generate targets for biomarker discovery studies. Inter-individual differences in genetic polymorphisms or mutations are thought to produce differences in genetic susceptibility that interacts with exposures to trigger carcinogenesis. Key elements of this genetic variation can be discovered using genome wide association studies (GWAS).
There are a number of critiques of this overall approach, including technology issues with the various platforms, the potential for a high false-positive rate and, in the case of proteomics, for the failure of the approach to identify previously validated cancer biomarkers such as prostate specific antigen [2, 5–7, 10, 13–16]. In addition, there are critiques of laboratory procedures for collecting and handling biological samples [10, 14–17]. Ransohoff has commented on the issue of bias in the design of high-throughput biomarker discovery studies, and pointed out that errors due to bias can become “hard-wired into data by faulty design…. then statistical analysis or data-mining cannot eliminate bias” [4]. He notes that laboratory researchers steeped in traditions of biological reasoning and controlled experimental conditions are not prepared for the non-controlled conditions of observational epidemiology when they apply their molecular assays to population studies of diagnosis and prognosis [3, 4]. Several recent publications have proposed ‘pipeline’ or stepped approaches to improving biomarker discovery and validation, but few have focused on issues of bias in the epidemiologic design of such studies [9, 11, 12, 18]. Regardless of how sophisticated the bioinformatics algorithms are, if the underlying design of the discovery study is faulty or if assumptions factored into study subject selection are not factored into the discovery data analysis process, the search for informative biomarkers will be jeopardized. The bioinformatics algorithms will find discriminating biomarkers but the question is whether the discovered biomarkers will ultimately prove to be valid and whether the epidemiologic design of the studies can be optimized to increase the probability that an identified biomarker is valid.
1. Epidemiologic Study Design and Bias
Most biomarker discovery studies have taken the form of case-control studies but have been conducted with little regard to major design concerns in such studies [19–33]. Recent recommendations for approaches to evaluating biomarkers for the early detection of cancer have focused on the nested case-control study design [11, 12, 18, 34]. Nesting a case-control study within a prospective cohort study provides, on average, the same result that a full analysis of the entire cohort data set would otherwise have produced, but with much greater logistical efficiency. The advantages of the design include; its prospective nature and collection of biological samples often years before clinical diagnoses, a clear definition of the source population from which to select controls, and ease in matching biological samples for differences in handling, processing and duration of storage [4, 11, 16, 35]. However, there has not been an in-depth discussion on how design issues for nested case-control studies impact biomarker discovery using data from high-throughput technologies [18].
This paper will describe the nested case-control design and how assumptions built into the design are not generally acknowledged in the discovery process of biomarkers for disease incidence and thus generate bias. The major issues to be discussed are the sampling of controls and the matching of controls to cases and how faulty design or failure to incorporate these design elements into the discovery process can “hard wire” bias into the resulting data set. The effects of study choices will be illustrated using a simulated cohort of prostate cancer, with a series of case-control studies nested in it using different design options. The paper will make recommendations for best practices in the use of nested case-control studies for biomarker discovery and will propose a new design option, ‘anti-matching’, that we believe will maximize the potential for valid biomarker discovery.
a. Control Selection
The most challenging part of a case-control study is appropriately selecting healthy controls to serve as the reference group to which the cases are compared. In traditional epidemiologic terms, the role of controls is to estimate the prevalence of exposure in the source population from which the cases were derived [33]. Analogously, for a biomarker discovery study the role of controls is to estimate the distribution of biomarker levels in the source population from which the cases were derived. Rigorous attention to control selection is not often apparent in biomarker discovery studies, and convenience and hospital based controls are often used, making valid control selection particularly difficult [19–33]. A major strength of the nested case-control design is that the source population from which the cases were derived is the cohort, which is readily defined and controls can be randomly selected from it with ease [33, 35].
Control selection in the nested case-control design can appear counter intuitive. Each time a case is diagnosed during cohort follow-up, a control is selected from among the cohort members who have not yet developed the disease of interest and who are still under follow-up [35]. Thus if the first case is diagnosed three years into cohort follow-up, a control is randomly selected from all those still under follow-up 3 years after entering the cohort (see Figure 2) [35]. The counter intuitive part of the control selection process is that the control selected to match the first case, is still eligible to be matched again as a control to a subsequent case diagnosed later during follow-up [35]. Furthermore, if an individual selected as a control later develops disease, this individual can serve as both a control and a case in the data set [35]. This approach to control selection is standard practice for nested case-control studies and is known as “risk set sampling” or “incidence density sampling” [33]. This approach is distinctly different from cumulative incidence sampling in which controls are selected from among those who remain disease free though out follow-up; the approach typically used in biomarker discovery studies.
Figure 2.
This diagram illustrates the selection of controls for two cases in a nested case control study. For subject 1, at the time of disease development (D) subjects 2, 4–7 and 9 are eligible to be selected as controls (C). Subject 6 is eligible to be a control for subject 1, despite the fact that subject 6 later develops disease. At the time that subject 6 develops disease subjects 2, 7 and 9 are eligible to be selected as controls, regardless of whether they were selected as controls for subject 1.
In our experience researchers commonly suppose that the control series in a nested case-control biomarker discovery study should be comprised of subjects who remain disease free until the end of cohort follow-up, that is, controls should be sampled in a cumulative incidence fashion. However, because of loss to follow-up and competing causes of mortality, in most large cohorts length of follow-up varies across subjects. Thus, most cohort analyses of cancer outcomes use time to event analyses rather than analyses of the cumulative incidence of disease during follow-up. Failure to use incidence density sampling of controls, essentially a time to event analysis, in a case-control study nested in a cohort will introduce bias into the results, the magnitude of which increases for more common outcomes. Nested case-control studies that use incorrect sampling of controls will produce results that, on average, differ from those that would have been observed in a full cohort analysis.
b. Matching
Another prominent feature of case-control designs in general, and in the biomarker discovery literature, is matching of controls of cases. [19, 25, 28, 35–37]. In a design utilizing matching, controls are selected such that the distributions of important confounders in the case and control populations are equal [33]. The utility of matching in a case-control study is that it improves the statistical efficiency of controlling for confounding factors [33]. However without appropriate statistical analyses, matching does not remove confounding, in fact it hard wires a specific pattern of bias into the data. Matching makes the control series more similar to the cases on all matching factors and correlates of the matching factors, than would have occurred under a purely random selection of controls [33]. While a matched design with appropriate acknowledgement of the matching during data analysis produces more statistically efficient results, failure to account for matching produces results that generally are biased towards the null [33]. As will be discussed below, matching in a biomarker discovery study is likely to make a valid biomarker appear less efficient at discriminating between cases and controls.
Intricate matching strategies have several ramifications for the overall design of a nested case-control study. For some combinations of matching factors there may be no controls available that can be matched to a specific case, reducing the number of case-control pairs that can be analyzed, or there may be relatively few individuals available to be selected as controls, increasing the likelihood that an individual is selected multiple times as a control. The second issue is that since cases and controls are usually matched on risk factors for the disease, matched controls have a higher than average risk of later becoming a case, increasing the likelihood that a study subject who is selected as a control early during follow-up will later become a case. The extent to which study subjects are selected into the final data set multiple times is a function of how common the disease outcome is and the extent to which the matching factors are strong risk factors for the disease. The third issue is that matching makes the control series more similar to the cases on all correlates of the matching factors than would have occurred under a purely random selection of controls [33]. If the targets of the high through-put laboratory assay are associated with the matching factors (which is the necessary condition for a factor to be a confounder–the primary indication for matching) then the cases and controls will have similar biomarker profiles.
In their proteomic study of lung cancer Yildiz and colleagues individually matched controls to cases on age, gender, smoking status and pack years to “…avoid confounding” [25]. While the matched pairs were assigned to the training set and test sets without breaking the matching, there is no indication in the paper that the bioinformatics analyses accounted for matching when identifying biomarkers. In the context of a case-control study, matching without statistical adjustment for the matching does not control for confounding but instead produces biased results [33]. If indeed lung cancers are associated with a specific proteomic signature that can be detected in blood samples, smoking is probably best thought of as causal antecedent to this proteomic signature. Smoking causes a carcinogenic process that leads to lung cancer and the carcinogenic process is the cause of the protein pattern the investigators sought to discover [1, 38]. Thus matching on smoking is likely to increase the prevalence of this protein pattern among the controls. Similarly, McLerran and colleagues’ careful studies on proteomic patterns in prostate cancer that uncovered bias due to differences in sample storage duration between cases and controls, frequency matched controls to cases on age and race [10]. However, their statistical analysis of the data did not appear to adjust for age and race and since these are major risk factors for prostate cancer, it is likely that the epidemiological design of their study also obscured any true differences in protein profiles. Without appropriate statistical analyses, matching is expected to reduce the likelihood of a discovery study correctly identifying biomarkers that best distinguish between cases and controls.
2. Illustration of Design Issues for a Nested Case-Control Study of Prostate Cancer
To illustrate principles of study design, we have generated a simulated a cohort of 10,000 men followed-up for prostate cancer. The design for this simulation was motivated by ongoing work on a nested case control study of prostate cancer at the Henry Ford Health System in Detroit Michigan [39]. The case-control study is nested in a cohort of African American and Caucasian men with prior prostate biopsies that yielded benign diagnoses. A biomarker discovery study is being planned for this study and the simulations were designed to help inform design choices for this planned study. Simulated subjects were assigned to be either African American or Caucasian and to be either 50–64 years old or 65–79 years old and the Michigan State prostate cancer rates for these age-race groups were applied. The upper 95tth percentile rates were used to simulate a high risk cohort (Black, age 65–79: N=3000, rate = 1,659.8 per 100,000/year, Black age 50–64: N=2000, rate = 660.4 per 100,000/year, White age 65–79: N=3,500, rate = 964.7 per 100,000/ year, White age 50–64: N=1,500, rate = 315.5 per 100,000/ year). Life table analyses were conducted to model the prostate cancer incidence of these men each year over a 10 year period. To simplify the example complete follow-up of the cohort was assumed.
The simulated cohort included a truly valid biomarker with a sensitivity and specificity of 80%. In this illustrative cohort the biomarker completely explains the risk of prostate cancer associated with age and race, that is, it is a perfect biomarker for the carcinogenic process associated with age and race. As a consequence, among those who are biomarker negative, age and race are not associated with the development of prostate cancer. Overall for predicting disease, the biomarker has a relative risk of 11.59 and a rate ratio of 13.55. This illustrative cohort also includes a false positive biomarker randomly distributed across strata of age and race, and that due to random chance appears in this cohort to be associated with prostate cancer risk with a sensitivity and specificity of 80%. This marker is said to be a false positive in the sense that it would not survive a validation study and in another independent cohort would not be associated with prostate cancer risk [16]. The false positive biomarker represents a biomarker that, in a real world discovery study in which 1,000s or 10,000s of targets are measured, appears to be associated with disease but the association is a random chance occurrence observed in the context of multiple testing.
a. The Effects of Epidemiologic Design Choices
Life-table calculations using the prostate cancer rates applied to the number of men in each age and race category yielded 960 cases over ten years of follow-up in the illustrative cohort. To illustrate the effects of the design issues discussed here, matched and unmatched nested case-control studies were sampled from the cohort and analyzed. It is acknowledged that the sample size of 960 cases is much larger than a typical biomarker discovery study, but the large sample size reduces the impact of variation due to the random selection of controls in this illustration. To further reduce the impact of random selection, the creation of matched and unmatched case-control studies was repeated using the same cases and new random draws of controls for a total of 10 trials each and the mean of the results is reported.
Within each of the 20 sampled control series there were subjects who were selected multiple times during follow-up to be matched to different cases. This occurred because for some strata of age, race and follow-up duration there were relatively few controls eligible to be matched to cases. Additionally within each of the 20 sampled control series individuals were selected as controls who later in follow-up became cases. As expected matching increased the number of individuals selected as controls who later became cases (see table 1).
Table 1.
Number of subjects sampled multiple times into the nested case control studies.
| Average1 number subjects sampled into the control group multiple times | Average1 number of controls who are also cases | |
|---|---|---|
| Matched design | 114.6 | 60.5 |
| Un-matched design | 93.3 | 44.9 |
Mean of 10 independently sampled nested case-control samples drawn from the underlying cohort.
The requirement that individuals be allowed to serve as both cases and controls in a nested case-control study reduces the apparent specificity of the biomarker in the final case-control data set. For a truly valid biomarker with 80% sensitivity, on average 80% of the individuals selected to be controls who later become cases will be biomarker positive and thus the selected control series will be enriched with biomarker positive individuals. Matching of controls to cases on major risk factors for disease further enriches the control series with individuals who are biomarker positive. Table 2 shows the apparent sensitivity and specificity of the true biomarker and the false positive biomarker in the selected case-control subjects, for matched and unmatched designs. The table illustrates that incidence density sampling reduces the specificity of both the true and false positive biomarker observed in the case-control study and that matching further reduces the observed specificity of only the true biomarker. The design effect of incidence density sampling of controls reduces the apparent specificity of the valid biomarker by 2.3% and the design effect of matching on risk factors further reduces the apparent specificity by 7.6%. A nested case-control biomarker discovery study that employs matching on disease risk factors without appropriate statistical adjustment for matching will underestimate the specificity of a truly valid biomarker and has a higher likelihood of identifying a false positive biomarker.
Table 2.
Apparent sensitivity1 and specificity1 of the truly valid biomarker and the false positive biomarker
| True Biomarker | False Positive Biomarker | |||
|---|---|---|---|---|
| Sensitivity observed in the Nested case-control study | Specificity observed in the Nested case-control study | Sensitivity observed in the Nested case-control study | Specificity observed in the Nested case-control study | |
| Matched design | 80.0 | 70.1 | 80.0 | 76.3 |
| Un-matched design | 80.0 | 77.7 | 80.0 | 77.2 |
Mean from 10 independently sampled nested case-control samples drawn from the underlying cohort. Within the cohort the true biomarker and the false positive biomarker both have a 80% sensitivity and 80% specificity for prediciting prostate cancer incidence.
Recommendations
1. Development of new bioinformatics tools
The nested case-control design is a logistically efficient and valid approach to the prospective study of the effects of exposures and biomarkers on subsequent disease risk. However, the validity of the design depends upon the availability of appropriate statistical methods for analyzing the data generated by these studies [33, 35]. The current approach for candidate biomarker analyses is to use conditional logistic regression analyses that make a series of comparisons within individually matched pairs of cases and controls1; the individual nature of the matching is never broken [35, 40]. This approach adjusts the regression estimates for the confounding effects of the matching variables and since the matched pairs are chosen using a time to event paradigm, the issue of some individuals selected as controls later becoming cases is addressed within the statistical analyses. Failure to conduct appropriate matched pair statistical analyses yields biased results because the design hardwires a specific set of correlations into the data. It is unclear at this point whether analogous matched pair approaches to data analysis exist for biomarker discovery studies. Such an approach would compare the high-throughput data array from an individual case to that from only its matched control. Current commonly used bioinformatics tools for biomarker discovery make group wise comparisons between a set of cases and a set of controls. For nested case-control designs to be optimally used in biomarker discovery studies new statistical techniques that address the specific biases built into the design must be developed.
2. Selection of the cohort in which to nest the case-control study
In the absence of optimal data analysis techniques for biomarker discovery in a nested case-control design, there are practical steps that investigators can take in the study design phase to reduce bias and increase the probability that the study will correctly discover a valid biomarker. The use of lower risk general population cohorts, such as EPIC, the Physicians Health Study and the Nurses Health Study will reduce the frequency with which high risk individuals are selected as controls and later go on to become cases. One of the reasons that control selection in the simulated prostate cancer study yielded so many controls who later became cases, was that the simulations were modeled on a high risk cohort under relatively intense medical surveillance [39]. Nested case-control studies of relatively rare diseases in general population cohorts will less often include individuals who are present in the data as both controls and cases. However, in general population cohorts longer follow-up may be needed before sufficient cases have accrued, and the longer time period between biological sample collection at enrollment and disease may reduce the sensitivity of discovered markers.
3. Appropriate matching strategies
Matching makes the controls similar to the cases on the matching factors and on all correlates of the matching factor, including biomarkers. Thus, a matching factor that is expected to be associated with the biomarker, or biological processes that generate the biomarker, will cause the selected controls to have a spectrum of biomarkers that is similar to that observed in the cases. This makes it more difficult to identify biomarkers that validly distinguish between cases and controls. The most problematic matching scenario is when controls are matched to cases on a strong correlate or determinant of the target biomarker profile and the matching factor is not otherwise a predictor of disease [33]. This would be a clear case of over matching.
Matching on major risk factors for the disease that themselves are likely antecedents of target biomarkers is also expected to undermine the discovery process. Such matching makes the controls more similar to the cases in respect to the biological process that generates the biomarker profiles. For instance, in proteomic studies of lung cancer the inclination has been to match on smoking status [25]. In part the concern is that without matching the bioinformatics algorithm will merely identify protein expression patterns that are associated with smoking status. However, in a study that does not match on smoking, any biomarker that is merely a marker for current smoking status will have a low sensitivity for predicting lung cancer, because current smoking itself has a low sensitivity for predicting lung cancer. If indeed there are biomarkers arising from the carcinogenic process for which smoking is a distal cause, the association between the biomarker and case-control status is expected to be larger in magnitude than the association between smoking and case-control status [41]. Furthermore, if the smoking induced carcinogenic process does produce a useful biomarker, matching on smoking is likely to make the prevalence of the biomarker more similar in cases and controls. As such, in the absence of statistical techniques for discovery that adjust for matching, a design that does not match on smoking is more likely to discover valid biomarkers for lung cancer than a design that does match.
In regard to matching strategies, genome wide association studies (GWAS) differ from other ‘–omic’ biomarker discovery studies that seek to identify biomarkers arising from underlying disease processes or pathology. Germ line single nucleotide polymorphisms (SNPs) and mutations are generally thought to be associated with disease incidence, not because they are caused by disease processes, but because they alone or through interactions with exposures cause disease. Thus, as illustrated in figure 1, genetic determinants of disease occur at a different position on the causal pathway from exposure to disease than other “–omic” biomarker targets commonly assayed for in discovery studies. In GWAS studies matching is predominantly used to address population stratification issues and so may be less of a threat to valid biomarker discovery.
Even without data analysis techniques for biomarker discovery that appropriately account for matching there are still some factors it may be appropriate to match controls to cases on. These include variables that relate to the logistics of study conduct that may influence the validity with which the biomarkers can be measured, but are not related to the biological process that generates the biomarkers. Matching factors falling into this category include: differences in biological sample collection, handling and processing and the number of freeze thaw cycles the biological sample was subjected to during storage [10, 35]. In nested case-control studies, matching of controls to cases on duration of follow-up should automatically match the cases and controls on duration of sample storage.
4. Data Analysis Approaches
In the data analysis phase, an understanding of the types of biases hardwired into the data set can be used to help identify biomarkers that are more likely to be valid. If high-throughput data from a matched pair nested case-control study are analysed using biomarker discovery tools that rely on group wise comparisons, the pattern of bias will cause a truly valid biomarker to appear to have a lower specificity than would be apparent in a full cohort analysis. Once a series of candidate biomarkers are identified by the bioinformatics algorithms, further traditional candidate biomarker case-control analyses of the discovered biomarkers can be used to identify biomarkers for which the specificity has been biased downwards. For truly valid biomarkers where matching has caused the apparent specificity to be low, the crude odds ratio will be substantially smaller than the odds ratio calculated from a conditional logistic regression that adjusts for the matching factors. Biomarkers that are false positives and appear to discriminate between cases and controls due to random chance are unlikely to produce crude and adjusted odds ratios that substantially differ.
Table 3 shows the crude and adjusted odds ratios from case-control analyses of the true and false positive biomarker, for matched and unmatched case-control designs. Analyses of data for the true biomarker from the matched designs produce crude odds ratios that are substantially smaller than the adjusted odds ratios. For the false positive biomarker the crude and adjusted odds ratios are similar. For a study with matching on risk factors for disease, the observation that an adjusted odds ratio is substantially larger than the crude odds ratio would be further evidence that the biomarker in question is a valid biomarker. Thus, biomarkers identified through the discovery process can be further screened in this manner to identify likely candidates for replication studies.
Table 3.
Crude and Adjusted Odds Ratios1 for the Biomarkers Predicting Prostate Cancer Incidence.
| True Biomarker | False Positive Biomarker | |||
|---|---|---|---|---|
| Crude OR | Adjusted OR | Crude OR | Adjusted OR | |
| Matched design | 9.41 | 14.02 | 12.74 | 12.52 |
| Un-matched design | 14.00 | 14.89 | 13.40 | 14.74 |
Mean from 10 independently sampled nested case-control samples drawn from the underlying cohort. The adjusted rate ratio for the true biomarker in the underlying cohort is 13.46 and for the false positive biomarker in the underlying cohort the adjusted rate ratio is 12.90.
5. Alternative Approaches to Matching
Matching in case-control studies, in itself, does not control for confounding, it is a tool for improving the statistical efficiency of multivariate approaches to controlling for confounding [33]. Matching establishes a particular form of bias in the data that allows for the statistically efficient estimation of de-confounded causal effects through multivariate analyses that acknowledge the matching scheme [33]. We propose a matching scheme, we call ‘anti-matching’, that we expect will increase the observed specificity of the biomarker in a case-control study, making it easier for the bioinformatics tools that use group wise comparisons to discover a valid biomarker. We use the phrase ‘observed specificity’ because the true specificity of the biomarker in the underlying cohort population remains the same, but the approach effectively selects controls from the cohort who are less likely to be positive for the true biomarker, causing the specificity observed in the case-control study for valid biomarkers to be higher. By matching controls to cases counter to known risk factors or causes of the disease, the investigator imprints a known pattern of bias onto the data that increases the signal-to-noise ratio (higher specificity) for biomarkers that reflect the underlying disease process. However, the sampling fractions are known and set by the investigator, thus case-control analyses of the discovered biomarker using conditional logistic regression will produce an odds ratio that correctly estimates the hazard ratio for the biomarker in the underlying cohort.
If anti-matching is applied to the simulated prostate cancer data, an African American control would be selected for a Caucasian case, likewise a 65–79 year old control would be selected for a 50–64 year old case. Since the true biomarker is positively associated with risk of developing prostate cancer, the prevalence of the biomarker will be lower in controls than would otherwise occur in a non-matched random selection of controls. However, the false positive biomarker that is randomly assorted with prostate cancer risk factors in the underlying cohort will be relatively unaffected by the matching scheme. Thus, the observed specificity of the true biomarker in the resulting case-control data set will increase, while the observed specificity of the false positive biomarker is not affected. Table 4 shows the results of analyses for anti-matched case-control studies nested in the simulated cohort. The crude OR for the true biomarker is substantially higher than observed in the unmatched design and in the underlying cohort, however the adjustment for the matching scheme produces an OR similar to that observed in the unmatched design. The crude and adjusted OR for the false positive biomarker are similar to each other and to the OR observed in an unmatched design and in the underlying cohort. Thus the signal-to-noise ratio for the true biomarker has effectively been increased, making it easier to detect in a discovery study.
Table 4.
Performance of the Anti-Matched Nested Cases Control Studies.
| True Biomarker | False Positive Biomarker | |
|---|---|---|
| Apparent sensitivity observed in the nested case-control study1 | 80.00 | 80.00 |
| Apparent specificity observed in the nested case-control study1 | 90.15 | 78.04 |
| Crude OR2 | 36.87 | 14.07 |
| Adjusted OR2 | 13.01 | 14.23 |
Mean from 10 independently sampled nested case-control samples drawn from the underlying cohort.
Mean OR from the 10 independently sampled nested case-control samples drawn from the underlying cohort. The adjusted rate ratio for the true biomarker in the underlying cohort is 13.46 and the adjusted rate ratio for the false positive biomarker in the underlying cohort is 12.90.
This strategy is expected to be applicable for biomarkers that are thought to arise from the underlying pre-clinical disease process, such as changes in protein expression or RNA transcription during carcinogenesis. Appropriate factors to anti-match on would be risk factors that are thought to drive the disease process. The relative effect of anti-matching on the apparent specificity of the biomarker depends on where along the causal pathway from exposure to disease the events represented by the biomarker occur. Anti-matching will have a relatively larger effect on apparent specificity for intermediate biomarkers more distal to disease on the causal pathway (i.e. biomarkers more proximal to the risk factors used for matching) than on intermediate biomarkers that are more proximal to disease. As an example, within the context of smoking and lung cancer a causal pathway has been proposed in which mutagens in cigarette smoke cause DNA damage, higher levels of DNA damage increase the risk of mutations occurring in oncogenes and tumor suppressor genes, and cells affected by these mutations then grow uncontrollably [42, 43]. Within this causal scenario anti-matching on smoking status will have a larger effect on apparent specificity for biomarkers related to DNA damage (biomarkers more distal to lung cancer) than on biomarkers related to cell growth (biomarkers more proximal to lung cancer.
While this matching strategy sounds counter-intuitive, it is an extension of the logic of normal matching in that, matching is a tool to imprint a particular, known pattern of bias into the data that improves the efficiency of an analytical strategy. This approach is similar to Langholz and colleagues approach of counter matching which increases the efficiency of nested-case control studies for identifying exposure-disease relationships and gene-environment interactions [44, 45]. We expect that anti-matching techniques will produce data sets better suited to bioinformatics tools that use group wise comparisons of data derived from nested case-control studies.
Conclusions
Putting aside critiques of laboratory procedures and specific technology platforms, the literature on biomarker discovery studies is replete with examples of poor epidemiological study design. Almost all published studies used either hospital based designs where control selection is most difficult or instead use convenience samples of controls. Some discovery studies include matching in the design without apparent statistical adjustment for the correlations that matching builds into the data set. While we fully endorse recommendations for improved sample handling, the use of higher quality biological samples, the development of improved high throughput technologies, and more robust approaches to replication and validation, we note that none of these advances will improve biomarker discovery if the underlying case-control comparisons are flawed. Nested case-control studies simplify and improve the validity of control selection. However, it must be understood that the incidence density sampling design inherent in nested case-control studies often generate data sets that include the same individual multiple times, sometimes as a control for multiple cases and sometimes as a control and later in follow-up as a case. Since current bioinformatics approaches to biomarker discovery do not account for matching, matching on major risk factors for disease should be avoided as it will reduce the observed specificity for a truly valid biomarker in the resulting dataset. We believe that greater attention to the epidemiologic aspects of discovery study design will increase the probability that a discovered biomarker is valid and will survive replication studies.
Acknowledgments
This work was supported by grants from the National Cancer Institute (R01-CA127532-01, R01CA102484, R01CA107431, R01CA122171, and P30CA14599).
Footnotes
Sometimes controls are matched two or three to one with cases, in which case the regression model makes comparisons between individual cases and pairs or triplets of controls.
References
- 1.Liotta LA, Ferrari M, Petricoin EF. Written in blood. Nature. 2003;425:905. doi: 10.1038/425905a. [DOI] [PubMed] [Google Scholar]
- 2.Kiehntopf M, Siegmund R, Deufel T. Use of SELDI-TOF mass spectrometry for identification of new biomarkers: potential and limitations. Clin Chem Lab Med. 2007;45:1435–49. doi: 10.1515/CCLM.2007.351. [DOI] [PubMed] [Google Scholar]
- 3.Ransohoff DF. Research opportunity at the interface of molecular biology and clinical epidemiology. Gastroenterology. 2002;122:1199. [Google Scholar]
- 4.Ransohoff DF. Bias as a threat to the validity of cancer molecular-marker research. Nat Rev Cancer. 2005;5:142–9. doi: 10.1038/nrc1550. [DOI] [PubMed] [Google Scholar]
- 5.Bhattacharya S, Mariani TJ. Array of hope: expression profiling identifies disease biomarkers and mechanism. Biochem Soc Trans. 2009;37:855–62. doi: 10.1042/BST0370855. [DOI] [PubMed] [Google Scholar]
- 6.Claudino WM, Quattrone A, Biganzoli L, Pestrin M, Bertini I, Di Leo A. Metabolomics: available results, current research projects in breast cancer, and future applications. J Clin Oncol. 2007;25:2840–6. doi: 10.1200/JCO.2006.09.7550. [DOI] [PubMed] [Google Scholar]
- 7.Zhu J, Yao X. Use of DNA methylation for cancer detection: promises and challenges. Int J Biochem Cell Biol. 2009;41:147–54. doi: 10.1016/j.biocel.2008.09.003. [DOI] [PubMed] [Google Scholar]
- 8.Vaissiere T, Cuenin C, Paliwal A, Vineis P, Hoek G, Krzyzanowski M, et al. Quantitative analysis of DNA methylation after whole bisulfitome amplification of a minute amount of DNA from body fluids. Epigenetics. 2009;4:221–30. doi: 10.4161/epi.8833. [DOI] [PubMed] [Google Scholar]
- 9.Rifai N, Gillette MA, Carr SA. Protein biomarker discovery and validation: the long and uncertain path to clinical utility. Nat Biotechnol. 2006;24:971–83. doi: 10.1038/nbt1235. [DOI] [PubMed] [Google Scholar]
- 10.McLerran D, Grizzle WE, Feng Z, Bigbee WL, Banez LL, Cazares LH, et al. Analytical validation of serum proteomic profiling for diagnosis of prostate cancer: sources of sample bias. Clin Chem. 2008;54:44–52. doi: 10.1373/clinchem.2007.091470. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Baker SG. Improving the biomarker pipeline to develop and evaluate cancer screening tests. J Natl Cancer Inst. 2009;101:1116–9. doi: 10.1093/jnci/djp186. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Baker SG, Kramer BS, McIntosh M, Patterson BH, Shyr Y, Skates S. Evaluating markers for the early detection of cancer: overview of study designs and methods. Clin Trials. 2006;3:43–56. doi: 10.1191/1740774506cn130oa. [DOI] [PubMed] [Google Scholar]
- 13.Diamandis EP. Mass spectrometry as a diagnostic and a cancer biomarker discovery tool: opportunities and potential limitations. Mol Cell Proteomics. 2004;3:367–78. doi: 10.1074/mcp.R400007-MCP200. [DOI] [PubMed] [Google Scholar]
- 14.Baggerly KA, Morris JS, Coombes KR. Reproducibility of SELDI-TOF protein patterns in serum: comparing datasets from different experiments. Bioinformatics. 2004;20:777–85. doi: 10.1093/bioinformatics/btg484. [DOI] [PubMed] [Google Scholar]
- 15.Marshall E. Getting the noise out of gene arrays. Science. 2004;306:630–1. doi: 10.1126/science.306.5696.630. [DOI] [PubMed] [Google Scholar]
- 16.Ransohoff DF. Rules of evidence for cancer molecular-marker discovery and validation. Nat Rev Cancer. 2004;4:309–14. doi: 10.1038/nrc1322. [DOI] [PubMed] [Google Scholar]
- 17.van der Merwe DE, Oikonomopoulou K, Marshall J, Diamandis EP. Mass spectrometry: uncovering the cancer proteome for diagnostics. Adv Cancer Res. 2007;96:23–50. doi: 10.1016/S0065-230X(06)96002-3. [DOI] [PubMed] [Google Scholar]
- 18.Pepe MS, Feng Z, Janes H, Bossuyt PM, Potter JD. Pivotal evaluation of the accuracy of a biomarker used for classification or prediction: standards for study design. J Natl Cancer Inst. 2008;100:1432–8. doi: 10.1093/jnci/djn326. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Lofton-Day C, Model F, Devos T, Tetzner R, Distler J, Schuster M, et al. DNA methylation biomarkers for blood-based colorectal cancer screening. Clin Chem. 2008;54:414–23. doi: 10.1373/clinchem.2007.095992. [DOI] [PubMed] [Google Scholar]
- 20.Cottrell S, Jung K, Kristiansen G, Eltze E, Semjonow A, Ittmann M, et al. Discovery and validation of 3 novel DNA methylation markers of prostate cancer prognosis. J Urol. 2007;177:1753–8. doi: 10.1016/j.juro.2007.01.010. [DOI] [PubMed] [Google Scholar]
- 21.Navaglia F, Fogar P, Basso D, Greco E, Padoan A, Tonidandel L, et al. Pancreatic cancer biomarkers discovery by surface-enhanced laser desorption and ionization time-of-flight mass spectrometry. Clin Chem Lab Med. 2009;47:713–23. doi: 10.1515/CCLM.2009.158. [DOI] [PubMed] [Google Scholar]
- 22.Li J, Zhang Z, Rosenzweig J, Wang YY, Chan DW. Proteomics and bioinformatics approaches for identification of serum biomarkers to detect breast cancer. Clin Chem. 2002;48:1296–304. [PubMed] [Google Scholar]
- 23.Rui Z, Jian-Guo J, Yuan-Peng T, Hai P, Bing-Gen R. Use of serological proteomic methods to find biomarkers associated with breast cancer. Proteomics. 2003;3:433–9. doi: 10.1002/pmic.200390058. [DOI] [PubMed] [Google Scholar]
- 24.Drukier AK, Ossetrova N, Schors E, Krasik G, Grigoriev I, Koenig C, et al. High-sensitivity blood-based detection of breast cancer by multi photon detection diagnostic proteomics. J Proteome Res. 2006;5:1906–15. doi: 10.1021/pr0600834. [DOI] [PubMed] [Google Scholar]
- 25.Yildiz PB, Shyr Y, Rahman JS, Wardwell NR, Zimmerman LJ, Shakhtour B, et al. Diagnostic accuracy of MALDI mass spectrometric analysis of unfractionated serum in lung cancer. J Thorac Oncol. 2007;2:893–901. doi: 10.1097/JTO.0b013e31814b8be7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Petricoin EF, 3rd, Ornstein DK, Paweletz CP, Ardekani A, Hackett PS, Hitt BA, et al. Serum proteomic patterns for detection of prostate cancer. J Natl Cancer Inst. 2002;94:1576–8. doi: 10.1093/jnci/94.20.1576. [DOI] [PubMed] [Google Scholar]
- 27.Petricoin EF, Ardekani AM, Hitt BA, Levine PJ, Fusaro VA, Steinberg SM, et al. Use of proteomic patterns in serum to identify ovarian cancer. Lancet. 2002;359:572–7. doi: 10.1016/S0140-6736(02)07746-2. [DOI] [PubMed] [Google Scholar]
- 28.Zhang YF, Wu DL, Guan M, Liu WW, Wu Z, Chen YM, et al. Tree analysis of mass spectral urine profiles discriminates transitional cell carcinoma of the bladder from noncancer patient. Clin Biochem. 2004;37:772–9. doi: 10.1016/j.clinbiochem.2004.04.002. [DOI] [PubMed] [Google Scholar]
- 29.Liu W, Guan M, Wu D, Zhang Y, Wu Z, Xu M, et al. Using tree analysis pattern and SELDI-TOF-MS to discriminate transitional cell carcinoma of the bladder cancer from noncancer patients. Eur Urol. 2005;47:456–62. doi: 10.1016/j.eururo.2004.10.006. [DOI] [PubMed] [Google Scholar]
- 30.Wei YS, Zheng YH, Liang WB, Zhang JZ, Yang ZH, Lv ML, et al. Identification of serum biomarkers for nasopharyngeal carcinoma by proteomic analysis. Cancer. 2008;112:544–51. doi: 10.1002/cncr.23204. [DOI] [PubMed] [Google Scholar]
- 31.Ho DW, Yang ZF, Wong BY, Kwong DL, Sham JS, Wei WI, et al. Surface-enhanced laser desorption/ionization time-of-flight mass spectrometry serum protein profiling to identify nasopharyngeal carcinoma. Cancer. 2006;107:99–107. doi: 10.1002/cncr.21970. [DOI] [PubMed] [Google Scholar]
- 32.Engwegen JY, Helgason HH, Cats A, Harris N, Bonfrer JM, Schellens JH, et al. Identification of serum proteins discriminating colorectal cancer patients and healthy controls using surface-enhanced laser desorption ionisation-time of flight mass spectrometry. World J Gastroenterol. 2006;12:1536–44. doi: 10.3748/wjg.v12.i10.1536. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Rothman K. Modern Epidmeiology. Boston: Little, Brown and Company; 1986. [Google Scholar]
- 34.Baker SG, Kramer BS, Srivastava S. Markers for early detection of cancer: statistical guidelines for nested case-control studies. BMC Med Res Methodol. 2002;2:4. doi: 10.1186/1471-2288-2-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Rundle AG, Vineis P, Ahsan H. Design options for molecular epidemiology research within cohort studies. Cancer Epidemiol Biomarkers Prev. 2005;14:1899–907. doi: 10.1158/1055-9965.EPI-04-0860. [DOI] [PubMed] [Google Scholar]
- 36.Han KQ, Huang G, Gao CF, Wang XL, Ma B, Sun LQ, et al. Identification of lung cancer patients by serum protein profiling using surface-enhanced laser desorption/ionization time-of-flight mass spectrometry. Am J Clin Oncol. 2008;31:133–9. doi: 10.1097/COC.0b013e318145b98b. [DOI] [PubMed] [Google Scholar]
- 37.Liu XP, Shen J, Li ZF, Yan L, Gu J. A serum proteomic pattern for the detection of colorectal adenocarcinoma using surface enhanced laser desorption and ionization mass spectrometry. Cancer Invest. 2006;24:747–53. doi: 10.1080/07357900601063873. [DOI] [PubMed] [Google Scholar]
- 38.Liotta LA, Kohn EC. The microenvironment of the tumour-host interface. Nature. 2001;411:375–9. doi: 10.1038/35077241. [DOI] [PubMed] [Google Scholar]
- 39.Kryvenko ON, Jankowski M, Chitale DA, Tang D, Rundle A, Trudeau S, et al. Inflammation and preneoplastic lesions in benign prostate as risk factors for prostate cancer. Mod Pathol. doi: 10.1038/modpathol.2012.51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Tang DL, Rundle A, Warburton D, Santella RM, Tsai WY, Chiamprasert S, et al. Associations between both genetic and environmental biomarkers and lung cancer: evidence of a greater risk of lung cancer in women smokers. Carcinogenesis. 1998;19:1949–53. doi: 10.1093/carcin/19.11.1949. [DOI] [PubMed] [Google Scholar]
- 41.Rundle A, Schwartz S. Issues in the epidemiologic analysis and interpretation of intermediate biomarkers. Cancer Epidemiol Biomarkers Prev. 2003;12:491–6. [PubMed] [Google Scholar]
- 42.Spivack SD, Fasco MJ, Walker VE, Kaminsky LS. The molecular epidemiology of lung cancer. Crit Rev Toxicol. 1997;27:319–65. doi: 10.3109/10408449709089898. [DOI] [PubMed] [Google Scholar]
- 43.Reid ME, Santella R, Ambrosone CB. Molecular epidemiology to better predict lung cancer risk. Clin Lung Cancer. 2008;9:149–53. doi: 10.3816/CLC.2008.n.022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Langholz B, Clayton D. Sampling strategies in nested case-control studies. Environ Health Perspect. 1994;102 (Suppl 8):47–51. doi: 10.1289/ehp.94102s847. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Andrieu N, Goldstein AM, Thomas DC, Langholz B. Counter-matching in studies of gene-environment interaction: efficiency and feasibility. Am J Epidemiol. 2001;153:265–74. doi: 10.1093/aje/153.3.265. [DOI] [PubMed] [Google Scholar]

