Abstract
How can the “strengths” of risk factors, in the sense of how well they discriminate cases from controls, be compared when they are measured on different scales such as continuous, binary, and integer? Given that risk estimates take into account other fitted and design-related factors—and that is how risk gradients are interpreted—so should the presentation of risk gradients. Therefore, for each risk factor X0, I propose using appropriate regression techniques to derive from appropriate population data the best fitting relationship between the mean of X0 and all the other covariates fitted in the model or adjusted for by design (X1, X2, … , Xn). The odds per adjusted standard deviation (OPERA) presents the risk association for X0 in terms of the change in risk per s = standard deviation of X0 adjusted for X1, X2, … , Xn, rather than the unadjusted standard deviation of X0 itself. If the increased risk is relative risk (RR)-fold over A adjusted standard deviations, then OPERA = exp[ln(RR)/A] = RRs. This unifying approach is illustrated by considering breast cancer and published risk estimates. OPERA estimates are by definition independent and can be used to compare the predictive strengths of risk factors across diseases and populations.
Keywords: breast cancer, relative risk, risk factor, standard deviation, strength of association
Comparing the “strengths” of risk factors in the sense of how well they discriminate cases from controls is problematic when they are measured on different scales. For example, how does one compare mammographic density and breast and ovarian cancer susceptibility gene (BRCA1) mutation status as risk factors for breast cancer, and how do these compare with number of livebirths as a protective factor? The first is measured on a continuous scale, the second is binary, and the third is on an integer scale.
First, we need to think about what is meant by “strength.” While carrying a BRCA1 mutation has a substantial influence on individual risk, carriers are rare. A binary risk factor with the same influence on risk but that is more common in the population will better discriminate cases from controls. Therefore, for a binary risk factor, the prevalence (and not just the relative risk) plays a role in defining strength in the sense above.
When it comes to comparing strengths across continuous measures, one approach is to consider categories defined by the percentiles of the risk factor for the controls and then to estimate the ratio of risks across extreme but arbitrary categories (e.g., interquartile risk ratio estimates). Another is to fit a log-linear association (refer to the Appendix) and to present it in terms of the (unadjusted) cross-sectional standard deviation of the risk factor.
Neither of those 2 approaches, however, takes into account that the estimated risk gradient for a given factor is the risk gradient for people of the same status for the other factors that are in the fitted model, as well as for the factors that have been matched on or controlled for by design. The standard approach of using the cross-sectional distribution to determine the standard deviation can be deceptive.
A SOLUTION
The scale on which the risk gradient is judged, and therefore its standard deviation, should be relevant to both the fitted model and the study design because that is how the risk gradients are interpreted. Furthermore, because every random variable has a standard deviation, this concept can be applied to continuous, binary, and ordered categorical variables, including integer scales.
For each risk factor, one can apply to appropriate population data an appropriate regression technique, with careful attention to outliers, goodness of fit, and model assumptions, to derive the best fitting relationship between the mean of that risk factor (X0) in terms of all the other covariates fitted in the risk analysis and the covariates implicitly adjusted for by design (X1, X2, … , Xn). The risk association for factor X0 should be presented in terms of the absolute change in risk per standard deviation of the residuals of X0 after adjusting for X1, X2, … , Xn, rather than the unadjusted standard deviation of X0 itself. I refer to this as the odds per adjusted standard deviation (OPERA).
For binary and ordered categorical variables, including integer scales, adjustment of the means of these risk factors for the relevant covariates can be done using, for example, logistic or Poisson regression. The adjusted standard deviation can then be derived from these regression analyses and used to calculate OPERA.
If s is the adjusted standard deviation and the increased risk is relative risk (RR)-fold over A = 1/s adjusted standard deviations, then RR = OPERAA, so
(1) |
Note that, in using OPERA to measure discrimination between cases and controls and to compare across risk factors (irrespective of the direction of their associations), by definition RR > 1 in the above formula.
The example of published associations of relevant risk factors with breast cancer and of their distributions in a given population will be used below to illustrate this concept and how it applies to different types of risk factors. For simplicity, I have assumed various values of the population parameters p and μ, but in practice they would be estimated for the population about which the study is making inference.
APPLICATION TO BINARY RISK FACTORS
For a binary factor with prevalence p, the standard deviation is s = [p(1 − p)]0.5, so the number of standard deviations between the 2 values of the binary factor is A = 1/s.
Consider breast cancer and sex and, for simplicity, assume that women are at RR = 100 times the risk of men. This risk factor is a binary variable (0 = male, 1 = female), and the probability of each category is p = 0.5. The standard deviation s is [p(1 − p)]0.5 = 0.5 (i.e., A = 2), and from equation 1 it is seen that OPERA = exp [ln(100)/2)] = 1000.5 = 10. That is, the change from 0 to 1 is A = 2 standard deviations, and given that the odds increase by 100 over 2 standard deviations, under a multiplicative model they must increase 10-fold over 1 standard deviation.
Next consider family history as a binary variable, such as having an affected first-degree relative (0 = no, 1 = yes). For simplicity, assume that p = 0.1 and that there is a 2-fold increased risk for having such a family history. The standard deviation is then s = 0.3 and RR = 2, so from equation 1, OPERA = 1.23.
For BRCA1 and BRCA2, the probability of being a mutation carrier for either gene has been estimated to be about 1 in 600 (1), though it can be as high as 1 in 40 for some populations such as Ashkenazi Jewish women. The increased risk for mutation carriers is about 10-fold, though it can be considerably higher for BRCA1 carriers at a young age, for example, 30-fold at age 30 years (1). Therefore, if p = 1/600, OPERA = 1.10 if RR = 10 and 1.15 if RR = 30, while if p =1/40, OPERA = 1.43 if RR = 10 and 1.70 if RR = 30.
APPLICATION TO COMBINED FAMILIAL AND POLYGENIC RISK FACTORS
Consider now the multitude of familial factors that must exist so as to explain the 2-fold increased breast cancer risk for having an affected first-degree relative. Under a multiplicative polygenic model in which the polygenic risk score is assumed to be normally distributed, the correlation in “polygene” between first-degree relatives is 0.5, and as the risk increases exponentially across the polygene, it has been shown that the interquartile risk ratio across those underlying factors must be about 20-fold (2, 3). Given that the mean of the upper quartile of a normal distribution is 1.27 standard deviations, there is a 20-fold increased risk across 2.54 standard deviations, so the OPERA must be 3.25.
APPLICATION TO MEASURED COMMON GENETIC RISK-ASSOCIATED VARIANTS
The currently 70 or so independent common genetic markers found to predict breast cancer risk have been found, from analysis of a very large genome-wide association study, to explain approximately 14% of the familial aggregation of breast cancer. From creating a polygenic risk score based on the study findings, Mavaddat et al. (4) estimated that the OPERA is about 1.55.
APPLICATION TO CONTINUOUS RISK FACTORS
Mammographic density, the white or bright areas on a mammogram, is an established risk factor for breast cancer. For women of the same age, body mass index (BMI), and other breast cancer risk factor profiles, those with greater amounts of either absolute dense area (DA) or percent dense area (PDA) are at greater risk. Taking age and BMI into account is important; for the age range in which women are having mammograms, as these factors increase so does breast cancer risk, but both DA and PDA decrease, and this negative confounding is especially strong for PDA. Moreover, after adjustment for age and BMI, the residuals in DA and PDA are highly correlated at about 0.9. Observations show that the OPERA is about 1.40 for both DA and PDA (5), while crude comparisons of the extreme (and unadjusted) quartiles would have suggested (inappropriately) that PDA was the “stronger” risk factor.
To date, comparisons of the relative strengths of these 2 risk factors have been based on the cross-sectional standard deviation, and when viewed this way PDA appears to have a stronger risk gradient. However, this is deceptive. Because age and BMI explain about 29% of the variance of unadjusted PDA (6), the adjusted standard deviation is (1 − 0.29)0.5 =0.85 times the cross-sectional standard deviation. Hence, the logarithm of OPERA = 0.85 times the logarithm of the odds ratio per cross-sectional standard deviation, which is a 15% decrease. In contrast, for the dense area, only 5% of the variance is explained by age and BMI, so the logarithm of OPERA is 0.975 times the logarithm of the odds ratio per cross-sectional standard deviation, only a 2.5% decrease. Therefore, mammographic density is an important example of why OPERA is the appropriate way to compare risk factors.
OPERA has been applied to measures of bone mineral density adjusted for age and menopausal status and estimated to be 1.35, with a wide confidence interval, for Korean women (7).
APPLICATION TO ORDINAL RISK FACTORS
Increasing number of childbirths is associated with a decrease in risk. Number of births has an approximate Poisson distribution, so the standard deviation, s, is approximately equal to the square root of the mean, μ0.5. Suppose that women have, on average, μ = 2 children and that each successive child is associated with an x = 7% reduction in risk (8), so that risk is decreased by RR = (1 + x)-fold over A = 1/20.5 standard deviations. Therefore, from equation 1, OPERA = exp [ln(1 +x)/A] = 1.071.41 = 1.10. Note that although number of children is protective, OPERA is >1 (refer to the definition above).
For a single common genetic marker (e.g., minor allele frequency of 0.3–0.5) that can take 3 values (so that s ≈ 0.5) weakly associated with risk (e.g., RR = 1.1 per allele), then risk is increased by RR = 1.1-fold over A = 1/2 standard deviation, so OPERA = 1.05.
TAKING INTO ACCOUNT VARIATION IN RISK GRADIENTS WITH OTHER COVARIATES FOR BINARY OR ORDINAL RISK FACTORS
For unaffected women (controls), the probability of having a family history, p, or the average number of children, N, could depend on age and perhaps other measured factors (X1, X2, … , Xn). These associations can be fitted by using logistic or Poisson regression, respectively. The adjusted standard deviation for a given set of (X1, X2, … , Xn) is therefore {p(X1, X2, … , Xn)[1 − p(X1, X2, … , Xn)]}0.5 or N(X1, X2, … , Xn)0.5, respectively. In these instances, OPERA is a function of X1, X2, … , and Xn. That is, the “strength” of a risk factor can be presented, appropriately, in terms of the age of the at-risk woman and her other measured characteristics.
TAKING INTO ACCOUNT INTERACTIONS BETWEEN RISK FACTORS
The concept of OPERA can also be applied when there are interactions between 2 risk factors in the sense that the risk gradient for one factor (and therefore its OPERA estimate) depends on the levels of the other risk factor. This would require estimating the standard deviation of each risk factor as a function of the other risk factor, as well as other factors in the model and design. In this context, OPERA would be interpreted in terms of the levels of each risk factor.
PRACTICAL CONSIDERATIONS
In practice, one can perform model fitting using whatever scale one likes to derive the relative risk estimates on that scale, but to derive and therefore present the OPERA estimates, the scale needs to be adjusted as described above to take into account the other relevant covariates adjusted for by design or analysis to derive the adjusted standard deviation, s. As the interpretation of an estimate in a fitted model is conditional on all the other factors in the model and design, they should all be considered in deriving these adjusted standard deviations. In addition, care must be taken in deriving these adjusted standard deviations by checking the goodness of fit of the relationship, addressing outliers and influential points, testing model assumptions, and if necessary transforming data. Choice of an appropriate sample is critical as well.
Given that both presentations are informative, one could present the different estimates together, that is, the relative risk estimate on the original scale and OPERA = RRs. Another approach is to first derive the adjusted covariates (i.e., each covariate adjusted for the other relevant covariates) and to standardize (i.e., to have unit standard deviation) and then fit these derived measures to give the independent OPERA estimates directly. The 2 approaches above are not guaranteed to give exactly the same results for a given data set, in part because of the different approaches taken to “adjustment,” but they are essentially addressing the same concept.
The data used to derive the adjusted standard deviation should be relevant to the population about whom the risk estimates are being made. Note that the OPERA estimates, like all other estimates, are strictly valid only for the population from which the study sample has been obtained. The issue of generalizability applies to all estimates, irrespective of the scale on which they are presented. The relative risk per unadjusted standard deviation, however, suffers from being also dependent on the sample characteristics, such as the age range, so it is less generalizable than OPERA.
PUTTING RISK FACTORS FOR A GIVEN DISEASE INTO PERSPECTIVE
Table 1 displays, for breast cancer, how the strengths of different risk factors can be compared with one another, and it highlights issues in the comments column. Clearly sex is paramount. Age could also be important, but it depends entirely on the age range and its distribution in the population and, hence, is deliberately left unstated. In their totality, familial factors rank highly, but the currently known “high-risk” genes and the established common markers of risk account for about half of this gradient in risk and less for early onset disease despite its having a stronger familial risk component. Within a Western population, the number of childbirths is not strong, but as OPERA increases exponentially with the average number of children per woman, this risk factor would be more important in some other populations. The currently known common genetic markers rank on a par with the current measures of mammographic density (adjusted for age, BMI, and other risk factors). The risk gradient with measured genetic risk factors will likely increase as new markers are discovered and better and more sophisticated risk prediction models are developed that take into account measured genetic factors, unmeasured major genes, and “polygenes” and use age at diagnosis and other risk-predicting features of family cancer histories. It will be interesting to see where new measures of risk, such as markers of methylation and novel approaches to extracting information on risk from mammograms and other screening modalities, fit into the picture.
Table 1.
Risk Factor | OPERA | Comment |
---|---|---|
Sex | 10 | |
Age | ? | Depends on age range and distribution |
All familial causes | 3 | Known and unknown factors |
Mammographic density | 1.4 | Likely to increase with new measures |
Known polygenic markers | 1.6 | Likely to increase with new studies |
Known gene mutations | 1.2–1.7 | Depends on age at diagnosis and ethnicity |
Family history | 1.2 | First-degree only; yes/no |
No. of childbirths | 1.1 | Greater for countries with more births |
Abbreviation: OPERA, odds per adjusted standard deviation.
a This paper is illustrative and about a (new) concept; the table is intended as a description of general results when applying the concept to breast cancer.
PUTTING A GIVEN RISK FACTOR INTO PERSPECTIVE ACROSS DISEASES, POPULATIONS, AND SETTINGS
With OPERA, an appropriate measure of the strength of a risk factor—adjusted for the other risk factors—can be compared across diseases, across subsets of a disease (e.g., based on age at onset or subtype), and across populations and different environmental settings.
Therefore, for any given risk factor, one can rank the diseases to which a risk factor predisposes on the basis of its strength as measured by OPERA. This would be important for determining how changes in a risk factor might impact on multiple diseases. It could be important in trying to work out for which disease(s) a particular intervention might have the most impact and put these impacts into perspective. By taking into account the benefits per disease, some of which might be negative, the overall impact of the intervention could be assessed.
DISCUSSION
First, given that risk estimates take into account other fitted and design-related factors, so should the presentation of their risk gradients. Second, presenting risk gradients as a function of the “adjusted” standard deviation (the standard deviation after adjusting for other relevant factors) is a unifying approach, because it can be applied irrespective of the distribution of the (ordered) risk factor (continuous, binary, ordered categorical, integer, etc.). This general concept could also be adapted and applied to hazard ratio estimates from cohort studies. OPERA scores across risk factors for a given disease are also, by definition, independent of one another. The strengths of risk factors—in the sense of their ability to discriminate cases from controls within a given population—can then be compared with one another and across diseases, categories of disease, populations, and even subpopulations.
ACKNOWLEDGMENTS
Author affiliation: Centre for Epidemiology and Biostatistics, Melbourne School of Population and Global Health, University of Melbourne, Carlton, Victoria, Australia (John L. Hopper).
J.L.H. has received financial support from the National Health and Medical Research Council of Australia (grant APP102434), the US National Institutes of Health (grants UM1 CA164920, 5 R01 CA159868, UM1 CA167551-01A1, and R01 CA168893), and Seoul National University (Seoul, South Korea).
Conflict of interest: none declared.
APPENDIX
For a continuous or ordered categorical risk factor, X, it is convenient to be able to describe the risk gradient by 1 parameter when fitting a log-linear association. To do so requires finding the most appropriate scale for X. This can be done by applying the Box-Cox power transformation f (X) = (Xλ − 1)/λ if λ ≠ 0, else ln(X) if λ = 0, fitting risk as a function of the covariate f (X) across a range of values for λ and choosing the transformation that gives the maximum log-likelihood of the fitted models (9).
REFERENCES
- 1.Antoniou A, Pharoah PD, Narod S, et al. Average risks of breast and ovarian cancer associated with BRCA1 or BRCA2 mutations detected in case series unselected for family history: a combined analysis of 22 studies. Am J Hum Genet. 2003;725:1117–1130. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Aalen OO. Modelling the influence of risk factors on familial aggregation of disease. Biometrics. 1991;473:933–945. [PubMed] [Google Scholar]
- 3.Hopper JL, Carlin JB. Familial aggregation of a disease consequent upon correlation between relatives in a risk factor measured on a continuous scale. Am J Epidemiol. 1992;1369:1138–1147. [DOI] [PubMed] [Google Scholar]
- 4.Mavaddat N, Pharoah PDP, Michailidou K, et al. Prediction of breast cancer risk based on profiling with common genetic variants. J Natl Cancer Inst. 2015;1075:djv036. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Baglietto L, Krishnan K, Stone J, et al. Associations of mammographic dense and nondense areas and body mass index with risk of breast cancer. Am J Epidemiol. 2014;1794:475–483. [DOI] [PubMed] [Google Scholar]
- 6.Nguyen TL, Schmidt DF, Makalic E, et al. Explaining variance in the cumulus mammographic measures that predict breast cancer risk: a twins and sisters study. Cancer Epidemiol Biomarkers Prev. 2013;2212:2395–2403. [DOI] [PubMed] [Google Scholar]
- 7.Kim BK, Choi YH, Song YM, et al. Bone mineral density and the risk of breast cancer: a case-control study of Korean women. Ann Epidemiol. 2014;243:222–227. [DOI] [PubMed] [Google Scholar]
- 8.Collaborative Group on Hormonal Factors in Breast Cancer. Breast cancer and breastfeeding: collaborative reanalysis of individual data from 47 epidemiological studies in 30 countries, including 50302 women with breast cancer and 96973 women without the disease. Lancet. 2002;3609328:187–195. [DOI] [PubMed] [Google Scholar]
- 9.Box GEP, Cox DR. An analysis of transformations. J R Stat Soc Series B. 1964;262:211–252. [Google Scholar]