Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2009 Sep 28.
Published in final edited form as: Genomics. 2008 Jul 21;93(1):10–12. doi: 10.1016/j.ygeno.2008.06.002

Less is more, except when less is less: Studying joint effects

CR Weinberg 1
PMCID: PMC2752945  NIHMSID: NIHMS142803  PMID: 18598750

Abstract

Most diseases are complex in that they are caused by the joint action of multiple factors, both genetic and environmental. Over the past few decades, the mathematical convenience of logistic regression has served to enshrine the multiplicative model, to the point where many epidemiologists believe that departure from additivity on a log scale implies that two factors interact in causing disease. Other terminology in epidemiology, where students are told that inequality of relative risks across levels of a second factor should be seen as “effect modification,” reinforces an uncritical acceptance of multiplicative joint effect as the biologically meaningful no-interaction null. Our first task, when studying joint effects, is to understand the limitations of our definitions for “interaction,” and recognize that what statisticians mean and what biologists might want to mean by interaction may not coincide.

Joint effects are notoriously hard to identify and characterize, even when asking a simple and unsatisfying question – like whether two effects are log-additive. The rule of thumb for such efforts is that a factor-of-four sample size is needed, compared to that needed to demonstrate main effects of either genes or exposures. So strategies have been devised that focus on the most informative individuals, either through risk-based sampling for a cohort, or case-control sampling, extreme phenotype sampling, pooling, two-stage sampling, exposed-only, or case-only designs. These designs gain efficiency, but at a cost of flexibility in models for joint effects.

A relatively new approach avoids population controls by genotyping case-parent triads. Because it requires parents, the method works best for diseases with onset early in life. With this design, the role of autosomal genetic variants is assessed by in effect treating the nontransmitted parental alleles as controls for affected offspring. Despite advantages for looking at genetic effects, the triad design faces limitations when examining joint effects of genetic and environmental factors. Because population-based controls are not included, main effects for exposures cannot be estimated, and consequently one only has access to inference related to a multiplicative null. We have proposed a hybrid approach, which offers the best features of both a case-parent and a case-control design. By genotyping parents of population-based controls and assuming Mendelian transmission, power is markedly enhanced. One can also estimate main effects for exposures and now flexibly assess models for joint effects.

Introduction – what is biological versus statistical interaction?

Most diseases are “complex” in that they are caused by the joint action of multiple genetic and environmental (including lifestyle) factors, playing out over age and time. The most interesting etiologic questions for complex diseases bear on possibly shared pathways, co-regulated families of genes and the mechanisms by which genetic and environmental factors modulate specific biologic processes. Methods to characterize joint effects on susceptibility to complex diseases should ideally illuminate the causal processes themselves, at mechanistic levels of understanding.

However, investigators have a natural preference for problems they can solve, and approaches to studying interactions have historically focused on specific statistical models. We begin by briefly considering the differences between how a biologist might think about interaction and how a statistician might operationally define it.

Colloquially, people refer to two factors as behaving in a synergistic way if their combined effect is greater than what would be expected based on their individual separate effects. A classic example is retardation due to phenylketonuria (PKU): dietary phenylalanine is innocuous in the absence of the metabolic PKU genetic defect, and the defect does not cause retardation in the absence of dietary phenylalanine. (Hence the importance of neonatal screening for this genetic variant.) Most synergistic interactions are less “pure” than the PKU paradigm, and typically both factors individually cause some increase in risk. Consequently, in order to say that a joint effect exceeds what would be expected based on individual effects, we first need a way to form a reasonable expectation for their joint effect, against which to define interaction, or synergism. Any definition of interaction logically relies on some specification of non-interactive effects.

Models for non-interactive joint effects

For assessing toxic effects of mixtures, toxicologists developed the notion of “simple independent action (SIA).” [1]. The idea there was probabilistic independence. If there is no background risk, i.e. no risk in the absence of either exposure A or B, then the probability of avoiding an outcome, say D, when exposed to both A and B should be the product of the two separate avoidance probabilities. Under simple independent action, the risk factors A and B thus behave in a way that is mutually oblivious. A paradigm for this would be two duck hunters separately shooting at the same duck [2], where the duck must avoid the bullets coming from both. Violations of SIA would then indicate non-independence, which would be seen as synergy if the violation is in the direction of increased risk and antagonism if in the direction of reduced risk. An independently acting background cause (or aggregation of causes not including A or B) can be allowed for as yet another independently operating factor, so that if Pr[D | A,B] is the probability of D in the presence of both exposures, simple independent action would require:

Pr[D¯|A,B]Pr[D¯|A¯,B¯] = Pr[D¯|A,B¯]Pr[D¯|A¯,B]Pr[D¯|A¯,B¯]Pr[D¯|A¯,B¯],

where the overbar indicates nonoccurrence of the exposure or outcome. The idea again is that the probability of escaping the causes associated with A, with B, and the causes associated with the background factors is the product of the three separate escape probabilities if A and B act independently of each other and of all other unmeasured causal factors (background). Taking logarithms yields a model that is additive in the log-complement, adjusted by the background:

{ln(1Pr[D|A,B])ln(1Pr[D|A¯,B¯])}={ln(1Pr[D|A,B¯])ln(1Pr[D|A¯,B¯])}+{ln(1Pr[D|A¯,B])ln(1Pr[D|A¯,B¯])},

, Suppose D is rare, with risk near to 0. Then, because mathematically -ln(1-x) converges to x as x goes to 0, this condition is approximately the same as:

(Pr[D|A,B]Pr[D|A¯,B¯])=(Pr[D|A,B¯]Pr[D|A¯,B¯])+(Pr[D|A¯,B]Pr[D|A¯,B¯]),

that is, additivity on the absolute risk scale. Even if D is not rare, if one is looking at hazards across instantaneous time the event is rare in each small time interval. Thus if λAB(t) is the instantaneous hazard rate at time t for those exposed to both A and B, then non-interaction would correspond to:

(λAB(t)λAB¯(t))=(λAB¯(t)λAB¯(t))+(λA¯B(t)λAB¯(t)),

that is, additivity of hazards. This additivity then corresponds to a multiplicative model in cumulative survival. The notion here, biologically, might be that these two exposures act through completely disjoint biologic pathways.

Epidemiologists, for reasons that are primarily historical and mathematical, have instead preferred a model for statistical independence that is based on additivity on a logarithmic scale applied to risk, rather than 1 minus risk. The multiplicative model for independence specifies that:

Pr[D|A,B]Pr[D|A¯,B¯] = Pr[D|A,B¯]Pr[D|A¯,B¯]Pr[D|A¯,B]Pr[D|A¯,B¯],

i.e. the relative risk associated with the combined exposure is the product of the two relative risks. Another way to think of this is that the relative risk for A relative to nonA (or B relative to nonB) is the same across levels of B (or A). If the disease is rare in the population under study, then the above model is approximately equivalent to a multiplicative model for the odds ratios, which becomes additivity on the logistic scale. In an unfortunate use of terminology, epidemiologists refer to departures from this model as effect modification, a jargon that strongly implies that one factor modifies the effect of the other. (A recent advance is instead to use the phrase effect-measure modification [3].)

Thus, there are a number of different possible scales on which to define non-interaction; while the biologist may prefer the log-complement scale because of the natural interpretability of probabilistic independence, the epidemiologist has traditionally preferred the log-odds scale. In practice, this distinction may not be a subtle one. For example, suppose the background risk (i.e. the risk in those with neither A nor B) is 10 per 100,000, while the risk with A alone is 20 per 100,000 and the risk with B alone is 30 per 100,000. Then if the risk in those with both is 50 per 100,000 there is antagonism (competition?) between A and B under a log-odds null, but synergism (enhancement?) under a log-complement null, because the expectation under the former is 60 per 100,000, versus 40 per 100,000 under the latter. Thus, under one model the two exposures mutually enhance each other and under the other they are seen as working against each other. So the direction of the finding can be different, depending on the selected model for nonindependence. Another consideration is that power for detecting “interaction” can be markedly greater if the null is specified additively. This enhancement of statistical power happens, not for some subtle reason, but simply because super-multiplicative joint action of two risk-enhancing factors is more distant from an additive null than from a multiplicative null.

Even under the simple log-odds formulation, detection of interactions tends to be statistically challenging, and a rule of thumb has been that the sample size required is about four-fold what would be needed to detect a main effect of similar magnitude. Thus, a doubling of a relative risk from one stratum to another is about 4 times as hard to detect as a twofold relative risk.

Less-is-more approaches

Epidemiologic designs have been developed to extract more information from fewer people, by judicious sampling strategies. Two-stage sampling, for example, was developed as a strategy for situations where a rare exposure is under study, and one can improve efficiency dramatically, particular for studying interactions, by over-sampling exposed cases and exposed controls and adjusting for the covariate- and outcome-dependent sampling fractions in the analysis. [4; 5] Pooling of specimens is another strategy that has been proposed as a way to include many more study subjects by minimizing assay costs, when interaction is of interest and exposure is assessed by a biomarker [6].

The case-only design is an extreme form of a less-is-more strategy, because no controls are sampled at all [7]. In place of controls we make a strong assumption. If one can safely assume that factors A and B occur independently of each other in the source population (an assumption that may often hold with an environmental factor and a genetic variant), then one can test a multiplicatively-defined interaction using only cases. Reiterating the argument given by Piegorsch et al (6], this can be seen as follows. Consider the multiplicative interaction parameter:

ψ=Pr[D|A,B]Pr[D|A¯,B¯]Pr[D|A¯,B]Pr[D|A,B¯]

Under the null that the joint effect of A together with B is simply multiplicative, this parameter is 1. Algebraically, it can be reexpressed as follows:

Pr[A,B|D]Pr[A¯,B¯|D]Pr[A¯,B|D]Pr[A,B¯|D]Pr[A,B]Pr[A¯,B¯]Pr[A¯,B]Pr[A,B¯]

Now note that under independence of A and B in the source population, the ratio in the denominator is 1 and what is left is the numerator ratio, i.e. the odds ratio for A versus B among cases only. The remarkable feature of this design is that its efficiency for detecting departures from multiplicative joint effects is as good as that of a study with an infinitely large control group, because in effect the controls are only serving to estimate that denominator “1” with increasing precision. However, the user should beware, because the limitations are drastic for the case-only design: even when the primary assumption holds, main effects cannot be studied at all and only multiplicative interactions can be assessed. Often a genetic and an environmental factor will not occur independently in the population, violating the method’s primary assumption, because attrition of risk factors can be jointly selective. These constraints handicap case-only inference related to joint effects.

A relatively new design also avoids use of population-based controls, but uses the parents of cases to serve as genetic controls, who are ethnically matched to each case [8]. One studies triads consisting of affected individuals and their parents. The notion is that genetic variants that confer susceptibility will have been over-transmitted to people with disease (compared to Mendelian transmission), and the extent of that over-transmission allows us to estimate the genetic relative risks [9]. This approach offers robustness against the bias due to population stratification that can affect a case-control approach to studying genetic risk factors. It also allows detection of maternally-mediated genetic effects [10] and parent-of-origin effects [11]. Stratification on exposure also permits estimation of multiplicative interaction parameters, by testing whether the genetic relative risks are the same across levels of the exposure. Clearly, this design relies on the adequate availability of parents and is best if used for conditions with onset early in life, such as birth defects. Although genetic main effects can now be estimated, a limitation is that because one cannot estimate the main effects of exposures with this design, one can only work with a multiplicative formulation for joint effects, and consequently sacrifices flexibility in inference.

A promising extension of the case-parent design, originally proposed by Nagelkerke, incorporates some controls from the population, as a kind of hybrid between the case-control and the case-parent triad approach [12]. In a recent variant of this design [13], the exposures for controls are studied, but one genotypes the parents of controls and not the controls themselves. The inclusion of control parents allows one to capture information related to enrichment for causative alleles in the parents of cases compared to parents of controls, information that is sacrificed by triad methods, which condition on the parental genotypes. The inclusion of controls greatly enhances the power of the design for detecting genetic effects. There are also advantages for studying gene-by-environment interaction. Provided one can safely assume Mendelian transmission, nondifferential survival (vis a vis the gene) to the age at study, and absence of bias due to genetic population stratification, this hybrid approach now allows estimation of the main effects of exposures, as well as the main effects of autosomal genetic variants, and hence provides both improved power and enhanced flexibility for modeling joint effects of genetic and environmental factors.

Despite such advances, characterization of joint gene-gene effects and joint effects of genetic and environmental factors will continue to present a challenge. Methodologic research for studying the etiology of complex diseases remains in its infancy, both conceptually and practically.

ACKNOWLEDGMENTS

This work was supported by the Intramural Research Program of the NIH, National Institute of Environmental Health Sciences (Z01 ES 040006-11).

Footnotes

Proceedings from meeting at Harvard School of Public Health, October, 2007

References

  • 1.Finney DJ. Probit Analysis. New York: Cambridge University Press; 1971. [Google Scholar]
  • 2.Weinberg CR. Applicability of the simple independent action model to epidemiologic studies involving two factors and a dichotomous outcome. American Journal of Epidemiology. 1986;123:162–173. doi: 10.1093/oxfordjournals.aje.a114211. [DOI] [PubMed] [Google Scholar]
  • 3.Rothman K, Greenland S, Lash T. Modern Epidemiology. Philadelphia: Lippincott Williams & Wilkins; 2008. [Google Scholar]
  • 4.White JE. A two stage design for the study of the relationship between a rare exposure and a rare disease. American Journal of Epidemiology. 1982;115:119–128. doi: 10.1093/oxfordjournals.aje.a113266. [DOI] [PubMed] [Google Scholar]
  • 5.Breslow N, Cain K. Logistic regression for two-stage case-control data. Biometrika. 1988;75:11–20. [Google Scholar]
  • 6.Weinberg C, Umbach D. Using pooled exposure assessment to improve efficiency in case-control studies. Biometrics. 1999;55:718–726. doi: 10.1111/j.0006-341x.1999.00718.x. [DOI] [PubMed] [Google Scholar]
  • 7.Piegorsch W, Weinberg C, Taylor J. Non-hierarchical logistic models and case-only designs for assessing susceptibility in population-based case-control studies. Statistics in Medicine. 1994;13:153–162. doi: 10.1002/sim.4780130206. [DOI] [PubMed] [Google Scholar]
  • 8.Spielman R, McGinnis R, Ewens W. Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM) Am J Hum Genet. 1993;52:506–516. [PMC free article] [PubMed] [Google Scholar]
  • 9.Weinberg CR, Wilcox AJ, Lie RT. A log-linear approach to case-parent triad data: Assessing effects of disease genes that act directly or through maternal effects, and may be subject to parental imprinting. Am J Hum Gen. 1998;62:969–978. doi: 10.1086/301802. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Wilcox A, Weinberg C, Lie R. Distinguishing the effects of maternal and offspring genes through studies of "case-parent triads". Am J Epid. 1998;148:893–901. doi: 10.1093/oxfordjournals.aje.a009715. [DOI] [PubMed] [Google Scholar]
  • 11.Weinberg C. Methods for detection of parent-of-origin effects in genetic studies of case-parent triads. Amercan Journal of Human Genetics. 1999;65:229–235. doi: 10.1086/302466. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Nagelkerke N, Hoebee B, Teunis P, Kimman T. Combining the transmission disequilibrium test and case-control methodology using generalized logistic regression. Eur J Hum Genet. 2004;12:964–970. doi: 10.1038/sj.ejhg.5201255. [DOI] [PubMed] [Google Scholar]
  • 13.Weinberg C, Umbach D. A hybrid design for studying genetic influences on risk of diseases with onset in early life. Amercan Journal of Human Genetics. 2005;77:627–636. doi: 10.1086/496900. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES