Abstract
Geneticists and epidemiologists often observe that certain hereditary disorders cooccur in individual patients significantly more (or significantly less) frequently than expected, suggesting there is a genetic variation that predisposes its bearer to multiple disorders, or that protects against some disorders while predisposing to others. We suggest that, by using a large number of phenotypic observations about multiple disorders and an appropriate statistical model, we can infer genetic overlaps between phenotypes. Our proof-of-concept analysis of 1.5 million patient records and 161 disorders indicates that disease phenotypes form a highly connected network of strong pairwise correlations. Our modeling approach, under appropriate assumptions, allows us to estimate from these correlations the size of putative genetic overlaps. For example, we suggest that autism, bipolar disorder, and schizophrenia share significant genetic overlaps. Our disease network hypothesis can be immediately exploited in the design of genetic mapping approaches that involve joint linkage or association analyses of multiple seemingly disparate phenotypes.
Keywords: autism, bipolar disorder, harmful genetic polymorphisms, schizophrenia, shared genes
In “simple” disorders with proven Mendelian inheritance, a single-nucleotide aberration in the genome can cause one disease while protecting against another; one nucleotide substitution can also manifest in multiple physiological systems. For example, a single-nucleotide substitution in a human β-globin gene (HBB) triggers in its bearer a drastic change of erythrocyte shape (sickle-cell anemia) but protects against invasion of the protozoan parasite (Plasmodium falciparum) that causes malaria. When designing a mathematical model that describes pairs of disease phenotypes, we can think of sickle-cell anemia and malaria as competing for the same nucleotide site in the human genome. In another example, a single-nucleotide polymorphism in the CFTR gene profoundly affects the bearer's digestive, reproductive, and respiratory systems and causes excessive loss of salt through sweating (a group of symptoms collectively known as cystic fibrosis). By analogy with our metaphor of phenotype competition for genes, we will say that these disparate phenotypic manifestations in cystic fibrosis cooperatively share the same nucleotide substitution (i.e., the substitution has a pleiotropic effect).
Does this logic of competitive or pleiotropic genetic polymorphisms extend to human disorders that have more complex (and largely unknown) genetics? We think it does. Here, we suggest a method for assessing such overlaps between complex phenotypes (a reframed comorbidity analysis), then demonstrate its application to a set of 161 disorders described in 1.5 million patient records from a clinical database at the Columbia University Medical Center.
We selected disorders that represent a broad spectrum of maladies, from common to rare, affecting diverse physiological systems, yet we also placed special emphasis on neurological phenotypes [Fig. 1 gives a complete account of phenotypes that we analyzed; we provide information on symptoms and patient statistics for each phenotype in supporting information (SI)]. Our choice of phenotypes reflect a view that the etiology of every human malady, even one as recently encountered and as clearly linked to an environmental cause as AIDS, includes a significant component of hereditary predisposition and/or resistance. For example, a series of recent studies showed that a significant proportion of people are partially or completely resistant to HIV infection, whereas other people have a predisposition to rapid AIDS progression once HIV infection has occurred (1).
Results and Discussion
Outline of Our Approach.
We developed a probabilistic model linking the unobserved genetic variation in human genomes to the observed succession of healthy and disease phenotypes in individual humans. Before formulating the model's assumptions and explaining details of its implementation, let us briefly outline its main components (Fig. 2B). In our description, when we consider a pair of disorders (D1 and D2), we model an individual's phenotype at or before a certain age (Fig. 2C). A person is born with (or without) a set of disease-predisposing variations (represented by random variables k1, k2, and k12 in Fig. 2B); these variations determine the probability that the person eventually will be diagnosed with the disease (age-integrated phenotype; see Fig. 2 B and C). Given that the age-integrated phenotype involves disease Di and the individual's life span, the individual can manifest symptoms of Di at any time during his/her lifetime with probability specified by the time-of-onset function specific to this disease and possibly by the individual's ethnicity and gender (see Fig. 2 B and C). We need to take into account the time course of each disorder to “subtract” all correlations between disorder pairs that are merely due to a commonality or a difference in their onset times. Without this age adjustment, an early-onset disease, such as autism, and a late-onset disease, such as Parkinson's disease, would misleadingly appear as negatively correlated.
To implement the model, we need a set of assumptions making our computation tractable.
Assumptions.
Environmental factors (such as physiological stress, diet, lifestyle, and exposure to pathogens) affect human phenotypes through the action of numerous molecules that are produced, distributed within cells and tissues, and used according to gene-encoded scripts. Therefore, our first assumption here is that, if the same environmental effect triggers two (or more) different maladies, it typically does so through molecular mechanisms that are common between these two maladies. Although this assumption is likely to be violated for some pairs of disorders, it is a reasonable starting point for the model. This assumption allows us to develop a probabilistic model linking the unobserved genetic variation to the observed phenotypes. We take into account environmental influence via a data-derived function that approximates the classical age-of-onset distribution, the cumulative probability that a disease manifests itself before or at any given age and given that the person will eventually get the disease (see Fig. 1). Note that these disease-specific functions are highly informative on their own and would merit careful examination. For example, allergic rhinitis, breast cancer, and especially viral infections have markedly different patterns of incidence in males and females; the function shapes are notably variable across disorders.
Our second assumption is that, for each phenotype pair (D1 and D2), the whole human genome can be divided into four disjoint sets of nucleotide sites (Fig. 2A). One set (S0) comprises nucleotide sites that can affect neither of the two phenotypes, regardless of each site's state. Sites within another set (S12) can affect both phenotypes simultaneously, either via a competitive or a cooperative mechanism. The remaining two site sets have the potential of predisposing the bearer exclusively to either phenotype D1 (set S1) or D2 (set S2) (see Fig. 2A). Depending on our choice of phenotypes, some of these four site sets may be empty.
Our third assumption involves a spectrum of hypothetical mechanisms that connect genetic variation within the four sets of nucleotide sites to the disease phenotype (genetic penetrance). All mechanisms we consider here have in common an intuitive property that, the larger the number of genetic aberrations within the disease-predisposing set of sites, the more likely the disease phenotype will manifest itself. We consider here two families of genetic-penetrance functions: sharp and soft threshold. With a sharp-threshold function, the bearer must have at least τ disease-related polymorphisms (where τ is a positive integer) if she/he is eventually to manifest symptoms of the disease. With τ = 1, we obtain the classic multilocus heterogeneity model, whereas larger values of the parameter represent more complex epistatic gene interaction models (2). With a soft-threshold function (see SI), the relationship between the number of deleterious polymorphisms and the time-integrated phenotype is more complicated: the likelihood that an individual's age-integrated phenotype will include a specific disorder at the end of the individual's life is nonzero for any number of the disease-related variations in her/his genome but grows quickly with an increase in the number of deleterious variations related to the disorder (see SI for details).
Competing Models.
In our data analysis, for every pair of disorders, we choose one of three competing models (hypotheses): (i) the disorders are uncorrelated (which we interpret as a lack of a genetic overlap), (ii) the disorders are significantly negatively correlated (which we interpret as a genetic overlap via competition), or (iii) the disorders are significantly positively correlated (genetic overlap via cooperation; see Methods). The cooperative model is slightly more general than the standard genetic pleiotropy model. The two models are identical when two phenotypes are caused by exactly the same set of genetic polymorphisms; however, unlike the pleiotropy model, the cooperation model allows each phenotype (in addition to the shared polymorphisms) to be associated with a pool of genetic polymorphisms that does not affect the other phenotype. The independence model is a special case of both overlap models, so we can represent results of our analysis with the two log-likelihood-ratio statistics (Λ), comparing both overlap models to the model of independence (Figs. 2 H–J and 3). Furthermore, our parametric model provides an estimate of the size of the hypothetical genetic overlap (Fig. 2E and SI).
Correlations and Overlaps.
Our analysis of genetic overlap reveals numerous correlations among disorders, many of which are well established (e.g., see refs. 3–8), whereas other correlations appear previously undescribed.
It is not surprising that autism is strongly correlated with pervasive developmental disorders and fragile X syndrome (Fig. 2H), because autism is included (along with several other disorders) in the formal definition of pervasive developmental disorders, and fragile X syndrome has autism as one of its manifestations. However, it is less obvious why autism, which typically manifests before the affected child is 3 years old, has a strong positive correlation with a number of neurological disorders, some of which have a late-age onset (ordered by decreasing statistical significance; see also Fig. 2H): attention deficit, epilepsy, cerebral palsy, depression, schizophrenia, bipolar disorder, neurofibromatosis, Parkinson's disease, and migraine. Our estimated significant overlap between autism and tuberculosis may indicate that both diseases are associated with genetic changes weakening the immune system.
Another group of phenotypes that overlaps with the most highly prevalent neurological disorders comprises various bacterial, viral, and protozoan infections. In the case of autism, the most strongly positively correlated phenotypes of this group include viral infections of the central nervous system (such as viral encephalitis), tuberculosis, viral infections of other systems, and staphylococcal and Helicobacter pylori infections (phenotypes are sorted here by decreasing significance of correlation). The third group of phenotypes, comorbid with autism and with many highly prevalent neurological disorders, includes allergies and autoimmune disorders, such as allergic rhinitis. Schizophrenia and bipolar disorder are positively correlated with many additional disorders of this group, including diabetes, rheumatoid arthritis, and psoriasis (see Figs. 2 I and J and 3). The fourth group of autism-correlated disorders includes both benign and malignant neoplasms. Autism is also comorbid with Kawasaki's disease [a relatively rare phenotype whose etiology is ill-understood and that probably relies on an unknown pathogen; similar to autism, it affects male individuals significantly more frequently than females (see SI)], acanthosis nigricans, and aberrations of carbohydrate metabolism. Similar groups of highly correlated phenotypes are visible in our analyses of schizophrenia and bipolar disorder (see Fig. 2 I and J), with the important addition that female breast cancer shows strong negative correlations with both schizophrenia and bipolar disorder (unlike other malignancies, including male breast cancer; see SI). This negative correlation is highly significant even when only female patients are analyzed (see SI). The negative correlation may indicate in the framework of our model a competition for genes in the cell cycle/cell death regulation: both schizophrenia and bipolar disorder under this explanation are associated with genetic polymorphisms that increase the probability of abnormal cell death in some tissues, whereas breast cancer is linked to (only partially known) genetic variation leading to an increased probability of abnormal cell proliferation. Although the competitive genetic overlap between bipolar disorder and female breast cancer has not been reported, there is recent indirect evidence that supports it: a well established breast cancer-treatment drug, tamoxifen, was recently discovered to be effective in treating symptoms of bipolar disorder (9).
We show a composite representation of correlations (interpreted as genetic overlaps) for autism, bipolar disorder, and schizophrenia (yellow spheres) and the rest of 158 disorders (blue spheres) in Fig. 3. All blue spheres have one, two, or three incoming arcs, indicating they correlate significantly with one, two, or all three of the yellow-sphere disorders. For example, acanthosis nigricans and cerebral palsy are positively correlated with every member of the yellow-sphere disease triplet. Female breast cancer is significantly negatively correlated with bipolar disorder and schizophrenia (blue arcs) but shows no significant correlation with autism. Neurofibromatosis is significantly positively correlated with autism and bipolar disorder (red arcs) but not with schizophrenia. Aortic aneurysm is negatively correlated with schizophrenia but is independent of autism and bipolar disorder.
Proportion of Autism-Predisposing Polymorphisms That Also Contribute to Schizophrenia or Bipolar Disorder.
So long as our model is designed for estimating the mean number of disease-related polymorphisms (per a randomly sampled human genome) in disease-specific site sets and in genetic overlap among disease pairs, we can use such estimates to assess the proportion of autism-predisposing variation that is shared with bipolar disorder and with schizophrenia (see Fig. 2 F and G).
Despite the fact that the absolute estimates of the expected number of disease-related polymorphisms are different under different models of genetic penetrance (see the model description and tables in SI), the proportion of the polymorphisms that autism shares with bipolar disorder and schizophrenia is consistent across different models (Fig. 2 F and G): we estimate that ≈20–60% of autism-predisposing variations also predispose the bearer to bipolar disorder, and 20–75% of autism-predisposing variations also predispose the bearer to schizophrenia. It is therefore extremely likely that there is a three-way positive correlation among autism, bipolar disorder, and schizophrenia, a correlation that probably arises from a genetic variation that predisposes to all three disorders.
Corollaries.
Our analysis suggests that, instead of following the familiar model of “unique malady–unique (disjoint with others) set of broken genes” applicable to most Mendelian disorders (Fig. 2D), most complex phenotypes are probably rooted in genetic variation that is significantly shared (in either a competitive or cooperative manner) by multiple disease phenotypes (Fig. 2E).
Phenotypes of non-Mendelian disorders are often defined with a considerable degree of fuzziness, especially those that are neurological: it is not uncommon to define a neuropsychiatric disease phenotype as comprising, for example, at least five of a list of 10 symptoms (4). This fuzziness arises because, in many cases, the observed disease is a heterogeneous collection of multiple maladies that have partially similar symptoms and potentially different genetic causes. However, these genetically heterogeneous maladies are combined because of the history of disease identification and the incompleteness of our knowledge about the disease causes.
Our interpretation of genetic overlap among pairs of disorders does not exclude the possibility that one disorder can cause the other. For example, it is possible that comorbidity of autism (or schizophrenia, or bipolar disorder) with infectious and autoimmune maladies indicates that the neurodevelopmental disorder can be triggered by different developmental insults, including viral or bacterial infection, or an autoimmune disease launched by a benign allergen. Another possibility is that the same molecular features that make a child more susceptible to infection or to autoimmune attack have a pleiotropic effect on brain development and function.
Our analysis has immediate practical implications for the design of gene-mapping studies that examine complex phenotypes. Imagine that we can study a set of families (pedigrees) whose members are affected by multiple disorders (for example, autism, bipolar disorder, schizophrenia, diabetes, and psoriasis). If we have reasons to believe that these disorders overlap in terms of disease-predisposing genetic variation, to extract maximum information from available data, we might be able to design genetic linkage or association strategies that analyze multiple complex disorders jointly. Furthermore, by selecting different sets of seemingly disparate disorders, we might be able to examine systematically the genetic background of a wide spectrum of complex phenotypes. In addition, we hope that the estimated disease overlaps will be useful in defining sharper (more specific) phenotypes that are also more genetically homogeneous.
Methods
Data.
Our input data comprise anonymized statistics about patients in the Columbia University Medical Center clinical database (1.5 million records). This database was designed for pragmatic purposes (such as billing) rather than for basic research; thus, in this study, we used a predefined data representation not specifically optimized for our purposes (see SI). With respect to the two diseases, D1 and D2, the ith patient ( in the notation i stands for human) is described with the following pentaplet of variables.
where N is the total number of patients in the database, i is the patient's age, i is the patient's gender, i is his/her ethnicity, and 1,i and 2,i are the patient's ages at the time she/he was first diagnosed with diseases D1 and D2, respectively. For the sake of encoding simplicity, we set k,i to infinity (∞) for patients who were never diagnosed with disease Dk.
The ethnicity, i, attributed to the ith patient in our data can have one of the following codes: A, B, D, E, H, I, M, N, L, O, P, U, W, or X. A table in SI provides the key to these codes.
Variable i takes values F (female), M (male), O (other, usually indicating an ambiguity/difficulty in gender assignment), and U (unknown, usually indicating missing data).
Models.
Let us focus on two human diseases, D1 and D2, each of which has a distinct hereditary component. We can divide the whole genome into four disjoint sets of nucleotide sites, S0, S1, S2, and S12 (see Fig. 2A). The first set, S0, comprises genomic sites that have no potential to contribute to either of the two diseases. The second and the third sets of sites, S1 and S2, include genomic loci that, when they harbor deleterious polymorphisms, predispose the polymorphisms' bearers to D1 and D2, respectively (see Fig. 2A). Finally, the fourth set of sites, S12, involves portions of the genome that predispose an individual who bears mutations in them to both D1 and D2 simultaneously. Although here we focus on point mutations, our approach can be extended to other types of genetic polymorphism, such as insertions, deletions, inversions, and translocations.
Phenotypes.
We define the following four phenotypes with respect to diseases D1 and D2: Φ1, Φ2, Φ12, and Φ0 correspond to “affected by disease D1 but not by disease D2,” “affected by disease D2 but not by disease D1,” “affected by both diseases D1 and D2,” and “affected by neither disease D1 nor D2,” respectively.
Genotypes: Probability of Gi = {ki,1, ki,2, ki,12}.
We denote the total number of deleterious polymorphisms that fall into S1, S2, and S12 for individual i with a triplet of random variables {ki,1, ki,2, ki,12}. In our model, these three variables completely describe the individual's genotype, Gi, with respect to diseases D1 and D2. We assume that the random variables ki,1, ki,2, and ki,12 independently follow Poisson distributions (10) with rates ρ1, ρ2, and ρ12, respectively. If a disease-related nucleotide site set Sk is small, as in the case of sickle-cell anemia (just two sites), we can assume that the observed number of disease-relevant polymorphisms per genome follows a binomial distribution instead of a Poisson distribution (see SI).
Probability of φ(∞) given Gi = {ki,1, ki,2, ki,12} (penetrance function).
We use the notation φ(∞) to denote an individual's phenotype with respect to diseases D1 and D2 at the end of his/her life (eventual or age-integrated phenotype; see Fig. 2B). We consider here two definitions of the penetrance function. The first definition postulates that disease D1 manifests itself only if the number of deleterious variations in S1 and S12, ki,1 + ki,12 is equal or greater than a threshold, τ1 (similarly, D2 develops eventually if ki,2 + ki,12 ≥ τ2). The second definition postulates that the threshold value itself is a random variable, so that the probability of developing a disease gradually increases with the number of deleterious polymorphisms (see SI for details).
Probability of φ(t) given φ(∞).
We use the notation φ(t) to indicate an individual's phenotype at or before age t. Let T1 and T2 be the ages at onset (or first diagnosis) of diseases D1 and D2, respectively. φ(t) = Φ1 is then equivalent to {T1 ≤ t, T2 > t}. Thus, the likelihood of the two-disease phenotype status can be studied using the joint failure time model (11) for T1 and T2, based on the genetic factors and covariates such as age and gender. We then define the following conditional distributions for T1 and T2,
where k = 1, 2. Note that we can estimate Fk(tk-e, g; Θ) directly from our data (estimates of 1 − Fk(tk-e, g; Θ) are shown in Fig. 1). Finally, we define the probability of φ(t) given φ(∞) in terms of probabilities Fk(tk-e, g; Θ), as shown in Table 1.
Table 1.
P(φ(t)|φ(∞), e, g; Θ) | φ(∞) |
|||
---|---|---|---|---|
Φ0 | Φ1 | Φ2 | Φ12 | |
Φ0 | 1 | F1(t|e, g; Θ) | F2(t|e, g; Θ) | F1(t|e, g; Θ) F2(t|e, g; Θ) |
φ(t) Φ1 | 0 | 1 − F1(t|e, g; Θ) | 0 | (1 − F1(t|e, g; Θ))F2(t|e, g; Θ) |
Φ2 | 0 | 0 | 1 − F2(t|e, g; Θ) | F1(t|e, g; Θ)(1 − F2(t|e, g; Θ)) |
Φ12 | 0 | 0 | 0 | (1 − F1(t|e, g; Θ))(1 − F2(t|e, g; Θ)) |
Total | 1 | 1 | 1 | 1 |
Two genetic overlap models: Cooperation and competition.
In the cooperation (generalized pleiotropy) model, the overlap genes can simultaneously contribute to both diseases, whereas in the competition model, the overlapped genes can contribute to only one of the diseases (the choice is made stochastically with probability specific to each pair of diseases; see SI).
Likelihood and Likelihood Ratio Test.
To compute a likelihood value for data representing the ith patient, we need to sum the probability of the observed phenotype [given ki,1, ki,2, ki,12, and φ(∞)] over all admissible values of ki,1, ki,2, ki,12 (ki,j = 0, 1, …, ∞), and φ(∞) (see SI for description of an efficient computation of this value). If we assume that the vector of parameters, Θ, is the same for all values of e (ethnicity) and g (gender), then the likelihood function is just a product (over all patients) of probabilities of the observed phenotypes given common parameter values (see SI). (Alternatively, we subdivide the data by ethnicity and gender and estimate a separate set of parameters for each data subset.)
At its heart, our analysis is a model selection problem. First, we have two versions of the same model where two disorders have an arbitrarily large genetic overlap (via either a cooperation or competition scenario). Second, we have a simpler model that is nested in both former models, where the two disorders are genetically independent. Put differently, the model where two disorders, D1 and D2, are genetically independent (the genetic overlap, nucleotide set S12, is empty [ρ12 = 0]) is a special case of the two models where the same two disorders are either genetically overlapping or independent (S12 = ∅ [ρ12 = 0] or S12 ≠ ∅ [ρ12 ≠ 0]). Therefore, we can use a standard log-likelihood ratio statistic, Λ, for nested models. This Λ statistic asymptotically (as the sample size grows) follows a χ2 distribution with the number of degrees of freedom equal to the difference in the number of parameters between the two models (1, in our case) (12).
In the presence of a statistical signal, we can distinguish among the three models (independence, cooperation, and competition) by computing two statistics: Λcooperation and Λcompetition.
Availability.
Detailed information on estimated disease overlaps for all pairs of disorders mentioned in this study is available as SI.
Supplementary Material
Acknowledgments
We thank Murat Çokol, Lyn Dupré Oppenheim, Ivan Iossifov, Igor Feldman, Richard Friedman, George Hripcsak, Marianthi Markatou, Rita Rzhetsky, and Chani Weinreb for very helpful comments on the earlier version of this manuscript. This work was supported by National Institutes of Health Grants GM61372 and U54 CA121852–01A1 (to A.R.) and GM070789 (to T.Z.), National Science Foundation Grants 0438291 and 0121687 (to A.R.) and 0532231 (to T.Z.), the Cure Autism Now Foundation (A.R.), and Defense Advanced Research Projects Agency Grant FA8750-04-2-0123 (to A.R.).
Footnotes
The authors declare no conflict of interest.
This article contains supporting information online at www.pnas.org/cgi/content/full/0704820104/DC1.
References
- 1.O'Brien SJ, Nelson GW. Nat Genet. 2004;36:565–574. doi: 10.1038/ng1369. [DOI] [PubMed] [Google Scholar]
- 2.Risch N. Am J Hum Genet. 1990;46:222–228. 229–241. [PMC free article] [PubMed] [Google Scholar]
- 3.Richardson AJ, Ross MA. Prostaglandins Leukot Essent Fatty Acids. 2000;63:1–9. doi: 10.1054/plef.2000.0184. [DOI] [PubMed] [Google Scholar]
- 4.Sutker PB, Adams HE. Comprehensive Handbook of Psychopathology. 3rd Ed. New York: Kluwer/Plenum; 2001. [Google Scholar]
- 5.Wiznitzer M. J Child Neurol. 2004;19:675–679. doi: 10.1177/08830738040190090701. [DOI] [PubMed] [Google Scholar]
- 6.Stahlberg O, Soderstrom H, Rastam M, Gillberg C. J Neural Transm. 2004;111:891–902. doi: 10.1007/s00702-004-0115-1. [DOI] [PubMed] [Google Scholar]
- 7.Cohen D, Pichard N, Tordjman S, Baumann C, Burglen L, Excoffier E, Lazar G, Mazet P, Pinquier C, Verloes A, Heron D. J Autism Dev Disord. 2005;35:103–116. doi: 10.1007/s10803-004-1038-2. [DOI] [PubMed] [Google Scholar]
- 8.Newcomer JW. J Clin Psychiatry. 2006;67(Suppl 9):25–30. discussion 36–42. [PubMed] [Google Scholar]
- 9.Kulkarni J, Garland KA, Scaffidi A, Headey B, Anderson R, de Castella A, Fitzgerald P, Davis SR. Psychoneuroendocrinology. 2006;31:543–547. doi: 10.1016/j.psyneuen.2005.11.001. [DOI] [PubMed] [Google Scholar]
- 10.Sawyer SA, Hartl DL. Genetics. 1992;132:1161–1176. doi: 10.1093/genetics/132.4.1161. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Kalbfleisch JD, Prentice RL. The Statistical Analysis of Failure Time Data. 2nd Ed. Hoboken, NJ: Wiley; 2002. [Google Scholar]
- 12.Neyman J, Pearson ES. Biometrika 20-A, 175–. 1928;240:263–294. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.