Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2004 Oct 7;101(42):15148–15153. doi: 10.1073/pnas.0404315101

Molecular triangulation: Bridging linkage and molecular-network information for identifying candidate genes in Alzheimer's disease

Michael Krauthammer a,b,c, Charles A Kaufmann d, T Conrad Gilliam b,d,e, Andrey Rzhetsky a,b,f,g
PMCID: PMC523448  PMID: 15471992

Abstract

A major challenge in human genetics is identifying the molecular basis of common heritable disorders. In contrast to rare single-gene diseases, multifactorial disorders are thought to arise from the combined effect of multiple gene variants, such that any single variant may have only a modest effect on disease susceptibility. We present a method to identify genes that may harbor a significant proportion of the genetic variation that predisposes individuals to a given multifactorial disorder. First, we perform an automated literature analysis that predicts physical interactions (edges) among candidate disease genes (seed nodes, selected on the basis of prior information) and other molecular entities. We derive models of molecular networks from this analysis and map the seed nodes to them. We then compute the graph-theoretic distance (the minimum number of edges that must be traversed) between the seed nodes and all other nodes in the network. We assume that nodes that are found in close proximity to multiple seed nodes are the best disease-related candidate genes. To evaluate this approach, we selected four seed genes, each with a proven role in Alzheimer's disease (AD). The method performed well in predicting additional network nodes that match AD gene candidates identified manually by an expert. We also show that the method prioritizes among the seed nodes themselves, rejecting false-positive seeds that are derived from (noisy) whole-genome genetic-linkage scans. We propose that this strategy will provide a valuable means to bridge genetic and genomic knowledge in the search for genetic determinants of multifactorial disorders.


Canonical triangulation is the process of determining the absolute position of an object in Cartesian space, based on relative signals from reference points whose absolute position is known (1). We have attempted to extend this principle to help us predict unknown genetic determinants (i.e., gene variants) for common heritable disorders when prior information implicates several, or many, possible candidates. The idea is to chart the universe of all known molecular interactions and then to establish species-specific molecular networks. These networks can then be used to predict the subnetwork related to the disease of interest from its nodes' proximity to known, or implicated, disease genes, the latter being seed nodes. To demonstrate and evaluate the approach, we have targeted the study of a common neurodegenerative disorder, Alzheimer's disease (AD). Genetic variants of apolipoprotein E (APOE) gene are known to account for a substantial fraction of overall genetic susceptibility to AD. Three additional genes [amyloid precursor protein (APP), presenilin 1 (PSEN1), and presenilin 2 (PSEN2)] have been shown to harbor unambiguous disease-causing mutations in families manifesting rare autosomal dominant forms of AD. Moreover, all four genes may converge in a common physiological role: in the maintenance, storage, or removal of the aberrant form of APP from disease-related amyloid plaques. Because the combined effect of mutations in these four genes accounts for less than one-half of overall genetic susceptibility to AD, and because linkage studies of whole human genome suggest that several additional chromosomal loci harbor disease-related gene variants, we have sought to identify new AD candidate genes by combining the predictions of molecular-interaction data with those of whole-genome genetic-linkage studies.

To address this issue, we considered the following problem. Imagine a large molecular network in which a subset of nodes, as is pointed to by a prior evidence, is relevant to the disorder of interest. In addition, we know that our data are noisy; that is, some or all implicated genes are implicated mistakenly. Our task is to identify the correctly implicated genes and to predict additional genes that are likely to harbor genetic variation that either predisposes individuals to, or protects them against, the onset of disease.

We propose a solution to this problem using a method that we refer to as molecular triangulation. In this method, we use available data from multiple sources, such as genetic-linkage, genetic-association, or gene-expression data, to identify candidate genes implicated in the etiology of a given heritable disorder. We also identify other molecular entities (such as small molecules) cited in the literature that demonstrate physical or chemical interactions with the seed nodes or with one another. We then search the resulting molecular networks for subnetworks that may harbor disease-relevant genes by identifying genes that are graph-theoretically close to multiple seed nodes (Fig. 1). (In graph theory, one can define a distance between a pair of nodes in a graph by the minimum number of edges that we must traverse to get from one node to another. We treat a molecular network as a graph in which genes or other molecules are nodes, and physical interactions between molecules are edges.)

Fig. 1.

Fig. 1.

The molecular-triangulation method. The analysis starts with a possibly noisy set of seed genes that have been identified as bearing information about the molecular subsystem that malfunctions in the disorder of interest. If the seed genes, shown as red nodes in the plot, are randomly distributed through a large molecular pathway (A), they tend to be far from one another. Therefore, few pathway nodes (shown in blue) lie in the immediate neighborhood of more than one seed node. On the other hand, if the seed genes cluster in a compact molecular neighborhood, thus indicating the faulty subsystem (B), they tend to have overlapping neighborhoods; those gene nodes that are in the immediate network neighborhood of more than one seed node constitute candidates for future mutation analysis in affected and healthy humans.

The triangulation-like method that we propose is based on two assumptions. First, we assume that erroneously identified seed nodes are uniformly distributed within a large network. Second, we assume that the unknown subnetwork relevant to the disorder of interest is compact (that is, it is unlikely to comprise numerous disconnected islands within the large network) and small with respect to the whole network. Consequently, a subset of seed genes that tends to cluster within the large network is likely to indicate the relevant subnetwork, and the additional candidate genes are those network nodes that are neighbors of the clustered seed nodes.

To apply the molecular-triangulation method to real data, we need a comprehensive model of molecular networks that represents the current knowledge of the international research community. Most of that knowledge is locked in an astronomically large number of research publications. Several recent text-mining approaches have been suggested to harvest information from the literature automatically (2).h,i We have developed a computer system called geneways (3) that automatically extracts molecular-interaction information from the research literature. Using geneways to analyze a large number of full-text articles, we reconstructed species-specific molecular networks. The accuracy and completeness of our knowledge related to molecular networks are clearly important. If our network-based method for selection of candidate genes is to work well, we must have extensive knowledge of molecular networks.

We sought to test the molecular-triangulation approach with a few types of imperfect (noisy) data implicating several genes in the etiology of AD. First, we seeded the algorithm with a list of 60 AD candidate genes prepared manually by an expert in the field (C.A.K.). We studied the resulting triangulation ranks for low- and high-scoring seed genes, effectively prioritizing among the 60 expert-validated AD genes. Second, we seeded the algorithm with the four known AD genes (APP, APOE, PSEN1, and PSEN2) and then compared the triangulation-predicted candidate genes with the list of expert-validated AD genes. We determined the significance of these automated predictions by comparing them to multiple predictions generated from randomly selected seed sets of any four genes. Further, we tested the robustness of the method by adding noise (randomly selected genes) to the set of four AD genes and then measuring the decrease in the fraction of predicted candidate genes that matched the expert list. Third, we applied the method to data from a recent comprehensive whole-genome linkage scan of AD families (4). The latter study identified a number of linkage peaks between contiguous DNA markers and AD-affection status, thereby implicating numerous AD positional candidate genes. In our analysis, we first mapped these genes to nodes in our literature-derived molecular-interaction network and then assigned priorities to the positional candidate genes by our algorithm.

Methods

Text Mining to Capture Knowledge of Molecular Interactions. We used geneways to parse ≈124,000 articles in 25 scientific journals and thus compiled a comprehensive cross-species database that comprised >1,500,000 unique interactions among molecular entities. The entities (nodes in a molecular network) are proteins, genes, messenger RNAs, small molecules, and other biological substances that participate in molecular processes. We mapped entity names stored in the geneways Ver. 4.0 database to unique SwissProt (5) identifiers of human proteins,j excluding interactions that could not be linked to protein (or corresponding gene) sequences. Of the remaining interactions, we considered only those that describe direct relationships among molecular entities (such as bind and phosphorylate). (The indirect interactions, such as activate, inhibit, and regulate, can be implemented as a chain of numerous direct interactions and therefore can be misleading for identifying network neighbors.) The resulting direct-links-only human network contained 17,211 interactions among 3,111 network nodes corresponding to unique SwissProt identifiers.

AD Candidate Genes Identified from Genetic-Linkage Information. One useful source for prediction of AD candidate genes is whole-genome genetic-linkage data. Such studies identify chromosomal regions likely to harbor disease-related genetic variation, but the typically marginal statistical significance that correlates any single locus with disease status is reflected by low broad linkage peaks, each of which may encompass tens to 100 or more candidate genes, all with equal candidate stature. Thus, the vast majority of positional candidate genes are expected to be unrelated to AD.

We used the whole-genome-scan data from Blacker et al. (4), because theirs is one of the largest and most recent studies, and also because their findings are generally consistent with those reported by other independent groups. We arbitrarily selected the 10 linkage peaks that had multipoint local scores (MLSs) >1.5, and we combined the data for three AD subtypes (late, early/mixed, and total) into a single common phenotype, considering only the highest peak per region. The search for candidate genes was further limited to the region under the linkage peak that we defined by dropping 1 logarithm of odds (lod) unit interval from the linkage peak (Fig. 2). We used genetic maps and intermarker distances provided by the Center for Inherited Disease Research (CIDR). We downloaded the CIDR marker set and corresponding gender-averaged distances from the CIDR web site (www.cidr.jhmi.edu/download/CIDRmarkers.txt). We used resources from the National Center for Biotechnological Information map-viewer resource (ftp://ftp.ncbi.nih.gov/genomes/H_sapiens) to calculate the coordinates between the CIDR marker positions (in centimorgans) and the absolute genomic positions (in base pairs). This step was necessary for locating positional candidate genes whose absolute genomic position is known. By this process, we identified 1,476 gene loci. To map the corresponding genes to our human interaction network, we annotated the loci with their respective SwissProt identifiers. This mapping turned out to be possible for only 657 of the 1,476 loci, 157 (23.9%) of which we were then able to map to nodes of our interaction network. We then prioritized the candidate genes with the molecular-triangulation methodology using MLS scores corresponding to the highest points of the respective linkage peaks.

Fig. 2.

Fig. 2.

Determination of the 1-lod unit interval. We imported the published MLS AD linkage information from Blacker et al.'s (4) study into a graphical package (Microsoft visio) so that we could draw the boundaries of the 1-lod interval accurately. In this example, we subtracted one MLS from the 70-cM peak of chromosome 19 to determine a 1-lod interval between 63 and 73 cM. [Linkage peak reproduced with permission ref. 4 (Copyright 2003, Oxford University Press).]

AD Expert Data. Our AD expert provided a manually curated list of high-probability AD candidate genes. The list consisted of 85 candidates, with their chromosomal location and predicted relationship to AD. We were able to map 60 of these genes (70.6%) to nodes of our molecular-interaction network. This list of the 60 genes included the four AD seed genes selected for this study: APP, APOE, PSEN1, and PSEN2. (The full list is shown in Table 4, which is published as supporting information on the PNAS web site.)

Prioritization of Gene Candidates with Molecular Triangulation. In applying the molecular-triangulation technique, we assume the presence of a large molecular network that encompasses multiple genes that harbor disease-related genetic variation and that is available in a computer-accessible form. Furthermore, we assume that we have prior evidence implicating a subset of disease-specific seed nodes in the network. We start by assigning a primary-evidence value to every seed node (for example, the MLS scores corresponding to the linkage peak associated with a set of positional candidate genes or an equal value to each of the four known AD genes); a higher primary-evidence value indicates a higher probability of being implicated in the disorder. For each network node (including the seed nodes), we next compute a secondary-evidence value, which combines a given node's multiple primary-evidence values into a single number. Using the analogy of an electric potential, we let each seed node project its evidence value to its immediate neighbor nodes, such that the secondary-evidence value decays with the distance from the seed node. More precisely, we compute the secondary-evidence value in the following way:

graphic file with name M1.gif

where E(u) is the secondary evidence for node u, Ep(v) is the primary evidence for seed node v, B is the set of all seed nodes, duv is the distance between nodes u and v,k and finally, f is a distance-dependent decay function. The function penalizes a large distance between nodes u and v, lowering the secondary evidence for node u. We determined that different shapes of the decay function (sigmoid, linear) lead to comparable results for the secondary evidence (data not shown). It is therefore sufficient to use the parameterless decay function

graphic file with name M2.gif

To understand the method, we can think of the seed nodes as emanating their primary evidence to their neighbor nodes, such that near neighbors are assigned a secondary-evidence value higher than that conferred on distant neighbors, the neighbor-liness being defined in graph-theoretic terms. Further, the secondary-evidence values generated by all seed nodes are summed for each network node, so the nodes that are close to more than one seed node receive secondary-evidence values higher than those of nodes that are close to only one seed node.

The next important question is: How do we distinguish a significant from an insignificant secondary-evidence value? Significance in this case is defined by the probability that a null model would generate the observed evidence values; if the observed secondary-evidence value of a node can be generated under a null model with probability α, the test has a P value of α. Our null model in this case corresponds to the assumption that we have selected the seed genes at random, using a uniform distribution over all network nodes.

We compute two types of P values that evaluate the raw and topology-subtracted significance of a secondary-evidence value. The two P values that we compute for each network node reflect two aspects of the hypothesis that the node is involved in a disease-related subnetwork.

A raw P value is the probability that any node in the network will attain or exceed its observed secondary-evidence value under the null model. (The null model corresponds to process of random selection of the seed nodes that have the same primary evidence as observed in the real seed nodes.) A network node with a significant (very small) raw P value can be seen as topologically close to the seed nodes, so that the secondary evidence value for this node is significantly higher than those achieved by any network node from a random equal-size set of seed nodes. A raw P value is topology-dependent: Some network nodes, as a result of their high connectedness in the network, may attain a significant raw P value under most or all random assignments of the seed nodes.

The topology-subtracted P value corresponds to the probability that a given node will obtain its observed secondary-evidence value by chance in a large number of random assignments of the seed nodes. (In other words, instead of pooling together secondary-evidence scores for different nodes in random seed-node assignments, as we do to compute raw P values, we compute the secondary-evidence value distribution for each node individually.) A network node receives a significant (small) topology-subtracted P value when there is a (compact) clustering of seed nodes around the network node in a way that is nonrandom. The calculation of topology-subtracted P values takes account of the connectedness of a network node: Highly connected nodes, which systematically achieve significant raw P values in random seed-node assignments, achieve significant topology-subtracted P values only in case of a true (compact) clustering of the seed nodes.

We therefore interpret the two P values as follows: The topology-subtracted P value indicates whether a network node participates in a specific disease-related subnetwork defined by the seed nodes, and the raw P value indicates whether the network node is central (topologically close) to this subnetwork.

In general, low values of both kinds of P values identify nodes that are good candidates for association with the disease process. Candidates with significant raw P value but insignificant topology-subtracted P value are most probably highly connected nodes that are central (significant raw P value) to multiple (insignificant topology-subtracted P value) subnetworks, and their malfunction is unlikely to be specific to a particular disorder.

Results and Discussion

We started our application of the molecular-triangulation algorithm by seeding with (i) the 60 expert-selected AD genes, (ii) the 4 AD-susceptibility genes (APOE, APP, PSEN1, and PSEN2), and (iii) 157 genes from the whole-genome linkage study. For the first and second seed sets, we assigned a uniform primary-evidence value of 1.0 to each seed node. For the third set, we assigned a primary-evidence value that corresponded to the MLS of the linkage peaks of the whole-genome AD study. The resulting secondary-evidence values, ranked according to their raw P values, and the corresponding raw and topology-subtracted P values are shown in Tables 1, 2, 3.

Table 1. Ranking of the top 50 network nodes after seeding with 60 expert AD genes.

Rank* Sec. ev. P valuer SP ID Symbol P valuets Rank Sec. ev. P valuer SP ID Symbol P valuets
1 19.92 0.0000 P12931 SRC 0.0006 26 18.12 0.0010 P98160 HSPG2 0.0026
2 19.87 0.0000 P20132 SDS 0.0020 27 18.10 0.0011 P02649 APOE 0.0000
3 19.82 0.0000 P03372 ESR1 0.0029 28 18.07 0.0011 P24941 CDK2 0.0026
4 19.72 0.0000 P05067 APP 0.0000 29 18.07 0.0011 P06241 FYN 0.0051
5 19.40 0.0000 P29323 EPHB2 0.0017 30 18.07 0.0012 P01375 TNF 0.1146
6 19.15 0.0001 Q9UMH0 TAU 0.0000 31 18.03 0.0012 P02593 CALM3 0.0269
7 18.87 0.0002 P04637 TP53 0.0486 32 18.03 0.0012 P00750 PLAT 0.0154
8 18.87 0.0002 P08047 SP1 0.0866 33 18.02 0.0013 P01266 TG 0.0000
9 18.65 0.0003 P01106 MYC 0.0743 34 17.98 0.0014 P06493 CDC2 0.0009
10 18.65 0.0003 Q09472 EP300 0.0023 35 17.95 0.0015 P16220 CREB1 0.0614
11 18.58 0.0004 P35568 IRS1 0.0000 36 17.93 0.0015 P12830 CDH1 0.0009
12 18.57 0.0004 P14210 HGF 0.0000 37 17.93 0.0015 P04629 NTRK1 0.0000
13 18.57 0.0004 P11498 PC 0.0011 38 17.87 0.0017 P00519 ABL1 0.0011
14 18.52 0.0004 P49768 PSEN1 0.0000 39 17.87 0.0017 P50391 PPYR1 0.0037
15 18.50 0.0004 P28482 MAPK1 0.0137 40 17.85 0.0018 Q13201 ECM 0.0009
16 18.48 0.0005 Q92793 CREBBP 0.0157 41 17.85 0.0018 P08100 RHO 0.0566
17 18.33 0.0007 P21912 SDHB 0.0243 42 17.82 0.0018 P02248 UBB 0.0849
18 18.33 0.0007 Q14155 P85 0.0280 43 17.82 0.0019 P20226 TBP 0.0549
19 18.32 0.0007 P40763 STAT3 0.0063 44 17.77 0.0021 P17080 RAN 0.0066
20 18.28 0.0007 P29353 SHC1 0.0086 45 17.72 0.0023 P01130 LDLR 0.0000
21 18.27 0.0008 P27361 MAPK3 0.0109 46 17.70 0.0023 O60674 JAK2 0.0066
22 18.23 0.0008 Q06124 PTPN11 0.0040 47 17.70 0.0023 P35610 SOAT1 0.0143
23 18.18 0.0009 P26599 PTBP1 0.0000 48 17.70 0.0023 P04049 RAF1 0.0511
24 18.17 0.0009 P01133 EGF 0.0794 49 17.70 0.0023 P06400 RB1 0.0831
25 18.15 0.0010 P43320 CRYBB2 0.0334 50 17.70 0.0023 P38936 CDKN1A 0.0983

Bold, nodes with primary evidence; italics, AD expert genes; P valuer, raw P value; P valuets, topology-subtracted P value; SP ID, SwissProt identifier; Sec. ev., secondary evidence value.

*

Sorted by raw P values.

Table 2. Ranking of the top 50 network nodes after seeding with 157 genes from whole genome AD linkage scan.

Rank* Sec. ev. P valuer SP ID Symbol P valuets MLS Rank Sec. ev. P valuer SP ID Symbol P valuets MLS
1 179.29 0.0002 P08047 SP1 0.1066 26 168.52 0.0031 P38936 CDKN1A 0.1343
2 179.24 0.0002 P03372 ESR1 0.1511 27 167.73 0.0036 Q09472 EP300 0.1786
3 178.27 0.0003 P12931 SRC 0.1340 28 167.53 0.0037 P42224 STAT1 0.1631
4 178.22 0.0003 P20132 SDS 0.1680 29 167.45 0.0038 P31749 AKT1 0.0671
5 176.20 0.0005 P29323 EPHB2 0.0706 30 167.31 0.0039 Q05397 PTK2 0.0554
6 174.31 0.0009 P01106 MYC 0.2409 31 166.92 0.0041 Q13510 ASAH1 0.0091
7 173.20 0.0012 P16220 CREB1 0.0237 32 166.91 0.0042 P29354 GRB2 0.0957
8 172.72 0.0013 P04637 TP53 0.4100 33 166.45 0.0045 P35568 IRS1 0.0743
9 172.31 0.0014 P43320 CRYBB2 0.0463 34 165.73 0.0051 P04049 RAF1 0.1737
10 172.20 0.0015 P00750 PLAT 0.0086 35 165.59 0.0052 P14222 PRF1 0.1111 1.8
11 171.98 0.0015 Q92793 CREBBP 0.0871 36 165.43 0.0054 P02248 UBB 0.3763
12 171.90 0.0016 P28482 MAPK1 0.0929 37 164.56 0.0062 P98160 HSPG2 0.1000
13 171.70 0.0016 P21912 SDHB 0.0746 38 164.42 0.0063 O00688 EGFR 0.3160
14 171.21 0.0018 P02593 CALM3 0.0397 7.7 39 164.21 0.0065 Q06124 PTPN11 0.1871
15 171.21 0.0018 P20226 TBP 0.0306 2.2 40 164.10 0.0067 P06400 RB1 0.3969
16 171.04 0.0019 Q14155 P85 0.1043 41 163.48 0.0074 P15498 VAV1 0.0437
17 170.79 0.0020 P08100 RHO 0.0523 42 162.77 0.0082 P05231 IL6 0.0466
18 170.31 0.0022 P01133 EGF 0.2280 43 162.68 0.0083 Q07869 PPARA 0.2077
19 169.74 0.0024 P17080 RAN 0.0020 44 162.64 0.0084 P42858 HD 0.0837
20 169.33 0.0026 P01375 TNF 0.3089 45 162.63 0.0084 P06744 GPI 0.0051
21 169.07 0.0028 P40763 STAT3 0.0766 46 162.48 0.0086 P06241 FYN 0.2217
22 169.04 0.0028 P29353 SHC1 0.0643 47 162.41 0.0087 Q92940 MADH3 0.0971
23 169.03 0.0028 P11498 PC 0.0243 48 162.23 0.0089 P05549 TFAP2A 0.1754
24 169.02 0.0028 P27361 MAPK3 0.0800 49 161.92 0.0093 O60674 JAK2 0.1151
25 168.62 0.0030 P12004 PCNA 0.2600 50 161.88 0.0094 P06401 PGR 0.0711

Bold, nodes with primary evidence; italics, AD expert genes; P valuer, raw P value; P valuets, topology-subtracted P value; SP ID, Swiss Prot identifier; Sec. ev., secondary evidence value.

*

Sorted by raw P values.

Table 3. Fifty-eight network nodes with significant raw and topology-substracted P values (≤0.05) after seeding with 157 genes from whole-genome AD linkage scan.

Rank* Sec. ev. P valuer SP ID Symbol P valuets MLS Rank Sec. ev. P valuer SP ID Symbol P valuets MLS
7 173.20 0.0012 P16220 CREB1 0.0237 124 155.81 0.0217 P45983 MAPK8 0.0109 1.8
9 172.31 0.0014 P43320 CRYBB2 0.0463 131 155.41 0.0229 P19174 PLCG1 0.0226
10 172.20 0.0015 P00750 PLAT 0.0086 136 154.74 0.0251 Q03405 PLAUR 0.0031 7.70
14 171.21 0.0018 P02593 CALM3 0.0397 7.7 137 154.73 0.0251 P08519 LPA 0.0291 2.20
15 171.21 0.0018 P20226 TBP 0.0306 2.2 143 154.29 0.0266 P06858 LPL 0.0049
19 169.74 0.0024 P17080 RAN 0.0020 147 154.06 0.0274 P10451 SPP1 0.0394
23 169.03 0.0028 P11498 PC 0.0243 151 153.91 0.0280 P45985 MAP2K4 0.0194
31 166.92 0.0041 Q13510 ASAH1 0.0091 154 153.41 0.0299 Q05586 GRIN1 0.0280
41 163.48 0.0074 P15498 VAV1 0.0437 156 153.14 0.0310 P04629 NTRK1 0.0414
42 162.77 0.0082 P05231 IL6 0.0466 162 153.03 0.0315 P26651 ZFP36 0.0091 7.7
45 162.63 0.0084 P06744 GPI 0.0051 166 152.79 0.0325 Q05682 CALD1 0.0046
53 161.26 0.0103 Q9NZ50 SR 0.0349 167 152.70 0.0329 P21359 NF1 0.0294
54 161.22 0.0103 P02649 APOE 0.0026 7.7 171 152.56 0.0335 P06729 CD2 0.0149
62 160.64 0.0112 P32119 PRDX2 0.0377 176 152.44 0.0340 P25098 ADRBK1 0.0406
66 160.35 0.0117 P29474 NOS3 0.0437 182 152.02 0.0359 P35520 CBS 0.0037
70 159.54 0.0131 Q14289 PTK2B 0.0314 184 151.98 0.0362 Q9NT12 ATP8A2 0.0177
75 158.61 0.0149 P01266 TG 0.0329 185 151.96 0.0362 P46734 MAP2K3 0.0049
78 158.56 0.0150 P08133 ANXA6 0.0157 191 151.71 0.0374 P19338 NCL 0.0477
94 157.72 0.0168 P10145 IL8 0.0057 195 151.63 0.0378 P11137 MAP2 0.0106
97 157.52 0.0173 Q00403 GTF2B 0.0497 1.6 199 151.61 0.0379 P08571 CD14 0.0111
100 157.42 0.0175 P11912 CD79A 0.0034 7.7 206 151.32 0.0394 Q9UIT9 MYLK 0.0163
109 156.76 0.0191 P07204 THBD 0.0191 210 151.05 0.0408 P09601 HMOX1 0.0400
110 156.67 0.0193 P16333 NCK1 0.0437 211 151.03 0.0409 P18206 VCL 0.0117 1.8
111 156.63 0.0195 P00749 PLAU 0.0026 1.8 220 150.73 0.0425 P42261 GRIA1 0.0100
112 156.62 0.0195 Q92934 BAD 0.0483 224 150.67 0.0429 P11166 SLC2A1 0.0203
115 156.38 0.0201 P47712 PLA2G4A 0.0214 236 150.36 0.0446 O95644 NFATC1 0.0451
117 156.25 0.0205 Q02750 MAP2K1 0.0449 245 149.93 0.0471 P00374 DHFR 0.0351
118 156.24 0.0205 P55290 CDH13 0.0394 248 149.86 0.0476 P02753 RBP4 0.0280
120 156.02 0.0211 P56945 BCAR1 0.0426 252 149.67 0.0487 P12270 TPR 0.0400

Bold, nodes with primary evidence; italics, AD expert genes; P valuer, raw P value; P valuets, topology-subtracted P value; SP ID, Swiss Prot identifier; Sec. ev., secondary evidence value.

*

Sorted by raw P values.

Table 1 shows the 50 top-scoring network nodes after seeding with set 1. The results are remarkable in two respects. First, the ranking of the expert nodes, which acted as seeds with equal primary evidence, may provide a means to assign priorities to genes within the seed set. For example, seed genes ESR1 and APP turned out to be the two highest-ranking expert genes at ranks 3 and 4, respectively, whereas seed genes CHRM1 and PRNP received the low ranks, 2,575 and 2,647, respectively. High-ranked genes (with both P values low) tend to cluster, whereas low-ranked expert genes tend to be outliers, strangers to the main cluster. Second, Table 1 shows many new genes (that is, genes that were not suggested by the expert) among the 100 top-scoring nodes. These genes receive high secondary-evidence values because of their network proximity to the seed nodes. The results indicate that these genes may be better AD candidates than low-scoring expert nodes. Table 1 also demonstrates the utility of using two complementary P values. For example, some nodes have a significant raw, but a nonsignificant topology-subtracted, P value. This coupling indicates nodes that are probably nonspecific to AD and that may contribute to multiple subnetworks regardless of disease relation. For example, node SP1, a transcription factor, achieves a triangulation rank of 8. According to the raw importance ranking, SP1 is central to the disease-specific subnetwork [SP1 has been described recently as playing an essential role in the processing of APP (6)]. However, the nonsignificant P value for topology-subtracted importance hints that SP1 affects multiple gene neighborhoods, thereby casting a degree of uncertainty on what specific role this transcription factor might play in the etiology of AD. By comparison, both the APP and the PSEN1 genes (ranks 4 and 14) have highly significant raw and topology-subtracted P values.

We then seeded the method with four AD-susceptibility genes and calculated secondary-evidence values for all other nodes of the network (Table 5, which is published as supporting information on the PNAS web site, shows the top 50 high-scoring network nodes.) We used the resulting ranking of network nodes to measure the sensitivity and specificity of the triangulation method in predicting expert-validated AD genes (56 genes, excluding the 4 seed genes) among the top-scoring network nodes. We evaluated the molecular-triangulation method using the same method as that commonly used for classification algorithms: We treated all expert-defined genes as true positives and all other nodes as true negatives. We measured the predictive strength of the method with the receiver operating characteristic (ROC) score, a value commonly used in computer-science research for evaluating classification algorithms. (An ROC score of 1 corresponds to a perfect method; an ROC score of 0.5 indicates a powerless classification method.) We found that seeding the method with the four AD-susceptibility genes resulted in significantly higher secondary-evidence values being assigned to unseeded expert-selected genes, compared with values assigned when we seeded the method with randomly chosen genes. Compared to the mean ROC score of 0.6257 for a random seed-node assignment,l our selected seed nodes provided a significantly higher ROC score of 0.6806 (P < 0.01).

We also wished to test whether our method is robust against noise in the linkage data. We repeated the experiment described in the preceding paragraph, this time adding an increasing number of random genes to the set of four known AD susceptibility genes. As Fig. 3 (upper intervals) shows, the results are stable (and remain significant) for the addition of up to three random genes.

Fig. 3.

Fig. 3.

Area under the ROC curve for predicting 56 expert genes. The upper intervals indicate results (mean area under the ROC curve with 95% confidence interval) of seeding the method with four known AD-susceptibility genes plus additional, believed to be irrelevant genes. The lower intervals indicate the results of seeding the method with an equivalent number of random genes.

We also conducted an experiment using 157 positional candidate genes from the whole-genome linkage study. After seeding our method with those candidates, we found among the top ranked genes mostly insignificant topology-subtracted P values. The 157 positional candidates do not belong to a well-defined network cluster, indicating a high noise level; Table 2 shows the 50 top-scoring nodes. We can use an elegant solution to weed out noisy data from these results. By selecting genes with significant raw and topology-subtracted P values (cutoff 0.05), we can identify genes that are central and specific to a network cluster. Only 58 genes (of 3,111 genes/nodes in our network) comply with this requirement (Table 3). This set contains a total of 11 seed genes from the whole-genome scan analysis; the remaining 147 seed genes are rejected (selecting 11 genes of 157 corresponds to a 93% reduction in candidates genes). In other words, despite the presence of linkage noise, the algorithm identified a total of 11 seed genes that are clustered in a way that is not random, a majority of those 11 genes have a demonstrated relationship to AD. Three of them are expert validated [APOE, mitogen-activated protein 8 (MAPK8), and urokinase plasminogen activator (PLAU)]. Apolipoprotein(a)'s (LPA) null phenotype has been shown recently to delay the age of onset of AD (7). Urokinase plasminogen activator receptor (PLAUR) has a clear functional relationship to (is the receptor of) PLAU. Calmodulin (CALM3) is central to the Ca2+/calmodulin-dependent protein kinase II (CaMKII), which has been implicated as being involved in the phosphorylation of tau protein (8). Given the ubiquitous nature of calmodulin, it is not surprising that the topology-subtracted P value associated with the protein is of borderline significance. Of the candidates mentioned, CALM3 and PLAUR are located close to the 19q13 (7.7 MLS) peak corresponding to the AD-susceptibility gene APOE. However, the candidate statures of CALM and PLAUR are not enhanced by their colocation with a major linkage peak, because it is clear that the linkage signal derives from the contribution of APOE. Both MAPK8 and PLAU map to the 10q22 (1.8 MLS) peak, whereas LPA maps to the 6q27 (2.2 MLS) peak. Among the remaining 47 genes that have no corresponding linkage peaks in Table 5, there are six additional expert-validated genes (IL6, NOS3, LPL, GRIN1, NTRK1, and HMOX1). In addition, many of the remaining genes have clearly established links to AD. Examples are the top-ranked genes CREB1 and PLAT (9, 10).

These results provide a strong argument for the usefulness of molecular triangulation in situations in which multiple data types are available for the same disorder, such as linkage information that can be combined with knowledge about molecular interactions.m Nevertheless, the results require further explanation. First, the expert's list of AD candidates cannot be regarded as a real gold standard, given that we still do not know which are the true AD disease genes. The results of the ROC score calculations might therefore be lower than expected. Another factor that may influence the performance of the algorithm is the quality of the underlying molecular network model. Here, we used a network that we compiled by mining the molecular literature (in principle, our method works with any type of molecular interaction network). We are aware that this network is far from being complete and error-free, given the many obstacles to harvesting information from the literature successfully (11). Possible errors include falsely identified or missed network edges resulting from automated parsing of complex sentences. Also, problems with automatically identifying molecular terms may result in the mapping of network connections to wrong nodes in the network. As a result, the triangulation algorithm may implicate or miss some candidate genes because of incorrect network edges. Another problem is that literature is biased toward positive research results or research results that deal with already known and popular topics, which may lead to over- or underrepresentation of certain network connections. Nevertheless, researchers are making fast progress in solving outstanding text-mining problems (see, for example, ref. 12), and their work may lead to the automated construction of high-quality data sets from literature sources.

Although there are several recent studies that used molecular network-related information for biological inference [for example, for predicting protein complexes (13, 14)], we are not aware of any prior work on combining genetic with molecular-pathway information to identify disease-related gene variants.

Supplementary Material

Supporting Tables

Acknowledgments

This study was supported by grants from the National Institutes of Health, the National Science Foundation, the Department of Energy, and the Defense Advanced Research Projects Agency (to A.R.). C.A.K. is the recipient of a National Alliance for Research on Schizophrenia and Depression Distinguished Investigator Award.

This paper was submitted directly (Track II) to the PNAS office.

Abbreviations: APP, amyloid precursor protein; PSENn, presenilin n; APOE, apolipoprotein E; AD, Alzheimer's disease; MLS, multipoint local score; CIDR, Center for Inherited Disease Research; lod, logarithm of odds; ROC, receiver operating characteristic.

Footnotes

h

Collier, N., Park, H. S., Ogata, N., Tateisi, Y., Nobota, C., Ohta, T., Sekimizu, T., Imai, H., Ibushi, K. & Tsujii, J. (1999) in EACL'99 (Bergen, Norway).

i

Pustejovsky, J., Castano, J., Sauri, R., Rumshisky, A., Zhang, J. & Luo, W. (2002) in ACL-02 (Philadelphia, PA).

j

The mapping step involved the resolution of spelling variants (p53 protein, p53-protein, and p53 proteins) as well as synonymy (amyloid β A4 protein, APP, A4, and AD1 are synonyms that denote the same entity). Interactions with ambiguous names (those that mapped to more than one SwissProt ID) were excluded.

k

The distance between nodes u and v is calculated by using Dijkstra's shortest-path algorithm.

l

The results of the random assignment are better than those of chance (which would correspond to an area under the ROC curve of 0.5), because the ranks of the specific nodes are stable regardless of the assignment of primary evidence. Such nodes are usually highly connected and thus affect multiple network neighborhoods. In our case, a portion of the expert nodes seem to fall into this category and are thus predictable even with random assignments of primary evidence.

m

Here we used data from a whole-genome linkage scan; the method is equally useful for similar noisy data from association studies or functional analysis.

References

  • 1.Hightower, J. & Boriello, G. (2001) IEEE Comput. 34, 57-66. [Google Scholar]
  • 2.Ng, S. K. & Wong, M. (1999) Genome Inform. Ser. Workshop Genome Inform. 10, 104-112. [PubMed] [Google Scholar]
  • 3.Rzhetsky, A., Iossifov, I., Koike, T., Krauthammer, M., Kra, P., Morris, M., Yu, H., Duboue, P. A., Weng, W., Wilbur, W. J., et al. (2004) J. Biomed. Inform. 37, 43-53. [DOI] [PubMed] [Google Scholar]
  • 4.Blacker, D., Bertram, L., Saunders, A. J., Moscarillo, T. J., Albert, M. S., Wiener, H., Perry, R. T., Collins, J. S., Harrell, L. E., Go, R. C., et al. (2003) Hum. Mol. Genet. 12, 23-32. [DOI] [PubMed] [Google Scholar]
  • 5.Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M. C., Estreicher, A., Gasteiger, E., Martin, M. J., Michoud, K., O'Donovan, C., Phan, I., et al. (2003) Nucleic Acids Res. 31, 365-370. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Christensen, M. A., Zhou, W., Qing, H., Lehman, A., Philipsen, S. & Song, W. (2004) Mol. Cell. Biol. 24, 865-874. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Emanuele, E., Peros, E., Tomaino, C., Feudatari, E., Bernardi, L., Binetti, G., Maletta, R., D'Angelo, A., Montagna, L., Bruni, A. C., et al. (2004) Neurosci. Lett. 357, 45-48. [DOI] [PubMed] [Google Scholar]
  • 8.Yoshimura, Y., Ichinose, T. & Yamauchi, T. (2003) Neurosci. Lett. 353, 185-188. [DOI] [PubMed] [Google Scholar]
  • 9.Melchor, J. P., Pawlak, R. & Strickland, S. (2003) J. Neurosci. 23, 8867-8871. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Vitolo, O. V., Sant'Angelo, A., Costanzo, V., Battaglia, F., Arancio, O. & Shelanski, M. (2002) Proc. Natl. Acad. Sci. USA 99, 13217-13221. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Blaschke, C. & Valencia, A. (2002) IEEE Intell. Sys. 17, 73-76. [Google Scholar]
  • 12.Krauthammer, M. & Nenadic, G. (2004) J. Biomed. Inform., in press. [DOI] [PubMed]
  • 13.Bader, G. D. & Hogue, C. W. (2003) BMC Bioinformatics 4, 2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Bader, J. S. (2003) Bioinformatics 19, 1869-1874. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Tables
pnas_101_42_15148__1.pdf (43.7KB, pdf)
pnas_101_42_15148__2.pdf (63.2KB, pdf)

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES