Abstract
Link prediction is a fundamental problem in network analysis. In a complex network, links can be unreported and/or under detection limits due to heterogeneous sources of noise and technical challenges during data collection. The incomplete network data can lead to an inaccurate inference of network based data analysis. We propose a parametric link prediction model and consider latent links as misclassified binary outcomes. We develop new algorithms to optimize model parameters and yield robust predictions of unobserved links. Theoretical properties of the predictive model are also discussed. We apply the new method to a partially observed social network data and incomplete brain network data. The results demonstrate that our method outperforms the existing latent-link prediction methods.
Keywords: brain network, link prediction, parametric model, outcome misclassification, social network
1 |. INTRODUCTION
Network data have become increasingly important to study complex problems, for example, to understand the role of contact patterns in epidemics through social network analysis and interactions between neuron populations in the human brain.1,2 A network can be represented by a graph, where a node denotes a study unit and an edge or link indicates interactions between a pair of nodes.3 In practice, network data are often incomplete because some links are unreported/undetectable due to various sources of noise and technical limitations during data collection. In brain network analysis, connections are constructed based on correlations and/or partial correlations between nodes (brain areas). A binarized brain network by thresholding and/or shrinkage4,5 is often used for the down-stream data analysis (eg, graph theoretical analysis). A binarized brain network, however, may miss true connectivity links while preventing false positive connections due to noise in recorded brain signals (Figure 1). Similarly, social networks estimated from survey data can be incomplete because many links are unobserved (false negatives). The observed social network data are subject to the recall bias as the chance of a subject recollecting all his/her contacts is very small, and the number of contacts that a subject can report may be restricted to a fixed number that is less than his/her total contacts.6 Therefore, the incomplete network data are common in practice. It has been well documented that inferences based on incomplete network data can be inaccurate.3 To mitigate this challenge, we focus on developing link prediction models to recover the full network based on incomplete network data.
FIGURE 1.
Graph-based analysis for brain connectivity networks: A, Define brain regions as nodes. B, Extract time series for nodes. C, Observe connectivity metrics between each pair of nodes as edges. D, Binarize brain connectivity matrix. E, Obtain functional brain networks. F, Utilize graph-based analysis for brain networks
The statistical analysis of network data (mainly social network analysis) has been an active area for decades.7 Markov random graphs by Frank and Strauss8 initially relaxed the dyadic independence of p1 model.9 Strauss and Ikeda’s (1990) discussion10 made the Markov model and its general form p* model to be computationally feasible by an approximation based on the logistic regression model, via a pseudo-likelihood method.11 Likelihood-based social network models based on assumption of conditional independence have been well accepted, because they are flexible in adjusting the multivariate features of each node as well as connections between nodes. Exponential random graph models—ERGM12 and latent space approaches13 further improve upon the statistical properties of the parametric social network model by accounting for the dependence of edges and network structures. Efficient computational algorithms, for example, Monte Carlo maximum likelihood estimation have been also been developed using advanced techniques.14–16 Recently, these network models have also been successfully applied to complex brain connectome data analysis.2 However, these parametric network models are mainly built on completely observed network data (ie, without latent links) and thus may not be directly applied to the incomplete network data,17 and are rarely constructed for the purpose of link prediction.18
Predicting latent links in incomplete networks can be considered as a binary outcome prediction problem. However, the commonly used binary outcome predictive models and machine learning methods (eg, gradient boosting and random forest) are limited for this purpose, because the outcome labels in the training data are misclassified. To address this issue, nonparametric statistical methods which take into account network topology and structure without the requirement of correct labeling of the training set, have been used for link prediction and are well suited for partially observed network data.3,6 Nonparametric models rely on similarity-based methods, which assume that nodes are more likely to have edges with similar nodes. Local, global, and quasi-local approaches can be used to describe the topology information included in computing the similarity of nodes.
In this article, we consider unreported/undetected links as missclassified binary outcomes in the framework of statistical network analysis. The binary outcome misclassification problem has been well studied in the statistical literature, using logistic regression for epidemiological research.19 Under nondifferential and differential outcome misclassifications assumptions and when validation data sets are provided, the likelihood-based method can perform well to correct the estimation bias.20–22 However, a major challenge to model incomplete network data with misclassified outcome models is caused by the lack of validation data. Neither sensitivity and specificity can be estimated without the use of validation data set. In that, the likelihood function cannot be correctly specified for parameter estimation.21 In addition, the misclassification mechanism of incomplete network data is different from existing binary misclassification models in logistic regression. The common assumption for a partially observed network is that the reported links are true links and not false positive, while some unreported links may be latent.3,23,24 The goal of the current research is to build a predictive model with high accuracy of latent link prediction, which is different from the goal of estimating effects regression coefficients in the parametric model. To address these aims, we propose a new statistical network model that integrates misclassified outcome variables into a parametric model and provides efficient algorithms for accurate link prediction.
The article is organized as follows: in Section 2, we describe the proposed method and optimization algorithms, and provide theoretical results for the conclusion that the performance of our link prediction is robust and accurate when validation data set is not available. In Sections 3 and 4, we present results of both simulations and two data examples demonstrating that the proposed approach outperforms existing methods in various settings.
2 |. METHODS
To begin, we denote a true link/contact between a pair of nodes in a network i, j ∈ {1, …, n}, i ≠ j by a binary variable Yij and let the realization yij = 1 indicate a connection and yij = 0 otherwise. The corresponding observed link in incomplete network data is wij. In the context of latent links (eg, epidemiological contacts in a social network), it seems sound to assume that all observed connected links are truly connected, whereas some unobserved connections are latent true links.3 Specifically, we let a proportion of true links are misclassified as latent links that Pr(Yij = 1|Wij = 0) > 0, whereas all reported links are considered true links Pr(Yij = 1|Wij = 1) = 1. Note that this outcome misclassification mechanism is different from conventional outcome misclassification models which is reflected by our model specification in Section 2.2. Our goal is to predict/recover the latent contacts from observed incomplete network data: where w is the partially observed network and other auxiliary information such as spatial distance and coworking relationship between subjects are available in xij.
2.1 |. Background
2.1.1 |. Likelihood for fully observed network
Let Y be a random matrix for the complete network, y is a realization of the adjacency matrix. xij is a p × 1 vector of covariates characterizing the auxiliary information between nodes i and j, for example, the spatial proximity and temporal cooccurrence.25 We use a parametric model to describe Y by XT = [x12 ×13 … x(n−1)n]. Specifically, we have
(1) |
where β is the corresponding p × 1 vector of parameters.
The parametric social network models (eg, ERGM) allow modeling Yij with both xij and the rest of links in the network , and thus account for forms of dependency structures.26 Therefore, we can extend our model as follows:
(2) |
where xij is extended to include both node-pair auxiliary information and the change of graph summary statistics for the network.26
In general, the model parameters β in (2) are estimated by maximum likelihood estimation.14–16 It has been known that the computational cost of MLE for correlated multivariate binary outcomes can be expensive. The pseudo-likelihood approach is often used instead of MLE, and the parameters estimated by maximum pseudo-likelihood estimation (MPLE) is equivalent to those by MLE when the assumption of dyadic independence is valid. In this current research, our focus is on link prediction instead of parameter estimation. It has been shown that the prediction of outcome links using (2) is invariant to the choice of MPLE and MLE for parameter estimation because only regulative rank of links is necessary for link prediction and there is no need for calculating the normalizing terms.27 We validate this claim by sensitivity analysis in the Appendix. Therefore, we apply MPLE in our analysis for the purpose of link prediction.
The expression of likelihood and pseudo-likelihood for the network data requires the knowledge of accurately measured links yij. In our application, however, the true links yij are not available in the incomplete network data because many links are unreported/undetected and thus latent. Simply substituting the true links yij by observed links wij for the pseudo-likelihood function is invalid because wij is subject to the misclassification errors and the equation of conditional probability on the rest of networks does not hold anymore. To address this challenge, we propose a new approach to incorporate outcome misclassification into the parametric network model.
2.1.2 |. Misclassification model for incomplete networks
We assume that the observed link wij is subject to misclassification error from the true links yij. Specifically, we let
(3) |
As pointed by Zhao et al. (2017),3 the above misclassification model can also be expressed by Bayes’ rule as follows:
(4) |
Model (4) can be considered as a Berkson error model in contrast to the classical error model (3).22,28,29 We favor model (4) than (3) because yij is more suited for parametric network modeling than wij. In that, our outcome misclassification specification in (4) is different from existing methods for outcome misclassification in logistic regression where the conditional distribution is based on wij|yij.22,28 We propose a new computational method to overcome this.
In this article, we model the probability of true links conditional on the observed links. We assume nondifferential single directional misclassification error w.r.t. parameter θ.
The lack of validation data for misclassified data. Another challenge to construct the outcome misclassification model for incomplete network data rises due to the lack of validation data set. Traditionally, the likelihood function for binary outcome misclassification is ℓ(β; y, X) = ℓ(β; w, X, sensitivity, specificity) where sensitivity and specificity are estimated from the validation data. However, the validation data set is often absent for the incomplete network data which prohibits the estimates of sensitivity and specificity. Thus, the likelihood function cannot be optimized using the existing misclassification models. As shown by follows, our misclassification model in (4) is well suited for the misclassification mechanism of incomplete network data. We propose a new algorithm to efficiently optimize the expected likelihood function for parameter estimation.
2.2 |. Expected likelihood for incomplete network data
In binary outcome misclassification models, the likelihood function can be specified based on the observed (misclassified) outcome variables (eg, via logistic regression19). However, with the absence of validation data, the parameters may not be properly specified.28 Thus, we calculate the expected value of the likelihood of y:30 Eθ(ℓ(β; y, X)|w) based on the misclassification model (4) since ℓ(β; w, X) cannot be directly specified. The expected log likelihood function for y conditional on observed network w is:
(5) |
It appears that the expectation-maximization (EM) algorithm can be applied to estimate θ and β from (5). However, when the validation data are not available, the direct maximization of (5) may not lead to meaningful estimates of θ. Since (5) is linear in θ, the direct estimates of θ would be either 0 or 1. To address this issue, we propose the objective function (6) on the constraint of θ and use a two-step procedure that estimate θ and β iteratively to maximize latent contact prediction accuracy.
Step 1: We first perform maximum likelihood estimation (MLE) of the expected log likelihood based on a given value of θ, which is . Under mild regularity conditions,
The first derivative of the expected log-likelihood is similar to the mean score function for a log-likelihood. It has been known that the MLE based on observed data can be solved by the mean score function.30
To implement the Newton-Raphson algorithm for the estimation of β given a known value of θ, the first and second derivatives of the expected log-likelihood can be calculated.
Let πT = (π12, π13, … , π(n−1)n). We observe the matrix form of first- and second-order derivatives as
where η(⋅) is the half vectorization. We next update βold by
where z = Xβold + G−1[η(w) + θ(1 − η(w)) − π]. The estimated parameters will be updated iteratively until convergence is reached. We denote the above estimated parameters as and we can accordingly calculate based on these parameters.
When the validation data are available and θ can be prespecified by transforming sensitivity and specificity, the parameters estimated by (5) are unbiased and consistent.22 We provide numerical validation of the unbiased and consistent estimation in Section 3.3. The estimated parameters with prespecified θ can be further refined by including the normalizing factor via the MCMC algorithm for MLE computation.26,31
Step 2: With the absence of validation data, we argue that link prediction is robust although the estimate of parameter θ may be biased. It has been known that the estimate of θ is challenging without the validation data due to the issue of identifiability.21 In step 2, we resort to an alternative pathway to optimize θ as a tuning parameter using the following objective function,
(6) |
where is thresholded with being estimated from (5) given . In our model, we consider observed links as true links and thus the observed links can serve as a semitraining data set. Since the number of links is a small proportion of , we tend to optimize our link prediction by balancing true positive rate and false discovery rate—FDR. The first term of (6) defines the true positive rate, the fraction of observed links (wij) that are correctly predicted by in the training data set. The second term defines the precision which is 1-FDR the proportion of observed links (wij) among predicted links . Hence, the objective function aims to recover a maximal number of missing links without introducing a large number of false positive predictions, in short, without increasing the FDR. This objective function is close to the F1 score which is a balanced metric between precision and recall and is commonly used to evaluate the performance of predictive models for uneven class distributions (eg, a large proportion of links are truly absent in a social network). The Newton-Ralphson algorithm in step one yields and with a given θ. Since the range of θ is constrained between 0 and 1, we perform grid search to optimize with a small incremental step (eg, 0.001) from 0 to 1. During the grid search, we compute the value of optimization function in (6) using each value of θ. The optimal is selected when maximum of the objective function (6) is achieved. Similarly, the optimal threshold r0 that can be determined by (6).
Optimizing (6) generally yields an good estimate that is located within the neighborhood of true θ0 where θ0 ranges from 0 to 1. However, may be biased without the use of validation data which can lead to biased estimates .21 Here, we argue that the biased estimates have little impact on the performance of link prediction. For example, the area of the curve (AUC) of the ROC is a commonly used metric to evaluate the performance of latent link prediction, which is approximately equal to the c statistic using logistic regression.32,33 The c statistic is determined by , where I is an indicator function. We provide both theoretical proof as follows and numerical validations in the Appendix and Sections 3 and 4 to support this claim.
Theorem 1.
In a network, assume true links given the rest of network are generated independently with Bernoulli(πij(β)), and . The auxiliary covariates Xij follow some distribution with (Xij, Yij) being exchangeable, and the conditional expectation exists and is linear in βXij for all . The observed network with contacts Wij are generated from model (3) where ϕij is observed according to Bayes’ rule with a fixed missing rate θ0. θ0 and represent the true and our estimated missing rate and in general . Then, as n → ∞,
with and Rij being ranks of and πij(β), respectively. is the maximizer of (5) over given missing rate .
Proof.
Following the results by Neuhaus,21 the misspecified outcome misclassification parameters (θ) can be considered as misspecified link function. Let Jij(β, θ; w, X) = Eθℓij(β|w) be the expected log-likelihood for edge (i, j). Under the above conditions, the theorem 2.1 in Li and Duan34 holds regarding the optimization of the expected version of objective function . Since the contacts in observed network are exchangeable, from the strong law of large numbers for exchangeable random variables,35
and both terms are convex in θ. Hence, the claim of theorem 5.1 in Li and Duan34 is still true if the unique maximizer of is an interior point of Ω. Thus, we conclude as n → ∞ for being the maximizer of the left term.
On the other hand, from the definition,
where . Since the logistic link in our method is monotonically increasing, we have
Hence,
and the claim is true.
Under dyadic independence and exogenous covariates, we could generate the network following the proposed models directly, while in other cases, the generation of such a network can be constructed via MCMC algorithms.26 Thus, according to the theorem, the deviation of from θ0 has little impact on the proportion of the concordant pairs (ie, ), and does not affect the c statistic of the logistic regression and the AUC of ROC. The performance of our link prediction is not impacted by the inaccuracy of . This conclusion may partially explain that the likelihood-based network model (with outcome misclassification adjustment) can outperform the popular models of nonparametric and machine learning techniques.
In summary, the iterative two-step procedure above provides a viable solution of link/outcome prediction for the binary regression model with the misclassified outcome but no validation data set. The proposed computational strategy is efficient and scalable to very large networks. The complexity of the proposed method is bounded by O((p + 1)M), where p is the number of covariates and M is determined by the resolution of search grids for θ and dichotomization thresholds.
3 |. SIMULATION STUDIES
3.1 |. Synthetic data
We simulate the social and/or brain network data sets by using the following models:
where i, j ∈ {1, …, n} denote a pair of vertices in the network, and are varied distributions to generate networks with different network topological structures, and will be specified with details in the following paragraphs. A link yij follows a Bernoulli distribution with parameter πij. πij is determined by the characteristics of nodes (xij1) and network topological parameters (xij2). A proportion of links are latent and not reported due to recall bias/detection limit and thus the partially observed network is composed of {wij}. The single-directional misclassification mechanism is described by (4) which can be straightforwardly transformed to the sensitivity ϕ and specificity.
We simulate the network data using n = 150 nodes and three topological structures: “random graph,” “community/stochastic block” structure with a single major community, and “rich club.”36 Specifically, we let , and β′ = (2, 1) for the random graph structure . For the community structure , we assume Xij1 = 1 for within community links and 0 otherwise with β′ = (2.53, −2.94). In the “rich club” network data, Xij1 is also used to denote the topological structure and Xij1 = 1 for links between all “rich” nodes and between “rich” nodes and their “periphery” nodes and with β′ = (2.94, −2.94). for all three structures.
We repeat each setting 100 times with different levels of recall bias (by tuning ϕ levels). For the random graph structure , we used the covariates xij1, xij2 in model fitting, while for community structure and “rich club” , only summarized statistics for observed network are considered, including vertex degree, number of common neighbors between vertices, shortest path between vertices, the transitivity, and so on. Via implementing model selection techniques for structure and , the variables for vertex degree and number of common neighbors between pairs of nodes are incorporated. The latent links (wij = 0|yij = 1) are considered as the hold-out data and used to evaluate the performance of the link prediction by our method: statistical network model with outcome misclassification (SNOM) and the existing methods including: neighborhood smoothing (NS) method,37 stochastic block model (SBM),38 and full sum (FS) Method.3 Note that SBM is developed for the completely observed network data instead of partially observed data. We include this method as a reference to emphasize the importance of models addressing the latent links (misclassified outcomes) for the purpose outcome prediction. Specifically, we denote predicted edges as , and the true positive rate (TPR) and false positive rate (FPR) were defined as:
3.2 |. Results
We summarize the performance of all methods using the averaged area under the curve (AUC) of the ROC curve in Table 1 and Figure 2 across 100 repetitions. We note that the largest difference comes from the network data with a random graph structure, where most information is included in the characteristics of nodes and links, instead of the similarity of nodes and links. In this case, the link prediction exclusively relies on the auxiliary information of the nodes. Our parametric model can conveniently include this information, while the comparable methods (nonparametric models) mainly rely on learning the graph topological patterns from the observed network. Therefore, the SNOW method can outperform the competing methods under this scenario. For networks with highly structured topological patterns (ie, community and rich-club), all methods have similar AUCs when a small proportion of links are latent. SNOM seems to be more robust with a higher proportion of misreported links. The SBM method is less accurate, because it was developed for a fully observed network and did not account for misclassified outcomes. Therefore, the network models without outcome misclassification adjustment may not be applicable to partially observed network data.37 In summary, the simulation results demonstrate that the proposed likelihood-based network model, with outcome misclassification adjustment, can provide robust and accurate predictions of latent links for various network data.
TABLE 1.
The results of link prediction for all methods: the means and SDs across all simulated data sets of different settings
ϕ | SNOM | NS | SBM | FS | |
---|---|---|---|---|---|
Random graph | |||||
0.1 | 0.9148 (0.0059) | 0.4946 (0.0161) | 0.4992 (0.0207) | 0.4987 (0.0161) | |
0.2 | 0.9166 (0.0040) | 0.4959 (0.0116) | 0.4988 (0.0125) | 0.5011 (0.0103) | |
0.5 | 0.9161 (0.0020) | 0.4977 (0.0080) | 0.5018 (0.0090) | 0.4988 (0.0051) | |
0.8 | 0.9156 (0.0013) | 0.4984 (0.0075) | 0.5006 (0.0073) | 0.4999 (0.0004) | |
Community | |||||
0.1 | 0.7779 (0.0289) | 0.7676 (0.0296) | 0.5044 (0.0309) | 0.7746 (0.0286) | |
0.2 | 0.7724 (0.0163) | 0.7645 (0.0175) | 0.4992 (0.0232) | 0.7704 (0.0154) | |
0.5 | 0.7505 (0.0082) | 0.7435 (0.0103) | 0.5001 (0.0149) | 0.5293 (0.0179) | |
0.8 | 0.6882 (0.0150) | 0.6175 (0.0149) | 0.4997 (0.0103) | 0.5750 (0.0165) | |
Rich club | |||||
0.1 | 0.8363 (0.0126) | 0.8376 (0.0119) | 0.5016 (0.0279) | 0.8200 (0.0114) | |
0.2 | 0.8314 (0.0089) | 0.8327 (0.0082) | 0.5019 (0.0229) | 0.8136 (0.0088) | |
0.5 | 0.8225 (0.0070) | 0.8137 (0.0067) | 0.5016 (0.0219) | 0.7588 (0.0183) | |
0.8 | 0.7913 (0.0117) | 0.6620 (0.0158) | 0.4984 (0.0156) | 0.6694 (0.0168) |
FIGURE 2.
Averaged ROC curves of social networks with the structures of random graph , community , and rich club for all methods across different settings
3.3 |. Parameter estimation with validation data or prespecified θ
In this section, we demonstrate that the proposed method can provide an accurate estimation of β when θ can be prespecified using the validation data. We let covariates Xij1 ~ Exponential(1) and Xij2 ~ Normal(0, 1) and β0 = 0.5, β1 = 1, β2 = 1.5. The results in Table 2 show that both point and interval estimates are accurate and reliable when θ is correctly prespecified. The results also indicate that our method with novel misclassified outcome specification (4) and expected likelihood (5) can provide an accurate estimation of the effects of exposure when validation data are provided. Therefore, the model can be readily applied when validation data are available and the parameter estimation procedure requires step one alone.
TABLE 2.
The estimation results of regression coefficients with correctly prespecified θ
Para. | Est. | MSE | Coverage of 95% CI | ||
---|---|---|---|---|---|
θ = 0.05 | n = 50 | β0 | 0.4959 | 0.0121 | 0.9660 |
β1 | 1.0073 | 0.0124 | 0.9650 | ||
β2 | 1.5064 | 0.0093 | 0.9730 | ||
n = 150 | β0 | 0.4997 | 0.0014 | 0.9480 | |
β1 | 1.0000 | 0.0014 | 0.9640 | ||
β2 | 1.4993 | 0.0010 | 0.9620 | ||
θ = 0.1 | n = 50 | β0 | 0.5017 | 0.0107 | 0.9730 |
β1 | 0.9985 | 0.0111 | 0.9700 | ||
β2 | 1.4916 | 0.0078 | 0.9820 | ||
n = 150 | β0 | 0.5020 | 0.0012 | 0.9580 | |
β1 | 0.9941 | 0.0013 | 0.9660 | ||
β2 | 1.4863 | 0.0010 | 0.9610 | ||
θ = 0.15 | n = 50 | β0 | 0.5077 | 0.0093 | 0.9830 |
β1 | 0.9874 | 0.0098 | 0.9700 | ||
β2 | 1.4660 | 0.0074 | 0.9680 | ||
n = 150 | β0 | 0.5095 | 0.0012 | 0.9700 | |
β1 | 0.9809 | 0.0015 | 0.9490 | ||
β2 | 1.4599 | 0.0023 | 0.8510 |
4 |. DATA EXAMPLE
4.1 |. Example 1: Partially observed social network
We apply the proposed method to a partially observed social network from a student cohort study for influenza research at the University of Maryland. One goal of this study is to investigate how environmental factors and biomarkers impact disease transmission between subjects with close contacts. The first phase of this study (year 2017) includes 75 undergraduate students between the ages of 18 and 19 and 55 females and 18 males (2 other). A social network survey was conducted at baseline, in which study subject was asked to provide their four close contacts, resulting in 189 recorded edges (case-contact connections). Clearly, many potential links are unreported which may yield biased and inaccurate estimates. To better understand the disease transmission process, we are motivated to predict latent links and provide guidance to recover the latent links.
We perform the link prediction using various methods and compare the performance based on the hold-out links (we artificially hide a proportion of edges, aside from the latent links, from the observed data). The graph statistics used our method are determined by a variable selection procedure. For each link-hiding proportion, we repeat the analysis and prediction 100 times. The performance of each predictive model, in terms of the AUCs of averaged ROC curves, is shown in Table 3 and Figure 3.
TABLE 3.
Comparing AUCs of ROCs between methods for hold-out links in partially observed social networks
ϕ | SNOM | NS | SBM | FS |
---|---|---|---|---|
0.1 | 0.7842 (0.0665) | 0.7460 (0.0617) | 0.5559 (0.0596) | 0.6660 (0.0548) |
0.2 | 0.7437 (0.0559) | 0.7150 (0.0387) | 0.5461 (0.0462) | 0.6124 (0.0412) |
0.5 | 0.6262 (0.0333) | 0.6227 (0.0280) | 0.5383 (0.0235) | 0.5128 (0.0151) |
0.8 | 0.5336 (0.0361) | 0.5172 (0.0126) | 0.5239 (0.0142) | 0.5357 (0.0226) |
FIGURE 3.
Averaged ROC curves for methods with different rates of hidden links in partially observed social networks
Our method outperforms the competing methods when link hiding proportions are 0.1 and 0.2. Our methods are superior to the competing methods, in terms of both the sensitivity and specificity of the predicted outcomes. In practice, the proportion of all nonconnected links (ie, the false negative rate) is around 0.1 and 0.2. Therefore, the proposed approach can be useful for many partially observed network data sets in epidemiological studies. When the unreported link rate increases to 0.5, the performance of our method is comparable to NS and superior to the other two methods. Last, all models fail when the rate of unreported links is 0.8 because the remaining information would not support proper predictive modeling. The results of this data example are well aligned with the results in the simulation section. The difference between our methods and competing methods of the data example is smaller than the social network with random graph structure but greater than the social network with a very organized topological structure ( and ). This may be explained by the fact that the real-world social network is composed of both organized topological patterns and randomness.
In addition, we apply our link prediction model to all observed links in the data example. Interestingly, because our method can capture the network topological structures, it predicts that many latent links will be positive in the communities (Figure 4), although the community structure is unknown prior to model estimation. The prediction seems to be reasonable because a group of students in college tends to take courses, eat, and relax in shared space although the validation data are not available and the ground truth cannot be assessed. The predicted links will be used for further epidemiological investigation, for example, whether the dorm environmental factors can impact the transmission of infectious disease between contacted subjects.
FIGURE 4.
Predicted social network
4.2 |. Example 2: Incomplete brain connectivity network
It has been documented that the graph properties and structures of brain networks largely resemble social networks.39 In a brain network, each node represents a brain area and an edge/link represents the interactive relationship between a pair of brain areas.40 The brain network is often thresholded to turn negatively and weakly connected edges into nonconnections.2,41 Although thresholding can remove false positive edges, it inevitably introduces false negative findings by labeling links under a detection limit as nonconnections. Therefore, the thresholded brain network is considered as an incomplete network. Our goal is to recover the “true” network based on the incomplete brain network data, by identifying the false negative edges using our method. The brain network here is built based on resting state fMRI data from 44 healthy subjects (age 36 ± 11, 17 females and 27 males) from a previous published study. The brain network includes 246 nodes defined by brain regions.42 The strengths of edges are measured by the temporal coherence between averaged time series from all brain regions/nodes.43 The group-level (average) brain network characterizes the brain circuitry of healthy subjects at rest. Since the brain network is thresholded and many edges can be latent links due to detection limit, we apply the proposed method to the incomplete brain network and perform latent link prediction. We evaluate the performance of varied link prediction methods based on the artificially hold-out links. We generate 20 sets of hold-out links for each link-hiding proportion, and predict latent links based on the rest of the edges in the observed brain network. The results of each predictive model are displayed in Table 4 and Figure 5 in terms of AUCs of averaged ROC curves.
TABLE 4.
Comparing AUCs of ROCs between methods for hold-out links in incomplete brain network
ϕ | SNOM | NS | SBM | FS |
---|---|---|---|---|
0.1 | 0.9689 (0.0076) | 0.9529 (0.0104) | 0.5777 (0.0205) | 0.7073 (0.0196) |
0.2 | 0.9631 (0.0039) | 0.9405 (0.0079) | 0.5724 (0.0193) | 0.6638 (0.0122) |
0.5 | 0.8355 (0.0101) | 0.8719 (0.0098) | 0.5464 (0.0117) | 0.5126 (0.0119) |
0.8 | 0.6605 (0.0094) | 0.6568 (0.0150) | 0.5345 (0.0072) | 0.5523 (0.0134) |
FIGURE 5.
Averaged ROC curves for methods with different rates of hidden links in incomplete brain network
As with the social network example, our method SNOM outperforms the competing methods under low hiding rate (0.1 and 0.2). NS shows better performance than the SBM and the FS methods. For a higher hiding proportion 0.5, our method and the NS method have similar performance and both outperform the other two methods. When the hold-out rate increases to 0.8, no method can provide accurate prediction, though the NS method and our method outperform the other two methods. We note that the link prediction performance based on incomplete brain network (average AUC = 97%) is generally better than the incomplete social network (average AUC = 78%). This may because social network data are collected based on subjective survey/questionnaire data which may include more noise than objectively observed neurophysiological signals of brain connectome network data.
5 |. DISCUSSION
We have developed a link prediction method for partially network data in the context of binary outcome misclassification. Our emphasis is on “outcome/link prediction” which distinguishes our work from prior work in binary outcome misclassification by aiming at estimating effects of exposures.28 The predicted latent links provide a more accurate recovery of the full network data using only observed incomplete network data.
Our method complements existing parametric network models by allowing misclassified outcome data and unreported/latent links. As many parametric network models (eg, ERGM) are considered as an incomplete-data generating process,44 our jointly modeling of misclassification and network is as well. We extend the basis of existing network models from likelihood to expected likelihood to account for the misclassified outcome variables. This framework is flexible with respect to incorporating node and edge characteristics with parametric model specification.
We also make several methodological contributions to the modeling of outcome misclassification for network analysis. Previous research has shown that validation data sets are critical for parameter estimation of binary outcome misclassification adjustment. When validation data are available, the proposed method provides a new statistical model to account for the new (single-directional) misclassification mechanism (latent links), which yields both accurate outcome prediction and parameter estimation. However, the validation data for a partially observed network data are rarely available. Our theoretical proof and numerical results demonstrate that the lack of validation data has little impact on latent link prediction in partially observed network data by leveraging a new optimization procedure. This implication of outcome prediction without validation data may be extended to general binary outcome regression models beyond statistical network models with further investigations and experiments. Novel outcome misclassification specification for the single-directional misclassification is well suited for the efficient computation of expected likelihood.
We evaluate the proposed method using three network topological structures in a simulation study and two data examples from an undergraduate student cohort and a brain connectome study. Our method demonstrates improved and/or comparable performance with existing machine learning and nonparametric methods in all scenarios. In practice, complex network data, including both social and brain connectome networks, presents combined patterns of organized topological structures and randomness. Our model provides accurate prediction of the latent links. In addition, our method is flexible and can incorporate information link-related multivariate covariates and latent network topological structures into network models.
Our current link prediction model is only applicable to binary and undirected graphs. Thus, it is limited to binary brain network analysis. Future works may further develop the measurement error based network model for weighted brain network analysis. Many statistical brain network models1 and graph summary statistics (eg, resource efficiency and path transitivity) are also built on the binary edges.40 In these applications, the latent link prediction is critical because it more accurately reflects the underlying graph theoretical properties.
In summary, the proposed link prediction model can integrate information from the observed network and other covariates and effectively recover the full network based on incomplete network data.
6 |. SOFTWARE
The software in the form of Matlab code, together with a sample input data set and complete documentation is available in https://github.com/qwu1221/LinkPred.
Funding information
NIH, Grant/Award Numbers: 1DP1DA048968-01, R01 MH115031, R01 MH094460
APPENDIX A
A1. Numerical evaluation of the impact of θ estimation on link prediction by Theorem 1
To further validate theorem 1, we perform numerical experiments to examine the impact of biased θ estimation on link prediction. We generate a social network with 50 nodes and sample edges (yij, i, j = 1, …, 50) from a Bernoulli(πij) distribution, where πij is linear combination of X1 ~ exp(1) and X2 ~ N(0, 1). We let observed edges wij are sampled from edges yij = 1 using another Bernoulli distribution with probability ϕij(θ). θ0 = 0.1 was used to generate the data.
We evaluate the impact of biased on c statistic by letting , and 0.15. are calculated based on the step one optimization for all possible 1225 (50 × (50 − 1)/2) links in the network. Next, the 749 700 (1225 × (1225 − 1)/2) pairs of links are used to count the concordant and discordant pairs. The AUC of ROC (which approximately equals to the c statistic) is determined by the proportion of concordant pairs. If for all , the proportion of concordant pairs is unchanged and the c statistic (AUC ROC) is not impacted by the estimation bias. Therefore, we can evaluate the impact of biased estimation of θ on the performance of link prediction by comparing the signs of and .The simulation results in Table A1 indicate that less than 0.1% concordant and discordant pairs are affected by the biased Therefore, we conclude that moderately biased due to the lack of validation data set has little impact on the prediction of latent links for our method.
TABLE A1.
The agreement of concordance and discordance metrics between
θ0 = 0.1 | |||
---|---|---|---|
384 581 (0.5130) | 284 (3.7882e–04) | ||
336 (4.4818e–04) | 364 499 (0.4862) | ||
384 787 (0.5133) | 111 (1.7340e–04) | ||
130 (1.4806e–04) | 364 672 (0.4864) | ||
384 812 (0.5133) | 105 (1.4006e–04) | ||
105 (1.4006e–04) | 364 678 (0.4864) | ||
384 648 (0.5131) | 281 (3.7482e–04) | ||
269 (3.5881e–04) | 364 502 (0.4862) |
Footnotes
DATA AVAILABILITY STATEMENT
Data subject to third-party restrictions.
REFERENCES
- 1.Simpson SL, Hayasaka S, Laurienti PJ. Exponential random graph modeling for complex brain networks. PLoS One. 2011;6(5):e20039. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Simpson SL, Bahrami M, Laurienti PJ. A mixed-modeling framework for analyzing multitask whole-brain network data. Netw Neurosci. 2019;3(2):307–324. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Zhao Y, Wu Y-J, Levina E, Zhu J. Link prediction for partially observed networks. J Comput Graph Stat. 2017;26(3):725–733. [Google Scholar]
- 4.Simpson SL, Laurienti PJ. A two-part mixed-effects modeling framework for analyzing whole-brain network data. NeuroImage. 2015;113:310–319. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Wang Y, Kang J, Kemmer PB, Guo Y. An efficient and reliable statistical method for estimating functional connectivity in large scale brain networks using partial correlation. Front Neurosci. 2016;10:123. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Zhang M, Chen Y. Link prediction based on graph neural networks. Advances in Neural Information Processing Systems, Montreal Canada; 2018:5165–5175. [Google Scholar]
- 7.Goldenberg A, Zheng AX, Fienberg SE, Airoldi EM. A survey of statistical network models. Found Trends Mach Learn. 2010;2(2):129–233. [Google Scholar]
- 8.Frank O, Strauss D. Markov graphs. J Am Stat Assoc. 1986;81(395):832–842. [Google Scholar]
- 9.Holland PW, Leinhardt S. An exponential family of probability distributions for directed graphs. J Am Stat Assoc. 1981;76(373):33–50. [Google Scholar]
- 10.Strauss D, Ikeda M. Pseudolikelihood estimation for social networks. J Am Stat Assoc. 1990;85(409):204–212. [Google Scholar]
- 11.Wasserman S, Pattison P. Logit models and logistic regressions for social networks: I. an introduction to Markov graphs andp. Psychometrika. 1996;61(3):401–425. [Google Scholar]
- 12.Snijders TAB, Van Duijn MAJ. Conditional maximum likelihood estimation under various specifications of exponential random graph models. Contributions to Social Network Analysis, Information Theory, and Other Topics in Statistics: A Festschrift in Honour of Ove Frank on the Occasion of His 65th Birthday. 2002;117–134. https://books.google.com/books/about/Contributions_to_Social_Network_Analysis.html?id=LdawAAAACAAJ. [Google Scholar]
- 13.Hoff PD, Raftery AE, Handcock MS. Latent space approaches to social network analysis. J Am Stat Assoc. 2002;97(460):1090–1098. [Google Scholar]
- 14.O’Madadhain J, Hutchins J, Smyth P. Prediction and ranking algorithms for event-based network data. ACM SIGKDD Explor Newsl. 2005;7(2):23–30. [Google Scholar]
- 15.Al Hasan M, Chaoji V, Salem S, Zaki M. Link prediction using supervised learning. Paper presented at: Proceedings of the SDM06: Workshop on Link Analysis, Counter-Terrorism and Security, Bethesda, Maryland; 2006. [Google Scholar]
- 16.Miller K, Jordan MI, Griffiths TL. Nonparametric latent feature models for link prediction. Advances in Neural Information Processing Systems, Vancouver, Canada; 2009:1276–1284. [Google Scholar]
- 17.Martinez V, Berzal F, Cubero J-C. A survey of link prediction in complex networks. ACM Comput Surv (CSUR). 2017;49(4):69. [Google Scholar]
- 18.Potter GE, Smieszek T, Sailer K. Modeling workplace contact networks: the effects of organizational structure, architecture, and reporting errors on epidemic predictions. Netw Sci. 2015;3(3):298–325. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Lyles RH, Lin J. Sensitivity analysis for misclassification in logistic regression via likelihood methods and predictive value weighting. Stat Med. 2010;29(22):2297–2309. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Magder LS, Hughes JP. Logistic regression when the outcome is measured with uncertainty. Am J Epidemiol. 1997;146(2):195–203. [DOI] [PubMed] [Google Scholar]
- 21.Neuhaus JM. Bias and efficiency loss due to misclassified responses in binary regression. Biometrika. 1999;86(4):843–855. [Google Scholar]
- 22.Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM. Measurement Error in Nonlinear Models: A Modern Perspective. Boca Raton, FL: Chapman & Hall/CRC Press; 2006. [Google Scholar]
- 23.Clauset A, Moore C, Newman MEJ. Hierarchical structure and the prediction of missing links in networks. Nature. 2008;453(7191):98. [DOI] [PubMed] [Google Scholar]
- 24.Rhodes CJ, Jones P. Inferring missing links in partially observed social networks. J Oper Res Soc. 2009;60(10):1373–1383. [Google Scholar]
- 25.Zhao Y, Weko C. Network inference from grouped observations using hub models. Stat Sin. 2019;29(1):225–244. [Google Scholar]
- 26.Hunter DR, Handcock MS, Butts CT, Goodreau SM, Morris M. ergm: a package to fit, simulate and diagnose exponential-family models for networks. J Stat Softw. 2008;24(3):nihpa54860. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Shojaie A. Link prediction in biological networks using multi-mode exponential random graph models. Paper presented at: Proceedings of the 11th Workshop on Mining and Learning with Graphs, Chicago, Illinois; 2013:987–991. [Google Scholar]
- 28.Lyles RH, Tang L, Superak HM, et al. Validation data-based adjustments for outcome misclassification in logistic regression: an illustration. Epidimiology. 2011;22(4):589. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Buonaccorsi JP. Measurement Error: Models, Methods, and Applications. Boca Raton, FL: CRC Press; 2010. [Google Scholar]
- 30.Reilly M, Pepe MS. A mean score method for missing and auxiliary covariate data in regression models. Biometrika. 1995;82(2):299–314. [Google Scholar]
- 31.Karwa V, Krivitsky PN, Slavković AB. Sharing social network data: differentially private estimation of exponential family random-graph models. J Royal Stat Soc Ser C (Appl Stat). 2017;66(3):481–500. [Google Scholar]
- 32.Austin PC, Steyerberg EW. Graphical assessment of internal and external calibration of logistic regression models by using loess smoothers. Stat Med. 2014;33(3):517–535. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Harrell FE Jr. Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. New York, NY: Springer; 2015. [Google Scholar]
- 34.Li K-C, Duan N. Regression analysis under link violation. Ann Stat. 1989;17(3):1009–1052. [Google Scholar]
- 35.Kingman JFC. Uses of exchangeability. Ann Probab. 1978;6(2):183–197. [Google Scholar]
- 36.Zhou S, Mondragón RJ. The rich-club phenomenon in the Internet topology. IEEE Commun Lett. 2004;8(3):180–182. [Google Scholar]
- 37.Zhang Y, Levina E, Zhu J. Estimating network edge probabilities by neighbourhood smoothing. Biometrika. 2017;104(4):771–783. [Google Scholar]
- 38.Airoldi EM, Costa TB, Chan SH. Stochastic blockmodel approximation of a Graphon: theory and consistent estimation. Adv Neural Inf Process Syst. 2013;26:692–700. [Google Scholar]
- 39.De Vico FF, Richiardi J, Chavez M, Achard S. Graph analysis of functional brain networks: practical issues in translational neuroscience. Philos Trans Royal Soc B Biol Sci. 2014;369(1653):20130521. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Rubinov M, Sporns O. Complex network measures of brain connectivity: uses and interpretations. NeuroImage. 2010;52(3):1059–1069. [DOI] [PubMed] [Google Scholar]
- 41.Van Wijk BCM, Stam CJ, Daffertshofer A. Comparing brain networks of different size and connectivity density using graph theory. PLoS One. 2010;5(10):e13701. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Fan L, Li H, Zhuo J, et al. The human brainnetome atlas: a new brain atlas based on connectional architecture. Cereb Cortex. 2016;26(8):3508–3526. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Chen S, Bowman FDB, Xing Y. Detecting and testing altered brain connectivity networks with k-partite network topology. Comput Stat Data Anal. 2020;141:109–122. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Schweinberger M, Krivitsky PN, Butts CT, Stewart J. Exponential-family models of random graphs: inference in finite-, super-, and infinite population scenarios; 2017. arXiv preprint arXiv:1707.04800.