Skip to main content
Biostatistics (Oxford, England) logoLink to Biostatistics (Oxford, England)
. 2017 Dec 16;20(1):97–110. doi: 10.1093/biostatistics/kxx066

Fixed choice design and augmented fixed choice design for network data with missing observations

Miles Q Ott 1,, Matthew T Harrison 2, Krista J Gile 3, Nancy P Barnett 4, Joseph W Hogan 5
PMCID: PMC6296337  PMID: 29267874

Summary

The statistical analysis of social networks is increasingly used to understand social processes and patterns. The association between social relationships and individual behaviors is of particular interest to sociologists, psychologists, and public health researchers. Several recent network studies make use of the fixed choice design (FCD), which induces missing edges in the network data. Because of the complex dependence structure inherent in networks, missing data can pose very difficult problems for valid statistical inference. In this article, we introduce novel methods for accounting for the FCD censoring and introduce a new survey design, which we call the augmented fixed choice design (AFCD). The AFCD adds considerable information to analyses without unduly burdening the survey respondent, resulting in improvements over the FCD, and other existing estimators. We demonstrate this new method through simulation studies and an analysis of alcohol use in a network of undergraduate students living in a residence hall.

Keywords: Augmented fixed choice design, Fixed choice design, Missing data, Right censoring by degree, Social network

1. Introduction

The statistical analysis of social networks is increasingly used to understand social processes and patterns (Knoke and Yang, 2008). Of particular interest to sociologists, psychologists, and public health researchers is the association between social relationships and individual behaviors. Social relationships are often measured through nominations, such as when collecting information on a friend network, the nomination from person Inline graphic to person Inline graphic indicates that person Inline graphic claims person Inline graphic as their friend. When analyzing a network, there is often the possibility that nominations are missing (Kossinets, 2006). Because of the complex dependence structure inherent in networks, missing data can pose very difficult problems (Holland and Leinhardt, 1973; Marsden, 1990; Kossinets, 2006). Even when missingness is at random, it can induce bias in structural measures of the network, such as homophily and centrality (Smith and Moody, 2013). Of particular interest is how to handle missingness when analyzing the network for peer effects on behavior.

A considerable amount missingness is a result of study design. Several recent studies on networks in school settings make use of the fixed choice design (FCD), which typically induces missing edges. In FCD, the number of possible nominations that each person in the network can make is capped at a maximum, Inline graphic, inducing missing nominations (Holland and Leinhardt, 1973; Kossinets, 2006; Yan and Gregory, 2011). For example, studies have variously restricted the number of friends in a classroom to 4 when studying depression (Witvliet and others, 2010), and the number of best friends to 5 when studying smoking (Mercken and others, 2010), and another study on infectious disease transmission on social networks allowed participants to name up to 6 within class contacts and up to 4 outside of class contacts (Conlan and others, 2010). The National Longitudinal Study of Adolescent Health (Add Health) allowed participants to nominate and rank up to 5 boys and 5 girls as friends (Resnick and others, 1997; Goodreau, 2007) and is the basis for significant methodological advancements in the analysis of social networks with missing data. For example, Goodreau and others (2009) approach the Add Health censoring mechanism by assuming that the true network of interest is the network consisting of the top five male and top five female friends of each respondent. Hipp and others (2015) investigated the consequences of missing observations in longitudinal network data and found that different methods of accounting for missingness can lead to vastly different results. Wang and others (2016) present an exponential random graph (ERGM) method for the imputation of missing network data using the Add Health study as an application. Handcock and Gile present network modeling approaches for networks with missing data with application to the Add Health study (Handcock and Gile, 2007).

Holland and Leinhardt (1973) first introduced the problem of missing ties due to the FCD (also called limited choice design and right-censoring of degree). Kossinets (2006) subsequently showed how the FCD can lead to biases in estimates of structural measures of the network. Gommans and Cillessen (2015) compared analyses on the same populations of elementary school students, with FCD as well as a design without censoring, and found significant differences in the conclusions drawn from the two different data collection schemes.

Hoff and others (2013) have developed a likelihood based approach for fixed rank network data (where there is a maximum number of nominations that could be made, and those nominations are ranked) as well as for FCD, and then used these likelihoods in Bayesian estimation of latent variables which are assumed to govern the nominations and their ranks.

In a study on the transmission of influenza through household contacts, Mossong and others (2008) collected egocentric information on the number of household contacts an individual in the household has made in a given day, without collecting which specific household members the ego was in contact with. Potter and others (2011) used these data to model disease transmission between household members, showing that the number of contacts, or edges, can be useful information even when Inline graphic is capped at zero.

In this work, we introduce a novel approach that can improve inference for FCD data: that true total nominations are collected in addition to the standard FCD data. We develop a method that accounts for the missingness resulting from FCD, given the true total nominations, and show how different censoring cut-offs affect parameter estimation using a dyad-independent ERGM model. In Section 2, we introduce novel methods for accounting for the fixed choice censoring and introduce a new survey design. In Section 3, we present simulation studies wherein we demonstrate our methods for handling fixed choice data and compare the effects of different censoring cut-offs on parameter estimation, as well as against different estimation methods. In Section 4, we demonstrate our method in an analysis of relationships in the presence of alcohol use in a network of undergraduates living in a residence hall. A discussion is presented in Section 5.

2. Model formulation and inference

2.1. General framework

Given a family of probability models indexed by a parameter Inline graphic we use the notation Inline graphic to denote the probability mass function (pmf) of a discrete random variable Inline graphic under the model with parameter Inline graphic. Usually, the model will also involve fixed covariates, but this is suppressed for now in our notation. Our goal is to identify what Inline graphic is so that we can find the maximum likelihood estimates of Inline graphic given our data Inline graphic. If Inline graphic is a matrix, then we use Inline graphic to denote the Inline graphicth row of Inline graphic. If Inline graphic is a vector, then we use Inline graphic to denote the sum of Inline graphic.

Let Inline graphic be an Inline graphic sociomatrix, where Inline graphic if Inline graphic nominates Inline graphic and Inline graphic for all Inline graphic. Note that Inline graphic does not imply that Inline graphic, meaning that these relationships may be unreciprocated. We use Inline graphic to denote the row sums of Inline graphic, i.e., Inline graphic. We assume that Inline graphic is the sociomatrix that would be observed without any reporting constraints, and, hence, Inline graphic is the true total of nominations of the Inline graphicth subject. If Inline graphic could be observed, then, given a computationally tractable probability model Inline graphic, we could use standard likelihood-based methods to estimate Inline graphic. A simple model, which we assume here, is that all ties are independent Bernoulli random variables, namely,

graphic file with name M39.gif (2.1)

where Inline graphic is designed according to the problem at hand.

In a FCD that allows at most Inline graphic nominations per subject we do not observe Inline graphic, but instead observe a censored sociomatrix Inline graphic. We assume the following possible censoring mechanism: if Inline graphic, then there is no censoring and Inline graphic, otherwise, the subject reports exactly Inline graphic of the Inline graphic original nominations chosen uniformly at random. Consequently, the joint pmf of Inline graphic is

graphic file with name M49.gif (2.2)

with

graphic file with name M50.gif (2.3)

Equation (2.3) involves the term

graphic file with name M51.gif (2.4)

which is simply the probability that a sum of independent Bernoulli’s is equal to Inline graphic and is easy to compute using discrete convolution.

From (2.2) to (2.4), we see that Inline graphic is easily computable and can be combined with standard likelihood methods to generate estimates of Inline graphic from joint observations of Inline graphic and Inline graphic. Since Inline graphic is not usually observed in a FCD, we call this design an augmented FCD (AFCD). In a regular FCD, from (2.2) we also see that

graphic file with name M58.gif (2.5)

where we sum over all possible values of Inline graphic for rows Inline graphic with missing data:

graphic file with name M61.gif (2.6)

Note from (2.3) when we are finding the Inline graphic, we are in the fully observed data case when Inline graphic, which is in contrast to (2.6) when there are missing observations Inline graphic since we do not observe Inline graphic. This is because when Inline graphic is known, then we are able to discern whether we have complete data when Inline graphic. However when Inline graphic is censored, as in the FCD setting, we cannot be certain if Inline graphic or if Inline graphic.

As noted above, our goal for both the AFCD and FCD is to identify the function Inline graphic in the AFCD setting, or the function Inline graphic in the FCD setting so that we may find the maximum likelihood estimates of Inline graphic given our data. Now that Inline graphic and Inline graphic have been specified for the AFCD and FCD data cases, respectively, we employ one of the many available optimization techniques to find the maximum likelihood estimates of of Inline graphic given the available data.

2.2. Variance estimation

The variance estimation for Inline graphic is non-trivial. Simply using the observed information will not incorporate the uncertainty of the censored values. In order to quantify the variance of the maximum likelihood estimates of the Inline graphic parameters, we describe a parametric bootstrap by following these steps with Inline graphic bootstrap samples.

  • 1. Maximize the appropriate likelihood as described above [(2.3) for AFCD or (2.6) for FCD] to produce Inline graphic.

  • 2. Using Inline graphic and the same covariates used in the model in Step 1, generate sociomatrix Inline graphic.

  • 3. With uniform probability delete Inline graphic edges for all row Inline graphic for Inline graphic in Inline graphic of Inline graphic to generate Inline graphic.

  • 4. Maximize the appropriate likelihood using Inline graphic to attain Inline graphic

  • 5. Repeat Steps 2–4 Inline graphic times to generate a distribution of Inline graphic which can be used to get bootstrapped standard errors and confidence intervals.

3. Simulation studies

Using the general framework described above, we experiment with models of the form

graphic file with name M93.gif

where Inline graphic is an appropriate link function for binary data, and the parameter Inline graphic is a column vector and where Inline graphic is a column vector of known covariates that may depend on both Inline graphic and Inline graphic. If Inline graphic was fully observed, so that we could use Inline graphic from (2.1) for inference, then this would be regression with edge-level covariates. Instead, for a FCD or an AFCD, we use Inline graphic or Inline graphic, respectively, as derived in the previous section to find a maximum likelihood estimates. We proceed in the rest of this article to use the probit link function, though other link functions could also be implemented.

We do not directly observe edge-level covariates in this simulation, but rather create them from vertex-level covariates. For each vertex Inline graphic, let Inline graphic be a vector of known covariates. Define the edge-level covariates as some subset of

graphic file with name M105.gif

which has dimension Inline graphic. For example, if we look at age as the single variable of interest (Inline graphic), then we define:

graphic file with name M108.gif

for all Inline graphic and Inline graphic. If we were to use age and income as the two covariates of interest (Inline graphic), then we define:

graphic file with name M112.gif

for all Inline graphic and Inline graphic.

3.1. Simulation design

In order to contrast the AFCD, FCD, a naive analysis (where we assume that all unobserved values of Inline graphic are non-edges), and Hoff and others (2013)’s censored binary (CB) estimator, we performed a simulation study.

For a simulated population of size Inline graphic, we first generated a continuous covariate Inline graphic for all Inline graphic members of the simulated population, which stays fixed for all simulations. In keeping with the previously defined notation, we next generated directed edges between members of the network such that the probability that individual Inline graphic has a directed relationship with individual Inline graphic, denoted by Inline graphic, is independent given Inline graphic:

graphic file with name M123.gif

where

graphic file with name M124.gif

and Inline graphic is a Inline graphic vector. Note that our formulation allows for both row covariates (Inline graphic), column covariates (Inline graphic) as well as edge covariates (Inline graphic), and that here we do not use the column covariates. We next simulate the censoring processes of the AFCD, the FCD, and the CB (which has the identical censoring process as the FCD) for maximum number of nominations Inline graphic. Under the AFCD, in row Inline graphic, given Inline graphic and Inline graphic, when Inline graphic, Inline graphic edges are censored, where each of the Inline graphic edges have an equal probability of being censored. Moreover, all non-edges in rows where Inline graphic are censored. This gives us the observed AFCD data Inline graphic. For the FCD, for a maximum Inline graphic in each row where Inline graphic, Inline graphic edges are censored, and all non-edges in these rows are censored, to give us the observed FCD data Inline graphic.

We then obtain maximum likelihood estimators under the AFCD and FCD with varying Inline graphic values. We also find estimates for the CB using the posterior means of Inline graphic using the amen package in R (Hoff and others, 2015). Finally, we carry out the naive analysis by applying probit regression to the Inline graphic data where we treat all censored values as zero, which is the default current practice for analyzing FCD.

3.2. Simulation results

The distribution of the covariate value Inline graphic is displayed in Figure 1(a). We generated the network using Inline graphic which generates networks with a roughly normal distribution of edges with mean number of edges = 10; however, having a normal distribution of edges is not a requirement here. Here, beta Inline graphic implies that for Inline graphic where Inline graphic, the probit of Inline graphic nominating Inline graphic is equal to Inline graphic. Likewise, Inline graphic implies that for a fixed value in the absolute difference between Inline graphic and Inline graphic, as Inline graphic increases by one unit, then the probit of Inline graphic nominating Inline graphic increases by 0.02. Last, Inline graphic implies that for a fixed value of Inline graphic as the absolute difference between Inline graphic and Inline graphic increases by one unit, the probit of Inline graphic nominating Inline graphic decreases by 0.025. We simulated 100 different networks with 100 nodes using these covariate and Inline graphic values. The distribution of Inline graphic for all 100 simulations is plotted in Figure 1(b), where grey bars indicate Inline graphic values used to censor the simulated data Inline graphic. For the CB, which uses Bayesian inference, we used MCMC and generated 55 000 posterior draws. The first 5000 draws were discarded, and of the remaining 50 000 posterior draws, we used every 25th draw, for a total of 2000 posterior draws. We present means of these 2000 posterior draws as the estimates.

Fig. 1.

Fig. 1.

Simulation specifications: distribution of covariates used to generate simulated networks, (a), and (b) distribution of the true total nominations made from 100 simulations. The grey bars denote values of Inline graphic used to impose censoring in 100 different simulations Inline graphic.

We present the mean squared error (MSE), empirical bias, and SEs of the estimated Inline graphic’s from the 100 simulations in Table 1.

Table 1.

Mean squared error, empirical bias, and standard deviation of estimated Inline graphic and Inline graphic from 100 simulations, varying the maximum number of nominations observed, Inline graphic

Inline graphic Inline graphic MSE Inline graphic Inline graphic Inline graphicBiasInline graphic Inline graphic SD Inline graphic
AFCD FCD Naive CB AFCD FCD Naive CB AFCD FCD Naive CB
0 34.68 4.43 3.90
2 0.34 153.08 73.84 377.03 0.01 7.85 8.59 18.84 0.59 9.61 0.26 4.72
4 0.27 32.00 28.31 501.85 0.02 2.46 5.32 18.12 0.52 5.12 0.20 13.24
6 0.26 1.10 10.42 77.28 0.03 0.09 3.22 7.20 0.51 1.05 0.23 5.07
8 0.25 0.40 3.26 4.95 0.03 0.06 1.79 1.77 0.50 0.64 0.28 1.36
Inline graphic Inline graphic MSE Inline graphic Inline graphic Inline graphicBiasInline graphic Inline graphic SD Inline graphic
AFCD FCD Naive CB AFCD FCD Naive CB AFCD FCD Naive CB
0 5.13 48.38 0.53
2 0.17 27.22 0.04 16.88 1.30 18.21 5.32 16.51 0.13 1.65 0.03 1.30
4 0.16 10.85 0.02 108.48 1.32 13.08 3.70 27.18 0.13 1.04 0.03 3.30
6 0.16 0.60 0.03 15.53 1.20 1.11 3.27 10.70 0.13 0.25 0.04 1.25
8 0.16 0.26 0.05 0.98 1.06 2.53 2.52 0.09 0.13 0.16 0.06 0.31
Inline graphic Inline graphic MSE Inline graphic Inline graphic Inline graphicBiasInline graphic Inline graphic SD Inline graphic
AFCD FCD Naive CB AFCD FCD Naive CB AFCD FCD Naive CB
0 936.23 651.87 71.89
2 1.30 45.91 5.95 3.54 3.45 134.78 71.59 50.70 3.61 16.74 2.89 3.13
4 0.58 6.14 2.97 1.48 2.32 39.16 50.54 31.84 2.40 6.82 2.06 2.18
6 0.37 0.43 1.54 0.63 1.10 2.09 35.04 17.30 1.93 2.07 1.77 1.84
8 0.31 0.33 0.78 0.34 0.29 0.88 22.23 6.23 1.76 1.84 1.70 1.74

For a given maximum Inline graphic, the AFCD had lower MSE than the FCD and CB for Inline graphic and Inline graphic. The AFCD had lower MSE than the naive estimator (in which the data are falsely assumed to be uncensored) for Inline graphic and Inline graphic, though naive estimator had lower MSE than the AFCD for Inline graphic, due to its lower variance. For the Inline graphic parameter, having an AFCD with Inline graphic had a lower MSE than FCD with Inline graphic, naive with Inline graphic and CB with Inline graphic. For the Inline graphic parameter, having an AFCD with Inline graphic outperformed a FCD with Inline graphic and CB with Inline graphic in terms of MSE. For Inline graphic, the AFCD had lower MSE for all levels of Inline graphic simulated as compared to the other estimators. For higher values of Inline graphic, the relative improvement in MSE for AFCD over FCD tended to diminish. In other words, when the data collection design incurs a higher levels of missingness (such as when the maximum number of nominations that can be made in the survey Inline graphic is low relative to Inline graphic, the true total nominations) the added benefit of knowing the true number of relationships can be quite large, which is why the AFCD will outperform the FCD and CB. When Inline graphic is high relative to the distribution of the true total nominations per individual in the network, there will be less missingness and a smaller benefit of the AFCD as compared to the FCD and CB. In general, the AFCD requires lower Inline graphic to achieve comparable MSE to the other estimators, indicating that it may be preferable to collect the total number of relationships Inline graphic to marginally increasing Inline graphic.

When comparing the MSE of the FCD to the CB, neither estimator proved uniformly better. While the FCD had substantially lower MSE than the CB when estimating Inline graphic regardless of Inline graphic, when estimating Inline graphic and Inline graphic the CB has lower MSE when Inline graphic is smaller, and the FCD has lower MSE when Inline graphic is larger.

The naive analyses in which all censored edges are considered to be non-edges (as is a common current practice), estimated Inline graphic poorly. This is result is not surprising as the naive estimator will necessarily underestimate the probability of any edge due to the way that it treats all censored values as non-edges. Notably, the naive estimator tended to have high bias and a low standard deviation.

4. Analysis of UrWeb study

We next apply the method to the UrWeb data set, in which data were collected from residents of a primarily freshman dormitory (Barnett and others, 2014). Each participant was 18 years old or older when the survey was administered and was asked to report the number of days in a month that they consumed alcohol. Central to our interests here, each participant was asked to nominate which of the other participants were important to them. This network is pictured in Figure 2(a). Among the 129 participants included in the sample, 507 nominations were made; 4 participants did not nominate anyone nor were they nominated.

Fig. 2.

Fig. 2.

UrWeb network (a), and distribution (b) distribution of nominations made by UrWeb study participants.

The UrWeb data were collected under a FCD, with Inline graphic. In this data set, only one person endorsed the maximum number of nominations, which suggests that there is little design-induced missingness of nominations. We will proceed to show the utility of the methods introduced here by artificially inducing a Inline graphic in the UrWeb data set, estimating parameters, and comparing the estimated parameters when Inline graphic to the parameters estimated with probit regression using the full data set. We will artificially induce Inline graphic by deleting edges of individuals with more than Inline graphic in two different ways, first by randomly deleting edges with uniform probability (as above in the simulation study), and secondly by deleting edges with regard to the order that each individual made their nominations. Figure 2(b) displays the distribution of the number of nominations that were made by each participant in the UrWeb study. We will proceed assuming that the UrWeb data are fully observed, and will demonstrate how this method will work in practice, compare the information loss for different Inline graphic values, and contrast AFCD, FCD, CB, and naive analyses.

We use the number of days in a month that the subjects consume alcohol as the Inline graphic covariate in this model:

graphic file with name M238.gif

where Inline graphic indicates that participant Inline graphic nominated Inline graphic, and Inline graphic.

Having artificially induced Inline graphic 100 times, we estimated Inline graphic using the AFCD, FCD, CB, and naive methods. We present boxplots of Inline graphic in Figure 3. In each of these figures, we denote the Inline graphic estimates from the fully observed UrWeb data (where Inline graphic) with a solid horizontal line and the Inline graphic estimate from the full data Inline graphic the estimated standard error from the full data with dotted horizontal lines.

Fig. 3.

Fig. 3.

Boxplots of Inline graphic, Inline graphic, and Inline graphic, varying maximum number of nominations Inline graphic, and AFCD, FCD, naive analysis, and CB in the UrWeb data set. The black horizontal line is the estimated value of the Inline graphic parameter when using the full UrWeb data. The dotted lines are Inline graphic computed from the full UrWeb data for: (a) Inline graphic, (b) Inline graphic, and (c) Inline graphic, respectively.

When Inline graphic we are only able to obtain maximum likelihood estimates of Inline graphic for the AFCD. Because there is only one way to impose censoring when no nomination data is observed, Figure 3 presents a point estimate rather than a distribution of estimates when Inline graphic.

The analyses in this section differ from those in Section 3 in a few important ways. First, rather than simulating several networks, we are analyzing a single real network Inline graphic and repeatedly removing edges at random to form many different realizations of Inline graphic. In these analyses, we do not know the true Inline graphic values, and so we cannot evaluate the bias of the estimates. However, we can use these analyses to investigate information loss due to the design-induced censoring by comparing AFCD, FCD, naive, and CB estimates when Inline graphic to estimates when we observe Inline graphic. For example in Figure 3, excluding when Inline graphic the estimates of Inline graphic and Inline graphic from the AFCD are all within one standard error of the estimate when the full UrWeb data are observed (Inline graphic). This is in sharp contrast to the FCD, naive, and CB estimates of Inline graphic and Inline graphic when Inline graphic. In general, the AFCD seems to lose less information than the FCD, which in turn loses less information than the naive analysis and the CB.

In this analysis of the UrWeb network, the AFCD produces estimates of Inline graphic that are roughly centered on the full data estimate for Inline graphic. The FCD method produces estimates that diverge from the estimate when the data are fully observed, especially when Inline graphic is small. This result is in agreement to the simulation study which also showed that the FCD on average produces somewhat biased estimates when Inline graphic is small.

Next, we deleted nominations in the reverse order in which they were made by the participants in the UrWeb study so that Inline graphic. We estimated Inline graphic using AFCD, FCD, CB, and naive methods. For the AFCD and FCD, we calculated standard error estimates from 500 bootstrap samples. For the naive analyses, we simply used the standard error from a regular probit regression model. For the CB, we used the standard deviation of the posterior draws of the Inline graphic terms to estimate the uncertainty of the estimates. These are presented in Figure 4. The UrWeb study was not explicitly designed to accommodate the AFCD, FCD, or CB design in that participants were not prompted to name a random sample of the people who were important to them. It is possible that study participants chose their nominations non-uniformly, for example nominating peers in the order of their importance. By deleting nominations in reverse order, we seek to investigate whether this could impact inference. We see very similar results when comparing Figure 3 in which nominations were deleted independently of order to Figure 4 in which nominations were deleted in reverse order. These results suggest that the order in which nominations were made in this data set did not greatly impact the inferences made in these analyses.

Fig. 4.

Fig. 4.

Plots of Inline graphic 1 bootstrap standard error for the AFCD and FCD, the probit regression standard error for the naive analysis, and the standard deviation of the posterior distribution estimates for the CB, deleting nominations in the reverse order in which they were made. The dotted lines are Inline graphic computed from the full UrWeb data for: (a) Inline graphic, (b) Inline graphic, and (c) Inline graphic, respectively.

5. Discussion

Collecting complete social network information in a closed population may be difficult as the network survey will impose an unreasonable amount of respondent burden. The FCD seeks to ameliorate respondent burden by asking respondents to nominate up to Inline graphic individuals in the population with whom they have a particular relationship.

In our application, we demonstrate that estimating associations between behaviors and social relationships from social network data arising from a fixed choice survey design as though the social network was fully observed (as is the current standard practice) can result in severely biased estimates. We introduce observed data likelihoods for FCD data. We demonstrated that maximizing the observed data likelihood for the FCD may improve the MSE in comparison to estimates where the data are (falsely) assumed to be fully observed.

We also introduce the AFCD, a new network survey sampling design and method of analysis which collects information on the total number of relationships for each individual in the network, in addition to the data collected with the standard FCD. This novel study design can add considerable information to analyses without unduly burdening the survey respondent, resulting in improvements over the FCD and naive analyses. We demonstrate that the AFCD is superior to both the FCD and naive analyses, as well as Hoff and others (2013)’s CB in terms of MSE. The improvement of the AFCD’s MSE relative to the FCD, CB, and naive analyses is particularly pronounced when the Inline graphic is small relative the number of true total nominations. Unsurprisingly, our simulations show that for every estimator when Inline graphic is larger, variation and bias is smaller. While collecting all nominations from each survey respondent would be optimal in terms of minimizing variance and bias, the AFCD can provide a way to improve estimation while keeping respondent burden low.

Since the AFCD utilizes information on the true total nominations, the estimates of the intercepts are much better with the AFCD than the FCD or the naive analyses. This suggests that the AFCD should be implemented when edge prediction is a goal of the analyses.

Limitations are acknowledged. In this work, we assume that nominations are randomly censored. Violation of this assumption may lead to incorrect inference. This assumption warrants additional investigation, and further research into survey methodology for AFCD and FCD data are necessary. Though, in this work we find that the order in which respondents nominated their peers did not heavily influence inference.

The analyses and simulations we present use a dyad independent ERGM model. This model does not incorporate important network characteristics including reciprocity, transitivity, and clustering. Modeling network structural characteristics and allowing for complex dependencies is particularly important when the goal of the model is to impute missing edges, or provide a realistic network model. Alternative models that incorporate network characteristics and dependencies include the social relations model (Warner and others, 1979), the ERGM family of models (Frank and Strauss, 1986; Robins and others, 2004; Goodreau, 2007), and the latent space and factor models (Hoff and others, 2002; Hoff, 2009).

Hoff and others (2013) presented likelihoods for fixed rank and FCD data. Hoff et al. assume that there is an underlying parametric model for the network that generates the ranked or binary social relations data. Using a social relations model, Hoff et al. perform estimation in the Bayesian framework. A benefit of that approach is the ability to accommodate both ranked and binary nominations. However, that method relies upon an underlying parametric model, requiring more stringent assumptions. As we are concerned with binary and not ranked data, we have compared the performance of Hoff’s CB estimator to the AFCD, FCD, and naive estimator and found that in simulations the AFCD had uniformly lower MSE than the CB, while the FCD often had lower MSE than the CB. It should be noted that Hoff et al. found that their CB estimator performed comparably to their estimator that accounted for social rankings (Hoff and others, 2013), hence we would anticipate that the AFCD would also outperform the fixed rank estimator. Therefore we suggest that when collecting sociometric data, whether or not relationship rankings are collected, that the total number of relationships should be collected, so that censoring can be more readily accounted for.

Funding

NSF (SES-1230081) including support from the National Agricultural Statistics Service; NSF D(MS-1309004); Research Excellence Award from the Center for Alcohol and Addiction Studies, Brown University; and NIH (R01AA023522, R01CA183854, R01AI108441, P01AA019072, P30AI42853).

References

  1. Barnett N., Ott M. Q., Rogers M., Loxley M., Linkletter C. and Clark M. (2014). Peer associations for substance use and exercise in a college student social network. Health Psychology 33, 1134–1142. [DOI] [PubMed] [Google Scholar]
  2. Conlan A. J. K., Eames K. T. D, Gage J. A., von Kirchbach J. C., Ross J. V., Saenz R. A. and Gog J. R. (2010). Measuring social networks in British primary schools through scientific engagement. Proceedings of the Royal Society Series B 278, 1467–1475. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Frank O. and Strauss D. (1986). Markov graphs. Journal of the American Statistical Association 81, 832–842. [Google Scholar]
  4. Gommans R. and Cillessen A. H. N. (2015). Nominating under constraints: a systematic comparison of unlimited and limited peer nomination methodologies in elementary school. International Journal of Behavioral Development 39, 77–86. [Google Scholar]
  5. Goodreau S. M. (2007). Advances in exponential random graph (p*) models applied to a large social network. Social Networks 29, 231–248. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Goodreau S. M., Kitts J. A. and Morris M. (2009). Birds of a feather, or friend of a friend? Using exponential random graph models to investigate adolescent social networks*. Demography 46, 103–125. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Handcock M. S. and Gile K. (2007). Modeling Social Networks with Sampled or Missing Data. Working Paper no. 75, Center for Statistics and the Social Sciences, University of Washington. [Google Scholar]
  8. Hipp J. R., Wang C., Butts C. T., Jose R. and Lakon C. M. (2015). Research note: the consequences of different methods for handling missing network data in stochastic actor based models. Social Networks 41, 56–71. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Hoff P. D. (2009). Multiplicative latent factor models for description and prediction of social networks. Computational and Mathematical Organization Theory 15, 261–272. [Google Scholar]
  10. Hoff P. D., Fosdick B., Volfovsky A. and Stovel K. (2013). Likelihoods for fixed rank nomination networks. Network Science 1, 253–277. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Hoff P., Fosdick B., Volfovsky A. and He Y. (2015). amen: Additive and Multiplicative Effects Models for Networks and Relational Data. R package version 1.1. [Google Scholar]
  12. Hoff P. D., Raftery A. E. and Handcock M. S. (2002). Latent space approaches to social network analysis. Journal of the American Statistical Association 97, 1090–1098. [Google Scholar]
  13. Holland P. W. and Leinhardt S. (1973). The structural implications of measurement error in sociometry. Journal of Mathematical Sociology 3, 85–111. [Google Scholar]
  14. Knoke D. and Yang S. (2008). Social Network Analysis. Quantitative Applications in the Social Sciences, 2nd edition, Volume 154 Thousand Oaks, CA: Sage Publications. [Google Scholar]
  15. Kossinets G. (2006). Effects of missing data in social networks. Social Networks 28, 247–268. [Google Scholar]
  16. Marsden P. V. (1990). Network data and measurement. Annual Review of Sociology 16, 435–463. [Google Scholar]
  17. Mercken L., Snijders T. A. B, Steglich C., Vartianen E. and de Vries H. (2010). Dynamics of adolescent friendship networks and smoking behavior. Social Networks 32, 72–81. [DOI] [PubMed] [Google Scholar]
  18. Mossong J., Hens N., Jit M., Beutels M., Auranen P. and Mikolajczyk K. (2008). Social contacts and mixing patterns relevant to the spread of infectious diseases. PLoS Medicine 5, e74. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Potter G. E., Handcock M. S., Longini I. M. Jr and Halloran M. E. (2011). Estimating within-household contact networks from egocentric data. The Annals of Applied Statistics 5, 1816–1838. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Resnick M. D., Bearman P. S., Blum R. W., Bauman K. E., Harris K. M., Jones J., Tabor J., Beuhring T., Sieving R. E., Shew M., Ireland M., Bearinger L. H.. and others (1997). Protecting adolescents from harm. Findings from the National Longitudinal Study on Adolescent Health. Journal of the American Medical Association 278, 823–832. [DOI] [PubMed] [Google Scholar]
  21. Robins G., Pattison P. and Woolcock J. (2004). Missing data in networks: exponential random graph (p*) models for networks with non-respondents. Social Networks 26, 257–283. [Google Scholar]
  22. Smith J. A. and Moody J. (2013). Structural effects of network sampling coverage I: nodes missing at random. Social Networks 35, 652–668. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Wang C., Butts C. T., Hipp J. R., Jose R. and Lakon C. M. (2016). Multiple imputation for missing edge data: a predictive evaluation method with application to add health. Social Networks 45, 89–98. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Warner R. M., Kenney D. A. and Stoto M. (1979). A new round robin analysis of variance for social interaction data. Journal of Personality and Social Psychology 37, 1742–1757. [Google Scholar]
  25. Witvliet M., Brendgen M., van Lier P., Koot H. M. and Vitaro F. (2010). Early adolescent depressive symptoms: prediction from clique isolation, loneliness, and perceived social acceptance. Journal of Abnormal Child Psychology 38, 1045–1056. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Yan B. and Gregory S. (2011). Finding missing edges and communities in incomplete networks. Journal of Physics A: Mathematical and Theoretical 44, 495102. [Google Scholar]

Articles from Biostatistics (Oxford, England) are provided here courtesy of Oxford University Press

RESOURCES