Skip to main content
Journal of the Royal Society Interface logoLink to Journal of the Royal Society Interface
. 2023 Aug 9;20(205):20220912. doi: 10.1098/rsif.2022.0912

Reconstructing multi-strain pathogen interactions from cross-sectional survey data via statistical network inference

Irene Man 1,2,, Elisa Benincà 1, Mirjam E Kretzschmar 2, Johannes A Bogaards 1,3
PMCID: PMC10410213  PMID: 37553995

Abstract

Infectious diseases often involve multiple pathogen species or multiple strains of the same pathogen. As such, knowledge of how different pathogens interact is key to understand and predict the outcome of interventions targeting only a subset of species or strains involved in disease. Population-level data may be useful to infer pathogen strain interactions, but most previously used inference methods only consider uniform interactions between all strains or focus on marginal pairwise interactions. As such, these methods are prone to bias induced by indirect interactions through other strains. Here, we evaluated statistical network inference for reconstructing heterogeneous interactions from cross-sectional surveys detecting joint presence/absence patterns of pathogen strains within hosts. We applied various network models to simulated survey data, representing endemic infection states of multiple pathogen strains with potential interactions in acquisition or clearance of infection. Satisfactory performance was demonstrated by the estimators converging to the true interactions. Accurate reconstruction of interaction networks was achieved by regularization or penalization for sample size. Although performance deteriorated in the presence of host heterogeneity, this was overcome by correcting for individual-level risk factors. Our work demonstrates how statistical network inference could prove useful for detecting multi-strain pathogen interactions and may have applications beyond epidemiology.

Keywords: multi-strain, pathogen, interactions, network inference, cross-sectional data

1. Introduction

The rapid expansion of sequencing technologies over the last few decades has drastically increased our ability to detect within-host pathogen diversity. As a result, it is recognized that infectious diseases often involve multiple pathogen species. For example, diseases may arise from bacterial and viral co-infections [1,2], or co-infections of multiple strains of the same pathogen species, e.g. serotypes of Streptococcus pneumoniae [3], genotypes of the human papillomavirus [4], or clones of Plasmodium falciparum [5]. It is also increasingly understood that networks of microbial interactions may determine the outcome of preventive or therapeutic interventions (e.g. vaccination, antibiotics, probiotics) in unexpected ways [68]. Interactions between pathogens or pathogen strains can take various forms: infection by one strain can modify susceptibility to subsequent infection by others, and the simultaneous presence of multiple strains can affect the duration, infectiousness, as well as severity of infection [9,10]. Consequently, pathogen interactions may have far-reaching clinical, epidemiological, and eco-evolutionary implications [1113].

The acknowledgement that pathogen–strain interactions shape the ecology of a wide number of infectious agents has spurred the development of epidemiological models where hosts can be infected by different pathogens or different strains of the same pathogen [14]. However, inference of interaction parameters in these models is notoriously challenging. The availability of large-scale cross-sectional surveys, detecting the presence of multiple infectious strains simultaneously, has given rise to a wide range of methods for detecting interactions from co-occurrence data [1520]. These methods rely on identifying deviation from statistical independence, either in the number of pathogen strains occurring in individual hosts, pairwise co-occurrence between any two strains, or in the distribution of exact combinations detected within the host population.

Clearly, the validity of such statistical methods for detecting pathogen interactions hinges on the assumption that non-interaction implies statistical independence, but statistical associations can be driven by factors other than direct biological interactions [21]. This point has been made repeatedly by community ecologists, concerned with the possibility to detect interdependence among species from co-occurrence patterns across habitats [22], and has been reiterated by epidemiologists [23]. A common criticism of using statistical associations is that hosts or habitats are not identical, and this may favour the presence of one or another (or both) species irrespective of direct biological interactions. Opportunely, epidemiological studies on human pathogens typically offer rich meta-data, making it possible (in principle) to correct for host heterogeneity. Moreover, when studying different strains of the same pathogen, one might assume that the same hosts are more likely than others to become infected with all strains, enabling correction for host heterogeneity by host-specific covariates in statistical models [23].

So far, many existing methods for detecting multi-strain pathogen interactions have been restricted to uniform (or homogeneous) interactions across strains (i.e. all strains under consideration interact with one another in the same manner), or focused on marginal interactions between pairs of pathogen strains (i.e. apparent interactions between any two strains without consideration of other co-occurring strains) [17,24]. Marginal estimation of pairwise interactions does not necessarily control for indirect interactions through other strains and may therefore be prone to bias, especially when strains interact in a heterogeneous manner. Network inference methods rely on conditional dependencies between multiple co-occurring strains, and might be more suited for detecting heterogeneous pathogen interactions (e.g. interactions that differ between subgroups of strains). However, the performance of these methods might be affected by the presence of host heterogeneity, insofar that co-occurrence within hosts depends not only on strain interactions but also on factors that are relevant for individual infection risk and further pathogen spread.

The purpose of this paper is twofold. First, we explore the feasibility of statistical network inference for detecting heterogeneous pathogen interactions from cross-sectional surveys that detect joint presence/absence patterns of multiple infectious strains within hosts. For this, we evaluate the performance of various methods in reconstructing pathogen strain interactions from simulated survey data for a set of epidemiological and network settings. Second, we investigate the robustness of these methods to settings that include infection risk heterogeneity in the host population, and suggest ways to correct for such host heterogeneity in statistical analysis. In this work, we focus on the reconstruction of interactions between multiple strains of the same pathogen, but the methodology is also applicable to potentially interacting pathogen species and may have applications beyond epidemiology.

2. Results

In order to evaluate the performance of various network inference methods, we simulated random interaction networks with up to 10 pathogen strains. These networks describe whether and to what extent multiple strains interact in an epidemiological model for pathogen transmission. Within each network, the presence of interaction between each pair of strains i and j was established with variable connection probability σ. In cases where interaction between a pair was present, its strength, indicated by xij, was drawn uniformly from the range (−θ, θ), where θ is a positive number that reflects the maximum strength of interaction. Values of xij in the range (−θ, 0) and (0, θ) correspond to competitive and mutualistic interactions, respectively, and the greater the value diverges from zero, the stronger the interaction.

Each network generated in this way was then used to parametrize a multi-strain susceptible–infected–susceptible (SIS) epidemiological model [8]. In the model, exp(xij) represents the hazard ratio that determines how much the presence of strain i in a host affects the host's hazard to acquire or clear strain j, and vice versa. We refer to Material and methods for the full description of the epidemiological model. For each network, we simulated the epidemiological model until the steady state prevalence was obtained, which was then used to sample random cross-sectional datasets with the number of observations being 100, 1000, 10 000 or 100 000. These datasets were then used to reconstruct the interaction networks via statistical analysis. See electronic supplementary material, figure S1, for a schematic representation of the entire data generation process.

The network inference methods that we considered in statistical analysis were the Ising model, graphical modelling approaches and generalized estimating equations (GEE) [2527]. For the Ising model and GEE, model selection was done based on lasso (least absolute shrinkage and selection operator) [25] and Wald tests with false discovery rate control [28], respectively. Graphical model selection was performed in either a forward or a backward fashion (i.e. by iteratively growing or shrinking networks, respectively), using either Akaike's information criterion (AIC) or the Bayesian information criterion (BIC) as selection criterion.

In the base-case setting, we considered 500 random networks with five strains, each generated with connection probability σ = 0.25 and a range of interaction parameters of (−θ, θ) = (−log 3, log 3). In the corresponding epidemiological model, we considered interaction in acquisition only (i.e. no interaction in clearance), a population without host heterogeneity, and type-specific basic reproduction numbers R0,i in the range (1.5, 3).

In the base-case setting, visual inspection showed satisfactory concordance between the estimated networks and true networks. This is illustrated in figure 1, which also illustrates differences between some of the network inference methods considered.

Figure 1.

Figure 1.

Examples of networks with n = 10 strains in the simulation study. Two random networks generated with connection probability σ = 0.25 and interaction strengths xij drawn from (−θ, θ) = (−log 3, log 3) (a,d). Estimated networks from co-occurrence in 100 000 observations according to the regularized Ising model (b,e); and according to a graphical model using backward model selection by BIC (c,f). Strength and type of interactions (mutualistic versus competitive) are indicated by the thickness and colour (blue versus red) of edges, respectively.

2.1. Trade-off between sensitivity and specificity

In order to compare the different methods systematically, we assessed various performance measures, including sensitivity and specificity for indicating the presence (or absence) of pairwise interactions, i.e. the proportions of truly present (or truly absent) interactions that are correctly identified in the reconstructed networks. As a measure for overall performance, we assessed the F1-score, which summarizes sensitivity and positive predictive value (PPV, i.e. the proportion of identified interactions that are truly present) in the form of their harmonic mean.

Network reconstruction improved with the number of observations according to almost all assessed performance measures for the various methods considered (figure 2a–d and table 1). Interactions were recovered with almost 100% sensitivity at 100 000 observations (figure 2a). The high specificity was attained already at small sample size in graphical modelling approaches with BIC-based selection, whereas specificity remained at a more or less constant suboptimal level of 65–85% in graphical modelling approaches with AIC-based selection (figure 2d). GEE with false discovery rate control performed well at small sample size, but displayed slightly deteriorating performance with increasing sample size (figure 2be). Overall, methods with lower specificity showed better sensitivity, showcasing trade-off between sensitivity and specificity. Finally, the Ising model and the graphical modelling approaches with BIC-based selection were able to achieve high sensitivity and specificity at large sample size, which resulted in near 100% F1-scores (figure 2ac).

Figure 2.

Figure 2.

Performance measures of statistical network inference in the base-case analysis. Several measures have been calculated to assess the performance of the statistical network inference as a function of the sample size for the base-case analysis. (a) Sensitivity; (b) positive predictive value (PPV); (c) F1 score; (d) specificity; (e) coverage; (f) root-mean-square deviation. In the base-case analysis, interactions between strains were generated with connection probability σ = 0.25, interaction strengths xij indicating interaction in acquisition drawn from (−log 3, log 3), and basic reproduction numbers randomly drawn from (1.5, 3). All methods were evaluated at sample sizes of 100, 1000, 10 000 and 100 000 observations, but x-axis coordinates are slightly jittered to improve visualization. Abbreviations: bw.aic (dark blue): graphical model with backward AIC selection; bw.bic (yellow): graphical model with backward BIC selection; fw.aic (green): graphical model with forward AIC selection; fw.bic (light blue): graphical model with forward BIC selection; gee (orange): generalized estimating equations; Ising (black): Ising model.

Table 1.

Performance measures of statistical network inference in the base-case analysis. In the base-case analysis, interactions between strains were generated with connection probability σ = 0.25, interaction strengths xij indicating interaction in acquisition drawn from (−log 3, log 3), and basic reproduction numbers randomly drawn from (1.5, 3). All methods were evaluated at sample sizes of 100, 1000, 10 000 and 100 000 observations. GEE, generalized estimating equations; PPV, positive predictive value; RMSE, root-mean-square error.

method sample sizea specificity (%)b sensitivity (%)b PPV (%)b F1-scoreb,c coverage (%)b,d RMSEb
Ising model 100 000 99.9 (99.8; 100.) 93.7 (92.3; 95.1) 99.6 (99.2; 99.9) 0.966 (0.958; 0.974) 94.2 (92.7; 95.6) 0.038 (0.036; 0.041)
10 000 99.6 (99.4; 99.8) 85.1 (82.8; 87.0) 98.6 (97.8; 99.3) 0.913 (0.899; 0.925) 92.7 (91.1; 94.1) 0.079 (0.075; 0.084)
1000 99.5 (99.3; 99.7) 51.5 (48.5; 54.3) 97.1 (95.6; 98.4) 0.673 (0.646; 0.697) 93.0 (91.1; 95.1) 0.218 (0.208; 0.229)
100 99.3 (99.1; 99.6) 7.9 (6.3; 9.7) 78 (70.3; 85.4) 0.144 (0.115; 0.172) 59.3 (49.1; 68.1) 0.606 (0.537; 0.679)
GEE 100 000 90.4 (89.1; 91.7) 96.3 (95.3; 97.4) 75.6 (73.8; 77.7) 0.847 (0.834; 0.861) 80.3 (78.4; 82.1) 0.055 (0.050; 0.060)
10 000 95.8 (95.1; 96.6) 89.9 (88.1; 91.8) 86.9 (85.0; 88.8) 0.884 (0.871; 0.895) 90.7 (89.0; 92.4) 0.085 (0.077; 0.093)
1000 98.6 (98.2; 99.0) 60.9 (58.2; 64.1) 92.9 (91.1; 94.6) 0.736 (0.714; 0.757) 96.1 (94.3; 97.5) 0.189 (0.177; 0.199)
100 99.1 (98.8; 99.4) 7.2 (5.6; 8.1) 70.8 (62.0; 79.0) 0.131 (0.103; 0.163) 64.5 (52.0; 77.2) 0.946 (0.851; 1.04)
graphical model with backward AIC selection 100 000 76.2 (74.4; 78.1) 97.6 (96.7; 98.4) 55.9 (53.7; 58.4) 0.711 (0.688; 0.730) 83.8 (81.8; 85.7) 0.033 (0.029; 0.037)
10 000 71.9 (70.1; 73.7) 94.7 (93.2; 96.0) 50.9 (48.5; 53.2) 0.662 (0.643; 0.681) 86.6 (84.9; 88.1) 0.075 (0.067; 0.085)
1000 71.1 (68.8; 73.1) 79.7 (77.4; 81.9) 46 (43.7; 48.4) 0.584 (0.562; 0.602) 90.6 (89.2; 91.9) 0.2 (0.190; 0.202)
100 67.1 (65.1; 69.1) 50.6 (47.6; 53.7) 32 (30.0; 34.3) 0.392 (0.372; 0.413) 92.5 (91.3; 93.8) 0.656 (0.633; 0.678)
graphical model with backward BIC selection 100 000 99.8 (99.6; 99.9) 94.4 (93.1; 95.7) 99.2 (98.7; 99.7) 0.967 (0.960; 0.974) 86.8 (84.1; 89.3) 0.033 (0.027; 0.040)
10 000 99.8 (99.6; 99.9) 85.9 (84.0; 87.8) 99.2 (98.5; 99.7) 0.921 (0.910; 0.932) 92.3 (90.3; 94.0) 0.066 (0.057; 0.077)
1000 99.2 (99.0; 99.5) 58.7 (56.0; 61.4) 95.9 (94.5; 97.3) 0.729 (0.704; 0.751) 92.4 (90.4; 94.2) 0.178 (0.168; 0.187)
100 96 (95.2; 96.7) 18.4 (16.1; 20.8) 58.5 (53.2; 63.3) 0.28 (0.252; 0.306) 65.3 (60.8; 70.6) 0.918 (0.864; 0.971)
graphical model with forward AIC selection 100 000 86.2 (85.1; 87.2) 97.2 (96.3; 98.1) 68.4 (66.4; 70.6) 0.803 (0.788; 0.819) 81.9 (79.8; 84.1) 0.034 (0.030; 0.040)
10 000 85.2 (84.1; 86.3) 93.5 (92.0; 94.9) 66.1 (64.0; 68.3) 0.775 (0.756; 0.790) 83.6 (81.5; 85.4) 0.081 (0.071; 0.093)
1000 84.6 (83.3; 85.8) 76.4 (73.8; 78.7) 60.4 (58.0; 62.7) 0.675 (0.653; 0.694) 87.1 (85.1; 88.9) 0.224 (0.212; 0.239)
100 83.3 (82.2; 84.5) 39.6 (36.9; 42.2) 42.2 (39.0; 45.0) 0.408 (0.382; 0.432) 87.4 (85.6; 89.5) 0.806 (0.781; 0.830)
graphical model with forward BIC selection 100 000 99.9 (99.8; 100) 94.1 (92.8; 95.3) 99.6 (99.1; 99.9) 0.967 (0.960; 0.974) 87.2 (84.5; 89.8) 0.033 (0.027; 0.040)
10 000 99.8 (99.7; 100) 85.8 (83.8; 87.6) 99.4 (98.7; 99.9) 0.921 (0.910; 0.932) 92.2 (90.3; 94.0) 0.066 (0.057; 0.077)
1000 99.2 (98.9; 99.5) 58.7 (56.0; 61.4) 95.8 (94.3; 97.2) 0.728 (0.704; 0.751) 92.3 (90.3; 94.1) 0.179 (0.169; 0.189)
100 96.5 (96.0; 97.1) 17.5 (15.5; 19.8) 60.8 (55.7; 65.3) 0.271 (0.241; 0.298) 60.2 (55.0; 66.0) 0.987 (0.938; 1.04)

aNumber of sampled individuals.

b95% confidence intervals (bootstrap percentile estimates) in parentheses.

cF1-score denotes the harmonic mean between sensitivity (also termed recall) and PPV (also termed precision).

dProbability that nominal 95% confidence interval of estimated interaction strengths contains the true value strength (non-zero estimates only).

2.2. Convergence of estimated interaction strengths at large sample size

In addition, we assessed the quantitative agreement between the estimated and the true interaction parameter xij, which indicates the strength of interaction, based on the coverage, which we defined as the proportion of generated interaction strengths contained within the 95% confidence intervals of the estimated interaction strengths, and root-mean-square deviation. Actual coverage agreed with nominal 95% coverage probability at 1000 observations for GEE, Ising and BIC-based model estimates, but only the Ising model was able to sustain this high coverage for 10 000 observations or more (figure 2e). Under the smallest sample size involving only 100 observations, fewer interactions were recovered and the estimated interaction strength deviated substantially from the true parameters, except for AIC-based model estimates (table 1). Moreover, even when coverage of estimated interaction strengths deteriorated above 1000 observations, accuracy continued to improve with increasing sample size for all methods, as verified by steadily decreasing root-mean-square deviations for all methods considered, suggesting asymptotically unbiased estimation (figure 2f).

2.3. Performance under alternative epidemiological settings and network configurations

In a series of sensitivity analyses, we modified the base-case setting to alternative epidemiological settings or network configurations: inclusion of interaction in clearance, lower basic reproduction numbers R0,i ∈ (1, 2), stronger interactions xij ∈ (−log 10, log 10), higher connection probabilities σ = 0.5, larger networks by increasing the number of strains to 10 or larger networks by connecting two sub-networks of five strains.

Network inference was not affected by including interactions in clearance relative to the base-case setting. In this setting, F1-scores remained the same for all methods considered (compare yellow dots to black lines in figure 3). This implies that the combined effect of interactions in acquisition and clearance was well captured by the estimated interaction parameters. Specifically, network inference allowed the asymptotically unbiased estimation of the ratio of the interaction parameters in acquisition and clearance (see Material and methods for the definition of this ratio).

Figure 3.

Figure 3.

F1-scores of statistical network inference methods in the sensitivity analyses. The performance of different network inference methods expressed by F1-score as a function of the sample size. (a) Ising model; (b) generalized estimating equations; and graphical models with (c) forward BIC selection, (d) forward AIC selection, (e) backward BIC selection, (f) backward AIC selection. The alternative settings considered in the sensitivity analyses were: acquisition and clearance (yellow): interaction strengths xij including both interactions in acquisition and clearance; reproduction number (orange): lower basic reproduction numbers from the range (1, 2) instead of (1.5, 3); strength (green): strong interaction strengths xij drawn from (−log 10, log 10) instead of (−log 3, log 3); network size (brown): larger networks with 10 strains; sub-networks (dark blue): larger networks with 10 strains created by collating sub-networks of strains; connection probability (light blue): higher connection probability σ being 0.5 instead of 0.25. Performance of the base-case analysis is given by black horizontal lines.

Looking across all alternative settings, regularized Ising, GEE estimation and BIC-based graphical modelling approaches maintained the high performance of the base-case analysis when the sample size was 10 000 or larger (figure 3ac,e). For these methods, when the sample size was 1000 or smaller, performance stayed more or less the same with higher connection probability or lower reproduction numbers (compare light blue and orange dots to black lines), improved with stronger interactions (compare green dots to black lines) but slightly deteriorated when considering larger networks of 10 strains (compare dark blue and brown dots to black lines).

As for the AIC-based graphical modelling approaches, the performance diverged more from that of the base-case setting (figure 3d,f). The performance improved substantially under higher connection probabilities (compare light blue dots to black lines). However, it suffered also more when networks were larger (compare dark blue and brown dots to black lines). These patterns are linked to the high sensitivity and poor specificity of the AIC-based approaches (as shown in the base-case analysis).

2.4. Robustness to host heterogeneity

As the last sensitivity analysis, we tested the robustness of the network inference methods to host heterogeneity in contact rate relevant to pathogen spread. The host population was divided into two interconnected sub-populations: one with high and one with low contact rate. Network inference strongly deteriorated in the presence of host heterogeneity as compared to the base-case analysis, if heterogeneity was not corrected for (compare orange dots to black lines in figure 4; compare figure 5d to figure 5a). For all methods considered, F1-scores did not improve with increasing sample size. Similarly, coverage deteriorated with increasing sample size and root-mean-square deviation remained high at large sample size (electronic supplementary material, table S1). These findings show a consistent bias towards positive associations in uncorrected analyses (electronic supplementary material, figure S2).

Figure 4.

Figure 4.

F1-scores of statistical network inference under host heterogeneity. The performance of the different inference methods evaluated under the setting with host heterogeneity as a function of the sample size. (a) Ising model; (b) generalized estimating equations; and graphical models with (c) forward BIC selection, (d) forward AIC selection, (e) backward BIC selection, (f) backward AIC selection. Similar epidemiological models are used as in figure 2, but with two sub-populations of hosts. Average contact rate is the same as in the base-case analysis, but 80% of hosts is assumed to have below-average contacts and 20% above-average contacts (coefficient of variation: 80%). Mixing between sub-populations occurred pseudo-assortatively (assortivity fraction: 50%). Performance is investigated in following ways: uncorrected (orange): based on representatively sampled individuals from the total population without correction for contact rate; corrected (light blue): the same but with correction for contact rate; low-risk (green): (stratified) analysis on individuals sampled from sub-population with low contact rate only; high-risk (yellow): (stratified) analysis on individuals sampled from sub-population with high contact rate only.

Figure 5.

Figure 5.

An example of network reconstruction under host heterogeneity. The true random network was generated with connection probability σ = 0.25, interaction strengths drawn from (−log 3, log 3), and basic reproduction numbers randomly drawn from (1.5, 2) (a). Estimated networks were obtained using the regularized Ising model from a dataset with 10 000 individuals in a stratified analysis among low-risk individuals only (b), high-risk individuals only (c), among a representative sample of the total population without correction (d), or with correction by augmenting the network with an extra node R, indicating membership to either sub-population (e). The filtered network (f) is obtained from (e) by omitting node R and the corresponding edges. Strength and type of interactions (mutualistic versus competitive) are indicated by the thickness and colour (blue versus red) of edges, respectively.

Network inference regained good performance if host heterogeneity was corrected for by performing stratified analyses based on the risk variable indicating membership to either sub-population (compare green and yellow dots to black lines in figure 4; compare figure 5b,c to figure 5a). At moderate sample size (100 and 1000 observations), performance was somewhat better in subgroup analyses on high-risk individuals, likely due to increased statistical power at higher prevalence of infection. As an alternative correction approach, we added the risk variable as an extra node to the network, representing elevated infection risk (see Material and methods), and this also performed well (compare dark blue to orange in figure 4). Graphically, the risk variable can be represented as a central node equally connected to all other nodes representing carriage of strains (figure 5e). This dependency illustrates that elevated infection risk is associated with positivity for each strain. After filtering out the risk variable, the strain-specific interaction network is retained (figure 5f). Finally, we verified that the performance deteriorated when error was introduced to measurement of the risk variable used for correction (electronic supplementary material, figure S3). Because of remaining bias towards positive associations in imperfectly corrected analyses, the F1-score and especially the specificity decreased rapidly, even with a small degree of misclassification of the risk variable (electronic supplementary material, figure S3c,d). Yet, the false positive associations were often attributed a small interaction strength, as verified by decreasing root-mean-square deviation with increasing sample size (electronic supplementary material, figure S3f).

3. Discussion

In this paper, we evaluated whether heterogeneous pathogen interactions can be recovered from cross-sectional surveys that detect joint presence/absence patterns of multiple strains within hosts. Using simulated data, we demonstrated convergence of estimators to true pairwise interactions between multiple infectious strains by means of statistical network inference. Performance of network inference was shown to be strongly influenced by host heterogeneity, but this could in principle be overcome by correcting for individual risk factors that are common to all pathogen strains.

So far, methods for recovering strain interactions from cross-sectional survey data either focused on marginal estimation of pairwise interactions or considered uniform interactions between all strains considered. Previous endeavours to infer competitive interactions from strain combination data found that they only performed satisfactorily in datasets where competitive interactions are particularly strong, and that accounting for host behavioural heterogeneity might be more important than adding additional information via the genotype combinations [20]. The importance of accounting for host heterogeneity is reiterated in our analysis, but we also demonstrate that network inference is well suited to detect small interactions at large enough sample size. The comparative efficiency of network inference methods stems from the conditional estimation of pairwise interactions, and from the natural incorporation of heterogeneous interactions in estimation. Whereas uniform interactions can be viewed as a special case in this framework, their nature and strength may well be heterogeneous, as multi-strain interactions are likely modulated by within-host niche differentiation. Such differentiation could arise from slight differences in host-cell tropism, as, for example, described for oncogenic HPV infection, where genotypes of α-7 and α-9 species have distinct preference for glandular and squamous epithelium, respectively [4,29]. Moreover, if multi-strain interactions are mediated by within-host immune responses, as hypothesized for different pneumococcal serotypes, strength of interaction may depend on antigenic overlap between strains [3].

Our results were obtained under a particular epidemiological model and under a steady state assumption, with between-strain interactions modelled in acquisition and clearance. We have previously shown that the odds ratio of co-occurrence in a two-strain version of this model is an exact estimator of the composite of interaction parameters defined accordingly [30]. The results of the present study should be viewed as an extension of that result to systems in the steady state with an arbitrary number of strains. While we have not considered estimation outside of steady states, previous studies on the potential of correlation-based methods for microbial network inference from cross-sectional data suggest that estimation of the interaction parameters may still be feasible when the system is subjected to process variability or perturbations close to the steady state [31].

Network models are well suited to capture conditional independence between multiple strains, providing a clear advantage over marginal estimation of pairwise interactions, as these may be confounded through indirect interactions with other strains. While we showed excellent performance of the network inference methods in this respect, this result does not necessarily extend to epidemiological models other than the one we considered. For instance, if host death is relevant in age groups susceptible to infection or if strains interact through natural cross-immunity that is long-lasting, infections with different strains may be positively associated (i.e. as given by odds ratios greater than one) and co-occurrence patterns of current infection are no longer informative of their interaction structure [21,30]. Similarly, the possibility of co-transmission in the same event may introduce a positive association between otherwise non-interacting pathogen strains [32]. In this study, we assumed that only a single strain could be transmitted per transmission event, which facilitated the computational burden of our transmission model for many pathogen strains. We expect this assumption to be justifiable for pathogens transmitted through frequent but short-term direct contacts with low transmission probability per contact. In other settings, the use of statistical network models to reconstruct strain interactions will be biased unless the confounding effect of co-transmission is appropriately adjusted for. Correspondingly, strains may interact through modification of viral or bacterial load during co-colonization, rather than through modification of acquisition or clearance. Although such interactions are better captured by records of viral or bacterial load, they are likely correlated with interactions in acquisition and clearance. Hence, network inference based on co-occurrence should retain good performance even in such situations.

Likewise, the performance of network inference may depend on the symmetry of interactions between multiple strains and on the way these combine during co-infection. Previously, we demonstrated that predictors of strain replacement in the wake of vaccination against a subset of pathogen strains perform best when interactions in acquisition and clearance are symmetrical and combine multiplicatively [8]. We envisage that reconstruction of heterogeneous strain interactions from co-occurrence data becomes more complicated if interactions are asymmetrical or combine in other ways than multiplicatively, i.e. if strength of interaction is not determined by the number of co-infection strains [8,31]. It should be noted that estimates obtained under GEE or Ising models are also marginal in this sense, as they only consider pairwise interactions and leave possible higher-order interactions unspecified. Conversely, graphical log-linear models implicitly incorporate a three-way interaction whenever three strains are graphically connected, allowing pairwise interactions to be modulated by a third strain [26]. Suitability of either approach is determined by the likelihood of higher-order interactions being present in a particular system, and by the quality and resolution of the available data. In general, we expect more complex mechanisms of interactions to be better identifiable from individual- or population-level longitudinal studies than cross-sectional surveys [33,34]. Alternatively, one might also consider the possibility of using age-dependent correlated frailty models to analyse multivariate serological data. Examples have been successfully applied to studies on associations between polyomavirus strains, and between the varicella zoster virus and a parvovirus strain [35,36]. More extensive simulations are needed to demonstrate the robustness of the network inference methods to variations in model specification and from data collected with different study designs.

All network inference approaches considered in this paper show consistent estimation of interaction parameters. Nonetheless, we found that different methods excel in different aspects of estimation, with a general trade-off between sensitivity and specificity. Network structure was only captured appropriately when applying some form of regularization (as in the Ising model with shrinkage of estimated interactions determined by lasso) or penalization for sample size (as in BIC-based log-linear model selection). In comparison, GEE or AIC-based model selection suffered from poor specificity, yielding comparatively low PPV and coverage at large sample size. In this respect, the trade-off between sensitivity and specificity across the various network inference methods is partly due to the application of different threshold schemes. In principle, it should be possible to obtain Ising model estimates with higher sensitivity (or GEE estimates with higher specificity) by using different forms of regularization or of false discovery rate control than have been applied in this study. Even so, all methods exhibited negligible root-mean-square deviation at large sample size, demonstrating that false discovery predominantly pertains to supposedly small interaction strengths. Penalization ensures that very weak connections are set to zero; hence, the appropriateness of using AIC- or BIC-based model selection depends on the sparsity of the true interaction network being analysed. In our simulations, sparsity of the networks used to generate co-occurrence data was controlled by fixing the rate of true zeros in the network. Instead of assuming truly absent interactions between pathogen strains and assessing the performance of network inference methods under uniform effect size distributions, it might have been more realistic to model network interactions by tapering effect sizes; that is, several large effects, followed by more smaller effects, followed by many more even smaller effects etc. [37]. We await further investigation into the performance of various network inference methods under such a ground-truth representation. The extent to which approximate independence holds in reality should determine which class of network inference methods is to be preferred [37]. PPV also deteriorated as a result of bias towards positive associations in analyses with imperfect correction for common risk factors, suggesting that shrinkage towards zero may itself carry a cost. From a practical point of view, however, the important interactions were almost always correctly identified, also at limited sample size.

In addition, we show that a regression framework familiar to most epidemiologists, i.e. GEE, performs satisfactorily in estimating pairwise interactions. The GEE regression framework has the additional benefit that it offers a flexible way of separating individual risk factors common to all strains, from interactions between any two strains [23]. This framework has been employed before to study clustering patterns of HPV genotypes across risk populations, with correction for known risk factors [18]. While correction for individual-level characteristics is also possible in graphical modelling approaches, as demonstrated in the present study, it relies on the inclusion of risk factors as binary nodes to the network. This clearly limits the practical ability to correct for various sources of host heterogeneity, that are more naturally accommodated in a regression framework [27,38]. Compared to other graphical modelling approaches, the GEE framework is not scalable to situations where the number of strains exceeds the number of observations. However, for most epidemiological studies this does not seem to pose a real problem.

A possible drawback of the network inference methods described herein might be that they rely on an implicit relation between the modelled strain-specific combinations and the underlying strain dynamics. We did not compare these methods to previously used approaches wherein the interaction parameters are estimated through explicit epidemiological modelling of observed proportions of the host population carrying different combinations of infectious strains [20,24,34]. However, we envisage that the comparatively good performance of statistical network inference methods would derive from their rigour and computational efficiency at the expense of tractability as regards underlying strain dynamics. Consistency of any method depends on whether the underlying interaction mechanisms are appropriately captured, but if approximations hold, statistical network inference provides a particularly powerful tool, as shown here.

To summarize, we have demonstrated how interactions between multiple pathogen strains might be estimated from cross-sectional surveys, detecting the presence/absence patterns of multiple infectious strains at once. We illustrated this by applying statistical network inference methods, that properly account for conditional independence between strains and are able to efficiently and consistently estimate heterogeneous interactions. Our work demonstrates how widely available multivariate methods may be used to identify between-strain interactions from forthcoming epidemiologic survey data.

4. Material and methods

4.1. Multi-strain epidemiological model

The epidemiological model we used for simulation was previously introduced to assess the outcome of vaccination against a subset of pathogen strains present in a host population [8]. This model allows for an arbitrary number of infectious strains to circulate, assuming SIS dynamics with regard to each of the individual strains. It applies to pathogens for which naturally acquired immunity is limited so that reinfection is possible [39], as for instance could be the case for infection with human papillomavirus or with Streptococcus pneumoniae. For a pathogen with n strains, the model state space S consists of 2n infection states, each denoted by the set of strains the host is infected with. For simplicity, we assume homogeneously mixing individuals and neglect birth and death processes on the time scale that is considered relevant to infection and transmission dynamics. See electronic supplementary material, figure S4, for a schematic of the possible infection states with n = 3. A system of ordinary differential equations describes how the proportions of individuals in each of the model states evolve over time. The equation for the change in prevalence NX of any state XS is generally given by

dNXdt=iXNX{i}qX{i}XiXNXqXX{i}iXNXqXX{i}+iXNX{i}qX{i}X. 4.1

The four groups of terms on the right-hand side, moving from left to right, correspond to flows of individuals into or out of state X due to (1) acquisition of a strain iX, (2) clearance of a strain iX, (3) acquisition of a strain iX and (4) clearance of a strain iX, respectively. Each term is a product of the proportion of the population in a particular state and the corresponding acquisition or clearance hazards (per capita rates) denoted by q. In an example with three strains and X = {1}, the corresponding terms are (1) Nq{1}, (2) N{1}q{1}, (3) N{1}q{1}→{1,2} + N{1}q{1}→{1,3} and (4) N{1,2}q{1,2}→{1} + N{1,3}q{1,3}→{1}. Note that we do not consider co-transmission of multiple strains in the same event. This assumption is justifiable if the probability of co-transmission is negligible compared with the probability of single-strain transmission. In electronic supplementary material, appendix S1, we argue that our model might be viewed as the approximation of a model that allows for co-transmission in the same event, for settings with high contact rate and low per-contact transmissibility for a given basic reproduction number R0. In the electronic supplementary material, appendix, we also demonstrate that non-negligible co-transmission leads to positive bias in the clustering of strains within individuals, and hence to deviation from the null-hypothesis of statistical independence. The baseline hazard of clearance for an individual only infected with strain i is denoted by q{i} and the baseline hazard for a completely susceptible individual to acquire strain i is given by

q{i}=cβiXSiNX. 4.2

Here, c is the per capita rate at which hosts make contacts that are relevant for pathogen spread, βi is the probability of successfully acquiring strain i given contact with an individual infected with strain i, and SiS is the subset of states containing strain i.

Potential interactions between strains are modelled through modification of the baseline acquisition or clearance hazards. The model allows for different structures of interactions, but we only consider the pairwise-symmetric multiplicative structure (see [8] for the alternative structures). To be exact, we assume each strain that is carried to contribute multiplicatively to the hazard of acquiring (or clearing) the incoming (or outgoing) strain. Hence, the hazards of acquisition and clearance in the presence of other strains are

qXX{i}=(jXkij)q{i}

and

qXX{i}=(jXhij)q{i}, 4.3

where the involved interaction parameters are pairwise-symmetric, i.e. kji = kij and hji = hij for all i and j. Defined as such, kij and hij are essentially hazard ratios that act on the baseline hazards. In particular, values equal to one imply no interaction and the deviation from one indicates that an interaction is present and how strong that interaction is. If strains only interact through acquisition (i.e. hij = 1), kij < 1 and kij > 1 indicate competitive and mutualistic interactions, respectively. Similarly, if strains only interact through clearance (i.e. kij = 1), hij < 1 and hij > 1 indicate mutualistic and competitive interactions, respectively. Previously, we have shown that when interaction is present in both modes, the overall interaction could be summarized by the ratio of the two interaction parameters kij/hij, with kij/hij < 1 and kij/hij > 1 indicating competitive and mutualistic interactions, respectively [30].

Finally, the basic reproduction number of each strain i is defined by R0,icβi/q{i}. In the absence of interactions, only strains having R0,i > 1 would be expected to survive, with marginal prevalence given by XSiNX=11/R0,i. Note, however, that R0,i > 1 is neither sufficient nor required for survival in the presence of strain interactions, and that the ranking in marginal prevalence does not necessarily reflect the ranking in reproduction number. For simplicity, we fixed c = 3, set q{i}=1 for all strains, and drew βi ∈ (0.5, 1) randomly to obtain R0,i.

4.2. Network construction

In the interaction networks considered in this paper, nodes represent pathogen strains, and edges represent the presence of a pairwise interaction between two strains. By connecting each pair of strains independently, with a fixed connection probability σ, the strain interaction networks resemble Erdös–Rènyi random graphs that become more saturated with higher connection probability [40].

Binary networks only indicating the presence or absence of interactions were converted to weighted networks by attributing strengths of interaction xij, drawn uniformly between (−θ, θ), with θ > 0 denoting the maximum strength of interaction. In settings where we considered interaction in acquisition only, exp(xij) was used as kij in the epidemiological model. When we also considered interaction in clearance, aside from xij, which we drew uniformly from (−θ, θ), we also drew an additional random number x^ij uniformly from (− θ/2, θ/2). exp(x^ij) was used as kij and exp(x^ijxij) as hij. This was to ensure the same range for the overall strength of interaction xij when considering only interaction in acquisition or also interaction in clearance.

4.3. Sampling cross-sectional datasets

We sampled random cross-sectional datasets [Y1, Y2, …, Ym]T from a multivariate binomial distribution that coincides with the steady state of the described epidemiological model, where m is the number of i.i.d. observations, i.e. individuals. For each individual l, Yl = [Y1,l, Y2,l, …, Yn,l] denotes the sequence of Bernoulli random variables [41], with Yi,l denoting the presence of strain i in individual l. It follows that P(Yi,l=1iX,Yj,l=0jX)=NX, where NX is the steady-state solution of model (4.1).

4.4. Statistical network inference

In previous work, it has been shown that in a two-strain version of model (4.1), the prevalence of being simultaneously infected with two strains that do not interact corresponds to the product of marginal prevalence of both strains, provided that host death in age groups susceptible to infection is negligible [21], and there are no unobserved common risk factors [30]. Moreover, the odds ratio of co-occurrence in the two-strain model corresponds to the ratio of interaction parameters in acquisition and clearance k12/h12 [30]. When there are more than two strains, however, the correspondence between pairwise association measures and corresponding pairwise composite interaction parameters kij/hij might no longer hold, and deviation from independence between two non-interacting strains could be induced by a third strain that interacts with both. The objective here is to capture these (possibly heterogeneous) pairwise interaction parameters from cross-sectional prevalence surveys for more than two strains. For this purpose, we make use of various network inference methods (electronic supplementary material, appendix S2).

We applied several statistical network inference methods. Firstly, we applied graphical modelling approaches based on log-linear analysis [42]. This technique is generally used to examine the relationship between more than two categorical variables, here denoting presence–absence of each pathogen strain in a host population. To reconstruct networks of strain-specific interactions from co-occurrence data with up to 10 circulating strains, we searched through the subset of decomposable log-linear models, i.e. models whose dependence graph is triangulated [26]. Selection was applied in both a forward (focused on adding edges) and backward (focused on removing edges) fashion, using AIC or BIC as selection criterion. Strengths of interaction were quantified a posteriori by calculating odds ratios using conditional maximum-likelihood estimation from the contingency tables implied by the selected model.

Secondly, we used the Ising model to reconstruct the simulated networks [25,43]. Here, the probability of a certain pathogen strain being present is modelled as a function of other strains being present at the same time. The Ising model can be shown to be equivalent to certain kinds of log-linear models, but interactions are at most pairwise and the dependence graph does not need to be triangulated. To estimate presence and strength of interactions, we made use of regularized logistic regressions: iteratively, one variable is regressed onto all others, with a penalty imposed on the regression coefficients to obtain a sparse network representation [44], and with model selection based on the extended BIC [25].

Lastly, we used GEE to estimate pairwise odds ratios between all strains under consideration. In modelling the associations between multiple strains concomitantly, we used the alternating logistic regression algorithm of the GENMOD procedure in SAS statistical software [27,38]. The regression framework facilitates assessment of strain-specific interactions on the basis of Wald tests. To correct for multiple hypothesis testing, we made use of the Benjamini–Hochberg procedure [28].

4.5. Host heterogeneity

In the sensitivity analysis with host heterogeneity, we considered a simple extension of the described multi-strain epidemiological model with two sub-populations that differ in their contact rates relevant for pathogen spread. The proportions of hosts in the high- and low-contact sub-populations are denoted by p1 = 20% and p2 = 80%, and the corresponding contact rates by c1 and c2, respectively. The values of c1 and c2 were chosen to obtain a coefficient of variation of 80%. Contacts between the two sub-populations follow a classical pattern based on assortativity fraction ϕ within host types, that takes the value 0 when mixing between hosts is random and 1 when fully assortative [39]. Here, we fixed ϕ = 50%. The contact rate csr between a ‘transmitting' individual from sub-population s and a ‘receiving’ individual from sub-population r is given by (see electronic supplementary material, appendix S3, for derivation)

csr=ϕcsδsr+(1ϕ)prcrp1c1+p2c2,fors,r{1,2}. 4.4

Here, δsr is the Kronecker delta (with δsr = 1 if s = r and zero otherwise). Defined accordingly, the baseline hazard for a susceptible individual in sub-population r to acquire strain i is given by

qr,{i}=c1rβiXSiN1,Xp1+c2rβiXSiN2,Xp2,forr{1,2}. 4.5

Here, Nr,X is the proportion of individuals in sub-population r ∈ {1, 2} and infection state X, with XSNr,X=pr. Differential equations for Nr,X were defined analogously as in model (4.1) and the hazard for acquisition in the presence of infection with other strains as in equation (4.3), e.g.

qr,XX{i}=(jXkij)qr,{i}. 4.6

For network reconstruction in the stratified analysis, we performed analyses separately within either sub-population. For the alternative analysis with correction, we included an additional variable R indicating to which sub-population each individual belongs. In the GEE regression framework, we included R as an explanatory variable (irrespective strain). In the other network approaches, we added R to the sequence of variables denoting strain-specific presence, i.e. Yl=[Yl,R]. In effect, the conditional independence structure inferred from [Y1,Y2,,Ym]T is augmented by one node representing elevated infection risk, next to nodes representing strains. To recover the estimated interaction network, we then removed the augmented node R and associated edges.

Ethics

This work did not require ethical approval from a human subject or animal welfare committee.

Data accessibility

The computer codes for the generation of simulation data are available from the GitHub repository: https://github.com/irene-man/Reconstruction-type-interaction-networks/. The statistical network inference analyses were performed with the following statistical software. Analyses of graphical models were performed with the packages gRbase (version 1.8.3), gRain (version 1.3.0) and gRim (version 0.2.0) developed for graphical modelling with R (version 3.4.1). Analyses of Ising model were performed using the packages IsingFit (version 0.3.1) and IsingSampler (version 0.2.1). Analyses of GEE were performed using the alternating logistic regression algorithm of the GENMOD procedure in SAS.

The data are provided in electronic supplementary material [45].

Declaration of AI use

We have not used AI-assisted technologies in creating this article.

Authors' contributions

I.M.: conceptualization, data curation, formal analysis, software, visualization, writing—original draft, writing—review and editing; E.B.: funding acquisition, validation, visualization, writing—review and editing; M.E.K.: supervision, validation, writing—review and editing; J.A.B.: conceptualization, data curation, formal analysis, funding acquisition, methodology, project administration, software, supervision, visualization, writing—original draft, writing—review and editing.

All authors gave final approval for publication and agreed to be held accountable for the work performed therein.

Conflict of interest declaration

The authors have declared that no competing interests exist.

Funding

J.A.B. and E.B. were supported by grant no. 645.001.002 (Ecology meets human health) through the NWO Complexity in Health and Nutrition programme from the Netherlands Organisation for Scientific Research (https://www.nwo.nl/en/find-funding); I.M. and J.A.B. by grant S/113005/01/PT (Prometheus project) through the Strategic Programme from the National Institute for Public Health and the Environment of The Netherlands (https://www.rivm.nl/).

References

  • 1.Budden KF, et al. 2019. Functional effects of the microbiota in chronic respiratory disease. Lancet Respir. Med. 7, 907-920. ( 10.1016/S2213-2600(18)30510-1) [DOI] [PubMed] [Google Scholar]
  • 2.Rowe HM, Meliopoulos VA, Iverson A, Bomme P, Schultz-Cherry S, Rosch JW. 2019. Direct interactions with influenza promote bacterial adherence during respiratory infections. Nat. Microbiol. 4, 1328-1336. ( 10.1038/s41564-019-0447-0) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Cobey S, Lipsitch M. 2012. Niche and neutral effects of acquired immunity permit coexistence of pneumococcal serotypes. Science 335, 1376-1380. ( 10.1126/science.1215947) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Egawa N, Egawa K, Griffin H, Doorbar J. 2015. Human papillomaviruses; epithelial tropisms, and the development of neoplasia. Viruses 7, 3863-3890. ( 10.3390/v7072802) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Lerch A, Koepfli C, Hofmann NE, Kattenberg JH, Rosanas-Urgell A, Betuela I, Mueller I, Felger I. 2019. Longitudinal tracking and quantification of individual Plasmodium falciparum clones in complex infections. Sci. Rep. 9, 3333. ( 10.1038/s41598-019-39656-7) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Coyte KZ, Schluter J, Foster KR. 2015. The ecology of the microbiome: networks, competition, and stability. Science 350, 663-666. ( 10.1126/science.aad2602) [DOI] [PubMed] [Google Scholar]
  • 7.Bucci V, et al. 2016. MDSINE: Microbial Dynamical Systems INference Engine for microbiome time-series analyses. Genome Biol. 17, 121. ( 10.1186/s13059-016-0980-6) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Man I, Auranen K, Wallinga J, Bogaards JA. 2019. Capturing multiple-type interactions into practical predictors of type replacement following human papillomavirus vaccination. Phil. Trans. R. Soc. B 374, 20180298. ( 10.1098/rstb.2018.0298) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Balmer O, Tanner M. 2011. Prevalence and implications of multiple-strain infections. Lancet Infect. Dis. 11, 868-878. ( 10.1016/S1473-3099(11)70241-9) [DOI] [PubMed] [Google Scholar]
  • 10.Griffiths EC, Pedersen AB, Fenton A, Petchey OL. 2011. The nature and consequences of coinfection in humans. J. Infect. 63, 200-206. ( 10.1016/j.jinf.2011.06.005) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Alizon S, De Roode JC, Michalakis Y. 2013. Multiple infections and the evolution of virulence. Ecol. Lett. 16, 556-567. ( 10.1111/ele.12076) [DOI] [PubMed] [Google Scholar]
  • 12.Van Dorp CH, Van Boven M, De Boer RJ. 2014. Immuno-epidemiological modeling of HIV-1 predicts high heritability of the set-point virus load, while selection for CTL escape dominates virulence evolution. PLoS Comput. Biol. 10, e1003899. ( 10.1371/journal.pcbi.1003899) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Susi H, Barrès B, Vale PF, Laine AL. 2015. Co-infection alters population dynamics of infectious disease. Nat. Commun. 6, 5975. ( 10.1038/ncomms6975) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Wikramaratna PS, Kucharski A, Gupta S, Andreasen V, Mclean AR, Gog JR. 2015. Five challenges in modelling interacting strain dynamics. Epidemics 10, 31-34. ( 10.1016/j.epidem.2014.07.005) [DOI] [PubMed] [Google Scholar]
  • 15.Howard SC, Donnell CA, Chan MS. 2001. Methods for estimation of associations between multiple species parasite infections. Parasitology 122(Pt 2), 233-251. [DOI] [PubMed] [Google Scholar]
  • 16.Regev-Yochay G, Dagan R, Raz M, Carmeli Y, Shainberg B, Derazne E, Rahav G, Rubinstein E. 2004. Association between carriage of Streptococcus pneumoniae and Staphylococcus aureus in children. JAMA 292, 716-720. ( 10.1001/jama.292.6.716) [DOI] [PubMed] [Google Scholar]
  • 17.Chaturvedi AK, et al. 2011. Human papillomavirus infection with multiple types: pattern of coinfection and risk of cervical disease. J. Infect. Dis. 203, 910-920. ( 10.1093/infdis/jiq139) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Mollers M, et al. 2014. Population- and type-specific clustering of multiple HPV types across diverse risk populations in the Netherlands. Am. J. Epidemiol. 179, 1236-1246. ( 10.1093/aje/kwu038) [DOI] [PubMed] [Google Scholar]
  • 19.Wilber MQ, Johnson PT, Briggs CJ. 2017. When can we infer mechanism from parasite aggregation? A constraint-based approach to disease ecology. Ecology 98, 688-702. ( 10.1002/ecy.1675) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Alizon S, Murall CL, Saulnier E, Sofonea MT. 2019. Detecting within-host interactions from genotype combination prevalence data. Epidemics 29, 100349. ( 10.1016/j.epidem.2019.100349) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Hamelin FM, et al. 2019. Coinfections by noninteracting pathogens are not independent and require new tests of interaction. PLoS Biol. 17, e3000551. ( 10.1371/journal.pbio.3000551) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Hastings A. 1987. Can competition be detected using species co-occurrence data? Ecology 68, 117-123. ( 10.2307/1938811) [DOI] [Google Scholar]
  • 23.Plummer M, Vaccarella S, Franceschi S. 2011. Multiple human papillomavirus infections: the exception or the rule? J. Infect. Dis. 203, 891-893. ( 10.1093/infdis/jiq146) [DOI] [PubMed] [Google Scholar]
  • 24.Mejlhede N, Pedersen BV, Frisch M, Fomsgaard A. 2010. Multiple human papilloma virus types in cervical infections: competition or synergy? APMIS 118, 346-352. ( 10.1111/j.1600-0463.2010.2602.x) [DOI] [PubMed] [Google Scholar]
  • 25.Van Borkulo CD, Borsboom D, Epskamp S, Blanken TF, Boschloo L, Schoevers RA, Waldorp LJ. 2014. A new method for constructing networks from binary data. Sci. Rep. 4, 5918. ( 10.1038/srep05918) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Højsgaard S, Edwards D, Lauritzen S. 2012. Graphical models with R. New York, NY: Springer Science & Business Media. [Google Scholar]
  • 27.Carey V, Zeger SL, Diggle P. 1993. Modelling multivariate binary data with alternating logistic regressions. Biometrika 80, 517-526. ( 10.1093/biomet/80.3.517) [DOI] [Google Scholar]
  • 28.Benjamini Y, Hochberg Y. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B 57, 289-300. ( 10.1111/j.2517-6161.1995.tb02031.x) [DOI] [Google Scholar]
  • 29.De Villiers EM. 2013. Cross-roads in the classification of papillomaviruses. Virology 445, 2-10. ( 10.1016/j.virol.2013.04.023) [DOI] [PubMed] [Google Scholar]
  • 30.Man I, Wallinga J, Bogaards JA. 2018. Inferring pathogen type interactions using cross-sectional prevalence data: opportunities and pitfalls for predicting type replacement. Epidemiology 29, 666-674. ( 10.1097/EDE.0000000000000870) [DOI] [PubMed] [Google Scholar]
  • 31.Pinto S, Benincà E, Van Nes EH, Scheffer M, Bogaards JA. 2022. Species abundance correlations carry limited information about microbial network interactions. PLoS Comput. Biol. 18, e1010491. ( 10.1371/journal.pcbi.1010491) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Malagón T, Lemieux-Mellouki P, Laprise JF, Brisson M. 2016. Bias due to correlation between times-at-risk for infection in epidemiologic studies measuring biological interactions between sexually transmitted infections: a case study using human papillomavirus type interactions. Am. J. Epidemiol. 184, 873-883. ( 10.1093/aje/kww152) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Telfer S, Birtles R, Bennett M, Lambin X, Paterson S, Begon M. 2008. Parasite interactions in natural populations: insights from longitudinal data. Parasitology 135, 767-781. ( 10.1017/S0031182008000395) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Shrestha S, King AA, Rohani P. 2011. Statistical inference for multi-pathogen systems. PLoS Comput. Biol. 7, e1002135. ( 10.1371/journal.pcbi.1002135) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Abrams S, Hens N. 2015. Modeling individual heterogeneity in the acquisition of recurrent infections: an application to parvovirus B19. Biostatistics 16, 129-142. ( 10.1093/biostatistics/kxu031) [DOI] [PubMed] [Google Scholar]
  • 36.Farrington CP, Whitaker HJ, Unkel S, Pebody R. 2013. Correlated infections: quantifying individual heterogeneity in the spread of infectious diseases. Am. J. Epidemiol. 177, 474-486. ( 10.1093/aje/kws260) [DOI] [PubMed] [Google Scholar]
  • 37.Burnham KP, Anderson DR. 2002. Model selection and multimodel inference. A practical information-theoretic approach, 2nd edn. Berlin, Germany: Springer. [Google Scholar]
  • 38.Lipsitz SR, Fitzmaurice GM. 1996. Estimating equations for measures of association between repeated binary responses. Biometrics 52, 903-912. ( 10.2307/2533051) [DOI] [PubMed] [Google Scholar]
  • 39.Keeling M, Rohani P. 2008. Modeling infectious diseases in humans and animals. Princeton, NJ: Princeton University Press. [Google Scholar]
  • 40.Newman M. 2018. Networks. Oxford, UK: Oxford University Press. [Google Scholar]
  • 41.Teugels JL. 1990. Some representations of the multivariate Bernoulli and binomial distributions. J. Multivariate Anal. 32, 256-268. ( 10.1016/0047-259X(90)90084-U) [DOI] [Google Scholar]
  • 42.Agresti A. 2003. Categorical data analysis. Hoboken, NJ: John Wiley & Sons. [Google Scholar]
  • 43.Ising E. 1924. Beitrag zur theorie des ferro-und paramagnetismus. Hamburg, Germany: Grefe & Tiedemann. [Google Scholar]
  • 44.Ravikumar P, Wainwright MJ, Lafferty JD. 2010. High-dimensional Ising model selection using ℓ1-regularized logistic regression. Annals Stat. 38, 1287-1319. ( 10.1214/09-AOS691) [DOI] [Google Scholar]
  • 45.Man I, Benincà E, Kretzschmar ME, Bogaards JA. 2023. Reconstructing multi-strain pathogen interactions from cross-sectional survey data via statistical network inference. Figshare. ( 10.6084/m9.figshare.c.6760129) [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The computer codes for the generation of simulation data are available from the GitHub repository: https://github.com/irene-man/Reconstruction-type-interaction-networks/. The statistical network inference analyses were performed with the following statistical software. Analyses of graphical models were performed with the packages gRbase (version 1.8.3), gRain (version 1.3.0) and gRim (version 0.2.0) developed for graphical modelling with R (version 3.4.1). Analyses of Ising model were performed using the packages IsingFit (version 0.3.1) and IsingSampler (version 0.2.1). Analyses of GEE were performed using the alternating logistic regression algorithm of the GENMOD procedure in SAS.

The data are provided in electronic supplementary material [45].


Articles from Journal of the Royal Society Interface are provided here courtesy of The Royal Society

RESOURCES