Abstract
In health care settings, patients who are physically proximate to other patients (co-presence) for a meaningful amount of time may have differential health outcomes depending on who they are in contact with. How to best measure this co-presence, however is an open question and previous approaches have limitations that may make them inappropriate for complex health care settings. Here, we introduce a novel method which we term “consistent co-presence”, that implicitly models the many complexities of patient scheduling and movement through a hospital by randomly perturbing the timing of patients’ entry time into the health care system. This algorithm generates networks that can be employed in models of patient outcomes, such as 1-year mortality, and are preferred over previously established alternative algorithms from a model comparison perspective. These results indicate that consistent co-presence retains meaningful information about patient-patient interaction, which may affect outcomes relevant to health care practice. Furthermore, the generalizabiity of this approach allows it to be applied to a wide variety of complex systems.
Keywords: affiliation networks, temporal networks, spatial networks, hospital administrative data
I. Introduction
“Big Data” is frequently characterized by the volume of information observed in a system [1]. Less often, albeit just as important, the focus is on the velocity at which information arrives and can be deployed in analysis and decision making [2]. In health care settings, for instance, patient administrative information is often collected at a rate much faster than can be processed by physicians and other hospital staff [3]. As such, it is difficult to incorporate much of the information into a patient’s care while they are in the health care system. This can have negative effects on both the patient and the health care system. Developing methods to utilize rapidly-accumulating data to improve patient health in an actionable way is an important goal of Big Data analytics for health care.
Examples for such an endeavor include Electronic Medical Records (EMR) and Hospital Administrative Data (HAD) [4]. EMR comprise the data pertaining directly to a patient’s health, such as diagnoses, administered treatments, and procedures. HAD, on the other hand, are the data collected by the hospital for purposes that lie outside of medical decision making. These are generally complex relational databases of information about a patient’s location, movement throughout a hospital, billing, and contact information all shielded via encryption and access control policies.
Consistent monitoring of patients’ EMR during respective hospital stays has improved patient health [5], decreased costs [6], and improved understanding of patient outcomes [7]. HAD, on the other hand, are used much less frequently in research, largely because they do not include explicit health-related information (with the exception of diagnostic codes, such as ICD-10, for billing purposes [8]). However, because HAD captures information about how a patient moves in time and space through the hospital setting during any given visit, these datasets may provide insights into how such movement might be related to patient health.
For instance, while a patient is treated independently of other patients, their time in the hospital is often spent co-present with other people. Such co-presence can occur in the waiting room, while receiving chemotherapy, or when receiving visitors. Co-presence often leads to social interactions [9] that can carry health consequences, such as the transmission of infectious diseases, sharing of health information, or social influences on one’s behavior [4], [10], [11]. The cumulative effect of these interactions has not been well-studied because 1) the computing power necessary to do so in real-time has not existed at scale until recently, 2) the magnitude of the effect of one’s health status is often assumed to greatly overshadow the effects of patient-patient interaction, and 3) existing algorithms to estimate “meaningful” interactions do not adequately address the many underlying complexities of hospitals’ assignment of patients to spaces in time.
In this paper, we describe an algorithm for inferring “meaningful” patient-patient co-location encounters in hospitals, which we term “consistent co-presence” networks. These networks represent millions of observations that accumulate rapidly in the HAD and could potentially be deployed to analyze a host of emergent health-related issues in hospitals where patient-patient contact is implicated. Alternative patient-patient monitoring systems, such as genomic sequencing of microbes to infer linkages between patients who share common infections during nosocomial outbreaks [10], [12], [13], or link-tracing via traditional forensic epidemiology [14], [15], are costly, time consuming, and can only be deployed retrospectively.
A. Contributions and Findings
Our main contributions to the practical analysis of Big Data using HAD are:
Since patient-patient interactions may have important implications in patient health, we propose an algorithm to identify patient-patient co-presence networks. This algorithm uses a patient’s observed entry time in the hospital as a baseline, calculating Jaccard indices for all patients in the observed data. Then each patients’ hospital entry times are randomly altered, and the new Jaccard indices calculated. The percentile of the observed indices are compared to the distribution of simulated ones. Times exceeding this percentile cut-off are considered significant co-presence, and are connections which are expected to have an increased likelihood of affecting a patient’s health.
We describe a data set in which we apply this algorithm - namely, the health system in a county in the United Kingdom (UK) encompassing one million patients over 15 years. The constructed co-presence network encodes information about which patients are around other patients more than expected by chance conditional on their underlying usage of the health care system. Connections in this network help predict a number of patient health-related outcomes, indicating the importance of HAD and which patients interact with which other patients while in the hospital.
We demonstrate the application of how this algorithm can be extended to any setting outside of health care where the scheduling of people is sufficiently complex. Because of the flexibility of the algorithm, the complexity of any situation can potentially be captured in the observed occupancy times of a person, obviating the need to explicitly encode the rules and logic of any given system. In particular, this allows one to research systems which would otherwise be black boxes or too complex to explicitly encode all the governing rules.
II. Background research
We define meaningful co-presence between two patients to convey a qualitative amount of time two patients share a common space where social interactions—such as an exchange of information—would be likely to occur. Thus, we distinguish between incidental co-presence (where two patients share a trivial amount of time in the same location) and meaningful co-presence. While approaches to measuring meaningful co-presence between patients currently exist, this past work is limited insofar as it does not address the complexities of HAD and patient movement in space and time. That is, previous work largely operationalizes co-presence networks from arbitrary measures that fail to account for the dynamics of the underlying social system [16]. In this section, we review such approaches and explain why our new method—consistent co-presence—achieves greater theoretical mapping to real world health care settings.
One of the most basic measures of co-presence is to take the sum of the amount of time two individuals spent together in a given location. In clinical settings, this approach has a number of potential benefits: it scales linearly as two patients are co-present more and it takes advantage of the full continuous range of time. This method has been used, for instance, to describe social interactions unfolding in a “day in the life” of individuals [17]. However, many counts of co-presence calculated in this way would be quantitatively the same, but be very different. For example, in the hospital setting, two patients each receiving dialysis for one hour a week every week over 52 weeks would have 52 hours of co-presence. And, two patients who were admitted to the Emergency Room (ER) and remained there for 500 hours, but only overlapped for 52, would also have co-presence of 52 hours. In this case, the first example is much more likely to reflect a meaningful interaction between patients, and the sum of co-presence time is unable to account for the qualitative difference in meaningful co-presence.
To account for some of these differences, one could normalize these summed co-presence times by the potential for two individuals to overlap. In another approach, we could formalize meaningful co-presence via the Jaccard Index for each pair of patients (i,j):
| (1) |
where H(i) is the set of hours patient i is present in a given location. The Jaccard Index has been shown to have ideal properties in settings such as co-citation networks [18] and to examine social networks in behavioral change interventions [19]. By normalizing over the potential for two patients to overlap it distinguishes the two previously-described patterns of co-presence. For the first case, if both patients are in the hospital for those 52 hours, and only those 52 hours, then their Jaccard index would be 1 (). For the other pair of patients, if they are in the hospital for a total of 500 hours between them, then their Jaccard index is 0.104 (). This would properly identify the first case as more likely to be meaningful.
However, this method does not address all possible issues. Consider the following figure, where there is clear heaping at certain values of the Jaccard index (Figure 1). This heaping is an artifact due to the outpatient nature of the hospital in question; patients leave and enter only during business hours, and so cannot exist across the entire continuum of co-presence times. Furthermore, an inordinate number of patients are in the hospital for only an hour or two (reflecting a single visit instead of a series of treatment visits), and so can only have Jaccard index values of 0, 1, or .
Fig. 1.

Kernel density-smoothed function of Jaccard indices in a hospital. We observe heightened frequency of Jaccard index values at 1, 12, 13, 14, and 15. The peaks at 1 and 1/2 are due to the large numbers of patients occupying the hospital for only one or two hours, and overlapping with others for one or two of those hours. These and the peaks indicate an underlying endogenous process influencing patient stays and therefore overlap not accounted for by the Jaccard index.
One potential solution to this heaping artifact is to consider a dichotomized co-presence index. Although dichomotizing variables at a specific cutoff is often poor statistical practice [20], setting cutoffs based on the data itself provides much more practical results. One such method is to set a cutoff when individuals are co-present more than expected by chance. This method has been effectively used in the past to look at interactions between school children at play [21] and to examine the social influence that cancer patients undergoing chemo-therapy together exert on one another [11]. In this case, the researchers assumed random mixing, and so the interaction could be modeled as:
| (2) |
where N is the number of individuals under observation. In other words, the tie between two individuals is 1 if the Jaccard index is greater than expected under a null distribution having the subjects randomly interact. This dichotomizes the Jaccard index based on the underlying structure of the data. In doing so, one avoids the pitfalls of dichotomizing a variable based on an arbitrary cutoff. The main disadvantage of this approach is the assumption of random interaction - in a hospital setting (and indeed in many other settings of structured human organization) this assumption is clearly violated because patients are not randomly distributed throughout the hospital. Importantly, the ways in which it is violated are uncertain as a mixture of structure and random processes are both involved. As a result, simply incorporating assumptions into the model based on the hospital’s standard administrative process is not a practical approach. Therefore, we propose to base our algorithm on the approach of connecting patients who are co-present more than expected by chance and then implicitly modeling the uncertainty using random perturbations of the system.
III. Algorithm
We posit that two patients are consistently co-present if their respective stays in a hospital overlap more often than would be expected by chance. To derive this variable, we first define a cutoff in terms of time co-present. As previously stated, a patient i has a set of hours in the hospital: H(i). With any other patient, j, the Jaccard index, J(i, j) is written in Equation 1. Similar to Eq. 2, we then want to dichotomize these J(i, j) based on those which are greater than expected by chance.
Without the simplifying assumption of random mixing, we make the assumption that hospital visits would occur, but that the timing of the hospital visit is stochastic. For example, a patient with a broken toe may immediately go to the hospital after the precipitating incident, or may wait to see if it is only a sprain - in either case the patient goes to the hospital, but the timing differs. All patients who had a hospital visit within this stochastic window then represent the corresponding risk set of patients with whom the focal patient (i.e. patient i) could have overlapped.
| (3) |
This states that the risk-set, R(i), is the set of all patients k that overlap in the health care system for any period of time as the focal patient.
In the case of appointments, treatments are given according to a prescribed timeline [22]. Within this timeline, there is a degree of flexibility, but the number and approximate timing of appointments is set. As an example, chemotherapy has a strict treatment regimen, requiring treatments three or four consecutive days a week for a few months. Each given treatment can be moved within a given day, but if moved beyond that the pharmacokinetics are disrupted [22]. Therefore, in the case of hospital visits stemming from appointments, the timing of the visit is assumed to potentially fluctuate up to one day earlier or later than observed.
On the other hand, some hospital visits stem from unplanned emergencies. These too are assumed to always occur at a similar time and for the same duration, but the exact timing is stochastic. Although needing to visit the ER is not itself random, the timing of the ER visit itself once an ER visit becomes necessary can be considered variable. This specific timing depends not only on one’s underlying morbidity, but on the timing of the morbidity, one’s schedule, the perceived severity, etc. As a result of this, the exact time of hospital visit is similarly assumed to also vary up to one day earlier or later. For example, when one develops infectious symptoms, one might wait to utilize the health care system and rely on home remedies. Only once these fail might one determine that visiting the ER is necessary. However, this point could have been reached earlier or later, depending on a number of factors. This stochasticity is what we leverage in allowing these entry times to vary.
From the assumption of stochasticity of hospital entry time, the algorithm to determine consistent co-presence naturally follows. For each patient i, we first determine the observed set of Jaccard indices across all other patients, j (via J. obs(i, j)).
We randomly perturb the k visit of patient i(entry.time(i, j)), according to a normal distribution without any expected deviation, and with a standard deviation of 12 hours (so 95% of values will fall within one day of the observed time, P).
| (4) |
We then calculate the Jaccard index with all other patients in this new regime for patient i. We replicate this T = 1, 000 times, building a distribution of Jaccard indices: J.sim(i). The choice of 1,000 replications is not canonical; it can be modified if more replications are needed to better-explore the distribution. Finally, we compare the observed Jaccard indices of patient i to the simulated distribution, calling any which fall above a specified quantile cutoff as consistent co-presence. Formally,
| (5) |
, where Q is the quantile cutpoint.
An edge from patient i to j is indicated when Ai,j is one, leading to the co-presence network. We will refer to the edges in this network as an indicator of patients who are “consistently co-present. Unlike the Jaccard-weighted person-hours, Ai,j = 0 does not imply that there was 0 overlap between i and j, only that we do not consider such overlap significant, that is, i and j are not consistently co-present. While this method results in a directed network (patient i’s overlap with patient j may be significant for patient i but not patient j), nearly all ties were mutual (> 99%). Given that empirically nearly all ties are mutual and theoretically consistent co-presence is symmetric, we treat the network as undirected. Thus, edges in the network have the convenient interpretation that two patients are connected when at least one of them was consistently co-present with the other.
A. Tunable parameters
There are three tunable parameters in the algorithm described above: the cutoff quantile, Q, T, the number of simulated trials, and P, the prior distribution of stochastic changes to entry times. For the quantile cutoff, as one decreases the quantile, the resulting network approaches the network formed by the Union rule: if two patients are ever co-present, then we consider an edge to exist between them. As Q increases, the density of the resulting networks decreases, until density reached 0 when Q = 1. Below, we explore how altering Q affects the results. Importantly, Q is not necessarily grounded in theory—whether the better cutoff is at the 80th or 90th percentile cannot necessarily be determined solely based on knowledge of the underlying process.
The distribution of stochastic entry times, however, can be so theorized. As described above, in the hospital co-presence case, an entry time of plus or minus one day is theoretically-meaningful in a way that adding a month to a chemotherapy visit would not be. Through this parameter, one can encode information about the underlying system without needing to specify all the knowledge about the system. If one were to randomly perturb both a focal patient’s times, and the times of all other patients, using a uniform distribution across all times with Q = 0.50, the algorithm would simplify to the case in Eq. 2. If one assumes no stochasticity at all (i.e. the entry times are P ~ N(0, 0)), then the algorithm simply takes the top Q% of each patient’s overlap times. Depending on the knowledge of potential stochasticity in the underlying process, this distribution and its associated parameters can be altered to reflect specific situations.
In addition to the tunable parameters, there is the question of spatial resolution. At a low spatial resolution, people could be considered co-present if they are anywhere in the entire hospital at the same time. On the other end of the continuum, a high spatial resolution would only consider people co-present if they are physically touching. Generally, we are interested in spatial resolution between these extremes. The spatial resolution of interest, although not an explicit tunable parameter, must be determined by researchers.
B. Algorithmic Notation
Based on the theoretical motivation for consistent co-presence discussed above we now formally describe the algorithm used in our subsequent analysis. The pseudo-code for the algorithm is outlined below.
Inputs: Data R (the set of patient spells in a given space and time window), Q… (a quantile), P… (a prior probability reference distribution), N (a set of unique individuals), and T (the number of trials).
Procedure:
for i in N
for j in N ∀j ≠ i
Set J.obs(i, j) = J(i, j)
for t in 1 : T
for i in N
for k in R
entry.time(i, k) = entry.time(i, k) + P…
for j in N ∀j ≠ i
J.sim(i) = concatenate(J.sim(i), J(i, j))
for i in N
for j in N ∀j ≠ i
if Q(J.obs(i,j)) > Q…(J.sim(i, j)) : Ai,j = 1
else : Ai,j = 0
Output: Consistently co-present Adjacency matrix, A.
C. Computational complexity
Beyond being theoretically sound in determining potentially meaningful patient-patient interactions in the hospital, consistent co-presence should be relatively simple to calculate in near-real time. These connections can be monitored by health services staff, such as hospital epidemiologists, looking to improve patient health and monitor nosocomial outbreaks. The computational complexity of this algorithm is therefore of import and we discuss it thusly.
For permuting a patient’s visit times, the complexity is O(V) where V is the number of hospital visits patient’s have on average. Calculating the overlap between the focal patient and all other patients is therefore O(V * N) where N is the number of patients. Because this needs to be done for each focal patient individually, the full complexity of this algorithm is O(V * N2). However, it can be reduced to O(V * N * log(N)) by using sparse matrix multiplication methods, as most patients are not in the hospital most of the time. It can be further simplified by subsetting to only the risk set of overlap - if two patients are never in the same location, then the algorithm does not need to check this every time. We can therefore subset Eq. 3 to include only overlaps occurring in any of the specific locations (say, wards) in which patient i ever occupies. This information can be a priori stored in a look-up table which can be referenced by the algorithm prior to evaluation. By doing so, the run-time of the algorithm is reduced by approximately an order of magnitude, depending on the exact distribution of patients across wards.
IV. Application
To demonstrate the efficacy of this algorithm, we apply it to a large data set of all patients in the entire health system of a county in the UK over the course of a year. This includes 250,004 patients across 15 hospitals and 157 wards who were in the system any time between 1 January, 2013 and 31 December, 2013. The dataset is quite complex and contains electronic health record information (demographics, diagnoses, clinical tests, etc) as well as administrative records of patient admission, movement, and discharge. The number of potential edges in a theoretical network of all patient-patient co-presence in this system would exceed 62 billion; which is illustrative of the complexity of the problem. The demographics and summary statistics of these patients’ hospital visits are shown in Table I. The average age was 52 years, and 45% of patients were male. Patients had an average of 1.23 stays, where a hospital stay is defined as an entry to, and concomitant exit from, the health care system. Each stay is broken up into one or more spells, where one spell is an entry to and exit from a single ward. Patients had an average of 1.93 spells.
TABLE I.
Demographics of the N=250,004 patients in the hospital from 1 January, 2013 to 31 December, 2013.
| Variable | N (%) or Mean (SD) |
|---|---|
| Male sex | 114,952 (45.98%) |
| Age | 52.63 (24.78) |
| Hospital stays | 1.15 (0.56) |
| Hospital spells | 1.85 (1.37) |
| Wards visited | 1.57 (0.90) |
| Time in hospital | 137.14 (643.85) |
A. Network construction
We applied the consistent co-presence algorithm to this data, using quantile cutoffs of 50, 75, 95 and 99 %. For spatial resolution, we are interested in co-presence at the ward level. Most wards in the hospitals in questions comprise one or a few open rooms. In this way, much of the co-presence captured here has the potential to reflect meaningful interactions between patients. For each case, we create a network. We do the same for two methods for which we hypothesize consistent co-presence will be more meaningful: A) “any co-presence” and B) more co-presence than expected by chance assuming random mixing (Eq. 2). The descriptive network characteristics of all six of these networks can be seen in Table II.
TABLE II.
Network characteristics for constructing co-presence in six ways. All networks use the same N=250,004 patients.
| Consistent Co-presence | ||||||
|---|---|---|---|---|---|---|
| Network metric | Any co-presence | Random mixing | Q=0.5 | Q=0.75 | Q=0.90 | Q=0.99 |
| Count of isolates | 77 | 1,208 | 21,791 | 52,658 | 80,528 | 130,784 |
| Density | 0.0031 | 0.0030 | 0.0013 | 5e-04 | 2e-04 | 3e-06 |
| Transitivity | 0.43 | 0.44 | 0.41 | 0.38 | 0.33 | 0.14 |
| Mean degree | 86.93 | 84.09 | 36.19 | 15.06 | 4.58 | 0.1 |
The number of isolates (patients with no co-presence edges under a given model) ranges from 77 in the network creating an edge if two patients are ever co-present, to 130,784 in the consistent co-presence network using Q=0.99. The isolates in the first case are true isolates in the hospital - those 77 patients are never in a ward with another person. The isolates in the consistent co-presence network, however, are only isolates in a network-analytic perspective - they are co-present with others during their hospitalization, but not to a great enough extent that it greater than expected by chance.
Among the other characteristics of the networks, density (the ratio of present ties to possible ties) decreases consistently as edges are increasingly constrained. Contrariwise, transitivity (a measure of the extent to which clustering is present in the form of triangles), decreases much more slowly as the cutoff increases; only once Q = 0.99 does it substantially drop from 0.33 to 0.14. Mean degree (the average number of connections per patient) begins low and decreases as the cutoff becomes more stringent.
B. Outcome prediction
To simultaneously test the hypothesis that the consistent co-presence metric is the most meaningful with respect to health outcomes, and which quantile cutoff works best in this situation, we fit a series of logistic regressions using each patient’s degree in the network as a predictor and a patient’s 30-day readmission as an outcome. To control for potential confounding, we then fit a series of models additionally adjusting for the following variables: patient age, patient sex, total time spent in the hospital, and number of hospital stays. In total, 15,256 patients were readmitted to the hospital within 30 days following the study period. If our hypothesis holds, the models using the consistent co-presence metric will fit the data better as measured by Bic and the p-value of the co-presence term. The results from the regressions can be seen in Table III.
TABLE III.
Results of using network degree as a predictor of 30-day readmission in a logistic regression. Adjusted models in columns 4 and 5 are additionally controlled for: patient age, patient sex, total time spent in the hospital, and number of hospital stays.
| Unadjusted | Adjusted | |||
|---|---|---|---|---|
| p-value | BIC | p-value | BIC | |
| Any overlap | 1.65e-12 | 9680.88 | 0.22 | 8926.81 |
| Random mixing | 4.30e-12 | 9682.41 | < 2e-16 | 8926.60 |
| Consistent Co-presence - Q=0.50 | < 2e-16 | 9726.55 | < 2e-16 | 8916.60 |
| Consistent Co-presence - Q=0.75 | 4.24e-06 | 9707.97 | < 2e-16 | 8927.88 |
| Consistent Co-presence - Q=0.90 | < 2e-16 | 9695.20 | < 2e-16 | 8927.96 |
| Consistent Co-presence - Q=0.99 | < 2e-16 | 9718.87 | < 2e-16 | 8927.21 |
In the unadjusted analyses, the degree of patients using the “any co-presence” rule performs the best, with a BIC of 9,680 and a p-value of 1.65e – 12. Using a cutoff based on the random mixing assumption yields a similar result. Among the consistent co-presence networks, using Q = 0.90 performs the best, with a BIC of 9,695.
when the analyses are adjusted for control variables that are related to both co-presence and patient outcomes, the any co-presence indicator completely loses significance, with a p-value of 0.22, indicating that although it is a strong indicator in and of itself, it is highly confounded with a variety of other indicators of patient well-being. In its stead, the consistent co-presence network with Q = 0.50 is the best fitting, with a BIC of 8,917, compared to 8,927 for the “any co-presence” rule. In other words, after controlling for variables which may predict both the amount of co-presence and 30-day readmission, the association with readmission is greatest for the consistent co-presence network using Q = 0.50. This simultaneously indicates that consistent co-presence encodes more relevant patient-patient connections than the dichotomization rules of any overlap or assuming random mixing, that this relationship is less confounded with health outcomes, and that among the values of Q tested, Q = 0.50 was the most meaningful quantile cutoff.
The primary limitation of this analysis is that we do not control for disease severity. It is likely that patients with more severe morbidities stay in the hospital longer, thus increasing their total co-presence with other patients, and that this is also positively-associated with the likelihood of readmission. However, the strength of this confounding would depend on the different approaches used here to draw edges between patients. For example, under the “any co-presence” rule, as one is in the hospital longer, the total patients with whom one is co-present is monotonically increasing. By contrast, however, under the consistent co-presence model the stochasticity of visit times effectively controls for the total time in the hospital and the distribution of co-presence times. Therefore, as one’s time in the hospital increases, the threshold for consistent co-presence will also increase, reducing the rate at which additional co-presence ties are drawn as time increases. In this way, the confounding caused by morbidity severity is likely reduced for the consistent co-presence network relative to the “any co-presence” network, and if we were able to additionally adjust for morbidity severity, the BIC would be expected to decrease more for the consistent co-presence networks than for the “any co-presence” network.
A further limitation of this algorithm is that it does not have an underlying ground truth - there is no “gold standard” with which to compare our results. As a result, it is impossible to definitively show that what we call “consistent co-presence” is meaningful with respect to real patient-patient interactions. However, we address this with out internal validation models for predicting readmission rates. One approach which can often be used in these situations is simulation - generating synthetic data where meaningful patient-patient interactions were known could determine how often our algorithm captures those in a blinded trial. However, the very complexity that makes this method useful also makes simulating representative data of patient hospital visit times impossible to do without a large number of simplifying assumptions that would substantially remove the synthetic data from accurately representing the reality. In other words, if we could accurately simulate data of this kind, we would not need this method, as we can instead directly model the patient visits. This complexity is therefore both a feature that the algorithm can implicitly take advantage of, as well as a limitation on the ability to check external validity—future work should evaluate how well the algorithm recapitulates known meaningful interactions under different data.
Finally, non-uniform sampling strategies have been shown to induce estimator biases in networks inferred from link-tracing algorithms when the sampling frame is unknown [23]. This is important, because the consistent co-presence algorithm samples edges drawn with uneven probabilities. Future work should evaluate how much the resulting network structure is biased from the choice of prior distribution.
V. Conclusions
In this paper we have described a solution to the problem of how to infer meaningful interaction in a complex system where the rules governing the system cannot all be explicitly modeled. Our solution involved a novel algorithm we term “consistent co-presence”. We show that patient readmission is best-predicted by networks generated using our algorithm. The strength of consistent co-presence as an approach to this problem lies in allowing the observed data, and random perturbations therein, to implicitly model complex systems. Although we apply the approach to a health care system here, this approach is general enough that it can be applied to a wide variety of systems when the metric of interest is when two entities overlap in meaningful ways.
Acknowledgments
This work was supported by the Intramural Research Program of the National Institutes of Health (ZIAHG200397 to LMK) and by a grant to the The Oxford Martin School.
Contributor Information
Jeffrey Lienert, Department of General Medicine, Perelman School of Medicine, Philadelphia, USA.
Felix Reed-Tsochas, Saïd Business School, University of Oxford, Oxford, UK.
Laura Koehly, Social and Behavioral Research Branch, National Human Genome Research Institute, Bethesda, USA.
Christopher Steven Marcum, Social and Behavioral Research Branch, National Human Genome Research Institute, Bethesda, USA.
References
- [1].Salganik Matthew J. Bit-by-bit: Social research in the digital age. Princeton University Press, 2019. [Google Scholar]
- [2].Chardonnens Thibaud, Cudre-Mauroux Philippe, Grund Martin, and Perroud Benoit. Big data analytics on high velocity streams: A case study. In 2013 IEEE International Conference on Big Data, pages 784–787. IEEE, 2013. [Google Scholar]
- [3].Greene Jeremy A. and Lea Andrew S.. Digital futures past the long arc of big data in medicine. New England Journal of Medicine, 381(5):480–485, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Lienert Jeffrey. The social and biological effects of patient-patient co-presence on health in hospitals using electronic medical records (PhD thesis). Oxford University Press, 2018. [Google Scholar]
- [5].Basch Ethan, Deal Allison M, Kris Mark G, Scher Howard I, Hudis Clifford A, Sabbatini Paul, Rogak Lauren, Bennett Antonia V, Dueck Amylou C, Atkinson Thomas M, et al. Symptom monitoring with patient-reported outcomes during routine cancer treatment: a randomized controlled trial. Journal of Clinical Oncology, 34(6):557, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Hillestad Richard, Bigelow James, Bower Anthony, Girosi Federico, Meili Robin, Scoville Richard, and Taylor Roger. Can electronic medical record systems transform health care? potential health benefits, savings, and costs. Health Affairs, 24(5):1103–1117, 2005. [DOI] [PubMed] [Google Scholar]
- [7].Raghupathi Wullianallur and Raghupathi Viju. Big data analytics in healthcare: promise and potential. Health Information Science and Systems, 2(1):3, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].Quan Hude, Li Bing, Saunders L Duncan, Parsons Gerry A, Nilsson Carolyn I, Alibhai Arif, Ghali William A, and IMECCHI investigators. Assessing validity of ICD-9-CM and ICD-10 administrative data in recording clinical conditions in a unique dually coded database. Health Services Research, 43(4):1424–1441, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].De Jaegher Hanne, Di Paolo Ezequiel, and Gallagher Shaun. Can social interaction constitute social cognition? Trends in Cognitive Sciences, 14(10):441–447, 2010. [DOI] [PubMed] [Google Scholar]
- [10].Snitkin Evan S, Zelazny Adrian M, Thomas Pamela J, Stock Frida, Henderson David K, Palmore Tara N, Segre Julia A, NISC Comparative Sequencing Program, et al. Tracking a hospital outbreak of carbapenem-resistant Klebsiella pneumoniae with whole-genome sequencing. Science Translational Medicine, 4(148):148ra116–148ra116, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Lienert Jeffrey, Marcum Christopher Steven, Finney John, Reed-Tsochas Felix, and Koehly Laura. Social influence on 5-year survival in a longitudinal chemotherapy ward co-presence network. Network Science, 5(3):308–327, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Sukhrie Faizel HA, Beersma Matthias FC, Wong Albert, van der Veer Bas, Vennema Harry, Bogerman Jolanda, and Koopmans Marion. Using molecular epidemiology to trace transmission of nosocomial norovirus infection. Journal of Clinical Microbiology, 49(2):602–606, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Qiu Shaofu, Li Peng, Liu Hongbo, Wang Yong, Liu Nan, Li Chengyi, Li Shenlong, Li Ming, Jiang Zhengjie, Sun Huandong, et al. Whole-genome sequencing for tracing the transmission link between two ard outbreaks caused by a novel HAdV serotype 7 variant, China. Nature Scientific Reports, 5:13617, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Rothenberg Richard B, Woodhouse Donald E, Potterat John J, Muth Stephen Q, Darrow William W, and Klovdahl Alden S. Social networks in disease transmission: the Colorado Springs Study. NIDA Research Monograph, 151:3–19, 1995. [PubMed] [Google Scholar]
- [15].Chen Yee-Chun, Huang Li-Min, Chan Chang-Chuan, Su Chan-Ping, Chang Shan-Chwen, Chang Ying-Ying, Chen Mei-Ling, Hung Chien-Ching, Chen Wen-Jone, Lin Fang-Yue, et al. SARS in hospital emergency room. Emerging Infectious Diseases, 10(5):782, 2004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].O’Hara Keith J, Walker Daniel B, and Balch Tucker R. Physical path planning using a pervasive embedded network. IEEE Transactions on Robotics, 24(3):741–746, 2008. [Google Scholar]
- [17].Marcum Christopher Steven. Age differences in daily social activities. Research on Aging, 35(5):612–640, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Leydesdorff Loet. On the normalization and visualization of author co-citation data: Salton’s Cosine versus the Jaccard index. Journal of the American Society for Information Science and Technology, 59(1):77–85, 2008. [Google Scholar]
- [19].Hunter Ruth F, McAneney Helen, Davis Michael, Tully Mark A, Valente Thomas W, and Kee Frank. “Hidden” social networks in behavior change interventions. American Journal of Public Health, 105(3):513–516, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [20].Royston Patrick, Altman Douglas G, and Sauerbrei Willi. Dichotomizing continuous predictors in multiple regression: a bad idea. Statistics in Medicine, 25(1):127–141, 2006. [DOI] [PubMed] [Google Scholar]
- [21].Schaefer David R, Light John M, Fabes Richard A, Hanish Laura D, and Martin Carol Lynn. Fundamental principles of network formation among preschool children. Social Networks, 32(1):61–71, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [22].Bonadonna Gianni and Valagussa Pinuccia. Dose-response effect of adjuvant chemotherapy in breast cancer. New England Journal of Medicine, 304(1):10–15, 1981. [DOI] [PubMed] [Google Scholar]
- [23].Ott Miles Q and Gile Krista J. Unequal edge inclusion probabilities in link-tracing network sampling with implications for respondent-driven sampling. Electronic Journal of Statistics, 10(1):1109–1132, 2016. [Google Scholar]
