Summary
Electronic health records (EHRs) from type 2 diabetic (T2D) patients consist of longitudinally and sparsely measured health-markers at clinical encounters. Our goal is to use such data to learn latent patterns that can inform patientâĂŹs health status related to T2D while accounting for challenges in retrospectively collected EHRs. To handle challenges such as correlated longitudinal measurements, irregular and informative encounter times, and mixed marker types, we propose multivariate generalized linear models to learn latent patient subgroups. In our model, covariate effects were time-dependent and latent Gaussian processes were introduced to model between-marker correlations over time. Using inferred latent processes, we integrated the irregularly measured health markers of mixed types into composite scores and applied hierarchical clustering to learn latent subgroup structures among T2D patients. Application to an EHR dataset of T2D patients showed different trends of age, sex and race effects on hypertension/high blood pressure, total cholesterol, glycated hemoglobin, high-density lipoprotein, and medications. The associations among these markers varied over time during the study window. Clustering results revealed four subgroups, each with distinct health status. The same patterns were further confirmed using new EHR records of the same cohort. We developed a novel latent model to integrate longitudinal health markers in EHRs and characterize patient latent heterogeneities. Analysis indicated that there were distinct subgroups of T2D patients, suggesting that effective healthcare management for these patients should be performed separately for each subgroup.
Keywords: electronic health records, latent process, kernel smoothing, generalized linear models, type 2 diabetes
1 |. INTRODUCTION
In the modern era of precision medicine, one important source of patient’s health data is electronic health records (EHRs). EHR data consists of longitudinal medical records from thousands of patients in one or more electronic healthcare systems that digitally capture measurements of patients’ health status through normal medical practices1,2,3, including patient’s vital signs, laboratory measurements, disease diagnosis codes, procedure codes and medications. Benefits of EHRs include cost effectiveness, real time updates, and reflections on patients’ disease course and healthcare managements in realistic settings. Therefore, integrative analysis of this information over time provides a great opportunity to understand the heterogeneity of patient’s disease progression and susceptibility in real world settings, which is useful for monitoring disease prognosis and optimizing personalized healthcare management.
Due to the retrospective nature of EHRs, the analysis of EHRs is complicated by the following challenges: first, the health markers measured over time are multivariate and the measurements can be either continuous (e.g., lab measures), binary (e.g., disease diagnoses) or counts (e.g., number of medications); second, for each patient, the health marker data are collected at each clinical encounter so the measurement times can be irregular, sparse, and heterogeneous across patients; third, the measurement times are often informative to patients’ health status or health care processes.
This work is motivated by the analyses of EHRs of type 2 diabetes (T2D) patients obtained from the Ohio State University Wexner Medical Center Information Warehouse (OSU-WMCIW). The data collection spanned a time period of eight years (between 2011 and 2018) from a total of 58,490 patients. The data contained patients’ medical records of glycated hemoglobin, high-density lipoprotein, total cholesterol, hypertension, and all medications prescribed at each clinical encounter. Because these markers were of different types and were not measured at the same time across and within patients, directly combining the values from these markers is neither meaningful nor feasible. For example, Figure 1 gives a snapshot of the measurement time of several health markers from 20 randomly selected patients. Clearly, each marker was measured sparsely at irregular times for each patient, and the measurement time patterns vary significantly from patient to patient.
FIGURE 1.
Observation time patterns of five health markers for 20 randomly selected T2D patients in the EHRs data. Each mark represents an existing measurement at the corresponding time.
Joint models based on linear or generalized mixed effects models have been commonly used for analyzing multivariate longitudinal data4. In the joint models, various distribution families are used5,6,7, and subject-specific random effects are shared across all health markers to explain their dependence due to a finite number of latent variables. For example, Lambert and Vandenhende8 jointly analyzed three repeated measured longitudinal outcomes using copula models in a dose titration safety study; Gueorguieva and Sanacora9 proposed correlated probit models for joint analysis of repeated measurements with ordinal and continuous health markers. Some extensions allowed time-dependent effects10,11, but assumed constant between-marker dependence over time. However, assuming parametric patterns or attributing the dependence to a few time-invariant random effects is rather restrictive especially for modeling EHRs over a long period of time, since in EHRs, the trajectories of the health markers and their dependence may vary over time depending on the disease progression and medication usage for each patient. Moreover, it is computationally challenging to maximize a joint likelihood in the presence of a large number of patients and many health markers.
Machine learning approaches have been also proposed to perform EHR analysis, such as deep Poisson factor models12, tensor factorization and non-negative matrix factorization13, and deep exponential families14. These approaches, although more flexible than aforementioned statistical models, are less interpretable and are highly computationally intensive, requiring substantial work for data engineering and model tuning. More importantly, none of these approaches can account for irregular but informative measurement patterns as seen in EHRs.
Our work seeks to strike a balance between the complex statistical modelling and flexible machine learning methods, while accounting for the unique challenges in EHRs. To conduct an integrative analysis of EHRs, we extend the multivariate generalized linear models (GLMs) by assuming appropriate distribution and link functions depending on the marker type. We allow the effects of covariates on health markers to be time-varying. Moreover, to account for the time-varying dependence among health markers, we introduce latent Gaussian processes into the models, where the covariance matrix is assumed to vary over time. For estimation, we adopt kernel smoothing method to pool information across time points and patients and apply weights to account for the heterogeneous patterns of measurement times. The inferred latent processes represent patients’ underlying health status, so in order to integrate these mixed-type health markers, we use the inferred latent processes to calculate the distances between any two patients using the Mahalanobis distance15. Finally, we apply hierarchical clustering to identify patients’ health patterns and characterize between-group heterogeneities.
The remaining parts of this article are organized as follows. In section 2, we propose our models and describe main ideas. We then provide inferences on estimating model parameters and procedures to perform numerical computations. In section 3, we derive the asymptotic distributions of the estimators. We conduct simulation studies in section 4. In section 5, we apply our method to an integrative analysis on health markers for T2D patients using EHRs from the OSU-WMCIW.
2 |. METHODOLOGIES
2.1 |. Statistical Models for Integrative Analysis
Suppose EHR data are obtained from n patients. For the ith patient, let Xi be m-dimensional baseline covariates. Among p health markers, let Yik(t) denote the measurement of the kth health marker at time t. We suppose Yik(t) is measured at time points , where nik is the total count of observations on the kth health marker for the ith patient. The total number of observations up to time t can be represented by a counting process , where is the indicator function. Since the documentation times are patient’s clinical encounters in the EHR system, patterns of these documentation/measurement time points may carry information on patients’ health status. Thus, we model the intensity of Nik(t) as
| (1) |
where λk(t)is a baseline intensity function, and γk is a vector of intensity parameters. By modeling the intensity of EHR measurement rates, one can adjust for the bias of informative measurement patterns and account for between patient heterogeneity. By modeling the intensity of EHR measurement rates, one can adjust for the bias of informative measurement patterns and account for between patient heterogeneity.
We further assumeYik(t) follows a distribution in an exponential family model as follows:
| (2) |
where and are the canonical parameter and the dispersion parameter, respectively, specific to each patient and each health marker. , and are known functions. Let , where is the canonical link function, and is the mean of . To capture the patient heterogeneity and dependence, we assume, at time t,
| (3) |
where βk(t) is a vector of regression coefficients for covariates is the kth element of the latent Gaussian process is independent of Xi, and it follows a mean-zero multivariate Gaussian distribution with a covariance matrix Ω(t). Estimating variance locally will requires dense measurements from the same biomarker, which is not the case for the EHRs. Moreover, in our empirical application the estimated variances do not vary much across time (section 6). Thus, to ensure numerical stability in subsequent analysis, we assume each latent process to have a constant variance and the constant is estimated using historical records. Hence, in Ω(t), only the correlations among health markers, i.e., the off-diagonal elements, need to be estimated.
Under the proposed models (2) and (3), each measurement Yik(t) can be uniquely represented by the latent process . Since has the same scale for different k, one can integrate the latent processes as an alternative way to integrate the mixed-type health markers. The integration can use the Mahalanobis distance as follows,
| (4) |
Thus, there are several important advantages of using the proposed models to perform an integrative analysis of mixed-type health markers. First of all, despite the health markers are irregularly measured and mixed-type, we can map them onto the same scale to align patients and characterize the between-patients heterogeneity. In addition, the dimension of latent processes can be further reduced to some lower dimensional subspaces than the number of health markers. Therefore, through the representation of latent processes, we achieve a dimension reduction.
2.2 |. Model Parameter Estimation
First, we use marker-specific Anderson-Gill intensity models16 to estimate γk in (1). With the estimator , we normalize the counting process Nik(t) by letting . Thus, the normalized counting process is homogeneous across different patients and different health markers.
Next, to estimate βk(t) for any fixed time point t, we solve the following kernel-weighted local estimating equation
| (5) |
where Kh(z) = h−1K(z/h) with K(z) being a symmetric kernel function, and h1n is the bandwidth of Kh(z). Essentially, we assign weights to the observed measurements Yik(s) near t, and we pool them together across all patients to estimate the mean (first moment) of Yik(t). This pooling process relies on the kernel smoothing. Also, pooling information across observations nearby and across patients overcomes the difficulty in parameter estimations that some sparsely measured health markers do not have sufficient samples at some time points. Moreover, using instead of dNik(s), we remove the heterogeneity of informative measurement time points among patients in a similar spirit as inverse probability weighting.
Similarly, to estimate the correlation between two latent processes, , we propose to solve the following kernel-weighted local estimating equation, for k ≠ l,
| (6) |
where is a bivariate kernel function with bandwidth h2n.
2.3 |. Numerical Computation
When the link functions in (3) take some simple forms, in (5) and in (6) can be explicitly computed. Specifically, for ,
and
When gk(z) takes a general form, we can compute the above expectations using the Gauss-Hermite quadrature method17.
Since Un,k(βk(t)) is only related to the parameter βk(t), we can solve (5) and obtain for each health marker k, separately. Similarly, plugging and to (6), we can solve the equation and obtain for each pair of health markers, separately. Therefore, even with many health markers, i.e., p is moderate or large, our algorithm can efficiently handle the computation burden by solving the estimating equations separately. Finally, we apply the above procedures for time grids t1, t2,…,tN to obtain the parameter estimators over the whole range of the follow-up.
A distance matrix D can be obtained by computing the Mahalanobis distance in (4) between each pair of patients. In particular, with the estimated latent processes, the distance is approximated by
| (7) |
and
| (8) |
where is the covariance matrix of . In particular,
The subsequent steps can be calculated using the Gauss-Hermite quadrature method as well, and the details are given in the supplementary material.
2.4 |. Data-adaptive Selection of Bandwidths
Our asymptotic results in the supplementary material suggest the bandwidths h1n and h2n can be chosen, respectively, on the order of n−1/3 and n−1/4. However, for practical applications, we consider a data-adaptive method for selecting the bandwidths18. The key idea is using observed data to obtain the empirical bias and variability of the estimators in terms of the bandwidths. Consequently, we search for the bandwidths that minimize the empirical mean squared error of selecting them.
Specifically, to choose the optimal bandwidth h1n for estimating , we first consider a reasonable range of bandwidths. For a fixed bandwidth h and a fixed time point t, we denote to the estimator for βk(t). To estimate the bias of , we fit a least squares regression by regressing on h2. We denote the regression coefficient of h2 as . Since the bias of is on the order of h2, as shown in the asymptotic result, is an estimator for the bias of . Next we investigate the variability of . We randomly split the data into two equal parts. Using either one of the split data, we obtain as the estimator for βkh(t) in this case. Similarly, using the other half, we obtain . Thus, can be used as an unbiased estimator of the variance of . Finally, given all the time points, we select the optimal bandwidth as , where
| (9) |
We denote the optimal h1n as H1 and denote the corresponding estimators for βk(t) as . Next, given h1n = H1 and , we select the optimal h2n, the bandwidth for estimating σkl(t)’s, by minimizing the empirical mean squared error of the corresponding estimators, which is numerically calculated in the similar way to above.
3 |. THEORETICAL RESULTS
We first state the following required conditions.
Condition 1. True parameters , and are continuously twice-differentiable for any , where k, l = 1, 2,…, p and k ≠ l. In addition, is strictly positive. Furthermore, second moments of and temporal covariances are continuously twice-differentiable.
Condition 2. The vector of baseline covariate X is bounded. If there exists a vector b such that XTb = 0, then b = 0.
Condition 3. . Furthermore, .
Condition 4. The kernel function K(z) is a symmetric density function satisfying ∫ z2K(z)dz < ∞. Similarly, is a symmetric bivariate density function with bounded fourth moments.
Condition 1 is used to give the asymptotic distribution for the parameter estimators in (1), and it assumes some smoothness properties of the time-varying coefficients and covariance matrices. From condition 3, the choice of h1n and h2n can be n−1/3 and n−1/4, respectively. A potential choice of the kernel satisfying condition 4 can be the Gaussian kernel or the Epanechnikov kernel. Theorem 1 states the asymptotic distribution of parameters . Theorem 2 establishes the asymptotic distribution of parameters , and k ≠ l.
Theorem 1 (Asymptotic distribution of ). Under conditions 1 to 4, for any fixed t,
| (10) |
where
and the asymptotic variance
where is a function of . Its definition and the proof of theorem 1 are given in the supplementray material.
Theorem 2 (Asymptotic distribution of ). Under conditions 1 to 4, for any fixed t,
| (11) |
where
is assumed to be non-singular, and the asymptotic variance
where is a function of and . Its definition and the proof of theorem 2 are given in the supplementray material.
Since the asymptotic variances in theorem 1 and theorem 2 do not have simple expressions, we use the bootstrap method to estimate the asymptotic variances in practice.
4 |. SIMULATION STUDIES
In the simulation studies, we simulated data of 6 health markers for 5,000 subjects. For the ith subject, we generated two covariates Xi1 ~ Univorm(−1, 1) and Xi2 ~ Bernoulli(0.5) − 0.5. Thus, Xi = (1, Xi1, Xi2)T was a 3-dimensional vector of baseline variables. The maximum observation time Ti for each subject was set to 12. The measured time points for simulated markers were generated from a Poisson process whose intensity function was . For the variances of latent processes, we assumed ck = 1, k = 1,2,…,6. Suppose there were Ni unique measured time points for all latent processes of the subject i, we sampled from a mean-zero multivariate Gaussian distribution with a covariance matrix , where ,
and
where . Thus, at each measured time point, Ω(t) is constant and equals to Σ1, but there exist underlying dependences in the time intervals between these time points.
The values of simulated markers were generated according to (2) and (3). To assess the ability of our models in section 2.1 to handle mixed-type markers, we assumed Yi1(t) and Yi4(t) were Gaussian distributed. Yi2(t) was Poisson distributed. Yi3(t), Yi5(t), and Yi6(t) were Bernoulli distributed. Thus, , and . Furthermore, since the distributions of Yi1(t) and Yi4(t) have dispersion parameters, we set . The true values of βk(t) were assumed to be
The scaled Epanechnikov kernel was chosen as the kernel function in (5), i.e.,
| (12) |
Furthermore, the kernel function in (6) was set to the product of two scaled univariate Epanechnikov kernels, i.e.,
| (13) |
Since the data-adaptive method for selecting bandwidths was computationally intensive, we first conducted a preliminary study on the simulated data. We used the method in section 2.4 and selected the optimal bandwidths among h = cn−1/z, where n = 5000, c = {5,10,20,30}, and z = 1,2,…,10. Hence, the potential bandwidths ranged from 0.001 to 12.800. We found h1n = 5n−1/3 = 0.292 and h2n = 10n−1/3 = 0.585 were close to the optimal. This set of h1n and h2n was used in all subsequent simulations.
For time points t = 0,1,…,12, we solved (5) and (6), and we obtained and . We evaluated accuracies of the asymptotic approximations by calculating the average bias and the sample standard deviation of and , respectively. In addition, using the bootstrap method, we calculated the bootstrap estimators for standard errors of and . Specifically, for each dataset, we resampled 5000 observations with replacement from X to produce a bootstrap dataset X*1. We could use X*1 to produce a new bootstrap estimator for βk(t), which we called . This procedure was repeated B times in order to produce B different bootstrap datasets, X*1, X*2,..., X*B, and B corresponding βk(t) estimators, . Next we computed the sample variance of these bootstrap estimators and treated it as the estimated variance. Similar procedures were also applicable to . Afterwards, 95% confidence intervals of each parameter were constructed. Finally, we counted how many times true parameters βk(t) and σkl(t) fell in their confidence intervals to obtain coverage probabilities.
Table 1 and Table 2 summarize the main results over 100 simulations at t = 1. From Table 1 and Table 2, we can conclude that, at t = 1, our method yields estimators which are close to the true parameters. All the estimators deviate from true parameters by less than 0.03. On the other hand, the absolute values of biases between estimators and true parameters become a little greater, but most of them are still less than 0.1. In addition, the bootstrap based standard errors are reasonable estimators for the standard deviations of . Almost all the differences between SD and SE are smaller than 0.03, except for . Also, excluding , all the coverage probabilities are greater than or equal to 0.9, and the majority of them are around 0.95.
TABLE 1.
Summary statistics for at t = 1 based on 100 simulations.
| Marker | Parameter | True value | Bias | SD | SE | CP |
|---|---|---|---|---|---|---|
| Y1 | β10 | −0.565 | 0.002 | 0.035 | 0.039 | 0.98 |
| Continuous | β11 | 0.933 | 0.001 | 0.059 | 0.067 | 0.98 |
| β12 | 0.000 | −0.002 | 0.085 | 0.078 | 0.94 | |
| Y2 | β20 | −0.819 | 0.007 | 0.050 | 0.058 | 0.98 |
| Count | β21 | −1.030 | 0.026 | 0.112 | 0.112 | 0.95 |
| β22 | 0.900 | −0.010 | 0.117 | 0.132 | 0.97 | |
| Y3 | β30 | 0.450 | −0.006 | 0.074 | 0.077 | 0.94 |
| Binary | β31 | −1.000 | 0.013 | 0.112 | 0.136 | 0.99 |
| β32 | 0.900 | 0.011 | 0.157 | 0.151 | 0.93 | |
| Y4 | β40 | −1.260 | −0.006 | 0.038 | 0.039 | 0.93 |
| Continuous | β41 | 0.982 | −0.010 | 0.063 | 0.068 | 0.97 |
| β42 | 0.765 | −0.005 | 0.078 | 0.077 | 0.93 | |
| Y5 | β50 | 0.732 | 0.001 | 0.077 | 0.074 | 0.95 |
| Binary | β51 | 0.470 | 0.001 | 0.149 | 0.134 | 0.92 |
| β52 | 0.315 | −0.014 | 0.163 | 0.150 | 0.92 | |
| Y6 | β60 | 0.331 | −0.018 | 0.085 | 0.077 | 0.90 |
| Binary | β61 | 0.100 | 0.004 | 0.144 | 0.136 | 0.95 |
| β62 | 1.924 | −0.021 | 0.163 | 0.156 | 0.94 |
Note. “Bias” is the bias of the average estimates; “SD” is the sample standard deviation of the estimates; “SE” is the average of the estimated standard errors based on 100 bootstrap samples; “CP” is the coverage probability of the 95% confidence intervals.
TABLE 2.
Summary statistics for σkl(t) at t = 1 based on 100 simulations.
| Parameter | True value | Bias | SD | SE | CP |
|---|---|---|---|---|---|
| σ12 | 0.342 | −0.061 | 0.149 | 0.136 | 0.91 |
| σ13 | 0.484 | −0.058 | 0.202 | 0.213 | 0.98 |
| σ14 | 0.578 | −0.086 | 0.127 | 0.121 | 0.87 |
| σ15 | 0.034 | 0.030 | 0.216 | 0.218 | 0.95 |
| σ16 | 0.047 | −0.009 | 0.210 | 0.207 | 0.96 |
| σ23 | 0.799 | −0.150 | 0.388 | 0.382 | 0.90 |
| σ24 | −0.493 | 0.065 | 0.232 | 0.233 | 0.95 |
| σ25 | −0.779 | 0.078 | 0.241 | 0.242 | 0.94 |
| σ26 | 0.796 | −0.143 | 0.371 | 0.366 | 0.91 |
| σ34 | −0.163 | 0.048 | 0.216 | 0.252 | 0.97 |
| σ35 | −0.363 | −0.024 | 0.252 | 0.261 | 0.97 |
| σ36 | 0.530 | −0.024 | 0.257 | 0.249 | 0.95 |
| σ45 | 0.802 | −0.076 | 0.212 | 0.219 | 0.94 |
| σ46 | −0.686 | 0.089 | 0.228 | 0.244 | 0.94 |
| σ56 | −0.846 | −0.019 | 0.160 | 0.181 | 0.97 |
Note. See Table 1.
After examining the estimators at a fixed time point, we also investigated the estimation performance as time changes. For instance, Figure 2 presents true parameters versus estimators across the 13 time points for β52(t) and σ34(t), respectively. From Figure 2, we can conclude is very close to the true parameter at each time point, and it well captures the underlying smooth function of β52(t) across time. Although the bias between σ34(t) and is greater than that between β52(t) and , all of σ34(t) are in the interquartile range of . Thus, the estimators perform consistently and the deviations are reasonable.
FIGURE 2.
Top panel: true β52(t) versus across 13 time points based on 100 simulations. Bottom panel: true σ34(t) versus across 13 time points 100 simulations. Red triangles: true values of the parameter. Blue triangles: average estimators of the parameter. Red curve: the true function of the parameter.
5 |. REAL DATA APPLICATION
5.1 |. Data prepocessing
We applied the proposed method to analyze EHRs of T2D patients from the OSUWMCIW. In our application, we included three baseline variables Xi: baseline age, race (1: white; 0: non-white), and sex (1: male; 0: female). Besides, there were five health markers Yik(t) related to T2D: hypertension/high blood pressure (HBP), total cholesterol (TC), glycated hemoglobin (HbA1c), high-density lipoprotein (HDL), and medications prescribed at each clinical encounter. Here, we dichotomized HBP as HBP=1 if a patient’s systolic blood pressure is higher than 140 mmHg and 0, otherwise. The medications served as one strong indicator of patient’s comorbidity and they could be T2D related or not. Thus, the health markers in the analysis consisted of three continuous markers (TC, HbA1c, HDL), one binary marker (HBP) and one count marker (number of medications).
For analysis, we split the data into three parts for different purposes. The first data consisted of the records collected between 2011 and 2012 and was used to estimate the variances of individual latent processes by fitting univariate generalized linear mixed models. The second part included the records from 24,975 patients between 2013 and 2017 who had at least one marker measurement. This part of the data was used for training our models and learning latent groups among the patients. The third part was the data collected in 2018 and would be used for validation purpose. The flow-chart for this work is illustrated in Figure 3.
FIGURE 3.
Flow-charts of the proposed method.
In our model fitting using the second part of the data, after checking normal ranges for the health markers 19,20,21, we removed extreme records such as TC ≤ 0 or ≥ 500 mg/dL, HbA1c ≤ 3 or ≥ 20%, and HDL ≤ 0 or ≥ 120 mg/dL. This led to a deletion of 1% of the data and a total number of 24,655 patients for analysis. Among these patients, 52.08% were female, 63.42% were white, and their ages in years ranged from 18.30 to 97.67 with a mean of 56.06. All of them had at least one observation for at least one health marker in the five years, but not necessarily for other health markers. Specifically, the average numbers of records for HBP, TC, HbA1c, HDL, and the number of medications per patient during these five years were 17.50, 4.01, 5.95, 3.64, and 53.21, respectively. In order to minimize the influence of different scales on the numeric stability, we normalized all continuous variables before identifying patient subgroups. Each of them has zero mean and unit variance.
5.2 |. Results
Table 3 shows the effect of each demographic variable on the pattern of the measurement times for each marker. From Table 3, we conclude that elder patients tend to have more observations for all health markers and females appeared to have more observations for HBP, HbA1c, and the number of medications, while males tend to have more TC measurements. Finally, whites have significantly less observations for HBP, HbA1c, and the number of medications than non-whites.
TABLE 3.
Demographic Effects on Frequency of Health Marker Measurements
| Marker | Demographic | Est | HR | SE | Z | P-value |
|---|---|---|---|---|---|---|
| HBP | age | 0.065 | 1.067 | 0.006 | 10.552 | < 0.001 |
| sex | 0.064 | 1.066 | 0.013 | 4.813 | < 0.001 | |
| race | −0.129 | 0.879 | 0.014 | −9.425 | < 0.001 | |
| TC | age | 0.035 | 1.035 | 0.006 | 6.238 | < 0.001 |
| sex | −0.035 | 0.965 | 0.012 | −3.000 | 0.003 | |
| race | −0.012 | 0.988 | 0.013 | −0.968 | 0.333 | |
| HbA1c | age | 0.008 | 1.008 | 0.005 | 1.721 | 0.085 |
| sex | 0.034 | 1.034 | 0.009 | 3.678 | < 0.001 | |
| race | −0.044 | 0.957 | 0.009 | −4.650 | < 0.001 | |
| HDL | age | 0.047 | 1.048 | 0.005 | 10.090 | < 0.001 |
| sex | −0.010 | 0.990 | 0.010 | −1.007 | 0.314 | |
| race | −0.007 | 0.993 | 0.010 | −0.715 | 0.475 | |
| Medications | age | 0.042 | 1.043 | 0.006 | 7.262 | < 0.001 |
| sex | 0.086 | 1.090 | 0.012 | 7.069 | < 0.001 | |
| race | −0.113 | 0.893 | 0.013 | −8.988 | < 0.001 |
Note: “Est” is the regression coeffiicent estimator; “HR” is the hazard ratio; “SE” is the standard error of the coefficient estimator; “Z” is the statistic for a z-test; “P-value” is the p-value for the z-test.
To estimate the parameters in the joint models, we first implemented the adaptive method of bandwidth selection as stated in section 2.4, and results are shown in the supplementary Figure 1. We ended up to choose h1n = 564.112 days and h2n = 494.687 days as the optimal bandwidths. Using the optimal bandwidths, we estimated βk(t) and σkl(t) at 61 time points. The results are presented in Figure 4 and Figure 5, respectively. The salmon-colored ribbons in these two figures are 95% confidence intervals for the parameters based on 100 bootstrap datasets.
FIGURE 4.
Estimated regression coefficients across 61 time points using h1n = 564.112 days and h2n = 494.687 days. Salmon-colored ribbons: 95% confidence intervals for the estimators.
FIGURE 5.
Estimated correlations across 61 time points using h1n = 564.112 days and h2n = 494.687 days. Salmon-colored ribbons: 95% confidence intervals for the estimators.
Figure 4 presents the relationships between each pair of health markers and covariates. In general, all health markers exhibit changes over time. Mean HbA1c decreases during the first 1.5 years and has an increasing trend afterwards, which may suggest the difficulty to achieve long-term control of glycemic levels in a chronically ill patient population. Mean HDL shows a similar quadratic pattern over time, suggesting difficulty of long-term cholesterol control. The estimated regression coefficients for covariates, i.e., the estimated effects of covariates on health markers, do not show any pattern of drastic changes over time. Instead, the estimated values across time fluctuate around mean values. However, we can observe decreasing trends for and , suggesting that as time increases, the expected means of cholesterol and the number of medications decrease. and are positive across time, while and are negative. is negative but close to 0. Hence, estimators suggest that elder subjects on average have higher HBP and HDL, but they have lower cholesterol and HbA1c. There is no apparent difference in the average number of medications between elder subjects and younger subjects. Similarly, estimators of sex effect, , suggest that compared to men, women tend to have higher expected means of cholesterol and HDL, but they have lower values of HBP and the number of medications. Although women have slightly lower expected means of HbA1c than men, the difference is inapparent. For race, the estimators of indicate that white people have lower or equal expected means than non-white people in almost all five health markers.
Figure 5 presents the correlations between each pair of health markers. The results suggest the concurrent correlations between HBP and cholesterol, HBP and medications, cholesterol and HbA1c, cholesterol and HDL are positive and moderate. Moreover, there exists negative and observable concurrent correlations between HbA1c and HDL, HDL and medications. The correlation between HbA1c and HDL decreases as time increases. On the opposite, the positive correlation between cholesterol and HDL decreases at the beginning, but increases after about one year. The positive correlation between cholesterol and HbA1c has a similar pattern as it decreases at first and increases after 1000 days. The correlations of HBP and cholesterol, HbA1c and number of medications increase in first 500 days, but they start to decrease during 500 to 1000 days, and bounce back afterwards. The correlations of HBP and HbA1c, HBP and HDL, HDL and number of medications decrease in first 500 days, and then increase, but decrease again after 1000 days.
One interesting observation from Figure 5 is that the estimated correlation between the number of medications and HBP is as high as 0.6 but its correlations with cholesterol and HDL are both negative, fluctuating around −0.30. However, there does not appear to be a strong association between the number of medications and HbAc1 over time. This may suggest that the patients in this cohort were most likely to take medications that aimed to control the levels of cholesterol and HDL, but not necessarily for controlling the level of HbA1c. The latter is consistent with the fact that over 90% of drugs recorded this database are non-diabetic drugs. One possible interpretation of the observed time-dependent correlation pattern is that there might exists another unobserved disease biomarker that influences the two observed markers temporally. Thus, the estimated correlation pattern could be potentially useful to identify such “common cause” biomarker so as to better understand the mechanism of disease progression.
Finally, we computed the similarity between each pair of patients using the distance defined in (7). To compute as (8), we substituted with the nearest neighbor observation of time t for patient i. Using the between-patient similarity matrix, we performed a cluster analysis on the 24,655 patients, and the results are given in Figure 6. We observe 4 clusters within which patients had similar health marker profiles.
FIGURE 6.
Dendrogram of Mahalanobis distances for 24,655 patients. Group index numbers are assigned according to group sizes.
To better understand the health patterns of patients in each subgroup, we calculated the average of normalized measurements for each health marker in each group, as shown in Figure 7. In the top panel of Figure 7, the value in each cell is averaged over all patients and all clinical encounters between January 1, 2013 and December 31, 2017. We compare these values to the average of each health marker in the entire study sample. A higher value of HDL and a lower value of HBP, cholesterol, and HbA1c represent healthier T2D status. The number of medications prescribed at each clinical encounter does not directly reflect the disease status, but a lower count usually indicates a less severe state. Group 4 contains 2163 patients, whose cholesterol was slightly higher than the overall average. Their HBP, HDL, and the number of medications were lower than the overall averages. In addition, they had the highest HDL and it was substantially higher than the overall average. Thus, group 4 is the relatively healthy group in which patients did not take many medications. Group 1 contains 10,705 patients who were less healthy since they had lower-than-average HDL, but other health markers were favorable or roughly neutral. The cholesterol of 6930 patients in group 2 was higher than the overall average, while other health markers were lower or around the averages. We conclude that group 2 is a moderately ill group. For the 4857 patients in group 3, their cholesterol levels were slightly lower than the overall average, however, they had the highest HbA1c. Also, other markers indicated bad health status. Therefore, group 3 patients were in the most severe state of T2D.
FIGURE 7.
Averages of normalized measurements by health markers and patient subgroups. (a): using data from 1/1/2013 to 12/31/2017. (b): using data after 1/1/2018. Red: more severe status than the overall sample average in terms of a health marker; Blue: healthier status than the overall sample average in terms of a health marker; White: overall sample average status in terms of a health marker.
To examine whether the subgroups inferred by the clustering truly represent patients’ health profiles, we validated the detected patterns using third split data that consisted of the EHR data collected after January 1, 2018. These data were not used in any other analysis of this application. The average values of normalized measurements for each health marker in each group are shown in the bottom panel of Figure 7. We conclude that the patients’ health patterns identified prior to year 2018 are consistent with those patterns afterwards. Therefore, the patient groups are not only meaningful, but also represent some true underlying patient patterns over time. This robustness is particularly important to the long-term health management of T2D patients.
6 |. DISCUSSION
In this work, we proposed a latent temporal process model to integrate health markers in EHRs and characterize patient heterogeneities. The proposed method is capable of handling unbalanced records and informative visits, i.e., patients can have missing health markers at some encounters or with visit times depending on their health status. Additionally, our model can both fit different types of health marker, capture the dependence structures among health markers, and takes into account informative pattern of visit times, via the intensity function of health markers. The real data application shows the capability of the proposed method on addressing the data challenges of EHRs, integrating different types of health markers, and identifying meaningful and robust patient subgroups. Therefore, the proposed method may shed lights on the detection of patient homogeneities and heterogeneities, and serve as a step towards applications of personalized medicine.
In the parameter estimation process, we assumed that variances of the latent variables were fixed and they were estimated using the EHR data of 2011 and 2012. To study whether the constant variance was reasonable, we estimated the changes in variances from 6 different time periods in windows of two years as well as using the whole five-year data, and the results, as shown in supplementary Tables 1 and 2, indicate that the estimates varied little. Thus, the constant variance assumption seems to be reasonable for our application. In addition, we re-estimated and σkl(t) using the five-year variance estimates in our approach. Supplementary Figure 2 and supplementary Figure 3 reveal slight changes in the estimated coefficients. In fact, the absolute percentage changes between the two sets of coefficients are less than 1%, except for and which have changes of up to 3%. Therefore, we could conclude that the estimation results are robust to the constant variance estimates. Moreover, to investigate the effect of the bandwidth selection as suggested by a reviewer, we report βk(t) using two suboptimal bandwidths close to the optimal bandwidth in the paper. Supplementary Figure 4 shows that the suboptimal estimators preserve the similar pattern to . The Canberra distance22 between the optimal estimators and suboptimal estimators of , , across time, calculated as
| (14) |
where is the vector of estimated using the optimal bandwidth H1 and is the vector of these ′ estimators using a suboptimal bandwidth , as all smaller than 0.05 as given in supplementary Table 3. The conclusions could be drawn for estimating σkl(t) (c.f., supplementary Figure 5 and supplementary Table 4).
In our models, we assumed that the intensity function of the counting process only depended on the baseline covariates. This assumption can be violated if the intensity also depends on the historical marker values. Since incorporating time-dependent marker values, which are missing for most of time points, is challenging, to examine how this assumption may affect our results, we included an ad hoc marker value, defined as the mean value of HbA1c in the past 12 months, in the intensity model (1). From supplementary Table 5, the effects of the historical HbA1c level on frequencies of HBP, TC, and HbA1c are significant, while the historical HbA1c level has lower impacts on frequencies of HDL and the number of medications. Supplementary Figure 6 and supplementary Figure 7 also reflect this phenomenon that there are slight differences between two versions of estimators for HDL and the number of medications. Although differences between two versions of estimators for HBP and TC are moderate, the new estimators still locate within or around the 95% bootstrapped confidence intervals for the original estimators. However, for HbA1c, the differences could not be ignored but the estimated curves present some unusual shapes. Therefore, further investigation is needed regarding what time-dependent marker values should be used and how missing data issues should be addressed. We will investigate it in our future work.
Since the estimation of both regression coefficients and correlations among latent processes only relies on one or two health markers, our method can be easily extended to handle a large number of health markers, where the computation can be easily parallelized to save computing time and cost. Inferences can be made based on subsampling subsets of the data. Some other extensions to the proposed method include to estimate all β’s simultaneously by incorporating the covariance matrix to the estimating equations for β’s, or to allow marker-specific and time-sensitive bandwidth selection during the parameter estimation (especially when smoothness of biomarker trajectories are expected to be substantially different). Another possible extension is to explicitly model the temporal dependence within the same health marker, as well as across health markers. Although with increased computational burden, an advantage of this extension is the potential to obtain a more precise assessment of the latent process given the entire history of health markers.
As stated in section 1, the latent processes can be also viewed as projections of the health markers onto a lower dimensional space. Therefore, our method can be used for identifying latent clusters among patients as illustrated in our application, and at the same time can also play a role in learning personalized disease prognosis and personalized disease management. For example, the summary of latent processes can be used to improve the understanding of treatment propensity scores in EHRs when learning individualized treatment rules. Lastly, the latent processes can be included in disease outcome models as prognostic or predictive health markers.
Supplementary Material
ACKNOWLEDGMENTS
This research is supported by U.S. NIH grants GM124104, NS073671, and MH117458. The codes to implement these methods are available from the authors upon request.
Footnotes
Conflict of interest
The authors declare no potential conflict of interests.
References
- 1.Gunter TD, Terry NP. The emergence of national electronic health record architectures in the United States and Australia: Models, costs, and questions. J Med Internet Res. 2005; 7(1): e3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Cebul RD, Love TE, Jain AK, Hebert CJ. Electronic health records and quality of diabetes care. N Engl J Med. 2011; 365(9): 825–833. [DOI] [PubMed] [Google Scholar]
- 3.Herrin J, Graca B, Nicewander D, et al. The effectiveness of implementing an electronic health record on diabetes care and outcomes. Health Serv Res. 2012; 47(4): 1522–1540. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Verbeke G, Fieuws S, Molenberghs G, Davidian M. The analysis of multivariate longitudinal data: A review. Stat Methods Med Res. 2014; 23(1): 42–59. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Davidian M, Giltinan DM. Nonlinear Models for Repeated Measurement Data: An Overview and Update. Journal of Agricultural, Biological, and Environmental Statistics. 2003; 8(4): 387–419. [Google Scholar]
- 6.Verbeke G, Molenberghs G. Linear mixed models for longitudinal data. New York: Springer. 2000. [Google Scholar]
- 7.Molenberghs G, Verbeke G. Models for discrete longitudinal data. New York: Springer. 2005. [Google Scholar]
- 8.Lambert P, Vandenhende F. A copula-based model for multivariate non-normal longitudinal data:Analysis ofa dose titration safety study on a new antidepressant. Stat Med. 2002; 21(21): 3197–3217. [DOI] [PubMed] [Google Scholar]
- 9.Gueorguieva RV, Sanacora G. Joint analysis of repeatedly observed continuous and ordinal measures of disease severity. Stat Med. 2006; 25(8): 1307–1322. [DOI] [PubMed] [Google Scholar]
- 10.Huang JZ, Wu CO, Zhou L. Varying-Coefficient Models and Basis Function Approximations for the Analysis of Repeated Measurements. Biometrika. 2002; 89(1): 111–128. [Google Scholar]
- 11.Fan J, Zhang W. Statistical methods with varying coefficient models. Stat Interface. 2008; 1(1): 179–195. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Henao R, Lu JT, Lucas JE, Ferranti J, Carin L. Electronic health record analysis via deep poisson factor models. The Journal of Machine Learning Research. 2016; 17(1): 6422–6453. [Google Scholar]
- 13.Ho JC, Ghosh J, Steinhubl SR, et al. Limestone: High-throughput candidate phenotype generation via tensor factorization. J Biomed Inform. 2014; 52: 199–211. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Miscouridou X, Perotte A, Elhadad N, Ranganath R. Deep Survival Analysis: Nonparametrics and Missingness. Proceedings of the 3rd Machine Learning for Healthcare Conference, in PMLR. 2018; 85: 244–256. [Google Scholar]
- 15.DeMaesschalck R, Jouan-Rimbaud D, Massart DL. The Mahalanobis distance. Chemometrics and Intelligent Laboratory Systems. 2000; 50: 1–18. [Google Scholar]
- 16.Andersen PK, Gill RD. Cox’s Regression Model for Counting Processes: A Large Sample Study. The Annals of Statistics. 1982; 10(4): 1100–1120. [Google Scholar]
- 17.Abramowitz M. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. District of Columbia: U.S. Govt. 1964. [Google Scholar]
- 18.Cao H, Zeng D, Fine JP. Regression analysis of sparse asynchronous longitudinal data. J R Stat Soc Series B Stat Methodol. 2015; 77(4): 755–776. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Stone NJ, Robinson JG, Lichtenstein AH, et al. 2013 ACC/AHA guideline on the treatment of blood cholesterol to reduce atherosclerotic cardiovascular risk in adults: a report of the American College of Cardiology/American Heart Association Task Force on Practice Guidelines. J Am Coll Cardiol. 2014; 63(25 Pt B): 2889–2934. [DOI] [PubMed] [Google Scholar]
- 20.Whelton PK, Carey RM, Aronow WS, et al. 2017 ACC/AHA/AAPA/ABC/ACPM/AGS/APhA/ASH/ASPC/NMA/PCNA Guideline for the Prevention, Detection, Evaluation, and Management of High Blood Pressure in Adults: A Report of the American College of Cardiology/American Heart Association Task Force on Clinical Practice Guidelines. J Am Coll Cardiol. 2018; 71(19): e127–e248. [DOI] [PubMed] [Google Scholar]
- 21.ADA. 8. Pharmacologic Approaches to Glycemic Treatment: Standards of Medical Care in Diabetes-2018. Diabetes Care. 2018; 41(Suppl 1): S73–S85. [DOI] [PubMed] [Google Scholar]
- 22.Lance GN, Williams WT. Computer Programs for Hierarchical Polythetic Classification (âĂIJSimilarity AnalysesâĂİ). The Computer Journal 1966; 9(1): 60–64. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.







