Methods for Estimating Kidney Disease Stage Transition Probabilities Using Electronic Medical Records

Lola Luo; Dylan Small; Walter F Stewart; Jason A Roy

doi:10.13063/2327-9214.1040

. 2013 Dec 18;1(3):1040. doi: 10.13063/2327-9214.1040

Methods for Estimating Kidney Disease Stage Transition Probabilities Using Electronic Medical Records

Lola Luo ⁱ, Dylan Small ⁱ, Walter F Stewart ⁱⁱ, Jason A Roy ⁱ

PMCID: PMC4371506 PMID: 25848580

Abstract

Chronic diseases are often described by stages of severity. Clinical decisions about what to do are influenced by the stage, whether a patient is progressing, and the rate of progression. For chronic kidney disease (CKD), relatively little is known about the transition rates between stages. To address this, we used electronic health records (EHR) data on a large primary care population, which should have the advantage of having both sufficient follow-up time and sample size to reliably estimate transition rates for CKD. However, EHR data have some features that threaten the validity of any analysis. In particular, the timing and frequency of laboratory values and clinical measurements are not determined a priori by research investigators, but rather, depend on many factors, including the current health of the patient. We developed an approach for estimating CKD stage transition rates using hidden Markov models (HMMs), when the level of information and observation time vary among individuals. To estimate the HMMs in a computationally manageable way, we used a “discretization” method to transform daily data into intervals of 30 days, 90 days, or 180 days. We assessed the accuracy and computation time of this method via simulation studies. We also used simulations to study the effect of informative observation times on the estimated transition rates. Our simulation results showed good performance of the method, even when missing data are non-ignorable. We applied the methods to EHR data from over 60,000 primary care patients who have chronic kidney disease (stage 2 and above). We estimated transition rates between six underlying disease states. The results were similar for men and women.

Keywords: disease progression, chronic kidney disease, hidden Markov model, transition probability, missing at random, missing not at random, EM algorithm

Introduction

The severity of many chronic diseases, including cancer and chronic kidney disease (CKD), are characterized, at least in part, by stages. The stage of disease and rate of progression or regression are important to deciding whether to treat, how to treat, and how often to monitor a patient. Moreover, knowledge about transition rates between stages helps patients understand what to expect and policymakers what to plan.

One approach for analyzing disease stage data is hidden Markov models (HMMs) (MacDonald and Zucchini 1997, 2009). Unlike ordinary Markov models, HMMs account for the fact that sometimes the observed disease stages are different from the underlying disease stages as a result of measurement error. Recently, researchers have used continuous-time HMMs to analyze data in a variety of clinical areas, such as hepatocellular cancer (Kay 1986), HIV progression (Satten and Longini 1996), and aortic aneurysms (Jackson 2003). However, a continuous-time model is computationally costly, and may be infeasible if the sample size is large, which is typically the case with electronic health records (EHR) data. Further, for many studies there would be no benefit to having finer information about the timing of a measurement than the calendar date. Discrete-time HMMs are a useful alternative, and have been developed and applied to a variety of health problems (Shirley et al. 2010; Rabiner 1986; Jackson and Sharples 2002; Scott 1999; Scott 2002; Scott et al. 2005; Gentleman et al. 1994; Bureau et al. 2000). While discrete-time HMMs have many desirable features, the estimation of transition rates typically requires large observational studies with long follow-up times as transitioning usually occurs over years. The resources required for such studies are often costly and time prohibitive. Use of longitudinal EHRs data from large primary care practices offers an alternative means of assembling longitudinal health experience of a population. Such data have the advantage of having both sufficient follow-up time and sample size to reliably and accurately estimate these rare transition rates.

In this paper we address challenges with using estimated glomerular filtration rate (eGFR) to study transition rates for chronic kidney disease (CKD). While large populations with years of longitudinal EHR data seem well suited for estimating CKD transition rates, two problems arise. First, unlike planned observational studies, digital patient records vary substantially in when (e.g., a patient seeks care for a problem) and why (i.e., a physician decides what to measure) a measurement is obtained, including measuring in relation to the severity of the underlying disease state. While eGFR is routinely measured on patients, the reason for measurement is also related to health status. Relatedly, measurement frequency varies substantially among patients and is often sporadic, leading to inferential challenges for handling these diverse types of missing data. Second, the size of the data set makes it challenging to fit complex models that involve computationally expensive optimization.

The objectives of this paper are to test methods for HMM that can address the challenges of estimating transition rates from large EHR data sets with irregular and potentially informative observation times. We deal with the size of the data and the irregularity of the observation times by developing a discretization method that transforms daily data (with a high degree of missingness) to data from wider time ranges. We use simulation studies to explore the impact of discretization assumptions on bias and variability, as well as on computing time.

In order to ensure that the simulation results are particularly relevant to CKD, we first conducted a preliminary analysis of the CKD data. In the simulation studies, we simulated data from models whose parameters were similar to those from the CKD analysis. To address concerns about potentially informative observation times (i.e., the decision to obtain or not obtain eGFR on a given date might depend on the observed health state), we conduct simulation studies where we apply our method to simulated data that have informative observation times. We find that the informative observation times do not have significant impact on the inference. We also demonstrate the feasibility of using this method on large EHR data, and present results from the CKD data as an illustration.

The rest of the paper is organized as follows: Section 2 describes the CKD study. Section 3 gives a brief introduction to HMMs and discusses in detail the HMM we proposed to fit the CKD data. Section 4 describes the simulation study and provides the results. The results of the CKD analysis are presented in Section 5. Finally, Section 6 includes a discussion of the findings, their implications, and some of the future research interests.

Background and Data

The study was approved by the Institutional Review Boards (IRBs) of Geisinger Health System and the University of Pennsylvania. Methods on CKD stages, access to EHR data, and HMM are described herein.

Chronic Kidney Disease

National Kidney Foundation Kidney Disease Outcome Quality Initiative (NKFKDOQI) classifies a patient’s CKD as being in one of five stages, defined by the level of the patient’s estimated glomerular filtration rate (eGFR) (Levy et al. 1999): kidney impairment with normal kidney function (stage 1, eGFR> 90), kidney impairment with mildly decreased kidney function (stage 2, eGFR 60-89), moderately decreased kidney function (stage 3, eGFR 44-59), severely decreased kidney function (stage 4, eGFR 15-29) and kidney failure (stage 5, eGFR< 15). Many patients who have CKD progress through these stages.

Data Description

All data for this study was derived from the Geisinger Health System (GHS), an integrated delivery system offering health care services to residents of 31 of Pennsylvania’s 67 counties with a significant presence in central and northeastern Pennsylvania. GHS includes the Geisinger Health Plan (GHP), an insurance plan, and the Geisinger Clinic (GC)—two major independent business entities with overlapping populations—as well as a host of other provider facilities (e.g., hospitals, addiction centers, etc.). GC primary care physicians manage approximately 400,000 patients annually. Adult (i.e., 18+ years of age) primary care patients were the source population for this study. These patients were similar to those in the region and were predominantly caucasian.

For this study, a database was created from EHR data of GC primary care patients that encompassed whether or not they were insured by GHP. All health information was integrated, including laboratory orders and results, medication orders, and inpatient (since 2007) and outpatient encounters. Longitudinal data were available for the period from July 30th, 2003 to Dec. 31st, 2009. Patients’ disease stages were evaluated according to eGFR values. Data were obtained from the National Kidney Registry and the Social Security Death Index, in order to determine dates at which any patients had dialysis, a kidney transplant, or died. Demographic variables routinely collected as part of patient care, such as age and gender, were also available.

Subjects were included in the study if they were between the ages of 30 and 75 years old, had Stage 2 or higher CKD at the time of their first eGFR, and had at least two valid values of disease stage (eGFRs, dialysis, kidney transplant, death). A total of 66,633 patients satisfied these criteria. Table 1 shows the baseline demographic information of our sample, where we define baseline as the date of first observed eGFR. The percentages of female and male were similar for patients who started with stage two CKD, but there were significantly more females than males who started with later stages of CKD. The mean age was 55 years old in both the male and female patients. The younger median age for stages 4 and 5 indicates the selection inherent to the prevalent sample because older patients are more common in more severe CKD stages and the risk of death among older patients is higher. There were 2,610 patients recorded with either dialysis, kidney transplant, or death as the outcome at the end of study.

Table 1.

Baseline demographic characteristics

	Female	Male	Total

Count

Stage 2–5 Stage 2 3 4 5	37,507(56%) 33,105(55%) 4,215(65%) 168(60%) 19(70%)	29,126(44%) 26,722(45%) 2,283(35%) 113(40%) 8(30%)	66,633 59,827 6,498 281 27

Median and IQR of age

All stages Stage 2 3 4 5	55(21) 54(20) 67(13) 64(14) 62(13)	55(20) 54(19) 66(13) 62(15) 61(12)	55(20) 54(20) 66(13) 63(15) 62(13)

Missing Mechanism	Definition
MAR	Pr(W_it = 1\|y_ij = 1)
MNAR1	Pr(W_it = 1\|y_it−1 = 1)
MNAR2	Pr(W_it = 1\|h_it = 1)

Scheme	State 1	State 2	State 3	State 4

1	0.993 (143)	0.989 (91)	0.954 (22)	0.909 (11)
2	0.990 (100)	0.987 (77)	0.984 (62)	0.980 (50)
3	0.995 (200)	0.950 (20)	0.750 (4)	0.550 (2)

	True value	θ̂(ESD)	θ̂(ESD)	θ̂ (ESD)

		30 Days	90 Days	180 Days

Initial Prob.

π_A	0.80	0.799 (0.007)	0.794 (0.006)	0.775 (0.006)
π_B	0.10	0.103 (0.005)	0.109 (0.005)	0.127 (0.006)
π_C	0.07	0.069 (0.004)	0.068 (0.004)	0.068 (0.004)
π_D	0.03	0.029 (0.002)	0.030 (0.002)	0.030 (0.002)

	Four hidden states	Five hidden states	Six hidden states
Men
AIC	108,895	106,224	105,222
BIC	109,009	106,378	105,424
Women
AIC	158,880	155,665	154,025
BIC	158,994	155,819	154,227

	Women	Men

	Estimate (SE)	Estimate (SE)

Initial Prob.
π_A	0.799 (0.0080)	0.862 (0.0032)
π_B	0.122 (0.0066)	0.081 (0.0033)
π_C	0.074 (0.0025)	0.053 (0.0021)
π_D	0.005 (0.0045)	0.004 (0.0004)
π_E	0.000 (0.0001)	0.000 (0.0001)

	Women	Men

	Estimate (SE)	Estimate (SE)

State-dep. Prob.

p_A1	0.980 (0.0010)	0.986 (0.0007)
p_A2	0.020 (0.0010)	0.014 (0.0007)

p_B1	0.583 (0.0194)	0.584 (0.0165)
p_B2	0.416 (0.0193)	0.414 (0.0164)
p_B3	0.001 (0.0003)	0.002 (0.0003)

p_C1	0.025 (0.0026)	0.020 (0.0033)
p_C2	0.964 (0.0021)	0.968 (0.0031)
p_C3	0.011 (0.0013)	0.012 (0.0013)

p_D2	0.143 (0.0177)	0.130 (0.0192)
p_D3	0.847 (0.0172)	0.861 (0.0187)
p_D4	0.010 (0.0019)	0.009 (0.0033)

p_E3	0.174 (0.0565)	0.050 (0.0541)
p_E4	0.826 (0.0565)	0.950 (0.0541)

	Women	Men

	Estimate (SE)	Estimate (SE)

Transition Prob.

γ_AA	0.987 (0.0004)	0.989 (0.0004)
γ_AB	0.011 (0.0004)	0.009 (0.0004)
γ _AF	0.002 (0.0001)	0.002 (0.0001)

γ_BA	0.029 (0.0015)	0.025 (0.0025)
γ_BB	0.932 (0.0024)	0.928 (0.0037)
γ_BC	0.036 (0.0019)	0.038 (0.0020)
γ_BF	0.003 (0.0004)	0.009 (0.0007)

γ_CB	0.014 (0.0008)	0.016 (0.0018)
γ_CC	0.973 (0.0009)	0.962 (0.0019)
γ_CD	0.007 (0.0005)	0.011 (0.0006)
γ_CF	0.006 (0.0004)	0.011 (0.0007)

γ_DC	0.031 (0.0043)	0.029 (0.0046)
γ _DD	0.918 (0.0051)	0.896 (0.0057)
γ_DE	0.016 (0.0019)	0.026 (0.0036)
γ_DF	0.035 (0.0033)	0.049 (0.0052)

γ_ED	0.000 (0.0000)	0.033 (0.0152)
γ_EE	0.839 (0.0329)	0.704 (0.0303)
γ_EF	0.161 (0.0329)	0.263 (0.0311)

	True value	θ̂ (ESD)	θ̂ (ESD)	θ̂ (ESD)

		30 Days	90 Days	180 Days

State-dep. Prob.

p_A1	0.90	0.904 (0.0018)	0.904 (0.0019)	0.905 (0.0024)
p_A2	0.10	0.096 (0.0018)	0.096 (0.0019)	0.095 (0.0024)

p_B1	0.10	0.104 (0.0021)	0.102 (0.0031)	0.103 (0.0044)
p_B2	0.80	0.805 (0.0026)	0.809 (0.0034)	0.815 (0.0050)
p_B3	0.10	0.091 (0.0020)	0.089 (0.0026)	0.082 (0.0032)

p_C2	0.10	0.097 (0.0019)	0.084 (0.0033)	0.073 (0.0047)
p_C3	0.80	0.813 (0.0023)	0.838 (0.0038)	0.854 (0.0054)
p_C4	0.10	0.090 (0.0019)	0.078 (0.0029)	0.073 (0.0046)

p_D3	0.10	0.101 (0.0023)	0.099 (0.0049)	0.090 (0.0072)
p_D4	0.90	0.899 (0.0023)	0.901 (0.0049)	0.910 (0.0072)

Missing Mechanism	30 Day	90 Day	180 Day

MAR	9592	2821	2373
MNAR1	8602	4106	1513
MNAR2	9394	3790	2602

PERMALINK

Methods for Estimating Kidney Disease Stage Transition Probabilities Using Electronic Medical Records

Lola Luo, PhD

Dylan Small, PhD

Walter F Stewart, PhD, MPH

Jason A Roy, PhD

Abstract

Introduction

Background and Data

Chronic Kidney Disease

Data Description

Table 1.

Figure 1.

Statistical Model and Methodology

Introduction to Hidden Markov Model

The HMM for the CKD Data

Methodology

Simulation Study

Data Generation

Missing Data Mechanisms

Table 2.

Table 3.

Analysis

Results

Table 4.

Table 5.

Table 13.

Figure 2.

Figure 3.

Application to CKD Data

Number of Hidden States

Table 14.

Table 15.

Table 17.

State-Dependent Probabilities

Initial State Probabilities

Transition Probabilities

Discussion

Table 6.

Table 7.

Table 8.

Table 9.

Table 10.

Table 11.

Table 12.

Table 16.

Acknowledgments

Appendix. Formulas

Likelihood

Expectation Maximization (EM) Method

How to Handle Missing Data

Simulation Results: Scheme 2

Table 18.

Table 19.

Table 20.

Table 21.

Table 22.

Table 23.

Table 24.

Table 25.

Table 26.

Table 27.

Simulation Results: Scheme 3

Table 28.

Table 29.

Table 30.

Table 31.

Table 32.

Table 33.

Table 34.

Table 35.

Table 36.

Table 37.

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases