Rapid Report on Estimating Incidence from Cross-Sectional Data

Justin B DeMonte; Anne M Neilan; Matthew S Loop; Andrea A Ciaranello; Michael G Hudgens

doi:10.1016/j.annepidem.2020.06.005

. Author manuscript; available in PMC: 2022 Jan 1.

Published in final edited form as: Ann Epidemiol. 2020 Oct 20;53:106–108.e1. doi: 10.1016/j.annepidem.2020.06.005

Rapid Report on Estimating Incidence from Cross-Sectional Data

Justin B DeMonte ^a, Anne M Neilan ^b,^c, Matthew S Loop ^a, Andrea A Ciaranello ^b,^d, Michael G Hudgens ^a

PMCID: PMC7736050 NIHMSID: NIHMS1613660 PMID: 32979470

Introduction

Incidence is a fundamental quantity in epidemiology. The incidence rate (IR) is typically defined as the number of new cases of disease divided by the person-time over a given period in the population of interest [1]. IR estimates are often obtained using data from prospective cohort studies, in which participants are followed over time and considered at risk while on-study and yet to experience an event (e.g., disease diagnosis). The crude, or unadjusted, IR estimator typically used in analysis of cohort data entails dividing the observed number of events by the observed total person-time at risk [1]. More specifically, for participant i, let T_i and C_i denote time to event and time to censoring (e.g., due to loss-to-follow-up), respectively. Let (X₁, D₁), …, (X_n, D_n) be independent random variable pairs where X_i = min(T_i, C_i), D_i = 1 if T_i ≤ C_i, and D_i = 0 otherwise. Then the crude IR estimator is

\hat{IR} = \frac{\sum_{i = 1}^{n} D_{i}}{\sum_{i = 1}^{n} X_{i}} .

(1)

This estimator is statistically valid in the sense that if event times are exponentially distributed with constant incidence (hazard) rate λ, and right censoring is independent of the event of interest, then $\hat{IR}$ is the maximum likelihood estimator (MLE) of λ [2]. Therefore, $\hat{IR}$ is a consistent estimator of the population IR, i.e., $\hat{IR}$ should closely approximate the true IR in large prospective cohort studies.

We consider a different setting where only cross-sectional data are available, i.e., at a single time point participants are evaluated to identify whether they have developed some disease. As a motivating example, in a study of youth living with HIV engaged in care in the United States [3], medical records were abstracted to assess STI incidence. For each participant, it was determined whether any new STI diagnosis occurred in the 12 months prior to study enrollment, but the exact date of diagnosis was not recorded. That is, instead of (X_i, D_i), independent random variable pairs (C_i, D_i) were observed, where C_i is the observation time for participant i. In the motivating example, C_i is constant (12 months) for all participants; this special case is addressed below.

Methods

Unlike the prospective cohort data case, for cross-sectional data the crude IR estimator is not the MLE of λ. In this setting the crude IR estimator might be defined as

\tilde{IR} = \frac{\sum_{i = 1}^{n} D_{i}}{\sum_{i = 1}^{n} C_{i}},

(2)

i.e., the total number of events divided by the total observation time. Intuitively, one would expect this estimator to be negatively biased since X_i < C_i for participants who have an event, such that the denominator in (2) will be larger than the denominator in (1). The simulation described below empirically demonstrates this estimator can indeed be negatively biased.

For cross sectional data, the MLE does not in general have a closed form, but is straightforward to calculate. Specifically, the likelihood function is proportional to

\prod_{i = 1}^{n} {1 - \exp (- λ C_{i})}^{D_{i}} {\exp (- λ C_{i})}^{1 - D_{i}} .

(3)

Thus, the score function for λ is

U (λ) = \sum_{i = 1}^{n} ([{C_{i} D_{i} \exp (- λ C_{i})} ∕ {1 - \exp (- λ C_{i})}] - C_{i} (1 - D_{i})) .

While there does not appear to be a closed form solution for λ to the equation U(λ) = 0, the MLE can easily be obtained via standard statistical software, as described below. There are 3 special cases where the MLE has either a closed form solution or approximation.

First, in some studies, as in the motivating example, observation times are identical for all participants, i.e., C_i = c for all i. Solving U(λ) = 0 for λ yields

\hat{λ} = \frac{\ln {n ∕ \sum_{i = 1}^{n} (1 - D_{i})}}{c} .

(4)

Note (4) can be viewed as a simple plug-in type estimator. In particular, let S denote the survival function of the exponential distribution with hazard λ. By substituting $\hat{S} (c) = \sum_{i = 1}^{n} (1 - D_{i}) ∕ n$ for S(c) into the identity λ = − ln(S(c))/c, one obtains (4). Using the observed information to estimate the variance of $\hat{λ}$ , a large sample (1 − α)100% confidence interval (CI) for ln(λ) is

\ln (\hat{λ}) \pm z_{1 - α ∕ 2} [\frac{\exp (\hat{λ} c) - 1}{\hat{λ} c \sqrt{\exp (\hat{λ} c) (\sum_{i = 1}^{n} D_{i})}}]

The endpoints of this interval can be exponentiated to construct a CI for λ = IR.

Second, suppose the event of interest is rare. For λ near zero, exp(−λC_i) ≈ 1 − λC_i. Therefore, it follows from (3) that $U (λ) \approx \sum_{i = 1}^{n} [{D_{i} (1 - λ C_{i}) ∕ λ} - C_{i} (1 - D_{i})]$ . Solving for λ gives $\hat{λ} \approx \sum_{i = 1}^{n} D_{i} ∕ \sum_{i = 1}^{n} C_{i} = \tilde{IR}$ . Thus, for rare events, the crude IR estimator approximates the MLE. Nonetheless, the crude IR estimator may exhibit substantial bias relative to the magnitude of λ, as demonstrated empirically below.

Thirdly, if there are no events, the MLE and crude estimators are equal with $\tilde{IR} = \hat{IR} = \hat{λ} = 0$ . In this case, an exact one-sided (1 − α/2) CI for λ is $(0, 0.5 χ_{2}^{2} (1 - α ∕ 2) ∕ \sum_{i = 1}^{n} C_{i})$ , where $χ_{v}^{2} (q)$ is the qth quantile of a chi-square distribution with v degrees of freedom [4].

In general, the MLE can be calculated using software that handles interval censored data [5], [6] (see appendix). By fitting an intercept only accelerated failure time model, $\hat{λ}$ can be calculated by exponentiating the negative of the intercept estimate. A large sample 95% CI estimate for λ = IR is given by $\hat{λ} exp (\pm 1.96 SE)$ , where SE is the estimated standard error of the intercept estimate.

Results

To compare the bias of the MLE $\hat{λ}$ and the crude estimator $\tilde{IR}$ under cross-sectional data with exact event times unknown, a simulation was conducted under 2 scenarios. In the first scenario, 1,000 data sets, each containing n = 1,000 observations, were generated to simulate the special case in which observation times were identical for all participants. For each observation, T_i was drawn from an exponential distribution with hazard λ, and C_i was set to 1 year. The second scenario was identical to the first except C_i was drawn from an exponential distribution with hazard 0.2. Simulations were repeated under each scenario for 6 true incidence rates λ = 0.5, 0.4, 0.3, 0.2, 0.1, 0.02. For each true λ value, empirical bias was calculated for $\hat{λ}$ and $\tilde{IR}$ .

Results are summarized in Table 1. As expected, the crude estimator $\tilde{IR}$ underestimated the true incidence, whereas the MLE was approximately unbiased. In general, bias of the crude estimator tended to be roughly one to two orders of magnitude larger (in absolute value) than the MLE. Note while the bias of the crude estimator decreases as λ gets smaller, the relative bias can still be substantial even when the incidence is low. For example, in the second scenario the relative bias of the crude estimator is −0.33 when λ = 0.1, i.e., using the crude estimator results in underestimating the true incidence by 33%. By contrast, the MLE $\hat{λ}$ was approximately unbiased, with relative bias no greater than 1% for all λ values in both scenarios. Whether the bias of the crude estimator is meaningful in practice will of course depend on the circumstances. Certainly in some settings, especially in public health and policy contexts, under-estimating incidence by 33% could be considered consequential.

Table 1.

Empirical bias and relative bias for the MLE $\hat{λ}$ and the crude estimator $\tilde{IR}$

	Incidence Rate	Proportion censored	Bias		Relative Bias
	Incidence Rate	Proportion censored	MLE	Crude	MLE	Crude
	100λ		$100 (\hat{λ} - λ)$	$100 (\tilde{IR} - λ)$	$(\hat{λ} - λ) ∕ λ$	$(\tilde{IR} - λ) ∕ λ$
Scenario 1	50	0.61	0.15	−10.6	0.00	−0.21
	40	0.67	0.14	−6.95	0.00	−0.17
	30	0.74	−0.03	−4.12	0.00	−0.14
	20	0.82	−0.07	−1.94	0.00	−0.10
	10	0.90	0.05	−0.44	0.01	−0.04
	2	0.98	−0.01	−0.03	0.00	−0.01
Scenario 2	50	0.29	0.03	−35.7	0.00	−0.71
	40	0.33	−0.03	−26.7	0.00	−0.67
	30	0.40	0.03	−18.0	0.00	−0.60
	20	0.50	−0.01	−10.0	0.00	−0.50
	10	0.67	0.01	−3.34	0.00	−0.33
	2	0.91	0.00	−0.19	0.00	−0.09

Open in a new tab

Conclusion

Under cross-sectional data with exact event times unknown, the MLE of the IR is straightforward to calculate, more accurate than the crude IR estimator, and statistically valid (i.e., consistent) provided the hazard is constant.

Highlights.

In cross-sectional studies, participants may be evaluated at a single time point to identify whether they have previously experienced some event of interest.
The usual (crude) incidence rate estimator, number of events divided by total person-time of follow-up, is in general biased when applied to cross-sectional data.
An alternative is to use a maximum likelihood estimator (MLE), which is statistically valid (i.e., consistent) under certain assumptions.
The MLE does not in general have a closed form, but it is easy to compute using standard statistical software.
Simulation results are presented, demonstrating that the crude incidence rate estimator is biased whereas the MLE is approximately unbiased for the true incidence rate given cross-sectional data.

Acknowledgment

This work was supported by grant U24HD089880 from the National Institutes of Health.

Abbreviations and Acronyms

IR: incidence rate
MLE: maximum likelihood estimator
CI: confidence interval

Appendix

# R
library(survival)
library(dplyr)
# prepare data
ds <- ds %>%
mutate(Left=ifelse(event==1, NA, obs_time)) %>%
mutate(Right=ifelse(event==1, obs_time, NA))
surv_object<-Surv(time=ds$Left, time2=ds$Right, type="interval2")
# fit intercept only AFT model
survreg(surv_object^~1, dist="exponential")

* SAS;
* prepare data;
data ds;
if event=1 then
do; Left=.; Right=obs_time; end;
else
do; Left=obs_time; Right=.; end;
* fit intercept only AFT model;
proc lifereg; model(Left, Right)= /dist=Exponential;

Footnotes

Conflict of Interest: None declared.

References

1.Rothman KJ, Greenland S, Lash TL. Modern Epidemiology. 3rd ed. Philadelphia, PA: Lippincott Williams & Wilkins; 2008. [Google Scholar]
2.Collett D Modelling survival data in medical research. 3rd ed. Boca Raton, FL: Chapman & Hall/CRC; 2015. [Google Scholar]
3.Data and Specimen Hub, Eunice Kennedy Shriver National Institute of Child Health and Human Development. Network-Wide Assessment of Current Health Status and Behavioral Risk Factors: An Expanded Study for New Sites in ATN III (ATN 106). Rockville, MD: Eunice Kennedy Shriver National Institute of Child Health and Human Development; https://dash.nichd.nih.gov/study/13894/. Accessed October 17, 2019. [Google Scholar]
4.Fleiss JL, Levin BA, Paik MC. Statistical Methods for Rates and Proportions. 3rd ed. Hoboken, NJ: Wiley-Interscience; 2003. [Google Scholar]
5.Allison P Survival Analysis Using SAS: A Practical Guide. 2nd ed. Cary, NC: SAS Institute Inc.; 2010. [Google Scholar]
6.Therneau T A Package for Survival Analysis in S, version 2.38. Vienna, AT: R Foundation for Statistical Computing; https://CRAN.R-project.org/package=survival. Accessed October 17, 2019. [Google Scholar]
7.R Core Team. R: A Language and Environment for Statistical Computing. Vienna, AT: R Foundation for Statistical Computing; 2018. https://www.R-project.org. Accessed October 17, 2019. [Google Scholar]

[R1] 1.Rothman KJ, Greenland S, Lash TL. Modern Epidemiology. 3rd ed. Philadelphia, PA: Lippincott Williams & Wilkins; 2008. [Google Scholar]

[R2] 2.Collett D Modelling survival data in medical research. 3rd ed. Boca Raton, FL: Chapman & Hall/CRC; 2015. [Google Scholar]

[R3] 3.Data and Specimen Hub, Eunice Kennedy Shriver National Institute of Child Health and Human Development. Network-Wide Assessment of Current Health Status and Behavioral Risk Factors: An Expanded Study for New Sites in ATN III (ATN 106). Rockville, MD: Eunice Kennedy Shriver National Institute of Child Health and Human Development; https://dash.nichd.nih.gov/study/13894/. Accessed October 17, 2019. [Google Scholar]

[R4] 4.Fleiss JL, Levin BA, Paik MC. Statistical Methods for Rates and Proportions. 3rd ed. Hoboken, NJ: Wiley-Interscience; 2003. [Google Scholar]

[R5] 5.Allison P Survival Analysis Using SAS: A Practical Guide. 2nd ed. Cary, NC: SAS Institute Inc.; 2010. [Google Scholar]

[R6] 6.Therneau T A Package for Survival Analysis in S, version 2.38. Vienna, AT: R Foundation for Statistical Computing; https://CRAN.R-project.org/package=survival. Accessed October 17, 2019. [Google Scholar]

[R7] 7.R Core Team. R: A Language and Environment for Statistical Computing. Vienna, AT: R Foundation for Statistical Computing; 2018. https://www.R-project.org. Accessed October 17, 2019. [Google Scholar]

PERMALINK

Rapid Report on Estimating Incidence from Cross-Sectional Data

Justin B DeMonte

Anne M Neilan

Matthew S Loop

Andrea A Ciaranello

Michael G Hudgens

Introduction

Methods

Results

Table 1.

Conclusion

Highlights.

Acknowledgment

Abbreviations and Acronyms

Appendix

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Rapid Report on Estimating Incidence from Cross-Sectional Data

Justin B DeMonte

Anne M Neilan

Matthew S Loop

Andrea A Ciaranello

Michael G Hudgens

Introduction

Methods

Results

Table 1.

Conclusion

Highlights.

Acknowledgment

Abbreviations and Acronyms

Appendix

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases