Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2025 Sep 5;21(9):e1013438. doi: 10.1371/journal.pcbi.1013438

A history-dependent approach for accurate initial condition estimation in epidemic models

Dongju Lim 1,2,#, Kyeong Tae Ko 3,#, Hyukpyo Hong 4, Hyojung Lee 3, Boseung Choi 2,5,6, Won Chang 7,8, Sunhwa Choi 9,*, Jae Kyoung Kim 1,2,10,*
Editor: Nik J Cunniffe11
PMCID: PMC12445537  PMID: 40911645

Abstract

Mathematical modeling is a powerful tool for understanding and predicting complex dynamical systems, ranging from gene regulatory networks to population-level dynamics. However, model predictions are highly sensitive to initial conditions, which are often unknown. In infectious disease models, for instance, the initial number of exposed individuals (E) at the time the model simulation starts is frequently unknown. This initial condition has often been estimated using an unrealistic, history-independent assumption for simplicity: the chance that an exposed individual becomes infectious is the same regardless of the timing of their exposure (i.e., exposure history). Here, we show that this history-independent method can yield serious bias in the estimation of the initial condition. To address this, we developed a history-dependent initial condition estimation method derived from a master equation expressing the time-varying likelihood of becoming infectious during a latent period. Our method consistently outperformed the history-independent method across various scenarios, including those with measurement errors and abrupt shifts in epidemics, for example, due to vaccination. In particular, our method reduced estimation error by 55% compared to the previous method in real-world COVID-19 data from Seoul, Republic of Korea, which includes likely infection dates, allowing us to obtain the true initial condition. This advancement of initial condition estimation enhances the precision of epidemic modeling, ultimately supporting more effective public health policies. We also provide a user-friendly package, Hist-D, to facilitate the use of this history-dependent initial condition estimation method.

Author summary

Accurately predicting infectious disease spread requires knowing the initial number of individuals in the exposed compartment at the start of the simulation (E(t0)), but this number is usually unknown. A common method to estimate E(t0) assumes that the chance of an exposed individual becoming infectious is the same, regardless of when they were exposed. However, this unrealistic assumption can lead to serious errors in the estimation of E(t0). To solve this problem, we developed a method that considers exposure timing. Our method successfully estimated E(t0) even with measurement errors or sudden changes in outbreak conditions. In particular, our approach accurately estimated E(t0) for COVID-19 data from Seoul that includes likely infection dates, which allowed us to obtain the true initial condition. This advancement of initial condition will help improve epidemic predictions and public health strategies. Our method can also be applied to estimate initial conditions in systems where timing or history matters, such as protein maturation or cell degradation pathways. To facilitate the broad adoption of our method, we have also developed and released Hist-D, a user-friendly software package.

Introduction

Epidemic dynamics have been successfully explained by harnessing mathematical models such as the Susceptible–Exposed–Infectious–Removed (SEIR) model [13]. These models predict the future exposed or infectious population over time, allowing predictions of disease spread and the formulation of appropriate public health policies [46]. However, these predictions from mathematical models, particularly those based on ordinary differential equations (ODEs), are highly sensitive to the initial condition used, such as the initial number of exposed (E) and infectious (I) individuals. Variations in these initial conditions lead to differences in the simulation of epidemic dynamics [7], ultimately affecting the subsequent estimations of epidemiological parameters such as reproduction number ().

Despite the importance of accurate initial conditions for the predictive power of the model, the initial values for some compartments of the model are usually unknown. In particular, the initial condition of the exposed compartment (E) is generally unknown, as determining how many people are actually exposed requires extensive contact tracing, whose complexity increases exponentially with the number of contacts [8]. As a result, previous studies have often subjectively determined the initial condition [9,10]. Some studies minimized this subjectivity by treating initial conditions as free parameters and estimating them [11,12], or using various potential values for the initial conditions and selecting the one whose subsequent simulation best fits the data [13,14]. However, this approach is computationally intensive. An alternative approach estimates the initial condition of E using the known number of daily incidence of becoming infectious [15]. This approach is consistent with the fundamental assumption of a standard SEIR model—the daily number of new infectious people is the product of (i) the population in the exposed compartment and (ii) the rate of progression to the infectious stage, which is reciprocal to the length of the latent period. Under this assumption, the initial condition of E can be estimated by multiplying the number of daily infectious individuals with the average latent period [15]. However, this method does not account for the different timing of exposure among people in compartment E, instead assuming the same likelihood of transitioning to the infectious stage for all individuals regardless of when they were exposed (i.e., exposure history). Thus, we refer to this method as the History-Independent estimation (Hist-I) throughout this study.

Relaxing the unrealistic history-independent assumption of a standard SEIR model requires two components: (i) a model reflecting the changing likelihood of transitioning to the infectious stage, and (ii) accurate initial conditions for such a model. The first component has been extensively studied through approaches such as the method of stages [16], linear chain tricks [17,18], and delay differential equation (DDE) models [19]. Applying this model, particularly the DDE-based model, enabled more accurate estimation of epidemic parameters by incorporating individual exposure history [19]. However, despite this advantage, a method for accurately estimating their initial conditions—particularly the number of exposed individuals—remains unknown.

Here, we developed a history-dependent method for estimating the initial condition of E, Hist-D, that considers the exposure history. Specifically, we estimated the initial condition of E by finding the solution of the formula expressing the time-varying likelihood of being infectious during a latent period. When applying this approach to simulation data mimicking the latent period of COVID-19, Hist-D outperformed Hist-I under various conditions including scenarios without measurement errors, in the presence of measurement errors, and with abrupt changes in the epidemic phase. Furthermore, when we applied Hist-D to real-world COVID-19 data from Seoul, South Korea, the error in initial condition estimation was reduced by 55% compared to Hist-I. As our approach provides a more accurate estimation of the initial condition of E, it will lead to a more precise understanding of epidemic dynamics, ultimately enabling more effective public health policies. To facilitate the application of Hist-D, we developed a user-friendly package.

Results

The history-independent method is inaccurate when the latent period is non-exponential

Mathematical models, even with identical parameter values, can yield different simulation results depending on their initial conditions. Consequently, using different initial conditions to fit the same model to identical data also can yield different estimates of key parameters, such as the transmission rate (β) or the reproduction number () in the SEIR model (Fig 1a inset). For example, if the initial condition for the exposed population is reduced to 25% of the original value, this leads to a 33.9% relative error in estimating the reproduction number under the parameter condition mimicking the COVID-19 dynamics, highlighting the importance of setting accurate initial conditions (S1a Fig). How the bias in initial conditions evolves in the estimation of the reproduction number is illustrated in S2 Text and S1b Fig.

Fig 1. Estimating initial conditions for the SEIR model.

Fig 1

(a) Schematic of the SEIR mathematical model, including the susceptible (S), exposed (E), infectious (I), and removed (R) individuals, which effectively explains epidemic dynamics. Fitting the SEIR model with observed time series (white dots) from t0 enables the estimation of crucial parameters in the epidemic such as reproduction number (). This estimation strongly depends on the initial condition at t0 (e.g., E(t0)), the starting point of the model simulation (See Supplementary Information for more details). (b) The initial condition of E (E(t0) in red) can be determined by summing up the daily change of E (ΔE) up to the t0 since the beginning of the disease (0) (red arrow). However, it requires daily incidence of exposure (fSE) and daily incidence of becoming infectious (fEI) data before t0, which are often unknown. This highlights the need for a method to estimate the initial condition of E using only the available daily data on infectious individuals from time t0 onward (green arrow). (c) To address this limitation, previous studies estimated the initial condition of E (E(t0) in red) by multiplying fEI at t0 and the mean latent period (𝔼[τL]) (green arrow). (d) However, while this History-Independent estimation (Hist-I) method provides an accurate estimation (red dots) if the latent period follows the exponential distribution (left), it becomes less reliable for the gamma distribution (right) observed in many infectious diseases, whereby an individual is more likely to transition from exposed to infectious the longer their time since exposure.

To calculate this initial condition accurately, it is first necessary to precisely determine the beginning time of the disease (time = 0) and estimate how the epidemic dynamics have changed from that point up to the start of the SEIR model simulation (time = t0) (Fig 1b). For example, to determine the initial condition for the exposed population (E), it is essential to track how E has changed from the beginning of the disease to the start of the simulation. Tracking these daily changes requires knowing how many people became exposed each day (fSE) and how many transitioned to the infectious stage, leaving the E compartment (fEI). This information can be derived from exposure history (the date of exposure) and the timing of infectiousness among exposed individuals. However, collecting such data, particularly examining the exposure history of each exposed individual, becomes increasingly challenging as the disease progresses because it requires labor-intensive contact tracing. Consequently, the only available data for determining the initial condition is typically the fEI after the data collection starts (Fig 1b) [20], making it challenging to set accurate initial conditions.

To overcome these limitations and set the initial conditions using the available data, the assumption underlying the standard SEIR model can be used (Fig 1c; See S2 Text for more details). Specifically, fEI(t0) is the product of exposed populations (E(t0)) and the rate of becoming infectious (κ), which is the inverse of the average latent period (𝔼[τL]=1/κ). This assumption naturally leads to the equation E(t0)=fEI(t0)×𝔼[τL]: the initial condition of E can be estimated by multiplying the average latent period (𝔼[τL]) by the fEI at the initial time point (fEI(t0)) [15].

This method, Hist-I, follows the core assumption of the standard SEIR model that does not consider the time when individuals are exposed (i.e., exposure history), assuming everyone experiences the same chance of becoming infectious regardless of their exposure history (Fig 1d). This history-independent (or memoryless) assumption is well-suited for the scenario where the latent period follows an exponential distribution. Conversely, when the latent period follows a non-exponential distribution (e.g., a gamma distribution), the memoryless property is lost, leading to inaccuracies in the Hist-I method (Fig 1d). However, most infectious diseases exhibit a gamma-distributed latent period [2127] (Fig 1d), meaning that the longer the time since an individual was exposed, the more likely they will transition to becoming infectious (i.e., the history-independent assumption does not hold in reality). This highlights the need for a new method that accounts for this variability in the chance of becoming infectious, depending on the individual’s exposure history.

A framework for estimating initial condition in history-dependent manner

To determine the initial conditions in a history-dependent manner (Fig 2a), we first utilized an equation that represents the relationship between the number of daily exposed individuals each day (fSE(t) in Fig 2b) and the number of individuals leaving the compartment E and becoming infectious (fEI(t) in Fig 2b), which is given by the data (Fig 2b (i)). This equation uses convolution to express the fact that after being exposed, each individual becomes infectious and leaves compartment E after a latent period (τL) following a specific probability distribution (gτL; e.g., Gamma) (Fig 2b (i)). In this way, it directly accounts for different exposure histories among exposed individuals.

Fig 2. Schematic figure for deriving the loss function to estimate the initial condition.

Fig 2

(a) To address the limitation of the history-independent method (left), we developed a novel history-dependent method (right). (b) (i) We established the connection between the known data, fEI, and the unknown fSE by treating the fEI as a convolutional output of fSE and the probability density function of the latent period, gτL. (ii) By discretizing this relationship and (iii) assuming fSE remains consistent before t0, we can express the known fEI as a linear combination of unknown E(t0) and unknown fSE with known coefficients Pkj and Qk+1. Pkj represents the probability of an individual having a latent period of exactly kj days, while Qk+1 represents the probability of the latent period being longer or equal to k+1 days. Pkj and Qk+1 can be obtained by integrating the convolution of gτL and 1[0,1], where 1[0,1] represents the characteristic function supported on [0,1] (See Methods for more details). (c) Extending the linear combination expression to the whole data (i.e., fEI(t) for t=t0+1,), we can construct a matrix that describes the relationship between known data and unknown parameters. (d) We utilized this matrix equation that E(t01) must satisfy to establish the data loss function, then sought to minimize this data loss by finding optimal values for unknown parameters, including E(t01). However, as the number of unknown parameters (n+2) exceeds the number of equations (n+1), the parameters cannot be determined solely from the data loss. This leads us to incorporate the regularization loss for the fSE parameters, which aims to smooth the fSE parameters by minimizing their second order derivatives. Consequently, by finding the parameters that minimizes the total loss function (L), which includes both the data loss and the regularization loss, we can estimate E(t01). By summing up the difference between daily incidence of exposure (fSE(t0)) and daily incidence of becoming infectious at t0 (fEI(t0)), we finally get the initial condition of E.

After discretizing this equation (Fig 2b (ii)), and assuming that individuals were being exposed at a constant rate before time t0 (Fig 2b (iii)), we were able to express the given data (fEI(t0+k), k=0,,n) as a linear combination of E(t01) and the daily incidence of exposure after time t0 (fSE(t), t=t0,t0+1,,t0+n) (See Methods for more details). The coefficient corresponding to E(t01) in this linear combination (Qk+1 in Fig 2b (iii)) represents the probability that the latent period longer or equal to k+1 days. This expresses that individuals exposed before t01 must go through a latent period longer or equal to k+1 days to become infectious at time t0+k. Conversely, the coefficient corresponding to fSE(t0+j) in the linear combination (Pkj in Fig 2b (iii)) represents the probability that the latent period is kj days. This reflects that an individual exposed at t0+j must go through a latent period of kj days to become infectious at time t0+k.

We can write these relationships for all given data (fEI(t0), …, fEI(t0+n)), and by combining them, we can express the equations in the form of a matrix (Fig 2c). By finding the value of E(t01) that satisfies this matrix equation, we can estimate the E(t01). To achieve this, we created a data loss function that becomes minimal when both sides of the matrix equation are equal (Fig 2d), then sought to minimize this data loss by finding optimal values for unknown parameters (E(t01), fSE(t0),…, fSE(t0+n)). However, as the number of parameters to be estimated is n+2, which is one more than the number of data points, n+1, there are infinitely many parameter combinations that satisfy the data loss. To identify a single parameter combination that is close to the true values among these infinitely many combinations, we added a regularization loss term (Fig 2d). This regularization term helps minimize the second derivative of fSE, ensuring that the estimated fSE does not exhibit abrupt changes. Consequently, by finding the combination of E(t01), fSE(t0),…, fSE(t0+n) that minimizes the loss function, which includes both the data loss and the regularization loss, we can estimate E(t01). By summing change in the number of exposed people at time t0 (fSE(t0)fEI(t0)), we finally estimated the initial condition of E (E(t0)) (Fig 2d).

The new history-dependent method outperforms the history-independent method

We evaluated whether our new history-dependent method can provide accurate estimates of the initial condition of E when the latent period follows the gamma distribution unlike Hist-I. To do this, we simulated an SEIR model whose latent period follows the gamma distribution with shape 4.06 and scale 1.35 [24], from t=0 to t=200 (Fig 3a). We then extracted the value of E and the number of people transitioning from E to I (fEI) at each time point t (see Methods for more details).

Fig 3. Hist-D outperforms Hist-I, regardless of the phase transition of epidemic dynamics and noise.

Fig 3

(a) The trajectory of E and the daily incidence of becoming infectious (fEI) were simulated through the SEIR model whose latent period follows the gamma distribution with shape 4.06 and scale 1.35 (See Methods for more details). (b) Simulated fEI was then utilized to estimate the E(t0) and compare History-Independent estimation (Hist-I) and History-Dependent estimation (Hist-D). Hist-I utilizes data from only single day, t0, while Hist-D uses data from 2×𝔼[τL] consecutive days after the t0, where 𝔼[τL] is a mean latent period. (c) The graph comparing the true E(t0) (light gray-colored bars) and the estimated E(t0) (E^(t0)). (d) The scatter plot displaying the error (E^(t0)E(t0)) across different levels of true E(t0). Estimation from Hist-D (green squares) has a much lower error compared to Hist-I (red triangles). (e) The graph showing the root mean squared error (RMSE) (bars) and the mean absolute percentage error (MAPE) (line) of Hist-I and Hist-D. When Hist-D was utilized, RMSE and MAPE was reduced by 86% and 85%, respectively, compared to Hist-I. (f) To better reflect the real-world situation with observation noise in given data, we applied multiplicative noise (eU(σ,σ)), where U(σ,σ) is the uniform distribution on (σ,σ), to the simulated fEI data used in (c-e) and compared the accuracy of Hist-I and Hist-D. (g) The scatter plots displaying the estimation error at the noise level σ=0.1. The error of both Hist-I and Hist-D increased proportionally to the level of true E(t0), and this was specifically manifested in Hist-I (top). In addition, compared to the zero-noise level case (i.e., the case in (c-e)), the error increment of Hist-D was lower than that of Hist-I (bottom). (h) The graph showing the RMSE (bars) and MAPE (line) of Hist-I and Hist-D across the different noise levels (σ=0,0.1,0.2,0.3). Hist-D achieved a lower RMSE and MAPE than Hist-I across all noise levels. (i) To assume the transition of epidemic dynamics, we abruptly changed the transmission rate, β, from β0 to β1 at a single point (top), and simulated fEI data (middle), which were then used to investigate the accuracy of Hist-I and Hist-D. (j) The scatter plot showing the error of Hist-I and Hist-D when the transmission rate has been doubled. Hist-D outperformed Hist-I. (k) The graph showing the RMSE (bars) and MAPE (line) of Hist-I and Hist-D across the different fold change (β1 / β0 = 1/3, 1/2, 1, 2, 3). Hist-D consistently outperformed Hist-I across all fold changes. In particular, when β was reduced to 1/3, the absolute increase in RMSE and MAPE for Hist-D was 22% and 19% that of Hist-I, respectively, demonstrating the robustness of Hist-D to sudden changes in β.

With this data, we estimated the initial condition E(t0) from the given fEI data using the history-independent method (Hist-I) and the history-dependent method (Hist-D). First, the Hist-I method estimates the E(t0) by multiplying the mean latent period (i.e., 4.06 × 1.35 = 5.48) by the value of fEI at t0 (Fig 3b). For example, we estimated E(80),E(81),E(82),,E(190) by multiplying 5.48 by the respective values of fEI at t=80,81,82,...,190. The second method, History-Dependent estimation (Hist-D), estimated the E(t0) value that minimized the loss function (Fig 2d) with the data of fEI for 2 × mean latent periods ≈ 10 days after t0 (Fig 3b). For example, when estimating E(80), we used fEI data from t=80 to t=90, and for estimating E(190), we used fEI data from t=190 to t=200. Note that while 2 × mean latent periods are used in this study, the length of data can be adjusted by users. Both Hist-I and Hist-D assume that fEI data only exists after the time point t0, which is the start of the SEIR model simulation.

Using these methods (Hist-I and Hist-D), we estimated the E(t0) for t0=80,81,,190 and compared them with their true values (Fig 3c). As a result, Hist-D was much more accurate than Hist-I (Fig 3d), particularly reducing the root mean squared error (RMSE) and mean absolute percentage error (MAPE) by 86% and 85%, respectively (Fig 3e). Similar improvements were also observed during the earlier phase (t0=10,,80) of epidemic growth (S3 Fig). This superiority of Hist-D persisted under various parameter conditions (See S1 Table) and even after modifications were made to the Hist-I method by summing up the future daily incidence of becoming infectious, as done in a previous study [28] (see S3 Text and S2 Fig). Consequently, we focused our analysis on the Hist-I method rather than its modified version.

Hist-D demonstrated superior accuracy compared to the Hist-I method under ideal conditions without measurement errors. However, real-world situations differ from simulations, as measurement errors are always present. To simulate a scenario with observation errors, we introduced the multiplicative noise to the given data (fEI) and used this data to estimate E(t0) with Hist-I and Hist-D (Fig 3f). When the noise level was 0.1 (σ=0.1 in Fig 3f), due to the effect of the multiplicative noise, the error increased as E(t0) grew larger for both methods (Fig 3g, top). However, Hist-D still maintained smaller errors compared to Hist-I (Fig 3g, top). Furthermore, Hist-D demonstrated greater resilience to increasing noise levels compared to Hist-I, exhibiting a smaller error amplification as the noise intensified from 0 to 0.3 (Fig 3g, bottom). This higher accuracy persisted across all tested noise levels, ranging from 0 to 0.3 in 0.1 intervals (Fig 3h). These results show that Hist-D is recommended for real-world applications with measurement errors in observed data.

Beyond the measurement error, real-world epidemics present additional complexities such as sudden changes in the epidemic phase due to social distancing, vaccination, or large-scale outbreaks of COVID-19. To reflect these changes in the simulation, we regenerated simulation data by changing the transmission rate, β, at a specific point (i.e., when E reaches its peak) and used this new simulation data to compare the accuracy of Hist-I and Hist-D (Fig 3i). When we doubled the β and investigated the errors for the 20-time points before and after the changing point, both Hist-I and Hist-D showed increased errors around the point of the second peak (time = 150 – 160 in Fig 3i; the right most of the graph in Fig 3j). However, the Hist-D method produced smaller errors than the Hist-I method (Fig 3j). This superior performance of Hist-D persisted across various β changes (1/3, 1/2, 2, and 3-fold), consistently achieving smaller RMSE and MAPE compared to Hist-I (Fig 3k). In particular, when β was reduced to 1/3, the absolute increase in RMSE and MAPE for Hist-D was less than half that of Hist-I (Fig 3k), demonstrating the robustness of Hist-D to sudden changes in β. Taken together, these results highlight the promising potential of Hist-D for estimating E(t0) in dynamic, real-world scenarios.

Hist-D outperforms Hist-I for real-world COVID-19 data

The results from the simulation data demonstrated the strong potential of Hist-D for accurately estimating the initial condition of E in real-world scenarios. To test this, we applied Hist-I and Hist-D to COVID-19 data from Seoul, Republic of Korea, spanning August 13th to November 25th, 2020. This data included the contact dates and symptom onset for people in Seoul, allowing us to empirically derive the number of people moving from S to E (fSE) and from E to I (fEI), as well as the distribution of the incubation period (i.e., the time between contact and symptom onset date) (See Methods for more details) (Fig 4a). With this information, we calculated the daily change in E (fSEfEI) and accumulated these changes starting from the date of the first recorded case of international transmission in Korea, to compute the daily E(t0) for 2020. Then, this real E(t0) was compared with E(t0) estimated by applying Hist-I and Hist-D to the fEI data and the empirical distribution of incubation period (Fig 4b). In particular, Hist-D utilized 8 days of fEI data, approximately twice the mean incubation period, as in the case of the simulation study (Fig 3b). For the last few days, when 8 days of fEI data were unavailable, Hist-D utilized fEI data from t0 to the last available date.

Fig 4. Hist-D provide more accurate estimates of the initial condition of E compare to Hist-I for real COVID-19 data in Seoul, Republic of Korea.

Fig 4

(a) We compared Hist-I and Hist-D to estimate the initial conditions of E for COVID-19 data in Seoul, Republic of Korea, from August 13 to November 25, 2020. From this data, fEI data and the distribution of the incubation period (light blue histogram) were extracted (see Methods for more details) and then used to estimate the initial condition of E with Hist-I and Hist-D. (b) The graph comparing the true E(t0) (light gray bars) and estimated E(t0) (Hist-I: red triangles, Hist-D: green squares). While both methods capture the long-term trend, Hist-I exhibits more pronounced fluctuations. (c) The scatter plot comparing the true E(t0) and estimated E(t0) (E^(t0)). Estimation from Hist-D is closer to the perfect estimation (i.e., the black cross line, where E(t0)=E^(t0)) than Hist-I. (d) The scatter plot displaying the error (E^(t0)E(t0)) across different levels of true E(t0). The error of Hist-I increased proportionally to the E(t0), while such a pattern was not manifested in Hist-D. (e) The graph showing the RMSE (bars) and the mean absolute percentage error (MAPE) (line) of Hist-I and Hist-D. Hist-D achieved 55% lower RMSE (8.44) and 55% lower MAPEs (18.9%) compared to Hist-I (RMSE: 18.76, MAPE: 42.2%), respectively, demonstrating the superior performance of Hist-D, in real-world epidemic data. (f) 95% Credible interval and empirical coverage of our estimated values. The upper and lower horizontal lines of each box represent the upper and lower bounds of the credible interval, corresponding to the 97.5% and 2.5% quantiles, respectively. 91.3% of true values were included in the 95% credible interval of Hist-D.

While both methods effectively captured the long-term trend of E(t0), Hist-D exhibited less fluctuation compared to Hist-I (Fig 4b). Notably, Hist-D provided more accurate estimates than Hist-I during abrupt changes in E(t0) such as near Oct 31 (Fig 4b), consistent with Fig 3k. As a result, Hist-D consistently demonstrated the higher accuracy than Hist-I (Fig 4c, 4d), whose error increased as the magnitude of E(t0) grew (Fig 4d). In particular, Hist-D reduced the RMSE by 55% compared to Hist-I (Fig 4e). Similarly, it decreased the MAPE by 55% (Fig 4e). These results indicate that in highly volatile real-world scenarios, Hist-D provides more accurate and reliable estimates of initial conditions than the Hist-I method.

Despite the promising results, Hist-D did not achieve perfect estimations. Therefore, we checked whether the true values fell within the 95% credible interval when using the Hist-D (Fig 4f; see Methods for more details). As a result, 91.3% of the true values were included within the credible interval for the Hist-D method (Fig 4f). Taken together, Hist-D demonstrates robust capabilities in precisely determining the initial condition of E, which is likely to result in a more accurate estimation of epidemic dynamics.

Discussion

While accurate initial conditions are crucial for the SEIR model, the initial condition value of the exposed population (E) is often unknown. Thus, the initial condition of E has often been estimated with the Hist-I method. However, Hist-I does not consider the timing of exposure of the individuals in the exposed compartment (i.e., exposure history). As a result, this method yields biased estimation (Fig 1d). To resolve this problem, in this study, we developed a new history-dependent method, Hist-D (Figs 2 and 3a-3b). For the simulated data, Hist-D estimated the initial condition of E much more accurately than Hist-I (Fig 3c-3e), even with measurement errors (Fig 3f-3h) or sudden changes in epidemic phases (Fig 3i-3k). Importantly, Hist-D successfully estimated the initial condition of E in real-world COVID-19 data from Seoul, Korea, reducing estimation error by 55% compared to Hist-I (Fig 4). These findings demonstrate that Hist-D can more accurately estimate the unknown initial conditions in the SEIR model using relatively accessible data.

Although this study focused on the SEIR model, Hist-D can be applied to any compartmental model where the transition time between two compartments is known and inflow data for the downstream compartment is available. Thus, Hist-D can be used when the daily incidence of becoming infectious and the latent period distribution are known (Fig 3), or when the daily incidence of symptom onset and the incubation period distribution are known (Fig 4). This flexibility allows Hist-D to be applied to other infectious disease models, such as SEIR-Vaccinated (SEIRV) [29,30], SEI-Quarantined-R (SEIQR) [31,32], or SE-Presymptomatic-IR (SEPIR) [33,34]. For more complex models [16] with additional substages in the exposed (E) or infectious (I) compartments, the same approach used in Hist-D can be adapted by modifying the left and right sides of the equations derived in this study (see Fig 2b and the ‘Derivation of loss function’ section of the Method section for more details). These modifications can also be easily implemented in our Hist-D package by altering a single function (see [7] of the S1 Text for further guidance). Therefore, Hist-D offers a flexible framework that can be readily extended, and applying it to a broader class of infectious disease models represents a promising direction for future research.

The epidemic dynamics, in particular, the transition from exposure to infectiousness, is inherently history-dependent (i.e., its likelihood varies over time since the exposure). However, this has been overlooked in previous studies, which employed a simple ODE model that assumes a constant chance of becoming infectious. While this history-independent representation simplifies the inference of crucial epidemiological parameters such as reproduction number, our previous work revealed that it introduces significant bias [19]. Thus, we address this bias by utilizing a model that describes the history-dependent dynamics [19]. Nonetheless, the advantage of using history-dependent models relies heavily on accurate initial conditions (S4 Text and S4 Fig), as these values significantly affect the model predictions. Previous methods for determining initial conditions were based on a history-independent assumption, misaligning with the dynamics in history-dependent models and resulting in a considerable bias in initial conditions (Fig 3) and subsequent estimation of the reproduction number (S4 Fig). We addressed this here by developing a history-dependent method for estimating the initial condition (S4 Fig). This, combined with history-dependent models, provides the first framework that completely describes the history-dependent dynamics of infectious disease.

Hist-D employs a master equation (Fig 2b (i)), which represents the daily infectious population as a convolution of daily exposed individuals and the latent period distribution. In this study, we modified this master equation to derive the total loss function (Fig 2b-2d). In contrast, previous studies have applied this master equation without direct modification [3537]. For example, Abbott et al. utilized a similar master equation to develop an algorithm that can estimate the sometimes-unknown daily infectious population from typically available daily confirmed cases [37]. This suggests the potential to extend the applicability of our approach by combining with the approach by Abbott et al. Specifically, our framework currently requires daily incidence of becoming infectious data, which is sometimes unknown. In such cases, we can estimate the daily infectious cases from the daily confirmed cases, which is typically easier to obtain, by using the approach of Abbott et al.

Beyond infectious disease studies, other biological systems have also been studied using mathematical models incorporating delay [3846]. These models simplify complex biological processes involving many intermediate stages by representing them as a single pathway with a time delay. For example, the complex maturation process of proteins has been replaced with one single protein production process with time delay [41,43] and the complex degradation pathway of damaged cells has been replaced with a single degradation process with delay [46]. This approach is similar to the SEIR model used in this study, where the detailed process from exposure to infectiousness has been simplified to a single process with delay (i.e., latent period). Considering this, Hist-D can be generalized to the other models incorporating the delay. For instance, when modeling the level of immature and mature proteins, Hist-D could estimate the initial condition of the immature proteins, which is often difficult to measure experimentally, by using the known data of mature proteins.

Despite the novelty of Hist-D, several limitations should be noted. First, our methods are derived from ODE-based infectious disease models, though stochastic compartmental models and network-based models are also used to better capture transmission uncertainty and detailed processes of disease progression, respectively [47,48]. Whether Hist-D can be extended to these models remains an open question, and exploring this would be a promising direction for future research. In addition, Hist-D assumes a constant daily incidence of exposure before the initial time point (See equation (8) in the Methods section). Relaxing this assumption to accommodate scenarios such as super spreading events [49] or exponential growth (or decay) represents another important direction for future work.

Another limitation is that our method has been primarily validated using COVID-19 data. In addition to real data from Seoul, Korea, we used simulation data that mimics the latent period of COVID-19 with a Gamma distribution. However, the latent periods of other infectious diseases may not follow the Gamma distribution. Nonetheless, even in such cases, Hist-D can be readily adapted by simply adjusting the latent period distribution, as our derivation of the loss function does not depend on any specific distributional assumption. Therefore, we hypothesize that Hist-D can still estimate the initial condition of E with reasonably high accuracy across various latent period distributions.

Lastly, the credible intervals for Hist-D were relatively wide, indicating a high degree of uncertainty in the estimates. As such uncertainty can hinder the precise determination of initial conditions, future work should focus on reducing this uncertainty. Additionally, the empirical coverage of the 95% credible intervals was below the expected level (i.e., 95%). This discrepancy may arise from a mismatch between the real-world data generation process and the model assumptions, such as that constant exposure occurs before the initial point (equation (8)), underlying the Bayesian approach. Reducing this model deficiency through advanced statistical techniques [50] could improve empirical coverage and enhance the reliability of the estimates.

Method

Derivation of the total loss function used in Hist-D

We established the total loss function to estimate the initial conditions of exposed individuals. This loss function started from the master equation that characterizes the history-dependent rate of becoming infectious by incorporating a non-exponentially distributed latent period. While this can be modeled through the method of stages [16], which introduces multiple substages in the exposed compartment, it requires specifying the number of substages by fitting an Erlang distribution to the empirical distribution of the latent period, which may increase the computational cost of the optimization process. More importantly, the method of stages constrains the latent period to an Erlang distribution. To avoid this, we adopted an alternative approach, following a previous study [19], which allowed us to incorporate arbitrary latent period distribution. In particular, the instantaneous rate of individuals becoming infectious is a convolution of the instantaneous rate of individuals exposed and the probability density function of the latent period, as follows (Fig 2b (i)):

*20cf~EI(t)=\nolimits0f~SE(ts)gτL(s)ds (1)

where f~EI(t) represents the instantaneous rate of the number of individuals becoming infectious at time t, f~SE(t) denotes the instantaneous rate of the number of individuals exposed at time t, and gτL(s) is the probability density function of the latent period. Here, we modified this equation to explicitly account for the effect of the initial condition of E on the daily infectious individuals. We first integrated the equation (1) to express the number of daily infectious individuals (fEI(t0+k)).

*20cfEI(t0+k)=\nolimitst0+k1t0+kf~EI(t)dt=\nolimitst0+k1t0+k\nolimits0f~SE(ts)gτL(s)dsdt (2)

Then, we discretized the marginal number of individuals exposed in terms of the number of daily incidence of exposure, as follows (Fig 2b (ii)):

*20cf~SE(t)=fSE(t0+j)ift0+j1<tt0+j (3)

This can be rewritten as follows:

*20cf~SE(t)=\nolimitsj=fSE(t0+j)1[t0+j1,t0+j](t) (4)

where 1[t0+j1,t0+j] denotes the characteristic function supported on the interval [t0+j1,t0+j]. By plugging in equation (4) to the equation (2), we derived the new equation:

fEI(t0+k)=\nolimitst0+k1t0+k\nolimits0\nolimitsj=fSE(t0+j)1[t0+j1,t0+j](ts)gτL(s)dsdt=\nolimitsj=fSE(t0+j)\nolimitst0+k1t0+k\nolimits01[t0+j1,t0+j](ts)gτL(s)dsdt (5)

By changing the variable from t to x by t=t0+j1+x, we obtain

fEI(t0+k)=\nolimitsj=fSE(t0+j)\nolimitskjkj+1\nolimits01[t0+j1,t0+j](t0+j1+xs)gτL(s)dsdx
=\nolimitsj=fSE(t0+j)\nolimitskjkj+1\nolimits01[0,1](xs)gτL(s)dsdx (6)
=\nolimitsj=fSE(t0+j)\nolimitskjkj+1(gτL*1[0,1])(x)dx

where * symbol denotes the convolution. Considering that (gτL*1[0,1]\rightleft(x)=0 for x<0, we finally obtain

*20cfEI(t0+k)=\nolimitsj=kfSE(t0+j)\nolimitskjkj+1(gτL*1[0,1])(x)dx (7)

However, this equation includes infinitely many unknown parameters (fSE(t0+j), where j is an integer smaller or equal to k), making parameter estimation challenging. To overcome this, we approximated the equation (7) by assuming individuals were exposed at a constant rate, r, before time t0 (Fig 2b (iii)):

*20cfSE(t0+j)rforj<0 (8)

Here, we found that this constant rate r is closely related to the E(t01). Specifically, we can express E(t01) as a function of r:

E(t01)=0f~SE(t01x)xgτL(y)dydxr·0xgτL(y)dydx  (9)

This equation arises from the fact that the people exposed at time t01x can remain in the exposed (E) compartment at time t01 only if their latent period (τL) is greater than x. By applying Fubini’s theorem to this equation, we can further simplify the equation as follows:

*20cE(t01)r·\nolimits0\nolimitsxgτL(y)dydx=r·\nolimits0\nolimits0ygτL(y)dxdy=r·\nolimits0ygτL(y)dy=r·E[τL] (10)

where 𝔼[τL] is the mean of the latent period. This equation suggests that the constant rate r is proportional to the E(t01):

rE(t01)𝔼[τL] (11)

Plugging in equations (8) and (11) to equation (7), we can derive the final equation:

fEI(t0+k)E(t01)𝔼[τL]k+1(gτL*1[0,1])(s)ds+j=0kfSE(t0+j)kjkj+1(gτL*1[0,1])(s)ds (12)

This final equation holds for every given data (fEI(t0),,fEI(t0+n)), and this system of n+1 equations can be written in a matrix form (Fig 2c):

(fEI(t0)fEI(t0+n))=E(t01)𝔼[τL](Q1Qn+1)+(P00PnP0)(fSE(t0)fSE(t0+n)) (13)

where Qk+1=\nolimitsk+1(gτL*1[0,1];(s)ds and Pkj=\nolimitskjkj+1(gτL*1[0,1];(s)ds. From this matrix equation, we established the data loss function which is minimal when the left and right sides of the equation (13) are similar.

dataloss=nk=0(fEI(t0+k)E(t01)𝔼[τL]Qk+1kj=0fSE(t0+j)Pkj)2 (14)

We aimed to find the unknown parameters (E(t01),fSE(t0),, fSE(t0+n)) by minimizing the data loss. However, as the number of unknown parameters (n+2), exceeds the number of equations (n+1), the parameters cannot be determined solely from the data loss. This leads us to incorporate the additional regularization loss for the fSE parameters. For this regularization, we employed the second-order derivative of daily incidence of exposure, because typically one day is insufficient to make a drastic increase or decrease in daily change of the exposed population. As a result, we derived the final total loss function.

total loss=(E(t01), fSE(t0),,fSE(t0+n))=data loss+regularization loss= k=0n(fEI(t0+k)E(t01)𝔼[τL]Qk+1j=0kfSE(t0+j)Pkj)2+ k=1n1(fSE(t0+k+1)fSE(t0+k)(fSE(t0+k)fSE(t0+k1)))2 (15)

Finally, we calculated the initial condition of E (E(t0)) from the estimated parameters (E(t01), fSE(t0)) and available information (fEI(t0)) by using following formula:

E(t0)=E(t01)fEI(t0)+fSE(t0)

Parameter estimation from the loss function

To find the value of E(t01), fSE(t0), , fSE(t0+n) minimizing the total loss function (i.e., equation (12)), we utilized the Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) method. This algorithm is a gradient-based quasi-Newton approach designed for solving large-scale optimization problems. The optimization process incorporates boundary conditions to ensure biological plausibility. These constraints guarantee that the estimated values remain non-negative and do not exceed the maximum population size S, preserving the physical meaning of the parameters.

Construction and simulation of the SEIR model with Gamma-distributed latent and infectious periods

We compared the accuracy of Hist-I and Hist-D by using simulation data from the SEIR model mimicking the latent period of COVID-19. For this, we constructed the SEIR model following the previous study by Hong and Eom et al. [19]:

dS(t)dt=βS(t)I(t)N
dE(t)dt=βS(t)I(t)NFlowEI(t)
dI(t)dt=FlowEI(t)FlowIR(t) (16)
dR(t)dt=FlowIR(t)

where β is the transmission rate and N is the number of the entire population (i.e., N = S(0+ E(0) + I(0) + R(0)). FlowEI(t) and FlowIR(t) indicates the history-dependent rate of transition from compartment E to I and I to R, respectively. These rates were calculated as follows:

FlowEI(t)=0tβS(u)I(u)Ng1(tu)du+E(0) g1~(t)
FlowIR(t)=0tβS(u)I(u)N(g1*g2)(tu)du+E(0)(g1~*g2)(t)+I(0)g2~(t)
gi~(t)=1𝔼[gi]tgi(v)dv,   i=1,2 (17)

where g1(t) and g2(t) represent the probability density functions of the distribution of the latent period and infectious period, respectively, and g1~(t) and g2~(t) represent the probability density functions of sojourn times of individuals initially in compartments E and I, respectively [19].

We set initial conditions as S(0)=9558153, E(0)=0, I(0)=1,R(0)=0 to mimic the initial stage of COVID-19 in Seoul, where the whole population of Seoul is susceptible except for one infectious individual. Additionally, we assumed that the latent period follows the Gamma distribution with shape 4.06 and scale 1.35, based on a previous study that fitted a gamma distribution to observed data on the COVID-19 latent period [24]. The infectious period was assumed to follow the Gamma distribution with shape 30 and scale 0.2, as estimated in a previous study using same COVID-19 data employed in this study [19]. Simulating this model using Heun’s method, we obtained daily numbers of S(t), E(t), I(t), and R(t), which were then converted to the daily incidence of exposure (fSE) data and daily incidence of becoming infectious (fEI) data as follows:

fSE(t)=ΔS(t)=(S(t)S(t1))
fEI(t)=ΔE(t)+fSE(t)=(E(t)E(t1))(S(t)S(t1)) (18)

These daily exposed and infectious people datasets were employed to compare the performances between Hist-I and Hist-D (Fig 3c-3e). Then, to further consider the possible measurement errors in the real-world, we applied multiplicative noise (eU(σ,σ)) to the given data (i.e., daily incidence of becoming infectious, fEI) (Fig 3f) and compared the estimation accuracy of Hist-I and Hist-D (Fig 3g). Additionally, we incrementally increased the noise level (σ) by 0.1 to assess whether Hist-D maintained its superiority under higher noise conditions (Fig 3h). To ensure the reliability and stability of our findings, this process was iterated 10 times, and their average RMSE and MAPE were reported (Fig 3h). Lastly, we simulated a sudden shift in epidemic dynamics by adjusting β during the simulation (Fig 3i), changing its initial value of 0.4 by multiplying it by factors of 1/3, 1/2, 1, 2, 3 at t=140.

Data collection and preprocessing

For the real-world data analysis, we utilized contact tracing data of COVID-19 in Seoul, Republic of Korea from January 20th, 2020 to November 25th, 2020. From this data, the period from August 13th, 2020 to November 25th, 2020 was chosen as the testing set for Hist-D and Hist-I as it includes both the increasing phase and decreasing phase of exposed individuals. The dataset contains individual contact dates, symptom onset dates, and confirmation dates of COVID-19 cases. While confirmation dates were complete, only 35% and 63% of contact dates and symptom onset dates were available, respectively, leading us to use 21% of the total data that had complete information on contact dates, symptom onset dates, and confirmation dates.

From these data, we extracted the number of daily incidence of exposure (fSE) and daily incidence of becoming infectious (fEI), by assuming that individuals become “exposed” at their contact dates and “infectious” at their symptom onset dates. Then, we calculated the population in the E compartment by setting E(t0)=0 on January 20th, 2020, the date of the first officially confirmed COVID-19 case in Seoul [19], and cumulatively summing the difference of fSE and fEI. The resulting values were used as true values of E(t0). Additionally, the distribution of the incubation period was obtained empirically, by calculating the time difference between the date of symptom onset and the contact date for each case: the probability of an incubation period of 5 days is the ratio of cases whose difference between symptom onset date and contact date is 5 days.

These data are protected and are not available due to data privacy laws. The Korea Public Institutional Review Board Designated by Ministry of Health and Welfare waived the need for ethical approval for the collection and analysis of the real-world data since the data was anonymized and none of the individuals were identifiable (reference number: P01-202404-01-016).

Uncertainty quantification using the Markov chain Monte Carlo (MCMC) method

We quantified the parametric and prediction uncertainties of Hist-D using Bayesian inference. We set the likelihood of given observed daily incidence of becoming infectious, fEI, as follows:

fEI(t0+k) ~ Poisson( E(t01)𝔼[τL]Qk+1+j=0kfSE(t0+j)Pkj)  (19)

where Pkj and Qk+1 are known constants introduced in equation (10). The prior distributions of parameters were assumed as follows:

E(t01) ~ LogNormal(E^(t01)σE2)
fSE(t0+j) ~ LogNormal(f^SE(t0+j)k)\ \ \ \ \ j=0,,n
k ~ LogNormal(μk, σk2) (20)

where LogNormal(x,y) denotes the lognormal distribution with mean x and variance y and E^(t01),f^SE(t0),,f^SE(t0+n) denotes the point estimates of E(t01),fSE(t0),,fSE(t0+n), obtained by utilizing Hist-D. We used σE=σk=100, and μk=20.

We used an MCMC method to sample the parameters (E(t01),fSE(t0),,fSE(t0+n)) from their posterior distribution defined by the likelihood in [19] and the priors in [20]. To be more specific, we performed 100,000 iterations of sampling E(t01),fSE(t0),,fSE(t0+n) from their posterior distributions, by using a Hamiltonian Monte Carlo (HMC) algorithm with the No-U-Turn Sampler (NUTS). We generated the posterior samples of E(t0) by cumulatively summing up the difference between posterior sample of daily incidence of exposure and given data: E(t0)=E(t01)+fSE(t0)fEI(t0). Then, the credible interval for Hist-D is calculated by determining the range between 0.025 and 0.975 quantiles of the posterior samples of E(t0).

Quantification and statistical analysis

In this study, functions to estimate the initial condition and simulate all scenarios were developed by the authors in the programming languages R (version 4.3.2) and Stan (version 2.32.2) and Rstan (version 2.32.6).

Supporting information

S1 File. Hist-D.

Computational package for Hist-D.

(ZIP)

pcbi.1013438.s001.zip (3.3MB, zip)
S1 Text. Computational package for Hist-D.

(DOCX)

pcbi.1013438.s002.docx (26.2KB, docx)
S2 Text. Estimation of the reproduction number heavily depends on the initial conditions.

(DOCX)

pcbi.1013438.s003.docx (25.3KB, docx)
S3 Text. Hist-D is more accurate than Hist-I, even after Hist-I was modified.

(DOCX)

pcbi.1013438.s004.docx (25.7KB, docx)
S4 Text. Hist-D enhances the accuracy of reproduction number estimation.

(DOCX)

pcbi.1013438.s005.docx (25.9KB, docx)
S1 Fig. Estimation of reproduction number is heavily dependent on the initial condition of E.

The plot displaying the fold error between the estimated reproduction number (^) and the true reproduction number () across various bias levels in E(0). E(0) was initially E0=50, and it changed to the E1.When =2 (red curve), E1/E0=22.5 led to a 37.9% relative error and E1/E0=22.5 led to a 19.4% relative error. When =4 (blue curve), E1/E0=22.5 led to a 42.5% relative error and E1/E0=22.5 led to a 34.1% relative error. (b) The plot showing the estimated time-varying reproduction number ((^(t))) under varying levels of bias in the initial condition of E: E(0)4,E(0)2,E(0),2E(0), and 4E(0), where E(0) is the original initial condition value. The influence of bias in E(0) continues noticeably well after the time point where the initial condition was estimated, and decreases over time, leading to convergence in the estimated reproduction numbers.

(EPS)

pcbi.1013438.s006.eps (1.3MB, eps)
S2 Fig. The superiority of Hist-D was preserved even after the modification of Hist-I as done in Rauch et al. (a) The graph comparing the true E(t0) (light gray-colored bars) and the estimated E(t0) (E^(t0)).

The modified Hist-I underestimates the E(t0) (b) The scatter plot displaying the error between estimated E(t0) (E^(t0)) and true E(t0) across different levels of true E(t0). Estimation from Hist-D (green squares) shows smaller errors compared to the modified Hist-I (red triangles). (c) The bar plot showing the root mean squared error (RMSE; bars) and mean absolute percentage error (MAPE; line) of the modified Hist-I and Hist-D. RMSE and MAPE were reduced by 80% and 53%, respectively, when Hist-D were utilized, compared to the modified Hist-I.

(EPS)

pcbi.1013438.s007.eps (1.1MB, eps)
S3 Fig. Hist-D outperforms both Hist-I and the modified Hist-I even in the early phase of the epidemic dynamics.

(a) The graph comparing the true E(t0) (light gray-colored bars) and the estimated E(t0) (E^(t0)) for time t0=1080. (b) The scatter plot displaying the error between estimated E(t0) (E^(t0)) and true E(t0) across different levels of true E(t0). Estimation from Hist-D (green squares) shows smaller errors compared to both Hist-I (red triangles) and the modified Hist-I (blue circles). (c) The bar plot showing the root mean squared error (RMSE; bars) and mean absolute percentage error (MAPE; line) of Hist-I, the modified Hist-I, and Hist-D. Hist-D reduced both RMSE and MAPE by 81% compared to Hist-I, whereas the modified Hist-I reduced both by 47%.

(EPS)

pcbi.1013438.s008.eps (1.2MB, eps)
S4 Fig. Hist-D enhances the accuracy of reproduction number estimation.

Boxplots of the posterior samples of the reproduction number (^) obtained from IONISE with initial conditions estimated by Hist-D (green) and Hist-I (red). IONISE combined with Hist-D accurately estimated the reproduction number, while using Hist-I introduced considerable bias. Here, the posterior samples were normalized by the true value () employed for generating the simulation data depicted in Fig 3c.

(EPS)

pcbi.1013438.s009.eps (1.1MB, eps)
S1 Table. Hist-D is more accurate than Hist-I under various parameter conditions.

The table shows the reduction in RMSE (former) and MAPE (latter) achieved by Hist-D relative to Hist-I under the same conditions as Fig 3c–3e, but with varied latent and infectious period parameters.

(DOCX)

pcbi.1013438.s010.docx (23.2KB, docx)

Data Availability

The simulated data on daily exposed and infectious people generated in this study is provided as a CSV file ‘SimulData_cde.csv’ within the Github repository (https://github.com/Mathbiomed/Hist-D), which is publicly available, and a permanent reference to the version of the code used in this study is provided at the Zenodo repository (https://doi.org/10.5281/zenodo.16891923). The real-world confirmed cases and contact tracing data were collected with informed consent and were provided by the Seoul Metro Infectious Disease Research Center. These data are protected and are not available due to data privacy laws. Specific academic requests for access to these data should be directed to the Citizens’ Health Bureau, Infectious Disease Control Division, Seoul Metropolitan Government (Tel: 82-02-2133-9480, E-mail: pr77889@seoul.go.kr) or the Institute for Basic Science (E-mail: webmaster@ibs.re.kr).

Funding Statement

This work was supported by the National Research Foundation of Korea (NRF) (grant no. RS-202300245056, B.C., D.L., RS-2024-00340139, S.C., NRF-2022R1A5A1033624, H.L., RS- 2022-NR068758, J.K.K., RS-2025-00523567, W.C.), the Institute for Basic Science (grant no. IBS-R029-C3, J.K.K.), the Samsung Science and Technology Foundation (grant no. SSTF-BA1902-01, J.K.K.), a grant of the project The Government-wide R&D to Advance Infectious Disease Prevention and Control (grant no. HG23C1629, B.C., S.C., and H.L.), the New Faculty Startup Fund from Seoul National University (grant no: 326-20240027, W.C.). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. No authors received a salary from any of the funders listed above specifically for this work.

References

  • 1.Tang B, Wang X, Li Q, Bragazzi NL, Tang S, Xiao Y, et al. Estimation of the Transmission Risk of the 2019-nCoV and Its Implication for Public Health Interventions. J Clin Med. 2020;9(2):462. doi: 10.3390/jcm9020462 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.He S, Peng Y, Sun K. SEIR modeling of the COVID-19 and its dynamics. Nonlinear Dyn. 2020;101(3):1667–80. doi: 10.1007/s11071-020-05743-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Hao X, Cheng S, Wu D, Wu T, Lin X, Wang C. Reconstruction of the full transmission dynamics of COVID-19 in Wuhan. Nature. 2020;584(7821):420–4. doi: 10.1038/s41586-020-2554-8 [DOI] [PubMed] [Google Scholar]
  • 4.Radulescu A, Williams C, Cavanagh K. Management strategies in a SEIR-type model of COVID 19 community spread. Sci Rep. 2020;10(1):21256. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Hong H, Noh JY, Lee H, Choi S, Choi B, Kim JK, et al. Modeling incorporating the severity-reducing long-term immunity: higher viral transmission paradoxically reduces severe COVID-19 during endemic transition. Immune Netw. 2022;22(3):e23. doi: 10.4110/in.2022.22.e23 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Lopez L, Rodo X. A modified SEIR model to predict the COVID-19 outbreak in Spain and Italy: simulating control scenarios and multi-scale epidemics. Results Phys. 2021;21:103746. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Carcione JM, Santos JE, Bagaini C, Ba J. A simulation of a COVID-19 epidemic based on a deterministic SEIR model. Front Public Health. 2020;8:230. doi: 10.3389/fpubh.2020.00230 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Wong V, Cooney D, Bar-Yam Y. Beyond Contact Tracing: Community-Based Early Detection for Ebola Response. PLoS Curr. 2016;8. doi: 10.1371/currents.outbreaks.322427f4c3cc2b9c1a5b3395e7d20894 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Viana J, van Dorp CH, Nunes A, Gomes MC, van Boven M, Kretzschmar ME, et al. Controlling the pandemic during the SARS-CoV-2 vaccination rollout. Nat Commun. 2021;12(1):3674. doi: 10.1038/s41467-021-23938-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Li X, Ghadami A, Drake JM, Rohani P, Epureanu BI. Mathematical model of the feedback between global supply chain disruption and COVID-19 dynamics. Sci Rep. 2021;11(1):15450. doi: 10.1038/s41598-021-94619-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Goldberg EE, Lin Q, Romero-Severson EO, Ke R. Swift and extensive Omicron outbreak in China after sudden exit from “zero-COVID” policy. Nat Commun. 2023;14(1):3888. doi: 10.1038/s41467-023-39638-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Drake JM, Handel A, Marty É, O’Dea EB, O’Sullivan T, Righi G, et al. A data-driven semi-parametric model of SARS-CoV-2 transmission in the United States. PLoS Comput Biol. 2023;19(11):e1011610. doi: 10.1371/journal.pcbi.1011610 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Cooper I, Mondal A, Antonopoulos CG. A SIR model assumption for the spread of COVID-19 in different communities. Chaos Solitons Fractals. 2020;139:110057. doi: 10.1016/j.chaos.2020.110057 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Gozzi N, Chinazzi M, Dean NE, Longini IM Jr, Halloran ME, Perra N, et al. Estimating the impact of COVID-19 vaccine inequities: a modeling study. Nat Commun. 2023;14(1):3272. doi: 10.1038/s41467-023-39098-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Girardi P, Gaetan C. An SEIR model with time-varying coefficients for analyzing the SARS-CoV-2 epidemic. Risk Anal. 2023;43(1):144–55. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Wearing HJ, Rohani P, Keeling MJ. Appropriate models for the management of infectious diseases. PLoS Med. 2005;2(7):e174. doi: 10.1371/journal.pmed.0020174 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Hurtado PJ, Kirosingh AS. Generalizations of the “Linear Chain Trick”: incorporating more flexible dwell time distributions into mean field ODE models. J Math Biol. 2019;79(5):1831–83. doi: 10.1007/s00285-019-01412-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Hurtado PJ, Richards C. Building mean field ODE models using the generalized linear chain trick & Markov chain theory. J Biol Dyn. 2021;15(sup1):S248–72. [DOI] [PubMed] [Google Scholar]
  • 19.Hong H, Eom E, Lee H, Choi S, Choi B, Kim JK. Overcoming bias in estimating epidemiological parameters with realistic history-dependent disease spread dynamics. Nat Commun. 2024;15(1):8734. doi: 10.1038/s41467-024-53095-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.De Salazar PM, Lu F, Hay JA, Gómez-Barroso D, Fernández-Navarro P, Martínez EV, et al. Near real-time surveillance of the SARS-CoV-2 epidemic with incomplete data. PLoS Comput Biol. 2022;18(3):e1009964. doi: 10.1371/journal.pcbi.1009964 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Nishiura H, Inaba H. Estimation of the incubation period of influenza A (H1N1-2009) among imported cases: addressing censoring using outbreak data at the origin of importation. J Theor Biol. 2011;272(1):123–30. doi: 10.1016/j.jtbi.2010.12.017 [DOI] [PubMed] [Google Scholar]
  • 22.Saito MM, Hirotsu N, Hamada H, Takei M, Honda K, Baba T, et al. Reconstructing the household transmission of influenza in the suburbs of Tokyo based on clinical cases. Theor Biol Med Model. 2021;18(1):7. doi: 10.1186/s12976-021-00138-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Miura F, van Ewijk CE, Backer JA, Xiridou M, Franz E, Op de Coul E. Estimated incubation period for monkeypox cases confirmed in the Netherlands, May 2022. Euro Surveill. 2022;27(24). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Xin H, Li Y, Wu P, Li Z, Lau EHY, Qin Y. Estimating the latent period of coronavirus disease 2019 (COVID-19). Clin Infect Dis. 2022;74(9):1678–81. [DOI] [PubMed] [Google Scholar]
  • 25.Huang S, Li J, Dai C, Tie Z, Xu J, Xiong X, et al. Incubation period of coronavirus disease 2019: new implications for intervention and control. Int J Environ Health Res. 2022;32(8):1707–15. [DOI] [PubMed] [Google Scholar]
  • 26.Men K, Li Y, Wang X, Zhang G, Hu J, Gao Y, et al. Estimate the incubation period of coronavirus 2019 (COVID-19). Comput Biol Med. 2023;158:106794. doi: 10.1016/j.compbiomed.2023.106794 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Li Y, Jiang X, Qiu Y, Gao F, Xin H, Li D, et al. Latent and incubation periods of Delta, BA.1, and BA.2 variant cases and associated factors: a cross-sectional study in China. BMC Infect Dis. 2024;24(1):294. doi: 10.1186/s12879-024-09158-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Rauch W, Schenk H, Rauch N, Harders M, Oberacher H, Insam H, et al. Estimating actual SARS-CoV-2 infections from secondary data. Sci Rep. 2024;14(1):6732. doi: 10.1038/s41598-024-57238-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Ringa N, Bauch CT. Dynamics and control of foot-and-mouth disease in endemic countries: a pair approximation model. J Theor Biol. 2014;357:150–9. doi: 10.1016/j.jtbi.2014.05.010 [DOI] [PubMed] [Google Scholar]
  • 30.Meng X, Cai Z, Si S, Duan D. Analysis of epidemic vaccination strategies on heterogeneous networks: Based on SEIRV model and evolutionary game. Appl Math Comput. 2021;403:126172. doi: 10.1016/j.amc.2021.126172 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Prabakaran R, Jemimah S, Rawat P, Sharma D, Gromiha MM. A novel hybrid SEIQR model incorporating the effect of quarantine and lockdown regulations for COVID-19. Sci Rep. 2021;11(1):24073. doi: 10.1038/s41598-021-03436-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Tiwari S, Vyasarayani CP, Chatterjee A. Data suggest COVID-19 affected numbers greatly exceeded detected numbers, in four European countries, as per a delayed SEIQR model. Sci Rep. 2021;11(1):8106. doi: 10.1038/s41598-021-87630-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Hilton J, Riley H, Pellis L, Aziza R, Brand SPC, K Kombe I, et al. A computational framework for modelling infectious disease policy based on age and household structure with applications to the COVID-19 pandemic. PLoS Comput Biol. 2022;18(9):e1010390. doi: 10.1371/journal.pcbi.1010390 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Thompson RN, Gilligan CA, Cunniffe NJ. Detecting presymptomatic infection is necessary to forecast major epidemics in the earliest stages of infectious disease outbreaks. PLoS Comput Biol. 2016;12(4):e1004836. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Gostic KM, McGough L, Baskerville EB, Abbott S, Joshi K, Tedijanto C, et al. Practical considerations for measuring the effective reproductive number, Rt. PLoS Comput Biol. 2020;16(12):e1008409. doi: 10.1371/journal.pcbi.1008409 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Cori A, Ferguson NM, Fraser C, Cauchemez S. A new framework and software to estimate time-varying reproduction numbers during epidemics. Am J Epidemiol. 2013;178(9):1505–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Abbott S, Hellewell J, Thompson RN, Sherratt K, Gibbs HP, Bosse NI, et al. Estimating the time-varying reproduction number of SARS-CoV-2 using national and subnational case counts. Wellcome Open Research. 2020;5(112):112. [Google Scholar]
  • 38.Josić K, López JM, Ott W, Shiau L, Bennett MR. Stochastic delay accelerates signaling in gene networks. PLoS Comput Biol. 2011;7(11):e1002264. doi: 10.1371/journal.pcbi.1002264 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Hong H, Cortez MJ, Cheng YY, Kim HJ, Choi B, Josic K, et al. Inferring delays in partially observed gene regulation processes. Bioinformatics. 2023;39(11). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Cortez MJ, Hong H, Choi B, Kim JK, Josić K. Hierarchical Bayesian models of transcriptional and translational regulation processes with delays. Bioinformatics. 2021;38(1):187–95. doi: 10.1093/bioinformatics/btab618 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Kim DW, Hong H, Kim JK. Systematic inference identifies a major source of heterogeneity in cell signaling dynamics: The rate-limiting step number. Sci Adv. 2022;8(11):eabl4598. doi: 10.1126/sciadv.abl4598 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Szavits-Nossan J, Grima R. Uncovering the effect of RNA polymerase steric interactions on gene expression noise: Analytical distributions of nascent and mature RNA numbers. Phys Rev E. 2023;108(3–1):034405. [DOI] [PubMed] [Google Scholar]
  • 43.Jo H, Hong H, Hwang HJ, Chang W, Kim JK. Density physics-informed neural networks reveal sources of cell heterogeneity in signal transduction. Patterns (N Y). 2023;5(2):100899. doi: 10.1016/j.patter.2023.100899 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Song YM, Campbell S, Shiau L, Kim JK, Ott W. Noisy Delay Denoises Biochemical Oscillators. Phys Rev Lett. 2024;132(7):078402. doi: 10.1103/PhysRevLett.132.078402 [DOI] [PubMed] [Google Scholar]
  • 45.Choi B, Cheng Y-Y, Cinar S, Ott W, Bennett MR, Josić K, et al. Bayesian inference of distributed time delay in transcriptional and translational regulation. Bioinformatics. 2020;36(2):586–93. doi: 10.1093/bioinformatics/btz574 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Byun JH, Roh Y, Yoon I-S, Kim KS, Jung IH. Fractional transit compartment model for describing drug delayed response to tumors using Mittag-Leffler distribution on age-structured PKPD model. PLoS One. 2022;17(11):e0276654. doi: 10.1371/journal.pone.0276654 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Calleri F, Nastasi G, Romano V. Continuous-time stochastic processes for the spread of COVID-19 disease simulated via a Monte Carlo approach and comparison with deterministic models. J Math Biol. 2021;83(4):34. doi: 10.1007/s00285-021-01657-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Stehlé J, Voirin N, Barrat A, Cattuto C, Colizza V, Isella L, et al. Simulation of an SEIR infectious disease model on the dynamic contact network of conference attendees. BMC Med. 2011;9:87. doi: 10.1186/1741-7015-9-87 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Nielsen BF, Sneppen K, Simonsen L. The counterintuitive implications of superspreading diseases. Nat Commun. 2023;14(1):6954. doi: 10.1038/s41467-023-42612-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Shafer G, Vovk V. A tutorial on conformal prediction. Journal of Machine Learning Research. 2008;9(3). [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1013438.r001

Decision Letter 0

Roger Dimitri Kouyos, Nik J Cunniffe

5 May 2025

PCOMPBIOL-D-25-00204

A History-Dependent Approach for Accurate Initial Condition Estimation in Epidemic Models

PLOS Computational Biology

Dear Dr. Kim,

Thank you for submitting your manuscript to PLOS Computational Biology. After careful consideration, we feel that it has merit but does not fully meet PLOS Computational Biology's publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript within 30 days Jul 05 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at ploscompbiol@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pcompbiol/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

* A rebuttal letter that responds to each point raised by the editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. This file does not need to include responses to formatting updates and technical items listed in the 'Journal Requirements' section below.

* A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

* An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, competing interests statement, or data availability statement, please make these updates within the submission form at the time of resubmission. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

We look forward to receiving your revised manuscript.

Kind regards,

Nik J. Cunniffe

Academic Editor

PLOS Computational Biology

Roger Kouyos

Section Editor

PLOS Computational Biology

Additional Editor Comments :

Thank you for sending this very nice paper to PLOS Computational Biology. I was fortunate to receive comments from three reviewers, all of whom have engaged with the paper and provided useful and relevant comments. All three agree with my own view that this is an interesting and potentially important contribution, considering how the initial number of latently infected individuals can be better estimated from data, going beyond a broad brush assumption based on the initial infected population and the expected latent period. However, in different ways all reviewers comment on how the work could be slightly better motivated and/or generalised: I was struck by R3's comment about transmission before symptoms, which is common for pathogens of a range of host taxa, and this should at least be discussed. I was also struck by R1's comments around making the case more strongly that the so-called Hist-I assumption is, in fact, "conventional"/widely used, and agree that no consideration of the method of stages is surprising. While these issues can, probably, be handled by adding more text, some of the other more detailed issues raised by R3 in their review might require additional computational work/reworking of some figures. I will also be very interested to read the authors' response to R3's comment around the potential for error introduced by assuming a constant rate of exposure for an epidemic that is growing exponentially, and it might be that some further tests with synthetic data might be helpful here. Both R1 and R3 raise minor issues with the supplied code or its documentation, and for this type of article I think these are important to address in full. Nevertheless, this appears at this stage to be a very nice article, and I look forward to seeing a revision, which at this stage I intend to send to (at least) R1 again (perhaps also R2 and R3, depending on what is said in the response to reviewers letter).

Journal Requirements:

1) Please ensure that the CRediT author contributions listed for every co-author are completed accurately and in full.

At this stage, the following Authors/Authors require contributions: Dongju Lim, Kyeong Tae Ko, Hyojung Lee, Boseung Choi, Won Chang, Sunhwa Choi, and Jae Kyoung Kim. Please ensure that the full contributions of each author are acknowledged in the "Add/Edit/Remove Authors" section of our submission form.

The list of CRediT author contributions may be found here: https://journals.plos.org/ploscompbiol/s/authorship#loc-author-contributions

2) Please upload all main figures as separate Figure files in .tif or .eps format. For more information about how to convert and format your figure files please see our guidelines: 

https://journals.plos.org/ploscompbiol/s/figures

3) We have noticed that you have uploaded Supporting Information files, but you have not included a complete list of legends. Please add a full list of legends for your Supporting Information files (HIstD.zip) after the references list.

4) In the online submission form you indicate that your data is not available for proprietary reasons and have provided a contact point for accessing this data. Please note that your current contact point is a co-author on this manuscript. According to our Data Policy, the contact point must not be an author on the manuscript and must be an institutional contact, ideally not an individual. Please revise your data statement to a non-author institutional point of contact, such as a data access or ethics committee, and send this to us via return email. Please also include contact information for the third party organization, and please include the full citation of where the data can be found.

5) Please amend your detailed Financial Disclosure statement. This is published with the article. It must therefore be completed in full sentences and contain the exact wording you wish to be published.

1) State what role the funders took in the study. If the funders had no role in your study, please state: "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript."

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Authors:

Please note that two reviews are uploaded as attachments.

Reviewer #1: Please see attachment.

Reviewer #2: The review is uploaded as an attachment.

Reviewer #3: In this article, the authors address the inference of size of the latent exposed class in SEIR-like epidemic models. They contrast the bias in estimates of the size of the latent class when making the (common) assumption that the size of the initial latent class is determined by the initial size of the infected population and the expected latent period, with their preferred set of assumptions: that allows a distribution of time since infected in the exposed class, assuming that prior to the initial time point the risk of exposure (moving from S to E) is constant. The authors construct a loss function with regularisation (penalising rapid changes in the rate of exposure) and generate posterior estimates for $E(t_0)$ using hamiltonian MCMC implemented in Stan, arguing that this approach is computationally efficient and results in improved estimates of the epidemic size (including latent exposures). They use both synthetic and detailed case data -- that includes information on observed E(t) -- for SARS-CoV-2 in South Korea to validate these results, and explore the impact of noise of different magnitude on their ability to recover the size of the exposed class.

This paper is a commendable attempt to consider the consequences of an oft used approximation for epidemic initial conditions. The detailed epidemiological case data, including exposure and onset dates, used to validate their approach is a significant strength. I have some suggestions that may help clarify the approach and the consequences of the results.

* The adopted model structure can accommodate general distributions for the latent and infectious periods, but given for SARS-CoV-2 and other pathogens transmission prior to symptom onset is common, and capturing the generation time appropriately is important for accurate estimates of the initial and time-varying reproduction numbers, can you briefly clarify (perhaps repeating information on Hong et al. [28]?) how $g_1$ and $g_2$ are chosen to ensure an appropriate generation time in this model?

* I presume the main focus of the inference is E(t) and I(t), and can see that $g_{1,2}$ are fixed, but it would be good to be clear(er) about whether particular $\beta$ (and its evolution in fig 3i), are being inferred, and if appropriate showing some of the bivariate posterior distributions.

* Figure 1 of the supplementary material demonstrates the bias introduced to estimates of $R_0$ generated by bias in E(t). Apologies if I have missed this in the discussion, but It would be nice to understand how this bias evolves in estimates of $R(t)$ in this modelling framework - i.e. if/when do these biases in IC become irrelevant for predicting the current model state?

* When discussing the limitations of the model, is it also possible that the discrepancy in coverage of CIs also due to assumption of constant exposure prior to $t_0$? E.g. Often epidemics may begin with 'super-spreading' events that the model isn't capturing (e.g. https://www.nature.com/articles/s41467-023-42612-9)

* Consider adding the version of Stan used to run the code, there was a syntax error in the data block for the current version (though easy to fix).

* The inference did not run for me when using the input data file 'SimulData_cde.csv' due an incorrect date format. It did run as for the file specified in the code, though with many divergent transitions. Please consider checking the instructions for running the code, and reporting on the convergence diagnostics for your MCMC chains.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean? ). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy .

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

Figure resubmission:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. If there are other versions of figure files still present in your submission file inventory at resubmission, please replace them with the PACE-processed versions.

Reproducibility:

To enhance the reproducibility of your results, we recommend that authors of applicable studies deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Attachment

Submitted filename: review.pdf

pcbi.1013438.s011.pdf (80KB, pdf)
Attachment

Submitted filename: PLoS Comp Biol_review_3-25.pdf

pcbi.1013438.s012.pdf (56.9KB, pdf)
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1013438.r003

Decision Letter 1

Roger Dimitri Kouyos, Nik J Cunniffe

5 Aug 2025

PCOMPBIOL-D-25-00204R1

A History-Dependent Approach for Accurate Initial Condition Estimation in Epidemic Models

PLOS Computational Biology

Dear Dr. Kim,

Thank you for submitting your manuscript to PLOS Computational Biology. After careful consideration, we feel that it has merit but does not fully meet PLOS Computational Biology's publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript within 30 days Oct 05 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at ploscompbiol@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pcompbiol/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

* A rebuttal letter that responds to each point raised by the editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. This file does not need to include responses to formatting updates and technical items listed in the 'Journal Requirements' section below.

* A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

* An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, competing interests statement, or data availability statement, please make these updates within the submission form at the time of resubmission. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

We look forward to receiving your revised manuscript.

Kind regards,

Roger Dimitri Kouyos

Section Editor

PLOS Computational Biology

Additional Editor Comments :

Thank you for engaging so positively with the reviewers' comments. As you will see, the reviewers who had substantive comments are now almost entirely satisfied, as am I. I will leave it up to you to decide whether - or not - to address the remaining comments from Reviewer #1; despite what is said below about "only permit corrections to spelling, formatting or significant scientific errors" you should be able to address these - if you wish - as part of the process of moving to publication.

Journal Requirements:

1) Please ensure that the funders and grant numbers match between the Financial Disclosure field and the Funding Information tab in your submission form. Note that the funders must be provided in the same order in both places as well. Currently, the order of the grants is different in both places especially  "NIMS-B24810000" and "SSTF-BA1902-01."

Note: If the reviewer comments include a recommendation to cite specific previously published works, please review and evaluate these publications to determine whether they are relevant and should be cited. There is no requirement to cite these works unless the editor has indicated otherwise.

Reviewers' comments:

Reviewer's Responses to Questions

Reviewer #1: I thank the authors for their detailed responses to my comments and those of the other reviewers, and am mostly satisfied with the responses. I have just a few remaining minor comments:

1. In the abstract, please modify “this history-independent method yields serious bias in the estimation of the initial condition” to “this history-independent method can yield serious bias in the estimation of the initial condition when the latent period is not exponentially distributed”. Similarly, in the author summary, change “this unrealistic assumption leads to serious errors” to “this unrealistic assumption can lead to serious errors”.

2. RE my initial comment 4, I agree that the method of stages is (at least slightly) less flexible than the approach here, although it’s not quite true that the number of stages must be treated as a tuning parameter – if a parametric or empirical latent period distribution is available (as is assumed in this study), the number of stages and transition rate(s) between them could be obtained by fitting the model’s (implicit) Erlang-distributed latent period distribution to best match the “true” distribution. Please revise the relevant part of the discussion accordingly.

3. I found this sentence in the Discussion a little weird: “First, our methods are derived from ODE-based infectious disease models, though PDE-based models are also used to account for spatial effects in disease spread”, since PDE models aren’t particularly common in the (applied) modelling literature. It would be nice instead to discuss the extension of this approach to other types of models more generally (for example, stochastic compartmental models and network-based models).

Reviewer #2: The authors have done a nice job with the revision and have addressed all of my concerns.

Reviewer #3: Thank you for thoroughly responding to my comments, I am happy to recommend acceptance.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean? ). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy .

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

Figure resubmission:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. If there are other versions of figure files still present in your submission file inventory at resubmission, please replace them with the PACE-processed versions.

Reproducibility:

To enhance the reproducibility of your results, we recommend that authors of applicable studies deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1013438.r005

Decision Letter 2

Roger Dimitri Kouyos, Nik J Cunniffe

14 Aug 2025

Dear Professor Kim,

We are pleased to inform you that your manuscript 'A History-Dependent Approach for Accurate Initial Condition Estimation in Epidemic Models' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Nik J. Cunniffe

Academic Editor

PLOS Computational Biology

Roger Kouyos

Section Editor

PLOS Computational Biology

***********************************************************

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1013438.r006

Acceptance letter

Roger Dimitri Kouyos, Nik J Cunniffe

PCOMPBIOL-D-25-00204R2

A History-Dependent Approach for Accurate Initial Condition Estimation in Epidemic Models

Dear Dr Kim,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

You will receive an invoice from PLOS for your publication fee after your manuscript has reached the completed accept phase. If you receive an email requesting payment before acceptance or for any other service, this may be a phishing scheme. Learn how to identify phishing emails and protect your accounts at https://explore.plos.org/phishing.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Narmatha Raju, M.Sc

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 File. Hist-D.

    Computational package for Hist-D.

    (ZIP)

    pcbi.1013438.s001.zip (3.3MB, zip)
    S1 Text. Computational package for Hist-D.

    (DOCX)

    pcbi.1013438.s002.docx (26.2KB, docx)
    S2 Text. Estimation of the reproduction number heavily depends on the initial conditions.

    (DOCX)

    pcbi.1013438.s003.docx (25.3KB, docx)
    S3 Text. Hist-D is more accurate than Hist-I, even after Hist-I was modified.

    (DOCX)

    pcbi.1013438.s004.docx (25.7KB, docx)
    S4 Text. Hist-D enhances the accuracy of reproduction number estimation.

    (DOCX)

    pcbi.1013438.s005.docx (25.9KB, docx)
    S1 Fig. Estimation of reproduction number is heavily dependent on the initial condition of E.

    The plot displaying the fold error between the estimated reproduction number (^) and the true reproduction number () across various bias levels in E(0). E(0) was initially E0=50, and it changed to the E1.When =2 (red curve), E1/E0=22.5 led to a 37.9% relative error and E1/E0=22.5 led to a 19.4% relative error. When =4 (blue curve), E1/E0=22.5 led to a 42.5% relative error and E1/E0=22.5 led to a 34.1% relative error. (b) The plot showing the estimated time-varying reproduction number ((^(t))) under varying levels of bias in the initial condition of E: E(0)4,E(0)2,E(0),2E(0), and 4E(0), where E(0) is the original initial condition value. The influence of bias in E(0) continues noticeably well after the time point where the initial condition was estimated, and decreases over time, leading to convergence in the estimated reproduction numbers.

    (EPS)

    pcbi.1013438.s006.eps (1.3MB, eps)
    S2 Fig. The superiority of Hist-D was preserved even after the modification of Hist-I as done in Rauch et al. (a) The graph comparing the true E(t0) (light gray-colored bars) and the estimated E(t0) (E^(t0)).

    The modified Hist-I underestimates the E(t0) (b) The scatter plot displaying the error between estimated E(t0) (E^(t0)) and true E(t0) across different levels of true E(t0). Estimation from Hist-D (green squares) shows smaller errors compared to the modified Hist-I (red triangles). (c) The bar plot showing the root mean squared error (RMSE; bars) and mean absolute percentage error (MAPE; line) of the modified Hist-I and Hist-D. RMSE and MAPE were reduced by 80% and 53%, respectively, when Hist-D were utilized, compared to the modified Hist-I.

    (EPS)

    pcbi.1013438.s007.eps (1.1MB, eps)
    S3 Fig. Hist-D outperforms both Hist-I and the modified Hist-I even in the early phase of the epidemic dynamics.

    (a) The graph comparing the true E(t0) (light gray-colored bars) and the estimated E(t0) (E^(t0)) for time t0=1080. (b) The scatter plot displaying the error between estimated E(t0) (E^(t0)) and true E(t0) across different levels of true E(t0). Estimation from Hist-D (green squares) shows smaller errors compared to both Hist-I (red triangles) and the modified Hist-I (blue circles). (c) The bar plot showing the root mean squared error (RMSE; bars) and mean absolute percentage error (MAPE; line) of Hist-I, the modified Hist-I, and Hist-D. Hist-D reduced both RMSE and MAPE by 81% compared to Hist-I, whereas the modified Hist-I reduced both by 47%.

    (EPS)

    pcbi.1013438.s008.eps (1.2MB, eps)
    S4 Fig. Hist-D enhances the accuracy of reproduction number estimation.

    Boxplots of the posterior samples of the reproduction number (^) obtained from IONISE with initial conditions estimated by Hist-D (green) and Hist-I (red). IONISE combined with Hist-D accurately estimated the reproduction number, while using Hist-I introduced considerable bias. Here, the posterior samples were normalized by the true value () employed for generating the simulation data depicted in Fig 3c.

    (EPS)

    pcbi.1013438.s009.eps (1.1MB, eps)
    S1 Table. Hist-D is more accurate than Hist-I under various parameter conditions.

    The table shows the reduction in RMSE (former) and MAPE (latter) achieved by Hist-D relative to Hist-I under the same conditions as Fig 3c–3e, but with varied latent and infectious period parameters.

    (DOCX)

    pcbi.1013438.s010.docx (23.2KB, docx)
    Attachment

    Submitted filename: review.pdf

    pcbi.1013438.s011.pdf (80KB, pdf)
    Attachment

    Submitted filename: PLoS Comp Biol_review_3-25.pdf

    pcbi.1013438.s012.pdf (56.9KB, pdf)
    Attachment

    Submitted filename: Response to reviewers.docx

    pcbi.1013438.s014.docx (1.7MB, docx)
    Attachment

    Submitted filename: Response to reviewers_QC.docx

    pcbi.1013438.s015.docx (33KB, docx)

    Data Availability Statement

    The simulated data on daily exposed and infectious people generated in this study is provided as a CSV file ‘SimulData_cde.csv’ within the Github repository (https://github.com/Mathbiomed/Hist-D), which is publicly available, and a permanent reference to the version of the code used in this study is provided at the Zenodo repository (https://doi.org/10.5281/zenodo.16891923). The real-world confirmed cases and contact tracing data were collected with informed consent and were provided by the Seoul Metro Infectious Disease Research Center. These data are protected and are not available due to data privacy laws. Specific academic requests for access to these data should be directed to the Citizens’ Health Bureau, Infectious Disease Control Division, Seoul Metropolitan Government (Tel: 82-02-2133-9480, E-mail: pr77889@seoul.go.kr) or the Institute for Basic Science (E-mail: webmaster@ibs.re.kr).


    Articles from PLOS Computational Biology are provided here courtesy of PLOS

    RESOURCES