Abstract
COVID-19 testing has become a standard approach for estimating prevalence which then assist in public health decision making to contain and mitigate the spread of the disease. The sampling designs used are often biased in that they do not reflect the true underlying populations. For instance, individuals with strong symptoms are more likely to be tested than those with no symptoms. This results in biased estimates of prevalence (too high). Typical post-sampling corrections are not always possible. Here we present a simple bias correction methodology derived and adapted from a correction for publication bias in meta analysis studies. The methodology is general enough to allow a wide variety of customization making it more useful in practice. Implementation is easily done using already collected information. Via a simulation and two real datasets, we show that the bias corrections can provide dramatic reductions in estimation error.
Keywords: Estimation of prevalence, Symptoms, Outbreak, Epidemic, Entropy
1. Introduction
There is an urgent need to better understand the spread of COVID-19 in populations both from being able to identify important changes in infection dynamics but also in understanding the effectiveness of control and mitigation strategies. So testing studies of multiple sorts have been undertaken all over the world. Serological surveys (sometimes together with nucleic acid amplification testing) have become a widespread tool to estimate SARS-CoV-2 prevalence and to assess the extent the infection has spread in the population. On the other hand, the results of nucleic acid amplification testing are usually used for diagnosis or detection of SARS-CoV-2 infection and estimating the incidence of infections. Outside of a few exceptions however, most of these studies have been with biased samples. These can be convenience samples which can lead to over representation of symptomatic sampled units (Alleva et al., 2020). And this in turn can lead to over-estimation of disease prevalence. Random sampling protocols can be inefficient due to the lower degree of infection amongst asymptomatic individuals. Thus there is a great need effective corrections that can reduce bias in prevalence estimation.
Much attention has been focused on the issue of correcting for imperfect tests (Diggle, 2011, Greenland, 1996); but less attention has been paid to correcting for biased sampling. One notable exception is Alleva et al. (2020), where it was proposed a snowball sampling approach in conjunction with contact tracing in order to set up a better disease surveillance system. However, this is not how the vast majority of studies are conducted today. In this paper, we address biased sampling from an entirely and somewhat unexpected viewpoint.
We note that biased samples also occur when doing meta analyses due to publication bias (Andrews and Kazy, 2019). That is, papers which favor a null hypothesis of no treatment effect are less likely to be published and hence meta-analytic estimates of treatment effect can be over-estimated. If the null-favoring censoring mechanism can be modeled, then interesting corrections can be made. Based on Andrews and Kazy (2019), we derive and adapt a version of their model for correcting sampling bias in the current COVID-19 pandemic.
We develop the main idea taking into account several categories of symptoms. However, some readers might find of interest to consider just two categories (symptomatic and asymptomatic). In this case, a simple summary of the theory can be found in SubSection 3.5 and a real-life example for this situation can be found in Section 6.
2. The model
Consider a population P of size N. P has a partition into subsets of P each with proportions given by the vector , where
In other words, belongs to the standard -simplex. We define a r.v. taking values in the set that, conditioned on , selects an element of the partition according to a categorical distribution in the interval :
where represents symptomatology (i.e., number or degree of symptoms), and , prevalence: represents infected, while represents non infected. Think of as the set of indexes of the elements in the partition . Thus, with a slight abuse of notation, we will refer to s either as a category or as the subset of with that particular index. In the most common scenario there are just the categories—asymptomatic and symptomatic—, so . However, some studies have considered more than two degrees of symptoms (see Sudre et al., 2020).
The proportion of people with symptoms is given by
(1) |
Here represents the probability of being in the category s and being infected, whereas represents the probability of being in the category s while non infected.
From this notation, the overall probability of being infected is
Also, is the conditional probability of being infected given that we are considering the category s of symptoms.
We assume a Bernoulli r.v. T, which will be 1 with probability . Let’s consider an independent sequence , distributed as T. If the individual j belongs to the group , which happens with probability , will tell us that the individual j is tested (sampled). The sample size is given by .
In summary, up to this point, for , we have that:
-
•
is the probability of being in the category m and being tested.
-
•
is the probability of testing for an individual in category m who is infected.
-
•
is the conditional probability of being infected given m.
-
•
is the real proportion of people with the symptoms.
Let’s assume without loss of generality, an ordering for the partition of P by increasing severity and/or number of symptoms. Then we obtain the four following orderings:
(2) |
(3) |
(4) |
(5) |
The intuition behind these is that the higher the degree and/or number of symptoms, then the higher the probability of being tested (2), the higher the probability of testing infected people (3), the higher the probability of being infected inside that group (Eq. 4), and the lower the real proportion of people with the symptoms (5).
From the conditional distribution of , we observe i.i.d. draws of , whose density, because of Bayes theorem, is
(6) |
Assume there is no error in testing. Then we know exactly the proportion of infected people in the sample. Moreover, we know under which category s is each person tested. Therefore, for all s, we can derive:
(7) |
Thus, we obtained in (6) the biased estimate of the proportion of people tested—and in (7) the biased estimate of the proportion of prevalence) for each s—. The total biased estimator of tested people is
(8) |
and the total biased estimator of prevalence is
(9) |
3. Bias correction
The bias correction is easily determined. It amounts to multiplying the quantity on the LHS of (6), (7) by the inverse of the quotient on the RHS of each respective equation. Specifically, the bias correction is given by , where
(10) |
Replacing x by s and will give us the bias correction for testing and prevalence, respectively, for each s. Summing over gives the sampling bias-corrected estimate of disease prevalence. Now, the numerator at the RHS of (10) can be estimated as , where is the number of people tested, and N is the census population. However, the denominator is unknown, but we can still say some things, depending on the number of symptoms M we are considering.
3.1. Big s
The bias problem of testing and prevalence is not in the last values of , but in considering only the last values of for testing. This is the main thing to correct. We already learned that the overall proportion of people tested is . We also know that most of the people tested are symptomatic; therefore, when s approaches is a good estimator of the real proportion of the last value. Finally, since we know that most of the symptomatic people will be infected, and most of them will get tested, then for large s.
3.2. Small s
When s decreases to 1, the situation is different. According to (2), in the absence of symptoms, the probability of being tested is small. Nonetheless, having few tests for small values of s does not mean that the proportion of people infected/uninfected is close to 0. I.e., might be small, but is large, which according to (6) is introducing a heavy bias. Another way to think of this is that for an unbiased sample, would have a clear representation.
Now, notice that, since , it is possible to give a lower bound for :
(11) |
Giving its lower possible value makes the corrected estimate . However, this implies that for all , the estimated probability of being tested is 0, which cannot be true.
In order to solve this, let’s divide into three groups: , where and are the subsets of low and high values of symptoms, respectively; and , the symptoms in between.
According to Section 3.1, we have already estimated the probability of the elements in . Thus making , we propose the following solution for values in :
-
1.
Take the space , which has probability .
-
2.
Letting be the cardinality of the set , assign equal probabilities to all . I.e., .
The assignation of equal probabilities is justified by our total ignorance of the real proportions, resorting thus to Bernoulli’s Principle of Insufficient Reason (PoIR, see, e.g., Dembski and Marks, 2009), which says that in the absence of further knowledge we must assign equal probabilities to events. Or, more strongly, by the maximum entropy principle, since, according to Jaynes, “in making inferences on the basis of partial information we must use that probability distribution which has maximum entropy subject to whatever is known. This is the only unbiased assignment we can make; to use any other would amount to arbitrary assumption of information which by hypothesis we do not have” (Jaynes, 1957) (see also Díaz-Pachón and Marks, 2020). In other words, if we are going to consider some other distribution than equiprobability, we must justify the reduction in entropy that is inserting bias.
3.2.1. Prevalence
We know that in the asymptomatic population, a non-negligible portion of individuals is infected. However, studies vary here as for the proportion of asymptomatic and infected individuals. Some of those studies maintain that most of the asymptomatic people are already infected. In such case, , as we explained in SubSection 3.2. Others lean on the side that, for small, s, we have closer to 0 than to 1, but not necessarily approaching 0. We propose the following algorithm:
-
1.
If we don’t have any information about the asymptomatic category with the disease, generate a random number u according to a uniform distribution in the interval .
-
2.
If we have some information about the asymptomatic category with the disease from the biased sample, make u the proportion of prevalence inside the particular category of interest provided by the biased sample (i.e., ).
-
3.
Make .
Remark 1
We have randomized u in , assuming total ignorance regarding . However, as we mentioned before the algorithm, other options are possible. For instance, in case of closer to 0 than to 1, but not necessarily approaching 0, it makes sense to randomize u in .
Remark 2
It is also worth noticing that, because of step 2 in the previous algorithm, the importance of u decreases as the information in the sample (particularly information about asymptomatic individuals and groups with lesser symptoms) increases.
3.3. Middle values of s
Notice that in the previous algorithm we are not asking to use for the correction in (10) of values in (although it might be done with some care.) This is because for these values we have few to no knowledge. Our recommendation is to collapse to . For this case, we explain the estimated values in SubSection 3.5.
3.4. Some model caveats
The model is presented in some generality on purpose. It naturally permits customization to different testing settings. For instance, symptomatology may be extended to reflect different subpopulations (e.g., racial/ethnic subgroups, age groups, risk groups, environments). The model can also be generalized to index more than 1 type of test (not shown here).
We also make the caveat that this is a correction of the bias, not a total elimination of it. This is a consequence of the fact that the correction factor in (10) cannot take any value, but has restrictions as explained in the obtention of Eq. (11). Nonetheless, as we will see in the toy example and the two real-life scenarios in the last sections, the proposed method achieves very important reductions in sampling bias, particularly for the overall prevalence, but does not reduce it to zero.
3.5.
As suggested before, an important simplification occurs when we consider only two groups of symptomatology: asymptomatic () and symptomatic (). In this case, the analysis is conveniently reduced to:
4. Estimated variance
Let be iid multinomial . Then . Let . By the MCLT, we know that
(12) |
where AN stands for asymptotically normal, and .
Now define
and let be iid , so that . Applying the Delta method to , we obtain that , where , and is
(13) |
Notice that the correction in (10) applies to , which leads back to . Therefore, asymptotically, we obtain again the distribution in (12).
In practice, we start with the biased iid , which are , whose sum is . Letting , we have again by the MCLT that . Making , where is a vector with components as in (10), we can apply again the Delta method to h, obtaining
(14) |
However, since we are not using , but , an M-vector whose last component is and with the first components being all . Then, the variance-covariance matrix becomes
where , and is the variance of . From this, the estimated variance of the total prevalence estimate can be calculated.
5. Toy example
Let’s consider a population of size one million where individuals are labeled as asymptomatic (), few symptoms (), mild symptoms (), and all symptoms (). According to our previous definitions let’s consider the following probabilities:
None of these values is known to the researcher. The first row is the proportion of people in each group, it adds to 1. The second (third) are the population proportion of non infected (infected). Notice that the total prevalence is . The fourth is the proportion of tested people without the disease in each group. The fifth is the proportion of infected tested people in each group. The sixth is the resulting proportion of people tested within each group, it is derived using all the other rows. The proportion of tested people is
(15) |
The researcher does not know this probability, but knows the total number of people tested (which can also be derived easily to be ), and the overall census population , with which she can estimate it. In addition, each individual’s symptomatology status and test result is known. Thus the proportion of people tested in each group is:
Notice that the last term is grossly overestimated as (against the real 0.1), and the first two terms are grossly underestimated (against the real 0.5, 0.25); the third term is very well approximated, but in practice we do not know it. However, since we know that most of is made of the last group, becomes a very good estimator of its size, and is also 0.0899. According to our proposal for the small values of S, take , and by maxent, distribute it equally among the three remaining groups. In this way, we obtain a probability of 0.306 for each of the first three groups. According to SubsubSection 3.2.1, using the information we have from the sample, we know that 50/500 were positive in the first group, 500/2500 were positive in the second group, and 7500/15,000 were positive in the third group. In this way, , and are the corrected prevalence estimators for each of the remaining groups. From this, we obtain a total corrected prevalence of . That is, we have obtained a big correction for the original estimator. The naïve estimator of prevalence produces .
This corrects greatly for and , but is harmful for , which is now overestimated, since its original naïve value was the right one, and the maxent value of 0.306 is doubling its real proportion. Moreover, since inside this third category the real number of infected is 75,000, and in the fourth category it is 90,000 (i.e., and of the overall population, respectively), any little change in the sampling scheme is likely to produce . Since our correction averages on unknowns, this very situation is observed. Thus the toy example illustrates very well the method, even highlighting the comments in SubSection 3.3 about the difficulty of dealing with middle values. Nonetheless, the fact remains that the overall correction of prevalence is a very good one.
6. The Diamond Princess COVID-19 outbreak: a real data example
A challenge in applying our bias correction to real data is that one does not know the population prevalence value typically (i.e. the true answer). However, we will consider the testing data from the Diamond Princess cruise (Mizumoto et al., 2020) in which essentially all people on the ship were tested, thus providing a “population” prevalence value as the gold standard. However, this example also has its problems in that it is considering mostly an elderly population. As background, a COVID-19 outbreak emerged on board the Diamond Princess cruise ship in early 2020. This was traced back to a former passenger who tested positive for the virus after disembarking in Hong Kong. After arriving in Yokohama, Japan, the ship was placed under quarantine and over a two-week period, essentially the entire population on the ship was tested by laboratory-based PCR for the virus. A total of 3063 unique tests were done out of 3711 total individuals. Some individuals were permitted to disembark at various points in time, but the status of those individuals not tested is assumed unknown.
At the end of the testing period, out of 3603 tests conducted, there were 634 confirmed positive cases (positive for the virus). Very importantly for our purpose, these were further categorized as 306 symptomatic and 328 asymptomatic at the time of testing. We will assume that all PCR negative individuals were asymptomatic. Additional demographic information is provided on country of residence, age and gender distributions for the cases (Mizumoto et al., 2020). The population value of prevalence found with PCR is thus 634/3063 = 0.206. We make the caveat that the accuracy of the PCR testing heavily depends on the delay between the time of infection and the time of sample collection, therefore, more probable than not, 0.206 does not correspond to the real prevalence. However, for the purposes of our illustration this is not an issue, and we will proceed considering it as the prevalence in the ship.
In order to demonstrate our bias correction, we will analyze samples from the tested ship population considering four possible sampling protocols: 1) total bias in which we sample only the symptomatic positive individuals; 2) partial bias in which we draw a sample with representation from symptomatic individuals; 3) a balanced sampled of symptomatic and asymptomatic individuals; 4) a truly random sample from the ship’s population based on population symptom frequencies.
Sampling Protocol 1: The sample consists only of the 306 symptomatic positive individuals. The naïve estimate of prevalence is then 1.0, which is grossly overestimated. The bias-corrected estimate of is . This is also our bias-corrected estimate of . Furthermore, and, taking the average of u, we obtain . Thus the bias-corrected total prevalence estimate is .
Sampling Protocol 2: The sample consists of 306 symptomatic positive individuals and 101 asymptomatic ones. Thus the sample has 75% representation of symptomatic individuals. We assume the number of positive asymptomatics in the sample is . Thus the naïve sample estimate of prevalence is . Here , which is also our estimate for . Then and . Setting represents knowledge we have, which is the sample estimate of the prevalence for the asymptomatic group. Thus the biased corrected estimated of total prevalence is which is very close to the true population value of 0.206.
Sampling Protocol 3: The sample consists of 306 symptomatic positive individuals and 306 asymptomatic ones. This means that all symptomatic individuals were still in the sample. We will assume positive cases among the asymptomatic individuals. Thus the naïve estimate of prevalence is . Once again , which is also our estimate for . Then and . Thus total corrected prevalence is 0.196, which is very close to the naïve estimate as we predicted.
Sampling Protocol 4: This is a random sample from the population. Suppose we take . Of these, are symptomatic positive individuals and thus 450 are asymptomatic individuals. Among the asymptomatic individuals, we assume are positive for the virus. Thus the naïve sample estimate of prevalence is which is quite close to the true value (save rounding errors). Thus, we anticipate, the bias correction will do little. Specifically, and this is also our estimate for . Then and . Thus the corrected prevalence estimate is . Again the true population value is 0.206.
Remark: One could also do brute force simulated bias estimation for each of the four sampling protocols above. This would mean drawing repeated samples of a given size from a “ship population” characterized using particular protocol-specific probabilities for symptoms and then conditional on symptom status, population probabilities for being virus positive. Corrected estimates of prevalence would be estimated for each sample and empirical biases estimated given that the true population prevalence is known. The analyses conducted above can be thought of as based on an idealized representative sample for each protocol across all draws.
Remark: It would be of interest to do similar analyses broken out by age and gender. However, the Diamond Princess data only provides aggregate information at these levels without a known mapping to symptomatology and viral presence status. It’s also important to note that all estimates and inferences are limited to considering the ship passengers as a population and should not be generalized further.
7. Lombardy – Italy
Since the share of symptomatic individuals on the Diamond Princess is controversial and the on-board population was highly skewed towards older ages, we add this new example on which the probability of developing symptoms is much lower (closer to the actual values estimated for COVID-19). This example will corroborate that the maxent principle (or Bernoulli’s PoIR) works in a more representative scenario of the COVID-19 pandemic. The actual example comes from a recent preprint by Poletti et al., where the authors calculated the probability of symptoms and critical disease after SARS-CoV-2 infection in Lombardy, Italy (Poletti et al., 2020).
In a sample of 5824 individuals it was possible to identify 932 infections through PCR testing. Moreover, besides these 932 infected individuals, they also detected 1892 infections using serological assays. Thus, the total of infected individuals was 2824. Among the total of infected, 876 were symptomatic (). Therefore, since for our purposes we are only interested in detection by PCR and not through antibodies, we will not count the 1892 infections detected through serological assays. However, we will use the fact that of the cases were symptomatic assuming that the same percentage is holding for the 932 infections detected by PCR. Thus, in our case we will have a prevalence of ; and among the infected, individuals will be symptomatic. The remaining 932–289 = 643 will be infected and asymptomatic.
Sampling Protocol 1: The sample consists of the 289 infected and symptomatic individuals. In this case, the naïve estimate of prevalence is 1. The bias corrected estimate will be . And this will also be the correction for . Then , and taking the mean of u we obtain . Therefore, the total corrected prevalence is estimated as , which still high but corrects heavily the effects of a very bad sample.
Sampling Protocol 2: The sample consists of 384 individuals. Among these, 289 () are infected and symptomatic, and 95 () are asymptomatic. We assume the sample has asymptomatic positive for the virus. So our naïve estimate of prevalence is . In this case, . Now, ; and setting u as . Therefore the total prevalence is corrected to , which is very close to the real 0.16.
Sampling Protocol 3: In this scenario we have 289 symptomatic positive and 289 asymptomatic. We assume the sample has asymptomatic and infected individuals. The naïve estimate is . However, ; and . In this case, ., where . Therefore, is the corrected estimate of prevalence.
Sampling Protocol 4: This sample is truly random. Say . Among these, are symptomatic and positive. Therefore, 570 are asymptomatic. Among the asymptomatic group, we are going to assume infected individuals. The naïve sample estimate is thus , which of course is the same as the real prevalence. In this case, the correction will work like this: . Then , and . Therefore, the total prevalence is estimated as . In this case, the correction is not bad, but of course does not do as well as the truly random sample.
8. The need for further bias correcting and discussion
The model assumes tests with no errors (i.e. false positives or false negatives). Clearly this is not the situation in practice. Specificities and sensitivities can often be less than ideal. Sampling bias-corrected estimates of prevalence can be further corrected in a second stage using the methods of Diggle (2011) or Greenland (1996) which account for using imperfect tests.
Our study demonstrates that under biased sampling designs that are often difficult to avoid in testing studies for COVID-19, the resulting biased estimates of prevalence can be corrected using simple methodology derived and adapted from corrections for publication bias used in meta analysis studies. Further research is needed in order to correct the prevalence of “middle” groups, as stated in SubSection 3.3 and seen with the overestimation of in the toy example. However, the correction detailed in our toy example, while extreme, shows the effectiveness of the correction for the total prevalence. The corrections can be used directly in practice using the data collected even though many of underlying quantities in the population may be unknown to the researcher.
CRediT authorship contribution statement
Daniel Andrés Díaz-Pachón: Conceptualization, Methodology, Validation, Formal analysis, Writing - original draft. J. Sunil Rao: Conceptualization, Methodology, Validation, Formal analysis, Supervision, Writing - review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
JSR was partially supported by NSF grant DMS-1915976 and NIH grants U54 MD010722 and UL1 TR000460. We would like to thank the two referees for their helpful input which improved the final version of the paper.
References
- Alleva, G., Arbia, G., Falorsi, P.D., Zuliani, A., 2020. A sample approach to the estimation of the critical parameters of the SARS-CoV-2 epidemics: an operational design with a focus on the Italian health system (Technical report). University of Sapienza.
- Andrews I., Kazy M. Identification of and correction for publication bias. Am. Econ. Rev. 2019;109(8):2766–2794. [Google Scholar]
- Dembski, W.A., Marks, R.J., II, 2009. Bernoulli’s principle of insufficient reason and conservation of information in computer search. In: Proc. of the 2009 IEEE International Conference on Systems, Man, and Cybernetics. San Antonio, TX. pp. 2647–2652.
- Díaz-Pachón D.A., Marks R.J., II. Generalized active information: extension to unbounded domains. BIO-Complexity. 2020;2020(3):1–6. doi: 10.5048/BIO-C.2020.3. URL:/ https://bio-complexity.org/ojs/index.php/main/article/view/BIO-C.2020.3. [DOI] [Google Scholar]
- Diggle, P.J., 2011. Estimating prevalence using an imperfect test. Epidemiol. Res. Int. 608719.
- Greenland S. Basic methods for sensitivity analysis of biases. Int. J. Epidemiol. 1996;25(6):1107–1116. [PubMed] [Google Scholar]
- Jaynes E.T. Information theory and statistical mechanics. Phys. Rev. 1957;106(4):620–630. [Google Scholar]
- Mizumoto K., Kagaya K., Zarebski A., Chowell G. Estimating the asymptomatic proportion of coronavirus disease 2019 (COVID-19) cases on board the Diamond Princess cruise ship, Yokohama, Japan, 2020. Euro Surveill. 2020;25(10):2000180. doi: 10.2807/1560-7917.ES.2020.25.10.2000180. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Poletti, P., Tirani, M., Cereda, D., Trentini, F., Guzzetta, G., Sabatino, G., Marziano, V., Castrofino, A., Grosso, F., del Castillo, G., Piccarreta, R., 2020. ATS Lombardy COVID-19 Task Force, A. Andreassi, A. Melegaro, M. Gramegna, M. Ajelli, and S. Merler. Probability of symptoms and critical disease after SARS-CoV-2 infection. Preprint, 2020. URL https://arxiv.org/abs/2006.08471.
- Sudre, C.H., Lee, K., Ni Lochlainn, M., Varsavsky, T., Murray, B., Graham, M.S., Menni, C., Modat, M., Bowyer, R.C.E., Nguyen, L.H., Drew, D.A., Joshi, A.D., Ma, W., Guo, C., Lo, C.H., Ganesh, S., Buwe, A., Capdevila Pujol, J., Lavigne du Cadet, J., Visconti, A., Freydin, M., El Sayed Moustafa, J.S., Falchi, M., Davies, R., Gomez, M.F., Fall, T., Cardoso, M.J., Wolf, J., Franks, P.W., Chan, A.T., Spector, T.D., Steves, C.J., Ourselin, S., 2020. Symptom clusters in Covid19: A potential clinical prediction tool from the COVID Symptom study app. medRxiv.https://doi.org/10.1101/2020.06.12.20129056.