Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2021 Mar 4;17(3):e1008726. doi: 10.1371/journal.pcbi.1008726

Group testing as a strategy for COVID-19 epidemiological monitoring and community surveillance

Vincent Brault 1,#, Bastien Mallein 2,#, Jean-François Rupprecht 3,*,#
Editor: Jingyi Jessica Li4
PMCID: PMC7932094  PMID: 33661887

Abstract

We propose an analysis and applications of sample pooling to the epidemiologic monitoring of COVID-19. We first introduce a model of the RT-qPCR process used to test for the presence of virus in a sample and construct a statistical model for the viral load in a typical infected individual inspired by large-scale clinical datasets. We present an application of group testing for the prevention of epidemic outbreak in closed connected communities. We then propose a method for the measure of the prevalence in a population taking into account the increased number of false negatives associated with the group testing method.

Author summary

Sample pooling consists in combining samples from multiple individuals into a single pool that is then tested using a unique test-kit. A positive test means that at least one individual within the pool is infected. Sample pooling could provide the means for rapid and massive testing for the presence of SARS-CoV2 among asymptomatic individuals. Here, we do not address any diagnostic problems—e.g. how to use a minimal number of tests to obtain an individual diagnostic—but rather focus on population-scale application of pooling. We first quantify the reduction of test sensitivity due to sample dilution and quantify the efficiency of large pools in (i) obtaining precise estimates of the proportion of infected individuals in the general population at reduced costs and (ii) implementing regular large-scale screenings beneficial in the early detection of epidemic outbreaks within communities (e.g. nursing homes or university campuses).

Introduction

Testing aims at revealing the presence of viral load of SARS-CoV-2 within infected individuals [1, 2]. At date, the most standard mean to reveal such viral load remains the reverse transcription quantitative polymerase chain reaction (RT-qPCR) tests [3]. Bottlenecks in the production of reactants used in RT-qPCR diagnostic testing [4, 5] contributed to the development of alternative techniques that provide a more rapid diagnostic, e.g. lateral-flow antigen and RT-LAMP tests, yet at the expense of a reduced sensitivity compared to RT-qPCR tests.

In the wake of a COVID-19 second wave in Europe and at the current date, several countries have been implementing or are actively considering the implementation of massive testing, e.g. Slovakia, whereby repeated nation-wide screenings based on antigen tests occured on week-ends in November 2020 [6]; the Duchy of Luxembourg, whereby a nation-wide screening based on RT-qPCR tests is scheduled for Spring 2021 [7]; the city of Liverpool (United Kingdom) with operation Moonshot consisting in repeated city-wide screenings using a combination of testing strategies.

As COVID-19 infected individuals may be contagious without showing symptoms, tracing is particularly challenging; while individuals showing no symptoms throughout the infection appear to account for only 15% of infections [810], pre-symptomatic individuals appear to cause around 50% of infections approximatively [1114].

Large-scale testing programs aim at addressing such challenge by allowing an earlier identification of asymptomatic and pre-symptomatic carriers [15]. In China, city-wide testing programs were reported in several cities including Wuhan (May 2020) [16] and Qingdao (October 2020) [17]. These cities relied on a technique called sample pooling, equivalently called group testing. The principle of group testing consists in combining samples from multiple individuals into a single pool that is then tested using a single test—which, in the COVID-19 context, amounts to using a single RT-PCR well and reactive kit. The pool sample is considered to be positive if and only if at least one individual in the group is infected.

Group testing has a long history that dates back to the seminal work by R. Dorfman in 1943 [18] in the context of syphilis detection, see [19] for a review.

Several teams across the world have developed group testing protocols for SARS-CoV-2 infected individuals using RT-qPCR tests. As early as February 2020, pools of 10 have been used over 2740 patients to detect 2 positive patients over the San Francisco Bay in California [20]. Late April, a report from Saarland University, Germany, indicated that positive sample with a relatively mild viral load from asymptomatic patients could still be detected within pools of 30 [21]. Further works suggest that RT-qPCR viral detection can been achieved in pools with a number of samples ranging from 5 to 64 [2236].

In parallel, the theoretical literature on group testing for SARS-CoV-2 diagnostic is growing at a fast pace [4, 3742]. Most of the emphasis has been put on the binary (positive or negative) outcome of tests, with little or no regard on the viral load quantification [3]. Moreover, if the possibility of false negatives is sometimes considered, the increase in the rate of false negatives with dilution of samples due to group testing is not often taken into account [43].

In this article, we do not address any diagnostic problems, such as the question of determining optimal strategies to provide individual positive diagnostic using a minimal number of tests as solved by hypercube methods [23, 26, 4345] and the P-Best algorithm [24]. We rather propose to evaluate pooling strategies in the non-diagnostic contexts of screening and surveillance, as defined by the Centers of Disease Control (CDC, USA) terminology [46].

In Section II, we propose a group testing protocol that aims at the early detection of an epidemic outbreak in a closed community, such as nursing homes or universities. In the context surveillance, the size of pools is mainly determined by the maximal tolerated sensibility loss induced by the sample pooling process; the optimal pool size predicted according to a diagnostic criteria addressing the minimizing number of diagnostic is ill-defined in such context, and does not provide an adapted answer to the question of determining whether a disease is present or not absent in a community.

In Section III, we provide a mathematical formula to estimate the viral prevalence (i.e. the fraction of positive individuals among the tested population) based on pool results. Indeed, the laboratory in charge of the screening program may not have access to the subsequent diagnosis results from positive pools; within the American CDC surveillance protocol, non-clinical (e.g. veterinary) laboratories are allowed to perform pooling of samples for surveillance screenings but individual diagnosis is to be performed by a clinical laboratory abiding by the Clinical Laboratory Improvement Amendments (CLIA) rules. Some individuals may also refuse to comply to the diagnosis tests. In addition, diagnostic tests performed on positive pools may also turn negative [47]; we see at least 3 possible reasons for such discrepancies: (i) an inherent false-negative risk in diagnostic tests, (ii) a possible time delay between the screening and diagnosis tests (e.g. that could result in positive individuals in the pool turning negative in the diagnostic test) or (iii) the fact that the screening and diagnostic tests may not rely on the same sample collection—with screenings relying on self-collected nasal swabs or saliva collection, while diagnosis tests are most often performed on nasopharyngeal swabs samples with an arguably higher level of sensitivity.

We find that large pools are extremely efficient at estimating the prevalence. Such estimates could serve as a metric to scale prevention measures within a predefined graded response scheme—e.g. in a college university campus, the decision to switch to remote teaching could be triggered once a critical measured prevalence is reached. Surveillance protocols based on sample pooling have indeed been implemented in several universities across the United States, including Duke University [48] and the State University of New York [49]. Similar protocols have been defined for regular surveillance at Liège University (Belgium), as well as at Nottingham and Cambridge universities (United Kingdom) [50]; in the latter, samples are pooled by dormitories; if a pool turns positive, all individuals are requested to undergo isolation as potential case contact; a second diagnostic test is then performed to find the infected individuals [50].

Both Section II and Section III rely on a realistic models for the risk of false negatives induced by sample pooling. Estimating such risk is the objective of Section I, whereby we provide a short description of the RT-qPCR and a statistical model for its study. In Section I.2, we analyse the distribution of viral loads among a series of clinical datasets to estimate the averaged false-negative rates induced by the sample pooling process, assuming a linear dilution and a fixed positivity cycle threshold.

Results

I Models for sample pooling in RT-qPCR test

We present a mathematical model of the RT-qPCR test as well as a new censored-Gaussian method to fit distributions of viral load in the population. We apply our results to the estimation of the increased risk of false-negatives due to dilution.

I.1 Statistical model for the cycle threshold value (fixed a given viral concentration in the sample)

The RT-qPCR technique is a routine laboratory technique used to estimate the concentration of viral material in samples [51]. A RT-qPCR machine typically returns a Ct value, which corresponds to −log2 of the initial number of DNA copies in the sample, up to an additive constant and measure error. It is measured as an estimated number of cycles needed for the intensity of the fluorescence of the sample to reach a target value (see Fig 1). The main test in the realization of that test are recalled in Box 1.

Fig 1.

Fig 1

(A) Sketch of an RT-qPCR fluorescence intensity signal for a positive patient without pooling (solid red line) a single positive patient in a pool of 64 patients (dashed red curve) and for a negative sample representing the response of an artefact (dotted magenta curve); as pooling dilutes the initial concentration, the pooled response (dashed red curve) is expected to be close to the translation xx + 6 from that of a single patient (solid red line). (B) Sketch of the distribution of threshold values for RT-qPCR individual tests (solid blue line) or in pools 64 (dashed red curve); part of the distribution crosses the limit of detection of the test (figured as the grey area) at the detection threshold dcens.

Box 1: A brief description of RT-qPCR tests

We very briefly review some of the steps implemented during an RT-qPCR diagnostic procedure [3]:

  1. The sample is treated so that a target RNA sequence (characteristic of the virus) is transcribed into DNA (reverse transcription);

  2. The sample is placed in a RT-qPCR machine, which can measure the concentration of DNA of interest in the sample by making it fluorescent;

  3. A reactive is added which approximatively doubles the number of DNA of interest at every cycle, driven by temperature changes;

  4. The time series of the concentration in DNA over time is recorded; on a linear regression of of the logarithm of the fluorescent signal over time, one deduces an estimate of the viral concentration in the sample from the linear regression value at the origin.

Combined measures of two viral RNA strands are also recommended [3]. Here we focus on a single RNA strand detection and we do not model here the possible errors at the reverse transcription stage, which could lead to some biased measure of the viral load distribution. Depending on the RT-qPCR device, the Ct value of the sample can shift by an additive constant; such constant can be estimated by measuring the Ct value of a standard solution of viral DNA to tare the measure. In any cases, some RT-qPCR device might allow the detection of lower viral loads than others.

I.1.1 Model of the cycle threshold values for an individual sample

RT-qPCR tests are prone to amplify non-specific DNA sequences [51, 52] that can trigger an onset of fluorescence in a samples with no viral SARS-CoV-2 load. The fact that such spurious onset of fluorescence typically occurs beyond a relatively large critical number of cycles imposes the following condition on the diagnosis to minimize the risk of false positives: a reliable positive result can only be made if the Ct value is lower than a critical value, denoted dcens. Here, the onset of fluorescence from virus-free samples will be modelled as if triggered by a vanishingly small artificial concentration, denoted ϵ1.

We propose to model the number of cycles threshold value Ct as a random variable, denoted by Y, that depends on the viral load c in the measured sample as

Y=-log2(c+ϵ1)+ϵ2, (1)

where we assume that (i) the risk of non-specific amplification (false-positive) ϵ1 as log-normal distribution with parameters (ν, τ2); and that (ii) the intrinsic variability in Ct measurement ϵ2 is a centered Gaussian random variable with variance ρ2.

As mentioned above, tests are considered to be reliably positive when Ydcens. To minimize the risk of false positives, the threshold dcens (with cens for censoring) is chosen such that P(ϵ1>2-dcens)1. Thus, using that as long as a and b are of different orders of magnitude, we have log(a + b) ≈ log(max(a, b)), we deduce that

Ymin(-log2(c),dcens)+ϵ2, (2)

which obeys the law of a Gaussian random variable with variance ρ2 and mean −log2(c), censored at dcens.

In the no false-positive risk limit (ϵ1 → 0), the RT-qPCR threshold intensity of a negative patient (c = 0) would never be reached (Y → ∞ as well as dcens = ∞).

I.1.2 Model of the cycle threshold values for pooled samples

We now consider what happens when constructing a pooled sample of N samples. For each iN, we write Zi = 1 if the sample i contains a viral RNA load with concentration Ci > 0, and Zi = Ci = 0 otherwise. In the rest of the paper, we assume that, in a combined sample created from a homogeneous mixing of the individual samples, the viral concentration reads:

C(N)=1Nj=1NZjCj. (3)

This assumption relies on the fact that infected individuals should have a sufficiently high number of viral copies per sample, so that taking a portion 1/N of a virus bearing sample brings a fraction 1/N of its viral charge. The result of the RT-qPCR measure of a grouped test with N individuals is then given by Eq 1, with c = C(N), hence reads

Y(N)=min[log2N-log2(j=1NZjCj),dcens]+ϵ2. (4)

where (Zi, iN) are i.i.d. Bernoulli random variables whose parameter is the prevalence of the disease in the population; (Ci, iN) are i.i.d. random variables corresponding to the law of the viral concentration within samples taken from a typical infected individual in the overall population.

Our model Eq 4 is consistent with the experimental result of [22] as well as [33], whereby linear relations are found between the logarithm of the pool size and the measured Ct that are sufficiently distant from the identified detection threshold.

Remark I.1. If it were possible to combine samples without dilution (e.g. following the protocol of [34], whereby the exact same volume of each sample is added to the buffer solution as if the sample were being tested individually), 4 would then be replaced by

Y(N)=min[-log2(j=1NZjCj),dcens]+ϵ2, (5)

in which case, theoretically, pool testing would never loose precision when the pool size increases. However, if the dilution effect occurs for pool sizes exceeding a threshold size K, Eq 4 would be replaced by

Y(N)=min[log2(NK)+-log2(j=1NZjCj),dcens]+ϵ2; (6)

where log2(N/K)+ = 0 if N < K and log2(N/K)) otherwise; the analysis would then be similar to what is presented in the rest of the paper, yet with a lower false negative rate.

Remark I.2. We expect the RT-qPCR result to correspond to the sample with the highest viral load, up to a dilution-induced drift log2(N), under the model hypothesis of Sec I (cf. Fig 1). Indeed, since the viral concentration in randomly selected infected individuals spans several order of magnitudes, we expect that

log2(1Ni=1,,jCi)log2(maxi=1,,jCi)-log2(N), (7)

for j positive samples with concentration Cj diluted in a pool of N. In contrast with [53], we find, based Eq 7, that the measured value of the pooled sample viral concentration cannot be used to estimate the number of infected individual within the pool. However, we point out that the RT-qPCR viral load measure could be used to improve efficiency and cross testing of smart pooling type diagnostic methods, which are beyond the scope of this paper. We plan to investigate this aspect in future work.

In order to determine the statistics of the measured cycle Y(N) in a group test of N individuals, we need a distribution for the value of Cj, the viral distribution of infected individuals in the population; this is the objective of the next section.

I.2 Statistical analysis of the population-level viral load

In this section, we model the Ct distributions extracted from a set of clinical datasets.

I.2.1 Clinical datasets

Here we considered four studies in our analysis:

  1. The ImpactSaliva dataset [54] providing raw Ct measures from saliva samples (Yale University, USA) with the N1 gene as a target. A set of raw Ct data from N = 180 individuals is provided, including 45 Cts values beyond the positivity threshold set at Ct = 38.

  2. A dataset constructed on a histogram from Lennon et al. [55] based on 2179 nasopharyngeal samples from residents and staff members of nursing homes during the April to May 2020 period; the N1 gene was used as a target (Massachusetts,USA). Up to date and to the best of our knowledge, Lennon et al. [55] is the largest to study date to present separated Ct histograms between symptomatic (N = 739) and asymptomatic (N = 1440) individuals at the time of testing.

  3. A dataset constructed on a histogram from Jones et al. [56] based on N = 3598 nasopharyngeal swabs samples from individuals with various age at La Charité hospital (Berlin, Germany) during the March to early April 2020 period; two target genes (E gene and RdRp) are mentionned in [1]).

  4. A dataset constructed on a histogram from Cabrera et al. [28] based on N = 852 infected nasopharyngeal swabs samples from residents and staff members of nursing homes in Galicia (Spain) during the March to May 2020 period; the Open Reading Frame 1b (ORF1b) was used as a target gene.

As the precise distribution of data points within each bars of the histogram are unknown in the datasets 2,3 and 4, we assume that points were distributed uniformly in their histogram bar class. We have verified the robustness of our fit estimator for several distribution of points which lead to consistent values for our model parameters (see Section I.1. in S1 Text).

I.2.2 Censored Gaussian model fits

As we expect the measure error ϵ2 of the RT-qPCR to be small with respect to the width of the histogram classes, we set ρ = 0 in the rest of the section.

Mixture model. The shape of the histograms in Fig 2 suggest that the law of the viral load should be distributed according to a mixture of three or more Gaussian distributions. We performed fitting using standard Gaussian distributions models which we refer to as the naive model.

Fig 2.

Fig 2

(A) Representation of the density for the classical mixing Gaussian model (dashed lines) and the partially censored model (solid lines) each composed as a sum of 3 components for the Gaussian model (orange/green/red dashed lines) and the partially censored model (orange/green/red solid lines); (purple vertical line) location of the threshold datt ≈ 35.6. Data based on the histogram presented in [56]. (B) Focus on the false negative region, with the estimated false negative probability in the partially censored model (solid line) due to the defect of detection above the threshold datt (red color filled area). (C) Mixing Gaussian model on the ImpactSaliva dataset presented in [54]. (D-E) Mixing Gaussian model on the ImpactSaliva dataset presented in [55] for (D) asymptomatic and (E) symptomatic individuals at the moment of the test. Raw data available in S1 Data.

However, as the dataset histograms usually exhibit a sudden drop in the number of detected cases around a Ct value denoted datt (e.g. datt = 35.6 for the Jones et al. dataset), that we refer to as the attenuation threshold. We explain these drops by a loss of sensibility of the measure for samples with Ct value between datt and dcens (the limit of detection). We model this loss of sensibility by a fixed probability q of detection above level datt.

Censored models. To model a partial lack of detection of low viral load (Ct higher than a threshold datt), we introduce the partially censored Gaussian variable as a building block for the representation of the density of the viral load in infected patients.

We assume that if the sampled Ct value is lower than the attenuation threshold datt; if the value is higher than datt, the sample will be detected with probability q, and its measure will be registered. Otherwise, it will be discarded as a (false) negative, with probability 1 − q. The parameter q represents the probability of detection of a viral load that falls below the detection threshold of some RT-qPCR measures.

The assumption that the probability of detection only depends on whether the Ct value is higher than a fixed threshold is of course an important simplification, as one would expect lower viral loads to be more difficult to detect than higher ones. However, the simplicity of this model allows us to study it as a three parameters statistical model, and to construct simple estimators for these parameters. Additionally, it fits rather well the available data, and fitting a more complicate censorship model would require a lot of measures of Ct values close to the detection threshold datt. The quantity datt is fixed based on the observed distribution of Ct values in datasets.

To avoid the problem of modelling of the partial censorship, a solution that we implement here as a comparison tool, is to forget the values after the threshold and we perform the fit on the completely censored model (i.e. with q = 0) to the remaining data. See the Method section for further discussion.

Application to the Jones et al. dataset [56]. We apply here the statistical analysis described in the previous section to simulated data based on the values for the viral load distribution found in [56] with a mixture model and a censoring threshold datt ≈ 35.6 (so the two rightmost bars in the histogram of Fig 2, that appear much smaller than the nearby values, are supposed to be censored). It is reasonable to assume that the censoring threshold has the same value for each sub-population, as it depends on the test methodology rather than on the tested individuals. In Fig 2, we represent the histogram with the density for the mixture.

We observe that the separation in sub-populations and the resulting densities are very close to the ones obtained in the naive classical Gaussian mixture model, constructed without taking into account the detection threshold. The principal difference between the naive and censored models consists, for the later, in a larger variance that extends above the threshold. To a lesser extent, the sub-population with a median concentration can also exceed the threshold. It is worth mentioning that as expected, the probability of detection below the threshold value is sensibly the same for all three clusters (around 20%).

As a result, using the computed estimates (see Table A in S1 Text) and the model, we can calculate a theoretical false negative rate, see Eq C in S1 Text: in this case, the value is approximately 3.8% (represented by the red area on the Fig 2B); it mostly belongs to the third cluster. Such false-negative estimate remains to be treated with caution.

To validate the censored model, we can verify that if one (i) erases the data to the right of a certain value and (ii) uses the totally censored model on the remaining data, a similar estimate should be obtained for the parameters. We refer to Fig G in S1 Text for the density obtained using the censored mixture estimation with datt ≈ 35.6, 34.4 and 33.2 (removing the first two, the third, then the fourth rightmost bars in the histogram). We observe that the first and second components are globally unchanged. The mean and standard deviation of the last component are almost the same for datt ≈ 34.4 and datt ≈ 35.6 (see Table B in S1 Text); only the proportions naturally decrease with the threshold. On the other hand, the mean moves slightly to the left for datt ≈ 33.2; this is due to the fact that we loose the information of the largest bars of this component. It might also be caused by our ignorance of the exact distribution of Ct values within classes of the histogram (we recall that we assume that it is a uniform distribution).

Note that if we were to set the threshold at datt ≈ 34.4 as threshold for the partially censored model without erasing data, the optimization procedure nlm would not converge. This is further indication that a detection drop happens in the neighbourhood of 35.6.

Application to the other datasets. We applied a similar statistical analysis to the other datasets listed in SecI.2.1 and consistently found either two to three sub-populations using our algorithm. The estimation obtained for the Gaussian fit of the Ct distribution they obtained is given in Table A in S1 Text.

In datasets of smaller size [22, 57], the statistical resolution does not allow us to distinguish between several sub-populations; we rather found that the distribution of Ct corresponds to a single Gaussian with standard deviation σ in the 5 to 6 range.

I.2.3 Interpretation of the Gaussian mixture model

We propose an interpretation of the observed Gaussian decomposition of the log-viral load of individuals based on the viral load temporal evolution within individuals.

A first model for the individual viral load evolution. Following [58], we consider a piece-wise linear model for the temporal evolution of the mean detected Ct as a function of time after infection t > 0,

E[Ct(t)]={tto(incubation),dcens-ΔCmaxtp-to(t-to)to<ttp,(growth),dcens-ΔCmax+ΔCmaxtr-tp(t-tp)tp<ttr(decay),t>tf,(recovered) (8)

In our first model, we only consider asymptomatic individuals, for which we set Δtdecay(asymp)=tr(asymp)-tp7 days [58]. All other parameters are indicated in Table 1.

Table 1. Table of values used in Fig 3 for our viral load evolution models.
Symbol Meaning Date/Value
ΔCmin Ct difference with threshold of the long time plateau 2
ΔCmax Ct difference with threshold of the peak plateau 13
t0 Incubation time Day 2
tp,tp1,tp,2 Peak time Day 5, 5, 7
tr Model 1—Decay time Day 11
tr(symp) Model 2—Decay time (symptomatic) Day 11
tr(asymp) Model 2—Decay time (asymptomatic) Day 14
tf(symp) Model 2—End of infection time (symptomatic) Day 16
tf(asymp) Model 2—End of infection time (asymptomatic) Day 20
σ Noise on the measured Ct 2
G(t) Distribution of testing times after infection N.A.
τ Decay time in the rate of new infections 0 or 10 days

Distribution of testing times. For individuals that remain asymptomatic throughout the infection, we expect the testing time distribution to be random, which means the viral load distribution should depend strongly on the rate of new infections.

We denote by G(t) the distribution of testing times t after infection; for asymptomatic individuals, we consider an exponential distribution G(t) ∝ exp(−t/τ); one may assume that τ > 0 can be approximated by the observed decrease rate of the incidence.

We simulate the viral load measured in a population of N = 4, 000 infected individuals samples at random times ti, 1 ≤ iN, distributed according to a model distribution of testing times. We further assumed that the measured viral load law follows a Gaussian distribution Ci=Ndcens(E[Ct(ti)],σ), 1 ≤ iN, where E[Ct(ti)] is given by Eq 8, and Ndcens is a Gaussian variable conditioned to values inferior to a model limit of detection threshold (set to dcens = 35); σ is a constant noise amplitude that models an intrinsic dispersion of the viral load among infected individuals.

Considering the model viral load evolution of Eq 8 (referred to as Model 1 in Fig 3A), exponential G(t) distribution will fail to account the two Gaussian peaks distribution observed in the asymptomatic dataset reported in [55], see Fig 3B.

Fig 3.

Fig 3

(A) Model 1 for the evolution of the viral load post-infection. (B) Modelled distribution in the infection age at the moment of the test for (red) symptomatic individuals as a Gamma function; (blue-cyan) fully asymptomatic (i.e. throughout the infection) individuals either as (blue) a constant if the new infections rate is a constant with time or (cyan) as an exponential if the rate of new infections decays exponentially with time (characteristic decay time τ = 10 days). (C) In the Model 1 context, the distribution of the viral load in asymptomatic individuals is relatively uniform. (D) Model 2 for the evolution of the viral load post-infection distinguishing between symptomatic and asymptomatic (combining [58] and [59]). Parameters estimate are provided in Table H in S1 Text. (E) In the model 2 context, the distribution of the viral load in asymptomatic individuals shows 2 peaks at high and low viral loads. (F) The distribution of the viral load in symptomatic individuals is less bimodal than the observed asymptomatic distribution.

Second model for the individual viral load evolution. To account for the observed peaks in the viral load distribution, we propose the existence of two flat Ct phase that would correspond to the behaviour of the viral (i) near the peak of viral excretion (ii) during a relatively long late infectious phase. Our piece-wise model then reads:

E[Ct(t)]={tto(incubation),dcens-ΔCmaxtp1-to(t-to)to<ttp1,(growth),dcens-ΔCmaxtp1<ttp2,(peak),dcens-ΔCmax+ΔCmaxtr-tp2(t-tp2)tp2<ttr(decay),dcens-ΔCmintr<ttf,(late),t>tf,(recovered) (9)

Following [58], we consider different estimates for the decay durations between symptomatic (Δtdecay(symp)=tr(symp-tp10 days) and asymptomatic (Δtdecay(asymp)=tr(asymp)-tp7 days) individuals. We consider the same scaling for the duration of the late infectious phase as the one for the decay time (see parameters in Table 1). The piece-wise model considered in [59] is also nearly flat during the late infectious phase at large time; yet in contrast to Eq 9, [59] considers an instantaneous change of slope at the viral load peak.

Results. Based on Eq 9, we find that the viral load distribution for asymptomatic individuals displays two peaks at high and low viral loads, see Fig 3E—in agreement with our analysis of the dataset Lennon et al. [55], see Fig 2D. An exponential decrease in the number of new cases favors the proportion of individual at high Cts, see Fig 3E. The distribution of the viral load in symptomatic individuals is less bimodal than the observed asymptomatic distribution, in agreement with our analysis of the symptomatic dataset from Lennon et al. [55], see Fig 2E.

In [59], a similar results was obtained; a decrease in the incidence rate is shown to be associated to an increase in the proportion of individuals with high Ct value.

Regarding symptomatic individuals, we assume the distribution of testing time to be modelled as a Gamma distribution G(t) = Γα,β(t) with parameters α = 2 and β = 3 day−1, see Fig 3; for simplicity, we consider such distribution to be independent of the epidemic status, although a realistic model could include an additional time delay in getting a test during high incidence phases, [60]. Based on both models Eqs 9 and 8, we observe the relatively equally distributed viral load, see Fig 3F; such distribution is in qualitative agreement with the behaviour observed in Fig 2E.

Multi-Gaussian expression. Here we interpret the observed multi-Gaussian distribution of the viral load in terms of a simple analytical model. In the absence of noise σ = 0, the distribution of viral loads corresponding to Eq 9 reads, for any xdcens

fnn(x)=(tp2tp1G(t)dt)δ(x-ΔCmax)+(trtfG(t)dt)δ(x-ΔCmin) (10)
+[G(t1(x))tp1-t0+G(t2(x))tp-tp2]1(x>Cmin)1(x<Cmax), (11)

where δ is a Dirac delta-function; t1(x) and t2(x) are the two dates such that Ct(t) = x, when applicable. In the presence of a noise of amplitude σ(x), the measured viral load density reads:

f(z)=NCminCmaxdx-dyexp(-y22σ(x)2)fnn(x-y)δ(z-x-y), (12)

with N a normalization constant. We therefore expect to obtain a multi-Gaussian distribution for the distribution of viral loads, with two weights being proportional to the time spent at the peak viral phase and at minimal elimination phase for the smallest and largest Ct means, respectively.

Generality. Our results are robust to reasonable variations of parameters. We expect our interpretation to be robust to a large class of viral load models that exhibit a sharp viral load rise, plateau at a high level and long decay time.

I.3 Estimation of the population-averaged false negative rate induced by pooling

One positive individual within the pool. The distribution of the viral load of a single positive sample within a pool of several negative samples appears as shifted towards higher Ct-values, see Fig 1. A pooled sample returns positive if the average concentration is smaller than datt with probability 1, or if the average concentration is between datt and dcens with probability q; thus using the observation of Sec I.2.2, infection will be detected in a group of N individuals typically if at least one individual in the group has a viral load larger than N2-dcens. Therefore, there is a risk that low viral load samples (that would have been tested positive using individual tests) would no longer be positive in pool tests. Similarly to 2, we express the increased rate of false negative due to pooling as

P(-log2(C)+ϵ2dcens-log2(N))+(1-q)P(datt-log2(N)-log2(C)+ϵ2dcens-log2(N)),

where log2(C) is the viral concentration of the positive individual. For simplicity we neglect the measurement error of the RT-qPCR, i.e. considering that ρ = 0, thus an expression for the increased rate of false negatives reads (1-Φ(dcens(N)))+(1-q)(Φ(dcens(N))-Φ(datt(N)), where

Φ(z)=P[-log2(C)z], (13)

and dcens(N)=dcens-log2(N), datt(N)=datt-log2(N). It is worth noting that a simple upper bound is obtained by setting q = 0, i.e. considering that the test is systematically negative when the Ct value is larger than datt (or in other words, by setting dcens = datt). This is the choice made when using the completely censored model, and the formula we will use in the rest of the article for the false negative probability is 1-Φ(dcens(N)).

Discussion. The naive and censored fitting models predict lead to two different cumulative distribution expressions Φ. This impact the estimation of the relative false-negative risks defined as 1-Φ(dcens(N))/(1-Φ(dcens(N)))

  • For the Watkins et al. [54] dataset, Fig 4A, our estimate of the relative increase in the false negative is consistent with the quantification performed on pools of a single positive samples in pools of 5, 10 and 20.

  • For the Lennon et al. [55] dataset we find that false-negative rate is higher for asymptomatic individuals than for symptomatic ones, see Fig 4, in agreement with a higher proportion of individuals at a low viral load in the former category. In the uncensored model, we make the assumption that the histogram obtained was not subject to any attenuation, while in the censored mode, we consider the sharp drop around Ct = 35.6 are being caused by false negative results.

  • For the Jones et al. dataset [56], we find that, when estimated by the censored model, the false negative risk function Φ(dcens(N)) grows quicker as the pool size increases than in the uncensored model, see Fig 4. This is mainly partly due to the fact that the censored model makes the assumption that dcens ≈ 35.6, whereas in the uncensored model, the assumption made is that dcens ≈ 37.3.

Fig 4.

Fig 4

(A-C) Relative increase in the false negative risk (1-Φ(dcens(N)))/(1-Φ(dcens(1))) in pools of size N including a single infected individual whose viral distribution is estimated using the naive (solid line), partially censored (circle) or fully censored (crossed line) fitting method of the following datasets from (A) Watkins et al. [54], (B) Jones et al. [56] and (B) Lennon et al. [55]. In (A), we superpose the clinical estimation of the risk of false negative provided in [54] (red crosses). Here, in contrast to [26], we do not change the threshold level of positivity compared to the individual test.

Multiple positive individuals within the pool. The case of a pool of N samples that contains k > 1 positive individual is particularly relevant as pooling may be achieved on individuals living in the same household, as in [47], or students sharing the same residence hall, as mentionned in [61]; in these cases, the fact that one individual is infected increases the probability that more individuals in the pool are infected as well. Such correlation has been clinically found to lower the risk of false negative risk, an effect coined hitchhiking in [47]. In Sec II in S1 Text, we provide estimation for the false-negative rate in pools, we expect to depend on the prevalence among the tested individuals.

Choosing a correct statistical model for the distribution of Ct values has a critical impact on the estimation of the false negative risk, but a less critical one on the estimation of the efficiency of screening strategies, as discussed in the next Result section.

II Group testing and epidemic outbreak surveillance

We now consider some applications of group testing to the early detection of an epidemic outbreak within a community (with a total number of individuals denoted A) that is interconnected and reasonably closed to the outside (e.g. schools, nursing homes, detention centres).

Here, we focus on the false negative risk. False positives are also a concern in low prevalence settings whereby positive predictive value might be low. However, positives appear very rare in RT-qPCR tests—with an estimated higher bound at 0.01% [62].

II.1 Risk mitigation from a single pre-symptomatic individual

We first consider the impact of group testing strategy, consisting in k group test with pools of N individuals, on the early time of the outbreak t ≪ λ−1. With a unique infected individual in the population, the detection probability reads

P[+|ktests]=kNΦ0(dcens(N))/A,withkNA, (14)

where Φ0(dcens(N)) is defined according to 13, with the difference that the assumed viral load of the patient 0, corresponding to that measured at early times, may need not be equal to the distribution estimated in 21 based on clinical data. For simplicity, we will assume in the following that Φ0 is the cumulative distribution of a log-normal viral load distribution logN(μ0, σ0) of mean μ0 and variance σ0.

We first consider of a patient 0 with a weak viral load (μ0 ≈ 30), see Fig 5A. Such low viral load can model the case of a presymptomatic individual, e.g. with a testing time distribution G(t) distributed in the t = t0 to t = tp − 2 days. In Fig 5, we represent the evolution of the probability to detect the patient 0 as a function of the total number of sampled individuals in a population of size A = 120. We observe that if μ0 is close enough to dcens, i.e. if the viral load of the patient 0 is close to being undetectable, then there will exist an optimal size for the pools. When N becomes too large the risk of false negative overcomes the potential benefits of testing larger portions of the community (see Fig 5A). In contrast, if the viral load of patient 0 is slightly higher, the detection probability becomes a monotonic function of the pool size N, indicating that larger pools are always beneficial. Additionally, if using multiple tests increases the detection probability when the viral load is close to the detection threshold, using multiple tests has a smaller impact when the viral load gets easier to detect.

Fig 5.

Fig 5

Detection probability within a community of 120 as a function of the total number of sampled individuals M = k × N, where k is the total number of tests used and N the number of samples pooled together in a test (A) Case of a single patient 0 with low viral load; k = 5 (red dotted line); k = 4 (orange line with arrow), k = 3 (purple line with circles); k = 2 (dashed green line); k = 1 (solid blue line) for several values in the parameters describing the viral load of the patient 0 at the onset of contagiosity, expressed in terms of a normal distribution in Ct (the number of RT-qPCR amplification cycles) with a standard deviation σ0 and a mean μ0 and a threshold at a value denoted dcens satisfying: μ0 = dcens − 1 (top row), modelling a patient 0 with a very low viral concentration, μ0 = dcens − 3 (middle row), μ0 = dcens − 5 (bottom row); σ0 = 2 (left column); σ0 = 6 (right column). (B) Case of a single patient 0 with a viral load distributed datasets(left) for the three fitting methods used to describe the asymptomatic dataset corresponding to Lennon et al. [55], for k = 1 (blue) and k = 5 (red)and (right) comparing the datasets of Lennon et al. [55] and Jones et al. [56] for the naive fitting method (upper curve k = 5, lower curve k = 1).

Here we first considered the case of a patient 0 with a weak viral load; however, [63] indicates that a large fraction of presymptomatic individuals detected in a nursing homes had relatively high viral loads (with a mean Ct in the μ0 ≈ 20 − 25 range), which tends to indicate that screening methods based on pooling would be even more efficient than suggested in Fig 5A. We then considered the case of single individual with a viral load distributed according to the fits of the datasets Lennon et al. [55] and Jones et al. [56]; in these instances too, we find that no optimum exists for these two viral load distributions and that large pool sizes are always optimal, see Fig 5B. The parameters used in these computations are recalled in Table 2.

Table 2. Table with standard parameter values (with std. the abbreviation of standard deviation).

Symbol Meaning Value
dcens Maximal cycle number Table D-E and Table G-I in S1 Text
μi, σi, pi Viral load (in Ct) distribution fits Table D-E and Table G-I in S1 Text
ρ RT-qPCR measurement error (std.) 0
A Total number in the community 120 or 4000
N Pool size 1–128
Q Threshold number of infected individuals 20
μ0; σ0 Ct-load in patient 0 (mean, std.) 30–35
k Number of tests used per day 1–5

II.2 Risk mitigation from a cluster of infected individuals

We now consider an epidemic outbreak involving a number Q of infected individuals within a community campus of size A = 4, 000 at the day T of a screening program. Our objective is to find an estimate of the pool size and testing cost that will ensure detection of at least one individual within the cluster within a maximal tolerated number of days denoted D. The probability to detect the outbreak using k pooled tests of size N simply reads:

P(1-daydetection)[1-(A-QNk)/(ANk)]Φ0(dcens(N)). (15)

In turn, the probability of detection between Day T and Day T + D then reads:

P(D-daydetection)1-(1-P(1-daydetection))D. (16)

For simplicity, here we considered that Q remains a constant; we do not model the spread of the infection between the day T and the day T + D; such spread would only increase the probability of detection, making Eq 16 a lower bound estimate. A more elaborated model exploring the question of the optimal testing frequency the presence of an epidemic spread is discussed in Sec IV in S1 Text.

For the surveillance program to be efficient, detection should be highly probable within a time window should be smaller than the typical time scale of apparition of first symptoms within the forming cluster.

We now consider a reasonable order of magnitude estimation of detection probability Eq 16 with D = 3 days and Φ0(dcens(N)) estimated using the Jones et al. dataset. With A = 4, 000 and setting a threshold of Q = 20 infected individuals, we find that with k = 16 pools of N = 16 individuals—i.e. a total number of N × k = 256 sampled individuals per day—the one-day detection probability Eq 15 reaches 72%. The 3-day success probability, as defined through Eq 16 then reaches 99%. With k = 4 pools of N = 16 individuals, corresponding to 64 sampled individuals per day, the one-day detection probability of Eq 15 is only at 27%; the 3-day success probability, as defined through Eq 16, reaches 62%; yet the 3-day detection probability reaches 85% if the threshold is raised to Q = 40 infected individuals.

In the next section, we intend to build an estimator for the prevalence based on the currently available results of pools.

III Measuring the prevalence using group testing

We investigate in this section the measure of the prevalence of the disease in a population using a group testing strategy. We first consider the assumption of perfect tests, i.e. with no risks of false negative nor false positive.

III.1 Measuring the prevalence in the absence of false-negatives

We assume that we have n pool tests of size N which allow us to sample, at random, nN individuals within a population. Each of these pools is then tested using the perfect tests. For all in, we write Xi(N)=1 if the ith test is positive (i.e. if and only if at least one of the N individuals in the ith pool is infected), and Xi(N)=0 otherwise. We denote by p the (unknown) proportion of infected individuals in the population. then (Xi(N),in) forms an independent and identically distributed (i.i.d.) sequence of Bernoulli random variables with parameter 1 − (1 − p)N.

Writing X¯n(N)=1nj=1NXj(N), the quantity 1-(1-X¯n(N))1/N is a strongly consistent and asymptotically normal estimator of p. Following the seminal derivation proposed in [64] (reproduced in Sec III in S1 Text), one finds that the confidence interval of asymptotic level 1 − α reads

CI1-α(p)=[1-(1-X¯n(N))1/N±qα(1-X¯n(N))1/N-1X¯n(N)(1-X¯n(N))nN], (17)

where qα is the quantile of order 1 − α/2 of the standard Gaussian random variable.

The precision of the measure of prevalence decays as n−1/2, with a prefactor that depend on the prevalence p and the number N of individual per pool. There exists an optimal choice of N that minimizes the value of this prefactor, largely improving the precision of the measure. Again following [64], one shows that the prefactor in Eq S12 is minimal when the number of mixed samples per pool is equal to:

Nopt(perf)=-clog(1-p)(1-p)Nopt(perf)0.20, (18)

where c = 2 + W(−2e−2) ≈ 1.59 and W is the Lambert W function. Specifically, the size of the pools is optimal when approximately 80% of the tests made on the groups turn positive, in sharp contrast with the diagnostic Dorfman criterion [18].

In a recent guideline [65], the European Center of Disease Control presents a seemingly different expression for the prevalence confidence intervals; however, we point out that these estimators for the confidence intervals width become asymptotically equivalent in the limit of a large number of individuals N.

If we measure the prevalence of the population using group testing, choosing N=Nopt(perf) for the size of the groups, then measuring with a given precision the prevalence will require significantly less tests than if we were to use one test per sampled individual (i.e. if N = 1). On the other hand, using this group testing method increases the total number of individuals needed to be sampled, which also has a cost to be considered. However, one can observe that the bottom of the valley of the (red) functions plotted in Fig 6, that represent the number of tests needed as a function of the size of the pool, is rather wide and flat. There is therefore a large variety of quasi-optimal pool sizes that can be chosen with minimal diminution of the precision in the measure of the prevalence.

Fig 6.

Fig 6

(A,B) Total number of tests (red) and total number of sampled individuals (blue) in order to estimate a prevalence of p = 1% with a ±0.2% precision with 95% confidence interval as a function of the pool size N for the perfect case (dashed lines) with no false negative versus the case with false negatives (solid lines) estimated according to the Lennon et al. asymptomatic dataset [55]. In (A) N ranges from 0 to 25; in (B) N ranges from 0 to 128. The optimal pool size Nopt(perf) is beyond the N-axis limit.

In Fig 6, we consider the case of a prevalence at p = 1%, in which case the the optimal pool size N=Nopt(perf) is larger than 255. Choosing a pool size of N = 20 requires almost a 100% increase in the total number Nn of sampled individuals but more than a 10 fold decrease in the total number of required tests, see Fig 6.

In Table 3, for illustrative purposes, we consider another example of a high prevalence setting (p = 3%) in which expect we expect the optimal pool size (minimizing the number of tests needed) to be in the clinically achievable range Nopt(perf)50 [22]. At optimality, the number of tests needed is divided by 20 as compared to individual testing. In this case, the total number of individuals that need to be sampled is more than doubled compared to individual testing (N = 1), see Fig 6. Choosing instead a pool size of N = 20 requires almost the same number of tests, yet at a cost of almost a 30% increase in the total number Nn of sampled individuals. The same observation holds for different values of the prevalence, see Fig M in S1 Text.

Table 3. Table of the pool size as a function of the number of tests for a prevalence of 3% measured with a precision of 0.2% at a 95% confidence interval, for both perfect tests (with no false negatives, see Sec III.1) and imperfect tests (with false negatives estimated using the Jones et al. dataset; model parameters defined in Table 2); computed using Eqs 17 and 19.

Pool size N Perfect tests Imperfect tests
Number of tests n Sample size nN Number of tests n Sample size nN
1 29100 29100 29464 29464
2 14775 29550 15069 30138
3 10003 30009 10261 30783
5 6191 30955 6411 32055
10 3350 33500 3530 35300
20 1973 39460 2130 42600
30 1561 46830 1716 51480
50 1349 67450 1525 76250
100 1884 188400 2235 223500
200 10378 2075600 13105 2621000

III.2 Measuring the prevalence including false negatives

As discussed in 4, we model the concentration of the pooled sample as the average of the individual sample loads; and we assume that viral concentration becomes undetectable below a given threshold. Therefore, creating groups has the effect of increasing the false negative rate, which has to be quantified. We then use this estimation to un-bias the estimator of the prevalence in the overall population based on group testing, and study its impact on the optimal choice of group sizes.

Assuming a false negative rate of 1-Φ(dcens(N)) in pool testing with groups of size N, we observe that 1-(1-X¯n(N))1/N (as defined using the notation of Section III) is a consistent estimator of pΦ(dcens(N)). As a result, the confidence interval constructed for the prevalence p now reads

CI1-α(p)=[1-(1-X¯n(N))1/NΦ(dcens(N))±qαn(1-X¯n(N))1/N-1X¯n(N)(1-X¯n(N))NΦ(dcens(N))]. (19)

For the numerical applications presented in Fig 6 and Table 3, we consider a viral load C that is distributed according to 21 using the Jones et al. parameter fits. As expected, due to false negatives, we find that the number of tests needed to reach a given precision on the prevalence is increased; however we this increase is moderate.

In particular, the optimal pool size value, Nopt(imper), that minimizes the number of tests needed to reach a given precision level, is close to the value Nopt(perf), defined in 18.

Similarly, one can observe that using a different distribution with similar mean and variance for −log2 C as 21 would lead to moderate changes of the values estimated in Table 3. While modelling of the viral load of an infected individual is crucial to un-bias the estimator of the prevalence via group testing, the practical implementation of such group testing strategy, i.e. the choice of the group size N and the number n of tests to use, is relatively independent of the precise statistical properties of the viral load distribution. We therefore obtained similar results as for the optimal pool size for the prevalence measurement using the viral distribution extracted from the Lennon et al. and ImpactSaliva datasets.

Based on 19, in Box 2 we propose an iterative method to estimate p, which, during a survey, allows for on-the-fly adaptations of the pool size.

Box 2: A protocol of prevalence determination

We propose the following procedure for the measure of prevalence via group testing:

  1. Start from an a priori estimate for the prevalence (p^0).

  2. Based on the value of p^0, estimate the number N of individuals in the pool that minimizes the total number of tests needed to achieve the estimation of the prevalence p at the targeted precision and confidence interval.

  3. Construct a number of n pools containing each N individuals selected at random in the general population, with n the number of tests available for the measure.

  4. Count the number of positive tests and compute the average X¯n(N).

  5. An improved estimate of the prevalence then reads: p^1=1-(1-X¯n(N))1/N (cf Lemma III.1).

Note that this method can easily be adapted into a Bayesian algorithm, with the number N of individuals tested modified at each iteration of the procedure.

III.3 Group testing and Bayesian inference of the prevalence in sub-categories of the population

The viral prevalence may vary significantly among specific categories within the overall population. In particular, a prevalence reaching 5% was measured among the health care workers population in a hospital [66], which we expect to be significantly higher than the estimate prevalence within the general population.

Here we show that we do not specifically need pool samples from individuals from homogeneous categories in order to recover the distribution of prevalence within these categories.

The protocol described in Box 2 can be adapted to study different prevalences in specific sub-populations, provided that the number of individuals of each subpopulation is known for every grouped sample. In Fig N in S1 Text, we evaluate, as function of the number of tests, the credibility intervals on the prevalence within two categories of the population: one at p1 = 5% representing 20% of the total population (a value inspired by [66]), the other being at p2 = 0.5% (a value inspired by [67]). More information on this adaptative protocol can be found in the S1 Text.

Remark III.1. Note that once a difference in prevalence is noted from the epidemiological study of the general population, testing can be adapted to construct groups containing only members of one subpopulation to attain similar levels of precision for the prevalence of the sub-populations. The prevalence in the general population can then be recovered by averaging the estimators of the sub-populations. The advantage of these adaptative settings is that the existence of a difference of prevalence in populations can be tested before deployment of resources needed to measure them specifically.

Discussion

We consider the effect of sample dilution in RT-qPCR grouped tests and we propose a model to describe the risk of false negatives as a function of the pool size. We present a procedure to analyse experimental datasets for the viral load of patients. Inspired by the clinical study [56], we expect the statistics of the number of amplification cycles to be well described as a mixture of 2 to 3 Gaussian variables censored at the RT-qPCR sensibility limit. We interpret this decomposition in terms of a simple model for the evolution in the viral load from samples of infected individuals.

We then considered the interest of group testing methods for large-scale screenings in communities. We have used a minimal set of parameters in order for analytical calculations to be tractable. Including more parameters (e.g. considering a time-dependent infection rate or viral load for patients after their infection, graph of relationship within the community) would be needed in order to obtain conclusive results to be used as healthcare guidelines. In this direction, based on stochastic simulations encompassing a large set of parameters, [68] also concludes on the efficiency of group testing in preventing epidemic outbreaks in health care structures.

Several recent papers indicate RT-qPCR tests based on saliva samples are highly-sensitive [6974]. Saliva collection appears well accepted [75] while decreasing the cost and risks of sample collection. In this context, saliva sample pooling, which demonstrates reduced loss of sensitivity even in large pool sizes [54] and has been massively used in the State University of New York, appears as a promising solution for regular large-scale surveillance programs.

Group testing could provide the means for regular and massive screenings allowing the early detection of asymptomatic and pre-symptomatic individuals—a particularly crucial task to succeed in the containment of the epidemic [14, 58, 76]. We expect group testing for SARS-CoV2 to remain relevant throughout the upcoming vaccination era, in particular as a tool to track the evolution of viral variants.

Method

Here we clarify the method used to fit the viral load distribution datasets. We define the partially censored Gaussian model, denoted by CNdatt(μ,σ,q), with μ and σ the mean and standard deviation of the Gaussian variable before censorship and q the detection probability above the threshold. If we denote by X the random variable, fμ,σ (resp. Fμ,σ) the density (resp. the cumulative distribution function) of a Gaussian law N(μ,σ) then the density of X is defined for every xR by:

fX(x)=fμ,σ(x)q+(1-q)Fμ,σ(datt)×{1ifxdatt,qotherwise. (20)

We also define the fully censored Gaussian model, written CNdatt(μ,σ)=CNdatt(μ,σ,0), such that the fitting density is defined for every xR by

fX(x)=fμ,σ(x)Fμ,σ(datt)1{xdatt}, (21)

where 1{xdatt} is the indicator function equal to 1 if xdatt, and 0 otherwise. This analysis allows to test several values of datt, the fact that estimates of μ and σ remain stable for different values of datt justifies the validity of the censored Gaussian mixture model.

Remark III.2. In the absence of censorship (i.e. in the limits q → 1 or datt → + ∞), we check that Eq 20 converges to a Gaussian density distribution.

Due to the presence of the cumulative distribution function of a Gaussian law in the denominator in the normalization constant, it is not possible to obtain analytical forms of the parameter estimators. Nevertheless, we can estimate the parameters using an optimization algorithm like the R function nlm (available in [77] and in the S1 Code) which implements a Newton-type algorithm. In S1 Text, we provide the proof of a theorem that guarantees the quality of our maximal likelihood estimators.

Supporting information

S1 Text. Supporting analysis and proofs document.

(PDF)

S1 Code. Codes used in the paper.

(ZIP)

S1 Data. Excel sheet with numerical values that were used to generate viral load histograms.

(XLSX)

Acknowledgments

We wish to thank the members of the GROUPOOL & MODCOV19 initiatives, particularly Françoise Praz and Florence Debarre who gave us numerous helpful comments. We thank Philippe Hupé for his critical reading; Marie-Claude Potier, Marc Sanson, Agnès Delaunay-Moisan, Jean-Yves Thuret, Catherine Hill and members of the FranceTest collective for insightful discussions on RT-qPCR tests, the interest of saliva samples, and epidemiological surveillance.

Data Availability

All relevant data are within the manuscript and its Supporting information files.

Funding Statement

The authors received no specific funding for this work.

References

  • 1. Wölfel R, Corman VM, Guggemos W, Seilmaier M, Zange S, Müller MA, et al. Virological assessment of hospitalized patients with COVID-2019. Nature. 2020;581(7809):465–469. 10.1038/s41586-020-2196-x [DOI] [PubMed] [Google Scholar]
  • 2.World Health Organization. Laboratory testing for 2019 novel coronavirus (2019-nCoV) in suspected human cases. WHO—Interim guidance. 2020;2019(January):1–7.
  • 3. Corman VM, Landt O, Kaiser M, Molenkamp R, Meijer A, Chu DKW, et al. Detection of 2019 novel coronavirus (2019-nCoV) by real-time RT-PCR. Eurosurveillance. 2020;25(3):1–8. 10.2807/1560-7917.ES.2020.25.3.2000045 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Gollier C, Gossner O. Group Testing against Covid-19. Covid Economics. 2020; p. 32–42.
  • 5.Pouwels KB, Roope LSJ, Barnett A, Hunter DJ, Nolan TM, Clarke PM. Group Testing for SARS-CoV-2: Forward to the Past? PharmacoEconomics—Open. 2020; p. 2–5. [DOI] [PMC free article] [PubMed]
  • 6. Holt E. Slovakia to test all adults for SARS-CoV-2. The Lancet. 2020;396(10260):1386–1387. 10.1016/S0140-6736(20)32261-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Health Ministry of the Grand Duchy of Luxembourg. Large Scale Testing; 2020. Available from: https://covid19.public.lu/dam-assets/covid-19/documents/strat{é}gie-lst/strategie-LST-Fr.pdf.
  • 8. Byambasuren O, Cardona M, Bell K, Clark J, McLaws ML, Glasziou P. Estimating the extent of asymptomatic COVID-19 and its potential for community transmission: Systematic review and meta-analysis. Official Journal of the Association of Medical Microbiology and Infectious Disease Canada. 2020;COVID-19:Accepted versio. 10.3138/jammi-2020-0030 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Johansson MA, Quandelacy TM, Kada S, Prasad PV, Steele M, Brooks JT, et al. SARS-CoV-2 Transmission From People Without COVID-19 Symptoms. JAMA Network Open. 2021;4(1):e2035057. 10.1001/jamanetworkopen.2020.35057 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Buitrago-Garcia D, Egli-Gany D, Counotte MJ, Hossmann S, Imeri H, Ipekci AM, et al. Occurrence and transmission potential of asymptomatic and presymptomatic SARS-CoV-2 infections: A living systematic review and meta-analysis. PLOS Medicine. 2020;17(9):e1003346. 10.1371/journal.pmed.1003346 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Mizumoto K, Kagaya K, Zarebski A, Chowell G. Estimating the asymptomatic proportion of coronavirus disease 2019 (COVID-19) cases on board the Diamond Princess cruise ship, Yokohama, Japan, 2020. Eurosurveillance. 2020;25(10):1–5. 10.2807/1560-7917.ES.2020.25.10.2000180 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Bi Q, Wu Y, Mei S, Ye C, Zou X, Zhang Z, et al. Epidemiology and transmission of COVID-19 in 391 cases and 1286 of their close contacts in Shenzhen, China: a retrospective cohort study. The Lancet Infectious Diseases. 2020;3099(20):1–9. 10.1016/S1473-3099(20)30287-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Bai Y, Yao L, Wei T, Tian F, Jin DY, Chen L, et al. Presumed Asymptomatic Carrier Transmission of COVID-19. JAMA. 2020;323(14):1406. 10.1001/jama.2020.2565 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Ferretti L, Wymant C, Kendall M, Zhao L, Nurtay A, Abeler-Dörner L, et al. Quantifying SARS-CoV-2 transmission suggests epidemic control with digital contact tracing. Science. 2020;368(6491):6936. 10.1126/science.abb6936 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Lavezzo E, Franchin E, Ciavarella C, Cuomo-Dannenburg G, Barzon L, Vecchio CD, et al. Suppression of COVID-19 outbreak in the municipality of Vo’, Italy. medRxiv. 2020. [Google Scholar]
  • 16. Cao S, Gan Y, Wang C, Bachmann M, Wei S, Gong J, et al. Post-lockdown SARS-CoV-2 nucleic acid screening in nearly ten million residents of Wuhan, China. Nature Communications. 2020;11(1). 10.1038/s41467-020-19802-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Xing Y, Wong GWK, Ni W, Hu X, Xing Q. Rapid Response to an Outbreak in Qingdao, China. New England Journal of Medicine. 2020;383(23):e129. 10.1056/NEJMc2032361 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Dorfman R. The Detection of Defective Members of Large Populations. The Annals of Mathematical Statistics. 1943;14(4):436–440. 10.1214/aoms/1177731363 [DOI] [Google Scholar]
  • 19. Aldridge M, Johnson O, Scarlett J. Group Testing: An Information Theory Perspective. Foundations and Trends® in Communications and Information Theory. 2019;15(3-4):196–392. 10.1561/0100000099 [DOI] [Google Scholar]
  • 20. Hogan CA, Sahoo MK, Pinsky BA. Sample Pooling as a Strategy to Detect Community Transmission of SARS-CoV-2. JAMA. 2020;323(19):1967. 10.1001/jama.2020.5445 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Lohse S, Pfuhl T, Berkó-Göttel B, Rissland J, Geißler T, Gärtner B, et al. Pooling of samples for testing for SARS-CoV-2 in asymptomatic people. The Lancet Infectious Diseases. 2020;20(11):1231–1232. 10.1016/S1473-3099(20)30362-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Yelin I, Aharony N, Tamar ES, Argoetti A, Messer E, Berenbaum D, et al. Evaluation of COVID-19 RT-qPCR Test in Multi sample Pools. Clinical Infectious Diseases. 2020;71(16):2073–2078. 10.1093/cid/ciaa531 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Ben-Ami R, Klochendler A, Seidel M, Sido T, Gurel-Gurevich O, Yassour M, et al. Large-scale implementation of pooled RNA-extraction and RT-PCR for SARS-CoV-2 detection. medRxiv. 2020. 10.1016/j.cmi.2020.06.009 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Shental N, Levy S, Wuvshet V, Skorniakov S, Shalem B, Ottolenghi A, et al. Efficient high-throughput SARS-CoV-2 testing to detect asymptomatic carriers. Science Advances. 2020;6(37):eabc5961. 10.1126/sciadv.abc5961 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Schmidt M, Hoehl S, Berger A, Zeichhardt H, Hourfar K, Ciesek S, et al. FACT- Frankfurt adjusted COVID-19 testing- a novel method enables high-throughput SARS-CoV-2 screening without loss of sensitivity. medRxiv. 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Mutesa L, Ndishimye P, Butera Y, Souopgui J, Uwineza A, Rutayisire R, et al. A pooled testing strategy for identifying SARS-CoV-2 at low prevalence. Nature. 2020;589(7841):276–280. 10.1038/s41586-020-2885-5 [DOI] [PubMed] [Google Scholar]
  • 27. Torres I, Albert E, Navarro D. Pooling of nasopharyngeal swab specimens for SARS-CoV-2 detection by RT-PCR. Journal of Medical Virology. 2020;92(11):2306–2307. 10.1002/jmv.25971 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Alvargonzalez JJC, Cao SR, Castro SP, Lamas LM, Calvo OC, Piñon JT, et al. Pooling for SARS-CoV-2 control in care institutions. BMC Infectious Diseases. 2020;20(1). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Wacharapluesadee S, Kaewpom T, Ampoot W, Ghai S, Khamhang W, Worachotsueptrakun K, et al. Evaluating the efficiency of specimen pooling for PCR-based detection of COVID-19. Journal of Medical Virology. 2020;92(10):2193–2199. 10.1002/jmv.26005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Martin A, Storto A, Andre B, Mallory A, Dangla R, Visseaux B, et al. High-sensitivity COVID-19 group testing by digital PCR; 2020. [DOI] [PMC free article] [PubMed]
  • 31. Khodare A, Padhi A, Gupta E, Agarwal R, Dubey S, Sarin SK. Optimal size of Sample Pooling for RNA Pool Testing: An Avant-Garde for Scaling up Severe Acute Respiratory Syndrome Coronavirus-2 Testing. Indian Journal of Medical Microbiology. 2020;38(1):18–23. 10.4103/ijmm.IJMM_20_260 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Wernike K, Keller M, Conraths FJ, Mettenleiter TC, Groschup MH, Beer M. Pitfalls in SARS-CoV-2 PCR diagnostics. Transboundary and Emerging Diseases. 2020. 10.1111/tbed.13684 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Gan Y, Du L, Faleti OD, Huang J, Xiao G, Lyu X. Sample Pooling as a Strategy of SARS-COV-2 Nucleic Acid Screening Increases the False-negative Rate; 2020. Available from: 10.1101/2020.05.18.20106138. [DOI]
  • 34.Griesemer SB, Slyke GV, George KS. Assessment of sample pooling for clinical SARS-CoV-2 testing; 2020. Available from: 10.1101/2020.05.26.118133. [DOI] [PMC free article] [PubMed]
  • 35. Chong BSW, Tran T, Druce J, Ballard SA, Simpson JA, Catton M. Sample pooling is a viable strategy for SARS-CoV-2 detection in low-prevalence settings. Pathology. 2020;52(7):796–800. 10.1016/j.pathol.2020.09.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Christoff AP, Cruz GNF, Sereia AFR, Boberg DR, de Bastiani DC, Yamanaka LE, et al. Swab pooling for large-scale RT-qPCR screening of SARS-CoV-2. medRxiv. 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Hanel R, Thurner S. Boosting test-efficiency by pooled testing for SARS-CoV-2—Formula for optimal pool size. PLOS ONE. 2020;15(11):e0240652. 10.1371/journal.pone.0240652 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Deckert A, Bärnighausen T, Kyei NN. Simulation of pooled-sample analysis strategies for COVID-19 mass testing. Bulletin of the World Health Organization. 2020;98(9):590–598. 10.2471/BLT.20.257188 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Sinnott-Armstrong N, Klein DL, Hickey B. Evaluation of Group Testing for SARS-CoV-2 RNA. medRxiv. 2020. [Google Scholar]
  • 40. Verdun CM, Fuchs T, Harar P, Elbrächter D, Fischer DS, Berner J, et al. Group testing for SARS-CoV-2 allows for up to 10-fold efficiency increase across realistic scenarios and testing strategies. medRxiv. 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Narayanan KR, Frost I, Heidarzadeh A, Tseng KK, Banerjee S, John J, et al. Pooling RT-PCR or NGS samples has the potential to cost-effectively generate estimates of COVID-19 prevalence in resource limited environments. medRxiv. 2020. [Google Scholar]
  • 42. Bilder CR, Iwen PC, Abdalhamid B, Tebbs JM, McMahan CS. Tests in short supply? Try group testing. Significance. 2020;17(3):15–16. 10.1111/1740-9713.01399 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Furon T. The illusion of group testing. [Research Report] RR-9164, Inria Rennes Bretagne Atlantique. 2018; p. 1–19.
  • 44. Barillot E, Lacroix B, Cohen D. Theoretical analysis of library screening using a N-dimensional pooling strategy. Nucleic Acids Research. 1991;19(22):6241–6247. 10.1093/nar/19.22.6241 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Thierry-Mieg N. Pooling in systems biology becomes smart. Nature Methods. 2006;3(3):161–162. 10.1038/nmeth0306-161 [DOI] [PubMed] [Google Scholar]
  • 46.Centers for Disease Control and Prevention. Interim Guidance for Use of Pooling Procedures in SARS-CoV-2 Diagnostic, Screening, and Surveillance Testing; 2020. Available from: https://www.cdc.gov/coronavirus/2019-ncov/lab/pooling-procedures.html.
  • 47. Barak N, Ben-Ami R, Sido T, Perri A, Shtoyer A, Rivkin M, et al. Lessons from applied large-scale pooling of 133, 816 SARS-CoV-2 RT-PCR tests. medRxiv. 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Denny TN, Andrews L, Bonsignori M, Cavanaugh K, Datto MB, Deckard A, et al. Implementation of a Pooled Surveillance Testing Program for Asymptomatic SARS-CoV-2 Infections on a College Campus—Duke University, Durham, North Carolina, August 2–October 11, 2020. MMWR Morbidity and Mortality Weekly Report. 2020;69(46):1743–1747. 10.15585/mmwr.mm6946e1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Mulder JT. How Upstate Medical University used spit and grit to make’game-changer’ coronavirus test; 2021. Available from: https://tinyurl.com/y6yk9vqr.
  • 50. Fogarty A, Joseph A, Shaw D. Pooled saliva samples for COVID-19 surveillance programme. The Lancet Respiratory Medicine. 2020;8(11):1078–1080. 10.1016/S2213-2600(20)30444-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Forootan A, Sjöback R, Björkman J, Sjögreen B, Linz L, Kubista M. Methods to determine limit of detection and limit of quantification in quantitative real-time PCR (qPCR). Biomolecular Detection and Quantification. 2017;12:1–6. 10.1016/j.bdq.2017.04.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Ruiz-Villalba A, van Pelt-Verkuil E, Gunst QD, Ruijter JM, van den Hoff MJ. Amplification of nonspecific products in quantitative polymerase chain reactions (qPCR). Biomolecular Detection and Quantification. 2017;14:7–18. 10.1016/j.bdq.2017.10.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Cleary B, Hay JA, Blumenstiel B, Harden M, Cipicchio M, Bezney J, et al. Using viral load and epidemic dynamics to optimize pooled testing in resource constrained settings. medRxiv. 2020. 10.1101/2020.05.01.20086801 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Watkins AE, Fenichel EP, Weinberger DM, Vogels CBF, Brackney DE, Casanovas-Massana A, et al. Pooling saliva to increase SARS-CoV-2 testing capacity. medRxiv. 2020. 10.1101/2020.09.02.20183830 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Lennon NJ, Bhattacharyya RP, Mina MJ, Rehm HL, Hung DT, Smole S, et al. Comparison of viral levels in individuals with or without symptoms at time of COVID-19 testing among 32, 480 residents and staff of nursing homes and assisted living facilities in Massachusetts. medRxiv. 2020. [Google Scholar]
  • 56.Jones TC, Mühlemann B, Talitha V, Marta Z, Hofmann J, Stein A, et al. An analysis of SARS-CoV-2 viral load by patient age. Preprint Charité Hospital. 2020.
  • 57. Liu Y, Yan LM, Wan L, Xiang TX, Le A, Liu JM, et al. Viral dynamics in mild and severe cases of COVID-19. The Lancet Infectious Diseases. 2020;20(6):656–657. 10.1016/S1473-3099(20)30232-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58. Kissler SM, Tedijanto C, Goldstein E, Grad YH, Lipsitch M. Projecting the transmission dynamics of SARS-CoV-2 through the postpandemic period. Science. 2020;368(6493):860–868. 10.1126/science.abb5793 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59. Hay JA, Kennedy-Shaffer L, Kanjilal S, Lipsitch M, Mina MJ. Estimating epidemiologic dynamics from single cross-sectional viral load distributions. medRxiv. 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60. Pullano G, Domenico LD, Sabbatini CE, Valdano E, Turbelin C, Debin M, et al. Underdetection of COVID-19 cases in France in the exit phase following lockdown. medRxiv. 2020. [Google Scholar]
  • 61. Mahase E. Covid-19: Universities roll out pooled testing of students in bid to keep campuses open. BMJ. 2020; p. m3789. [DOI] [PubMed] [Google Scholar]
  • 62. Hogan CA, Gombar S, Wang H, Röltgen K, Shi RZ, Holubar M, et al. Large-Scale Testing of Asymptomatic Healthcare Personnel for Severe Acute Respiratory Syndrome Coronavirus 2. Emerging Infectious Diseases. 2021;27(1):250–254. 10.3201/eid2701.203892 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63. Arons MM, Hatfield KM, Reddy SC, Kimball A, James A, Jacobs JR, et al. Presymptomatic SARS-CoV-2 Infections and Transmission in a Skilled Nursing Facility. New England Journal of Medicine. 2020;382(22):2081–2090. 10.1056/NEJMoa2008457 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64. Thompson KH. Estimation of the Proportion of Vectors in a Natural Population of Insects. Biometrics. 1962;18(4):568. 10.2307/2527902 [DOI] [Google Scholar]
  • 65.Prevention ECfD. Methodology for estimating point prevalence of SARS-CoV-2 infection by pooled RT-PCR testing. Stockholm: ECDC. 2020.
  • 66. Barrett ES, Horton DB, Roy J, Gennaro ML, Brooks A, Tischfield J, et al. Prevalence of SARS-CoV-2 infection in previously undiagnosed health care workers at the onset of the U.S. COVID-19 epidemic. medRxiv. 2020. 10.1101/2020.04.20.20072470 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67. Gudbjartsson DF, Helgason A, Jonsson H, Magnusson OT, Melsted P, Norddahl GL, et al. Spread of SARS-CoV-2 in the Icelandic Population. New England Journal of Medicine. 2020;382(24):2302–2315. 10.1056/NEJMoa2006100 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68. Smith DRM, Duval A, Pouwels KB, Guillemot D, Fernandes J, et al. Optimizing COVID-19 surveillance in long-term care facilities: a modelling study. BMC Medicine. 2020;18(1). 10.1186/s12916-020-01866-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69. Wyllie AL, Fournier J, Casanovas-Massana A, Campbell M, Tokuyama M, Vijayakumar P, et al. Saliva or Nasopharyngeal Swab Specimens for Detection of SARS-CoV-2. New England Journal of Medicine. 2020;383(13):1283–1286. 10.1056/NEJMc2016359 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70. Azzi L, Carcano G, Gianfagna F, Grossi P, Gasperina DD, Genoni A, et al. Saliva is a reliable tool to detect SARS-CoV-2. Journal of Infection. 2020;81(1):e45–e50. 10.1016/j.jinf.2020.04.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71. To KKW, Tsang OTY, Leung WS, Tam AR, Wu TC, Lung DC, et al. Temporal profiles of viral load in posterior oropharyngeal saliva samples and serum antibody responses during infection by SARS-CoV-2: an observational cohort study. The Lancet Infectious Diseases. 2020;20(5):565–574. 10.1016/S1473-3099(20)30196-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72. Williams E, Bond K, Zhang B, Putland M, Williamson DA. Saliva as a Noninvasive Specimen for Detection of SARS-CoV-2. Journal of Clinical Microbiology. 2020;58(8). 10.1128/JCM.00776-20 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73. Khurshid Z, Zohaib S, Joshi C, Moin SF, Zafar MS, Speicher DJ. Saliva as a non-invasive sample for the detection of SARS-CoV-2: a systematic review. medRxiv. 2020. [Google Scholar]
  • 74. Hanson KE, Barker AP, Hillyard DR, Gilmore N, Barrett JW, Orlandi RR, et al. Self-Collected Anterior Nasal and Saliva Specimens versus Health Care Worker-Collected Nasopharyngeal Swabs for the Molecular Detection of SARS-CoV-2. Journal of Clinical Microbiology. 2020;58(11). 10.1128/JCM.01824-20 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75. Siegler AJ, Hall E, Luisi N, Zlotorzynska M, Wilde G, Sanchez T, et al. Willingness to Seek Diagnostic Testing for SARS-CoV-2 With Home, Drive-through, and Clinic-Based Specimen Collection Locations. Open Forum Infectious Diseases. 2020;7(7). 10.1093/ofid/ofaa269 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76. Treibel TA, Manisty C, Burton M, McKnight Á, Lambourne J, Augusto JB, et al. COVID-19: PCR screening of asymptomatic health-care workers at London hospital. The Lancet. 2020;395(10237):1608–1610. 10.1016/S0140-6736(20)31100-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77.R Core Team. R: A Language and Environment for Statistical Computing; 2020. Available from: https://www.R-project.org/.
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008726.r001

Decision Letter 0

Jingyi Jessica Li, Tom Britton

25 Oct 2020

Dear Dr. Rupprecht,

Thank you very much for submitting your manuscript "Group testing as a strategy for COVID-19 epidemiological monitoring and community surveillance" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Jingyi Jessica Li

Guest Editor

PLOS Computational Biology

Tom Britton

Deputy Editor

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Brault and colleagues present a computational framework for pooling samples prior to Sars-Cov-2 testing as a tool to estimate the prevalence of infection within a population. Pooling samples has often been used in the past to detect low-prevalence targets in an efficient manner by conserving tests, and therefore there has been great interest in adapting this to Covid19 testing. The unique insight here is that, in contrast to the usual approach, the authors do not examine pooled testing as a prelude to the testing of individual samples (i.e. for efficient eventual identification of infected individuals) but rather as an end in itself, in order to efficiently provide population surveillance. To do so, they develop a model of test-positivity grounded in apparently reasonable assumptions about test performance and viral load, estimate a simulated distribution of viral loads using a partially-published dataset, and use this to describe how their computational approaches can be adapted as a tool to monitor the appearance and tracking of prevalence within a defined population.

Rigourous assessments of the computational approaches I will delegate to my co-reviewers, and focus instead on the conceptual framework and applicability of findings.

The major weakness of the paper, as the authors implicitly acknowledge, is the use of a simulated dataset of viral densities represented in Figure 3; this was estimated based upon inspection of a pre-print article reporting many thousands of results from Germany, but the exact data were not available for this report, and furthermore that source paper provides little information about how viral positivity and density are correlated with important clinical and demographic variables, such as age, symptomaticity, etc. This has important implications for the applicability of their suggested testing protocols later in the report. Furthermore, what is the authors' hypothesis as to why these values are distributed as they are?

It would be helpful in the Introduction for the Authors to more clearly differentiate their goals from those of others' in the increasingly crowded covid-sample-pooling space. Namely that mentioned above, which is that the goal here is not to provide a tool to optimize pooled "cascade" testing and efficiently find the few infected individuals, but rather for routine population surveillance , presumably to provide a metric by which to deploy population-level controls (i.e. distancing, etc). Describing a roadmap for how these data might be ultimately actionable would sharpen this focus and clarify the distinct aims of this effort.

Very minor commments: "Contaminated" is an odd choice and I suggest infected in its place. Also viral "charge" is non-standard to me, and "load" would be a more common usage.

Reviewer #2: The authors present a modeling study of pooled testing for SARS-CoV-2. The article is not framed well in infectious disease epidemiology and I recommend the authors work with an infectious disease epidemiologist before resubmitting their paper for review. Many of the ideas that the authors present have been discussed at great length in the field of infectious disease epidemiology, particularly around infectious disease surveillance.

1. The framing of the article is odd and fails to account for understanding of infectious disease epidemiology. Although it is true that pooled testing could be used to estimate the prevalence of COVID-19, prevalence of an infectious disease with such high transmission potential is an odd metric to use in surveillance. As soon as the prevalence is estimated, it would be outdated. I recommend reframing the article around a realistic use of pooled testing, either to increase throughput of incident cases or as screening in attempts to find SARS-CoV-2 infections and break chains of transmission. The authors might consider bringing on an infectious disease epidemiologist to help in this regard, as the revisions needed are extensive. The following instances need to be reframed:

a. Lines 2-6. Regularly monitoring the prevalence of the disease does not prevent the onset of an epidemic wave. Prevalence means nothing without incidence, and monitoring does nothing to prevent increasing transmission. Also prevalence is not the metric that would be used to assess the effectiveness of interventions.

b. Lines 20-23. Although it is true that repeated random samples would give a measure of the prevalence of the pathogen, prevalence is not used to monitor infectious disease unless the carriage is long (such as tuberculosis, malaria, or HIV). With SARS-CoV-2 having about a two-week infectious interval, prevalence becomes basically worthless. Mass testing instead, should be used as an intervention to screen and find infections. Pooled testing is also useful to increase throughput for diagnostic testing.

c. In section 1 the authors presume a point prevalence of 3%. A point prevalence of 3% for SARS-CoV-2 would be alarmingly high and would signify that the transmission rate is out of control. I would recommend at the highest using 1%, and if really trying to

2. Another example highlighting the lack of infectious disease epidemiology understanding is in lines 10-13. This metric is called test positivity, and corresponds to the proportion of tests that are positive among treatment-seeking individuals. The term apparent prevalence is not widely used, and should not be continued. Although the principles discussed in this paragraph are correct, the terminology is wrong and confusing. I suggest reframing this paragraph to discuss how test positivity among treatment seeking individuals will overestimate the prevalence of the pathogen in the population.

3. The authors’ focus on the application of group testing to solve the problem of estimating the prevalence is strange. The problems of estimating the prevalence of a pathogen are rooted in who gets tested and how to draw the sample rather than in how the test is conducted. Pooled testing increases throughput, which then can operationalize testing to screen for more infections.

4. There is great value in estimating the false negativity rate among pooled testing. I would have liked to see that particular problem highlighted and improved, with modeling taking into account an estimate of actual prevalence at the time of testing.

Reviewer #3: Review of “Group testing as a strategy for COVID-19 epidemiological

monitoring and community surveillance “

The paper analyses the use of group or pooled testing for detecting

COVID-19 prevalence in a population. It assumes a limited number of

available tests, and looks at the optimal number of individuals to pool

together into each test to achieve tightest confidence interval around

the estimate of prevalence.

This is a solid and interesting paper, and it addresses all the major

problems associated with pooled testing. I have a few minor comments.

1. Line 90, the paper cites a “classical computation”, giving a 2020

reference. If it is a classical computation, I'd cite an earlier source.

The source cited, does not, as far as I can tell, use the same

computation. Instead it uses c_{*}=1. Please give a correct reference,

and even better an actual derivation of the result. This result seems

indeed valid for reducing the confidence interval, though there are

other possible choices for error reduction, such as mean square error,

or, my personal favorite, maximum information. As the paper points out,

the minimum is very flat and as such these are very similar.

2. The paper derives a distribution of C_{t} values, the viral load.

What is not taken into account or mentioned is that this distribution is

not a constant. Each individual goes through a time course of viral

load, and therefore the distribution depends on the time course of the

disease in the population. I don't see a need to use this fact in the

estimation, but it would be nice to mention this fact, and maybe address

how results would differ, if they would. In my opinion all of section 3

detracts from the main message of the paper. I think it should go in a

supplement, just giving the main result of mixture of Gaussian+censure

in the paper for further analysis.

3. Section 4.4.2, optimization of regularity of test, stands somewhat

apart of the rest of the paper. It is a very interesting question, but

the paper only addresses it via simulation. There is also no discussion

of false positives, which are relevant for timing of outbreak.

4. The paper uses “contaminated individual” for “infected individual”.

Usually, I would use contaminated for a false positive. There are also

some other small problems such as “law of the artefact” on line 138 -

I'm not sure what this means, “exemple” on 105. “But to reach similar

level of precision than in single testing“ 107. “the measure is always

made for samples detected as positive “ 240. “We now show how the

previous analysis of the tests used to measure the viral load in

patients can be used to precise the epidemiological monitoring of the

disease in the general population. “ 330. I think these are errors, but

since I'm not a native English speaker, I'm not totally sure. Careful

editing would be good.

---

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: None

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions, please see http://journals.plos.org/compbiol/s/submission-guidelines#loc-materials-and-methods

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008726.r003

Decision Letter 1

Jingyi Jessica Li, Tom Britton

9 Jan 2021

Dear Dr. Rupprecht,

Thank you very much for submitting your manuscript "Group testing as a strategy for COVID-19 epidemiological monitoring and community surveillance" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations.

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. 

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Jingyi Jessica Li

Guest Editor

PLOS Computational Biology

Tom Britton

Deputy Editor

PLOS Computational Biology

***********************

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

[LINK]

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: As with my initial review, I've limited my focus to the conceptual approach, interpretation, and potential implementation of the work for disease surveillance. I find acceptable and agreeable the author's edits made in response to my suggestions, and furthermore agree with those made in response to the other reviewers as well.

Reviewer #2: The authors have done commendable work revising the article and responding to my edits. I really enjoyed the introduction, and again incorporating false negatives into the modeling is extremely well done. I am satisfied with their response I include a few more revisions that stood out, all minor.

1. Lines 28-30 could be revised for clarity. Completely asymptomatic infections only occur at most in 20% of the population, whereas pre-symptomatic transmission is quite common. We don’t know yet how much asymptomatic carriers actually transmit coronavirus, but we do know that pre-symptomatic transmission is common. I suggest striking this sentence beginning, “particularly challenging…” and having the previous sentence start the new paragraph with China as the example. There are two systematic reviews for proportion of cases that are truly asymptomatic that would be better references – both found about 15%. One’s now a living systematic review at PLOS Medicine: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7508369/ and the other is at the Journal of the Association of Medical Microbiology and Infectious Disease Canada: https://jammi.utpjournals.press/doi/10.3138/jammi-2020-0030

2. Line 38 should read “seminal” rather than “semina”

3. Line 87 – This local newspaper article is probably the best write-up of how SUNY is doing their massive testing and could be cited to give some credit to Dr. Middleton’s heroic efforts: https://www.syracuse.com/coronavirus/2020/11/how-upstate-medical-university-used-spit-and-grit-to-make-game-changer-coronavirus-test.html

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see http://journals.plos.org/ploscompbiol/s/submission-guidelines#loc-materials-and-methods

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008726.r005

Decision Letter 2

Jingyi Jessica Li, Tom Britton

20 Jan 2021

Dear Dr. Rupprecht,

We are pleased to inform you that your manuscript 'Group testing as a strategy for COVID-19 epidemiological monitoring and community surveillance' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Jingyi Jessica Li

Guest Editor

PLOS Computational Biology

Tom Britton

Deputy Editor

PLOS Computational Biology

***********************************************************

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008726.r006

Acceptance letter

Jingyi Jessica Li, Tom Britton

11 Feb 2021

PCOMPBIOL-D-20-01303R2

Group testing as a strategy for COVID-19 epidemiological monitoring and community surveillance

Dear Dr Rupprecht,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Alice Ellingham

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Text. Supporting analysis and proofs document.

    (PDF)

    S1 Code. Codes used in the paper.

    (ZIP)

    S1 Data. Excel sheet with numerical values that were used to generate viral load histograms.

    (XLSX)

    Attachment

    Submitted filename: ReviewersLetterPlosComp_11122020.pdf

    Attachment

    Submitted filename: Reviewers_120121.pdf

    Data Availability Statement

    All relevant data are within the manuscript and its Supporting information files.


    Articles from PLoS Computational Biology are provided here courtesy of PLOS

    RESOURCES