Skip to main content
Journal of Applied Statistics logoLink to Journal of Applied Statistics
. 2024 Apr 2;51(15):3039–3058. doi: 10.1080/02664763.2024.2335569

Estimation of world seroprevalence of SARS-CoV-2 antibodies

Kwangmin Lee a, Seongmin Kim b, Seongil Jo c,CONTACT, Jaeyong Lee b
PMCID: PMC11536633  PMID: 39507207

Abstract

In this paper, we estimate the seroprevalence against COVID-19 by country and derive the seroprevalence over the world. To estimate seroprevalence among adults, we use serological surveys (also called the serosurveys) conducted within each country. When the serosurveys are incorporated to estimate world seroprevalence, there are two issues. First, there are countries in which a serological survey has not been conducted. Second, the sample collection dates differ from country to country. We attempt to tackle these problems using the vaccination data, confirmed cases data, and national statistics. We construct Bayesian models to estimate the numbers of people who have antibodies produced by infection or vaccination separately. For the number of people with antibodies due to infection, we develop a hierarchical model for combining the information included in both confirmed cases data and national statistics. At the same time, we propose regression models to estimate missing values in the vaccination data. As of 31st of July 2021, using the proposed methods, we obtain the 95% credible interval of the world seroprevalence as [35.5%,56.8%].

Keywords: Bayesian model, hierarchical model, SARS-CoV-2 antibodies, vaccination, world seroprevalence

1. Introduction

At the beginning of December 2019, the first coronavirus disease 2019 (abbreviated COVID-19) patient, due to severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), was identified in Wuhan, China [12]. In the following weeks, the disease rapidly spread all over China and other countries, which caused worldwide damage and is still widespread. According to the official statement, COVID-19 has so far caused more than 317 million infections and 5.5 million deaths globally.

Vaccines are a critical tool for protecting people because of producing antibodies against infectious diseases. Every country in the world is struggling to block the spread of the virus and treat patients. As part of that, countries are administering COVID-19 vaccines, and the majority of people in many countries have been given the vaccines. There are a variety of available COVID-19 vaccines, e.g. AstraZeneca, Johnson & Johnson, Moderna, Novavax, and Pfizer-BioNTech, and candidates currently in Phase III clinical trials [7].

Seroprevalence is the ratio of people with antibodies, which is produced by previous infection or vaccines, to a particular virus in a population. In this paper, we study the seroprevalence of SARS-CoV-2 infections in people all over the world, particularly in adults, using information officially reported by countries. The available information includes confirmed cases, the number of people vaccinated, types of vaccines, and serosurvey data.

Recently, there have been various approaches for estimating the seroprevalence of antibodies to SARS-CoV-2. For example, [6] proposed a Bayesian method that uses a user-specific likelihood function being able to incorporate the variabilities of specificity and sensitivity of the antibody tests, [14] utilized a Bayesian logistic regression model with a random effect for the age and sex, and [10] developed a Bayesian multilevel poststratification approach with multiple diagnostic tests. Lee et al. [11] presented a Bayesian binomial model with an informative prior distribution based on clinical trial data of the plaque reduction neutralization test (PRNT), a kind of serology test. While these approaches have been developed for the populations in certain regions, not global, [2] considered the seroprevalence in the global region by proposing a meta-regression method. In this paper, we propose a new Bayesian method for estimating the world seroprevalence of SARS-CoV-2 antibodies and verify the proposed method via numerical studies. This method estimates the percentage of people who have developed antibodies due to viral infection or vaccination in each country and combines these estimates using a hierarchical Bayesian model. Additionally, the method utilizes informative priors constructed from external information to enhance the accuracy of our estimates. By doing so, we can provide global seroprevalence estimates that better reflect available information and uncertainty. We assess the accuracy of the proposed method via a simulation study and the leave-one-out cross-validation method for the real data.

The rest of the paper is organized as follows. In the next section, we introduce the serology test and vaccination datasets for the SARS-CoV-2 and briefly review the model proposed in [11] for constructing an informative prior. In Section 3, we propose a new Bayesian approach to estimate the world seroprevalence of SARS-CoV-2. Section 4 presents the results of empirical analysis using real data. Finally, the conclusion is given in Section 5.

2. Materials

2.1. Vaccine data

In this subsection, we introduce the notation used in the rest of the paper and describe datasets for estimating the number of effectively vaccinated people by country. The datasets include the vaccinations, delivery amount of vaccines, and observational studies for vaccine effectiveness.

2.1.1. Vaccination data by country

We utilize the vaccination data given in [13], which is collected from official public reports on vaccinations against COVID-19 by country. The dataset contains the cumulative vaccine doses administrated, the cumulative number of fully vaccinated people, the report dates, and the information for vaccine manufacturers. As of 31 July 2021, the number of countries on reports is 182.

We denote the jth report date of the ith country using di,j where i=1,2,,182 and j=1,2,,Ji, and the cumulative doses administrated and the cumulative number of fully vaccinated people until the date dij are denoted by Xi,j and Yi,j, respectively, for the jth report of the ith country. Note that Xi,j is observed for all i and j, while Yi,j is not observed in some reports. Specifically, Yi,j is not observed at all in two countries, Cote d'Ivoire and Ethiopia, and is partially observed in 113 countries. We denote the set of vaccine manufacturers used at the corresponding date by Vi,j. For example, if the vaccines produced by AstraZeneca and Pfizer-BioNTech are only used at the jth report date of the ith country, then Vi,j={AstraZeneca,Pfizer-BioNTech}.

We define Xi,j,k as the cumulative doses by vaccines from the kth manufacturer for k=1,,K, where K is the number of vaccine manufacturers in the whole vaccination data. With this definition, we have Xi,j=k=1KXi,j,k. In the vaccination data we consider, Xi,j,k are observed in 32 countries.

2.1.2. Delivery amount of vaccines

As of the 31st July, [16] presents the delivery data, which refer to the amounts of doses that a country has received. The delivery data consist of publicly reported delivered vaccine amounts, including bilateral agreement, COVAX shipment, and donations. Among 182 countries providing vaccination reports (Section 2.1.1), the delivery data are available for 140 countries. We use these data for the estimation of missing values of Xi,j,k.

Let D be the set of country indexes having the delivery amount data, and let s~i,k be the delivery amount of the kth vaccine in the ith country, iD. We define si,k as

si,k={s~i,k/k=1Ks~i,kifiDiDs~i,k/k=1KiDs~i,kifiD,

which denotes the proportion of the kth vaccine delivered in the ith country. Note that for the case iD, this definition is based on the assumption that the delivery amount of the kth vaccine in a country is affected by the total supply of this vaccine.

2.1.3. Observational studies for vaccine effectiveness

In the vaccination data introduced in Section 2.1.1, twelve kinds of vaccines are used. We identify the vaccines by the name of manufacturers, which are listed in Table 1. We categorize these vaccines into three groups called type 1, 2, and 3 vaccines. The numbering of the type represents the required doses for one person to be fully vaccinated.

Table 1.

The list of vaccine manufacturers in the vaccination data [13].

type manufacturer interval (days) number of studies (fully/ partially)
1 Janssen 8/0
  CanSino 0/0
2 AstraZeneca (AZ) 84 17/15
  Pfizer 21 80/46
  Sinopharm 21 1/0
  Sputnik V 21 1/0
  Sinovac 14 2/1
  Moderna 28 40/26
  Covaxin 28 0/0
  QazVac 21 0/0
  EpiVacCorona 21 0/0
3 RBD-Dimer 56 0/0

Note: In the third column are the recommended intervals between the first and last doses of each vaccine, which are obtained from [15]. In the fourth column, the number of studies on vaccine effectiveness is presented. The studies targeting fully vaccinated and partially vaccinated people are distinguished.

A vaccine is evaluated by its efficacy or effectiveness, which measures how well vaccination protects people against infection, symptomatic illness, hospitalization, or death. While the efficacy is based on the controlled clinical trial, the effectiveness is based on real-world observation studies. In this paper, we consider the effectiveness since we analyze real-world vaccination data.

Higdon et al. [9] conducted a systematic review of COVID-19 effectiveness studies. They collected 107 effectiveness studies, categorized into four groups: effectiveness studies against death, severe disease, symptomatic disease, and any infection. We use the effectiveness studies against any infection for seroprevalence estimation. Then, we have 69 studies for 7 vaccines: Pfizer, Moderna, AstraZeneca (AZ), Sputnik V, Janssen, Sinovac, and Sinopharm. We summarize the 69 observational studies in Table 1. Observation studies for type 2 vaccines have the effectiveness of fully vaccinated and the effectiveness of partially vaccinated.

2.2. Serological survey data

We collect the serological survey data from SeroTracker, a knowledge hub of COVID-19 serosurveillance [1]. We exclude survey data that has a risk of a biased sample. Specifically, we consider the following two exclusion conditions:

  • The sample is collected from a sub-population.

  • The seroprevalence is lower than the COVID confirmed population rates.

The survey data collected by Serotracker includes surveys from specific sub-populations, such as a particular region, age group, or healthcare workers. We exclude surveys targeting these sub-populations to ensure our analysis represents the national population, leaving us with 126 serological surveys after applying the first exclusion criterion. The second exclusion criterion pertains to the conceptual distinction between individuals with confirmed COVID-19 and those with detectable SARS-CoV-2 antibodies; surveys that do not adhere to this distinction are also excluded. Following the application of this second criterion, by July 31st, 2021, we have 99 serological surveys from 45 countries. Each survey is characterized by its own sampling period. The histogram of the last dates in the sampling periods is shown in Figure 1.

Figure 1.

Figure 1.

The histogram of the last dates in the sampling periods for 99 nationwide serological surveys.

3. A Bayesian method for the seroprevalence estimation

We present a Bayesian method to estimate the seroprevalence. Specifically, we propose the method for estimation of the seroprevalence based on the two parts: the proportions of the effectively vaccinated and of the infected, which are denoted by θ(V) and θ(I), respectively.

Recall that the effectively vaccinated are people with antibodies produced from vaccines and that the infected are those who have gotten the antibodies by infection. We define the seroprevalence θi(t) at t date of the ith country as

θi(t)=θi(V)(t)+θi(I)(t)θi(V)(t)θi(I)(t),

where the product terms θi(V)(t)θi(I)(t) represent the cases in which the infected are vaccinated without the knowledge of infection. We provide Bayesian models to estimate θi(V)(t) and θi(I)(t) in next two Subsections 3.1 and 3.2, respectively, for each country and date.

3.1. Models for vaccine induced seroprevalence

For the estimation of θi(V)(t), we propose a Bayesian model to estimate the number of effectively vaccinated people. Let Mi,j denote the number of effectively vaccinated people at the jth report date in the ith country. Note that the index j in Mi,j indicate the report index of the vaccination data (Section 2.1.1), and vaccination reports are not given for everyday. If Mi,js for j[Ji] are given, we can obtain θi(V)(t) as

θi(V)(t)={0if{j[Ji]:di,jt}=Mi,j~/Piotherwise, (1)

where j~=max{j[Ji]:di,jt}, and Pi is the population of the ith country. When there is no report in date t, we use the most recent report from date t. Thus, we focus on the estimation of Mi,j for the estimation of θi(V)(t).

Let Yi,j,k be the number of fully vaccinated people by the kth vaccine at the jth report date in the ith country, and Ek(f) and Ek(p)[0,1] be the efficacies of the kth vaccines for the fully vaccinated people and those who have at least one dose but have not finished the required doses, respectively. We assume that the distribution of Mi,j is

Mi,jk(Binom(Yi,j,k,Ek(f))+Binom(2d(k)(Xi,j,kd(k)Yi,j,k),Ek(p))), (2)

where d(k) denotes the required doses of the kth vaccine.

The term 2(d(k))1(Xi,j,kd(k)Yi,j,k) in (2) represents the partially vaccinated people of kth vaccine. If d(k)=1, since Xi,j,k=Yi,j,k by definitions, this term is zero. If d(k)=2, 2(d(k))1(Xi,j,kd(k)Yi,j,k)=Xi,j,k2Yi,j,k, which is the number of people who have gotten only one vaccine. If d(k)=3, Xi,j,kd(k)Yi,j,k is the sum of the number of people vaccinated with one dose and twice the number of people vaccinated with two doses. Under the assumption that the number of people vaccinated once and twice is the same, 2(Xi,j,k3Yi,j,k)/3 is equal to the number of people who have at least one dose of vaccination, but have not finished the required number of vaccination. We are aware that this assumption is not warranted, but since the vaccine requiring 3 doses is used only in one country, Uzbekistan, we believe that the effect of the assumption is not critical.

Since some of Xi,j,k, Yi,j,k, Ek(f) and Ek(p) are not observed, we need statistical models for these variables. In Sections 3.1.13.1.3, we suggest a method to specify Xi,j,k and Yi,j,k. In Section 3.1.4, we suggest a method to specify Ek(f) and Ek(p).

3.1.1. Model for Xi,j,k

We consider a multinomial regression model for Xi,j,k given Xi,j and si,k, which are defined in Sections 2.1.1 and 2.1.2, respectively. Let Xi,j=(Xi,j,1,Xi,j,2,,Xi,j,K)RK be the response vector and wi,j=(wi,j,1,,wi,j,K)RK be a covariate vector, which is to be defined with si,k and Xi,j for j[j], where [n]:={1,2,,n} for a positive integer n. We assume

Xi,jMultinom(Xi,j,pi,j),pi,j=(pi,j,1,,pi,j,K)[exp{β(V1)log(wi,j,1)},,exp{β(V1)log(wi,j,K)}], (3)

where β(V1)R is the regression coefficient. Model (3) implies that

log(pi,j,x/pi,j,y)=β(V1)log(wi,j,x/wi,j,y), (4)

for all x,y[K]. Equation (4) means that the ratio of usage probability of the xth vaccine to that of the yth vaccine, pi,j,x/pi,j,y, is proportional to the ratio of wi,j,x to wi,j,y after logarithm transformation. This assumption is examined via visualization after the definition of wi,j.

We now define wi,j using the variables for delivery amount si,k and the numbers of doses administrated Xi,j for j[j]. In the definition of wi,j, we reflect the idea that wi,j,k is positively dependent both on the delivery amount of the kth vaccine in the ith country and the period during which the kth vaccine is used. First, let

dwi,j=(dwi,j,1,,dwi,j,K),dwi,j,k=si,kdXi,jI(v(k)Vi,j),fork=1,,K, (5)

where v(k) is the kth vaccine, dXi,j=Xi,jXi,j1 and Xi,0=0. The variable dwi,j,k is defined by multiplying the number of doses administrated at the date of the jth report, dXi,j, to the delivery amount of the kth vaccine in the ith country if the kth vaccine is used at this date. Otherwise, we set dwi,j,k as zero. Then, we define wi,j:=jjdwi,j. Figure 2 is the scatter plot for the points in the set {(log(Xi,j,x/Xi,j,y),log(wi,j,x/wi,j,y)):both ofXi,j,xandXi,j,yare observed}, and shows that the linearity assumption in (4) is reasonable.

Figure 2.

Figure 2.

The scatter plot for the points in the set {(log(Xi,j,x/Xi,j,y),log(wi,j,x/wi,j,y)):both ofXi,j,xandXi,j,yare observed.}.

We assign a non-informative prior distribution for β(V1):

π(β(V1))1.

Theorem 3.1 shows that the posterior distribution under the flat prior is proper. The proof is given in the supplementary material.

Theorem 3.1

Suppose Xi,j follows the distribution (3) for j=1,2,,Ji and i1,2,,N1. Let Ui,j={k[K]:wi,j,k>0}. If there exists (i,j) such that |{kUi,j:Xi,j,k>0}|2, then

i,jp(Xi,jpi,j(β(V1)))dβ(V1)<,

where pi,j(β(V1)) is pi,j constructed by β(V1), and p(Xi,jpi,j(β(V1))) is the density function with parameter pi,j(β(V1)) and observation Xi,j.

3.1.2. Model for Yi,j

There are missing values in Yi,j (the cumulative number of fully vaccinated people at the jth report date of the ith country), and we propose a distribution for the missing values. To do this, we first present methods for three simple cases in which only one type of vaccines are used in the country i up to the report date di,j, and then expand those to the method for the general case in which mixed types of vaccines are used in the country i up to the report date di,j.

In Case 1 in which only type 1 vaccines are used, Yi,j is easily derived from Xi,j since the vaccination is completed with only one dose. Thus, we have

2(Xi,jYi,j)=0. (6)

In Case 2, in which only type 2 vaccines are used, we employ the Poisson distribution to the random variable Xi,j2Yi,j. Note that Xi,j2Yi,j is the number of the doses administrated to people who have gotten one dose but not finished vaccination as of the jth report date of the ith country. We assume that the longer the interval between the first and the last doses is, the larger Xi,j2Yi,j is. We also assume that the larger the doses recently administrated is, the larger Xi,j2Yi,j is.

To specify the doses recently administrated, we address the relation between the report index j and the corresponding report date. For each report index j, di,j is defined as the report date, and di,j satisfies di,1<di,2<<di,Ji. In the vaccination data, there exists an index j such that di,jdi,j1>1, i.e. the reports are not given for everyday. When we need vaccination data for date d with {d:di,j=d,j=1,,Ji}=, we use the data from the closest report. Specifically, we define j(j,δ;i), to indicate the closest report index from date djδ, as

j(j,δ;i)=min{argminjj1|di,jdi,jδ|},

for country index i, report index j and positive integer δ. According to the definition of j(j,δ;i), when there are more than one minimizer in argminjj1|di,jdi,jδ|, we use the smallest index. In this paper, we set δ=21, and if there is no confusion, we let j denote j(j,δ;i). Using the definition of j, we define Zi,j=(Xi,jXi,j)/(di,jdi,j) representing the average of daily doses recently administrated, and we define Wi,j=Zi,jT approximating the doses administrated for recent T days, where T is the required interval between the first and last doses.

Supposing only one kind of type 2 vaccine is used, we propose the regression model

(Xi,j2Yi,j)Pois(exp(β0(V2)+β1(V2)log(Wi,j))). (7)

This model reflects the assumptions that (Xi,j2Yi,j) is positively related to the doses administrated for recent T days. Recall that Xi,j2Yi,j is the number of the doses administrated to people who have gotten one dose but not finished vaccination as of the jth report date of the ith country. We suppose that people who have gotten only one dose had the first dose in recent T days based on the required interval.

The model (7) can be used only when one kind of type 2 vaccine is used. We expand (7) to consider the case when K kinds of type 2 vaccines are possibly used, where K is a positive integer larger than 1. We substitute T in Wi,j to the weighted mean of the intervals as k=1Kwi,j,kTk. Here Tk is the required interval between the first and last doses of the kth vaccine. We define wi,j,k as

wi,j,k=j=jjdwi,j,kk=1Kj=jjdwi,j,k. (8)

Recall the definition of dwi,j,k in (5). The variable dwi,j,k is zero when the kth vaccine is not used at the jth report date of the ith country; otherwise, this variable represents the delivery amount of the kth vaccine in the ith country multiplied by the doses administrated at the corresponding date. Thus, wi,j,k is constructed from the three factors: the delivery amount, the doses administrated during recent di,jdi,j days, and whether the kth vaccine is used. Using the weighted mean of the intervals, we define Wi,j(2)=Zi,jkV(2)wi,j,kTk to replace Wi,j in (7). We suggest the distribution for Case 2 as

(2Xi,j2Yi,j)Xi,j+Pois(exp(β0(V2)+β1(V2)log(Wi,j(2)))), (9)

where V(2) is the index set for type 2 vaccines.

Next, we propose a model for Case 3, in which only type 3 vaccines are used, using the similar idea as in Case 2. To do this, we use the random variable Xi,j3Yi,j instead of Xi,j2Yi,j. Here the variable Xi,j3Yi,j represents the doses administrated to people who have not finished vaccination. Then we consider the Poisson model as

(Xi,j3Yi,j)Pois(exp(β0(V2)+β1(V2)log(Zi,jkV(3)wi,j,kTk))),

where Wi,j(3)=Zi,jkV(3)wi,j,kTk, and V(3) is the index set of type 3 vaccines. We can re-express this distribution as

(2Xi,j2Yi,j)43Xi,j+23Pois(exp(β0(V2)+β1(V2)log(Wi,j(3)))). (10)

Finally, we combine the models (6), (9) and (10) to construct the model for general case. Let ql be the weight of type l vaccines for l = 1, 2, 3 with q1+q2+q3=1, which are defined as ql=kV(l)wi,j,k for l = 1, 2, 3. By combining (6), (9) and (10), we propose the generalized model as

2(Xi,jYi,j)Pois(q2(Xi,j+exp(β0(V2)+β1(V2)log(Wi,j(2))))+q3(43Xi,j+23exp(β0(V2)+β1(V2)log(Wi,j(3))))). (11)

We choose the flat prior distribution on β1 and β0,

π(β0(V2),β1(V2))1.

The following theorem shows that the prior induces the proper posterior distribution. The proof for this theorem is given in the supplementary material.

Theorem 3.2

Let n be a positive integer with n2, and let x1,x2,,xnR and y1,y2,,ynN. If there exists a pair of indexes i and j such that xixj, then

i=1nλiyiexp(λi)dβ0(V2)dβ1(V2)<,

where λi=exp(β0(V2)+β1(V2)xi).

3.1.3. Distributional assumption for Yi,j,k

In this subsection, we provide a distribution for Yi,j,k given Yi,j and Xi,j for j[j]. This distribution is based on the following three premises:

  1. k=1KYi,j,k=Yi,j

  2. Yi,j,k=Xi,j,k for kV(1)

  3. Yi,j,k is positively dependent on Xi,j(j,Tk;i),k for kV(1)

The first premise is obvious from the definitions of Yi,j,k and Yi,j, and the second premise is based on the definitions of Xi,j,k and Yi,j,k. When a type 1 vaccine is considered, the number of fully vaccinated people Yi,j,k coincides with the number of doses Xj,j,k since only one dose is required for this type of vaccine. Next, we address the third premise. Recall that Tk is the interval between first and last doses of the kth manufacturer's vaccine, and j(j,Tk;i) is defined so that di,jdi,jTk. Those who have gotten the first dose of the kth vaccine until the j(j,Tk;i)th report date are expected to be fully vaccinated until jth report date. Thus, we assume that Yi,j,k is positively dependent on Xi,j(j,Tk;i),k.

Using the premises, we suggest a distribution for Yi,j,k for kV(1). We let Y~ij=(Yi,j,k(1),Yi,j,k(2),,Yi,j,k(K~)), which is the vector comprised of Yi,j,ks excluding the type 1 vaccines. Likewise we let X~ij=(Xi,j(j,Tk(1);i),k(1),Xi,j(j,Tk(2);i),k(2),,Xi,j(j,Tk(K~);i),k(K~)). Given X~i,j, Yi,j and Xi,j,k, we suggest the distribution for Y~i,j as

Y~i,jMultinom(Yi,jkV(1)Xi,j,k,X~i,j/l=1K~Xi,j,k(l)).

3.1.4. Model for the estimation of the vaccine effectiveness parameters

We propose a hierarchical model to analyze vaccine effectiveness studies introduced in Section 2.1.3. The hierarchical model extends the random effect model for meta-analysis proposed in [3]. The model in [3] is designed for the meta-analysis of one vaccine. We suggest the hierarchical model to consider more than one vaccine.

We review the model in [3]. Suppose we have n studies for the effectiveness of one vaccine. Bodnar et al. [3] defines the effect size as the log risk ratio, log(1VE), where VE[0,1] is the vaccine effectiveness. The ith study gives the effect size yi with the standard error σi. The random effect model supposes (y1,σ1),,(yn,σn) are generated from

yiθi,σiN(θi,σi2),θiθ¯,ω2N(θ¯,ω2), (12)

where θi represents the true effect size of the ith study and θ¯ represents the overall mean of the true effect size. The parameter ω is the heterogeneity parameter, which represents environmental differences among the studies. For the Bayesian inference of model (12), [4] and [3] suggest the Berger and Bernardo reference prior as

π(μ,ω)ωi=1n(σi2+ω2)2.

We extend model (12) to analyze the observational studies of more than one vaccine. Suppose we have observational studies for K vaccines. Let (yk,i(0),σk,i(0)) and (yk,i(1),σk,i(1)) denote the results of the ith study for partially and fully vaccinated of the kth vaccine, respectively, i=1,,nk and k=1,,K, where nk is the number of observational studies for the kth vaccine. The hierarchical model assumes that yk,i(0) and yk,i(1) are generated from the following distribution:

yk,i(d)N(θk,i(d),(σk,i(d))2),θk,i(d)N(θ¯k(d),ω2),θ¯k(d)N(μ0(d),(κ0(d))2), (13)

where θk,i(0) and θk,i(1) represent the true effect size of the ith study for partially and fully vaccinated of the kth vaccine, respectively. The θ¯k(0) and θ¯k(1) represent the overall effect size for partially and fully vaccinated of the kth vaccine, respectively, and μ0(0) and μ0(1) represent the effectiveness of overall vaccines for partially and fully vaccinated of the kth vaccine, respectively. We append constraints on θk,i(d), θ¯k(d) and μ0(d) in the hierarchical model as follows:

I(θk,i(0)θk,i(1)),I(θ¯k(0)θ¯k(1)),I(μ0(0)μ0(1)).

The constraints imply that the effectiveness increases as the number of doses increases.

We suggest the following prior distribution:

π(ω)ωk,i,d((σk,i(d))2+ω2)2,π(μ0(d))1,π(κ0(d))1.

For the parameter of ω, we employ the prior distribution used by the random effect model (12). We assign the flat prior on the parameters of μ0(d) and κ0(d) motivated by Gelman et al. [8]. By the posterior distribution of θ¯k(d), we estimate the effectiveness of the kth vaccine. If the observational studies of kth vaccine do not exist, i.e. nk=0, the posterior distribution of θ¯k(d) is derived based on the overall mean parameter μ0(d) in (13).

3.2. Models for infection induced seroprevalence

In this section, we propose a method to estimate θi(I)(t) using a hierarchical model, an extension of the model (14) proposed by Lee et al. [11],

XBinom(N,p+θ+(1p)(1θ)), (14)

where N is the number of subjects in a serosurvey, X is the number of subjects who is test-positive, p+ and p are sensitivity and specificity of the serology test, respectively, and θ is the seroprevalence. While model (14) is used for the analysis of one set of serosurvey in a country, we suggest the hierarchical model to analyze the serosurvey data over countries given in Section 2.2.

First, we introduce a reparameterized form of model (14) in Section 3.2.1, and we propose the hierarchical model in Section 3.2.2 using the reparameterized model. We introduce notations for this section. We use 99 serosurveys introduced in Section 2.2, and let Nl and Xl denote the numbers of survey samples and test-positive samples in the lth serosurvey, respectively, l=1,,99. The index il represents the country index in which the lth serosurvey is conducted, and the index tl indicates the last date in the sampling period of the lth serosurvey.

3.2.1. Reparameterization of model for one serosurvey

We reparametrize model (14) since we are interested in the seroprevalence by infection θil(I)(tl). The reparameterized model is

XlBinom(Nl,pl+θil(tl)+(1pl)(1θil(tl))),θil(tl)=θil(I)(tl)+θil(V)(tl)θil(I)(tl)θil(V)(tl), (15)

l=1,,99, where pl+ and pl are the sensitivity and specificity of the serology test used in the lth survey, respectively. Recall that θil(I)(tl) and θil(V)(tl) denote the seroprevalence by infection and the proportion of the effectively vaccinated, respectively, in the ilth country at tl date. If a serosurvey is conducted before vaccination, then θil(tl)=θil(I)(tl). Note that among 99 serosurveys, 80 surveys are conducted before vaccination.

We construct a prior distribution on θil(V)(tl) from the number of effectively vaccinated in (2), divided by the population. Recall that the distribution for the number of effectively vaccinated is derived only for dates when the vaccination report is provided. If there is no vaccination report of the ilth country in date tl, we use the most recent report from the date. Given the prior on θil(V)(tl), we propose a Bayesian method to estimate θil(I)(tl) in the following section.

3.2.2. Model for serosurvey data over countries

We propose a hierarchical model to analyze the serosurvey data over countries. Let θil(C)(tl) denote the proportion of the cumulative confirmed cases, which is referred to as confirmed ratio in the ilth country at tl date, respectively. We assume that random variable log(θil(I)(tl)/θil(C)(tl)) is explained by country-specific random effect and country statistics: adult population density and GDP per capita of the corresponding country. Note that the random variable θil(I)(tl)/θil(C)(tl) represents the ratio of the number of infected to that of confirmed. We represent this assumption as

log(θil(I)(tl)/θil(C)(tl))TN(0,log(θil(C)(tl)))(βil+β1(I)PDil+β2(I)Gil,τ2),βilN(μ0,σ2), (16)

where PDil and Gil are the standardized log adult population density and log of GDP per capita, and TN(a,b)(μ,σ2) is the truncated normal distribution with mean μ, covariance σ2 and the range of (a,b). Combining (15) and (16), we construct the hierarchical model as

XlBinom(Nl,pl+θil(tl)+(1pl)(1θil(tl))),θil(tl)=θil(I)(tl)+θil(V)(tl)θil(I)(tl)θil(V)(tl),log(θil(I)(tl))log(θil(C)(tl))TN(0,log(θil(C)(tl)))(βil+β1(I)PDil+β2(I)Gil,τ2),βilN(μ0,σ2), (17)

Next, we describe prior distributions on θil(V)(tl), τ, μ0, σ, β1(I), β2(I), pl+ and pl. As suggested in Section 3.2.1, we use the distribution (2) for the prior on θil(V)(tl). Gelman et al. [8] suggested the flat prior for the standard deviation σ in hierarchical models, and they also showed that this prior gives the proper posterior distribution when flat priors are given for other parameters, μ0, τ, β1(I) and β2(I) for our model. For pl+ and pl, we construct prior distributions based on the method in Section 4 of [11]. We give the detail in supplementary material.

4. Results

In this section, we give the results of the Bayesian inference for the regression models and the hierarchical models in Section 3, and we give the results of world seroprevalence estimation. We use NIMBLE [5] for the Bayesian inference of these models. In each inference, we generate 4000 posterior samples, including 2000 burn-in sample for 4 chains. All codes are available on https://github.com/klee564/worldsero.

In Section 4.1, we give the simulation study for models proposed in Section 3. In Section 4.2, we give the posterior distributions of the regression coefficients and the vaccine effectiveness parameters. In Section 4.3, we derive the predictive posterior distributions of θ(V) and θ(I) for each date and country and summarize the posterior distributions to figure out the world seroprevalence.

4.1. Simulation study

We conduct a simulation study to evaluate the models proposed in Section 3. We set the parameters β(V1) in (4), (β0(V2),β1(V2)) in (11), (μ0(0),μ0(1),κ0(0),κ0(1),ω) and σk,i(d) in (13) as the random values from the following distributions:

β(V1)U(0.5,1.5),β0(V2)U(1,0),β1(V2)U(0,1),μ0(1)U(4,2),μ0(0)μ0(1)μ0(1)+U(0,2),κ0(0),κ0(1),ω,U(0,1),σk,i(d)U(0,1).

We set the parameters β1(I),β2(I),τ,μ0 and σ in (17) as the random values from the following distributions:

β1(I)U(0.5,0.5),β2(I)U(1,0),τU(0,1),μ0U(1.5,2.5),σU(0,1).

We set the covariates and the number of observations as the real data values and generate the simulation data with 100 repetitions.

We estimate the parameter of the models by the Bayesian method, which gives credible intervals for the parameters. We evaluate the accuracy of the Bayesian method by the coverage probability for the true parameter of the credible intervals since we are interested in interval estimation. Table 2 gives the coverage probabilities of the 95% credible intervals for the regression coefficients and vaccine effectiveness parameters. The coverage probabilities attain the nominal probability.

Table 2.

Coverage probabilities for the parameters β(V1) in (3), ( β0(V2),β1(V2)) in (11) and θ¯k(d) in (13).

Parameter β(V1) β0(V2) β1(V2) θ¯k(d) β1(I) β2(I)
Coverage probability 94% 96% 97% 93.4% 95% 95%

4.2. Posterior distributions for regression coefficients and vaccine effectiveness

First, we present the posterior distributions for models (3) and (11), i.e. we give the posterior distributions of β(V1), β0(V2) and β1(V2). The posterior distributions are represented in Figure 3.

Figure 3.

Figure 3.

The posterior samples of β(V1), β0(V2) and β1(V2) in models (3) and (11).

The posterior distribution of β(V1) is concentrated around 1. Note that, by (4), β(V1) represents the relation between the usage rate by vaccine and the ratio of vaccine delivery amounts. The posterior means of β0(V2) and β1(V2) are 0.935 and 1.15 respectively. For the convenience of interpretation, we interpret β0(V2) and β1(V2) via the model (7), a simplified version for the case when only a type 2 vaccine is used. According to (7), we have

E(Xi,j2Yi,j)=exp(β0(V2))Wi,jβ1(V2). (18)

The regression coefficients explain the relation between Xi,j2Yi,j and Wi,j via (18). Recall that the random variable Xi,j2Yi,j represents the number of doses administrated to people who have gotten one dose but not finished vaccination, and Wi,j approximates the doses administrated for recent T days, where T is the required interval of the vaccine.

We give the posterior distributions of β1(I) and β2(I) in model (17). The posterior samples are summarized in Figure 4.

Figure 4.

Figure 4.

The posterior samples of β1(I) and β2(I) in model (17).

Recall that the regression coefficients β1(I) and β2(I) appear in the following distribution:

log(θil(I)(tl)/θil(C)(tl))TN(βil+β1(I)PDil+β2(I)Gil,τ2).

The left term represents the log ratio of the seroprevalence by infection to the confirmed ratio, and β1(I) and β2(I) are the regression coefficients for the population density and the GDP, respectively. The posterior mean and the 95% credible interval of β1(I) are 0.105 and [- 0.199,0.413], respectively. For the β2(I), the posterior mean and the credible intervals are 0.751 and [- 1.015,- 0.436], respectively.

Next, we give the posterior distributions of the vaccine effectiveness parameter Ek in Figure 5.

Figure 5.

Figure 5.

The box plot represents the posterior sample of each vaccine and vaccination status, where vaccination status means whether the subject is partially or fully vaccinated. The x-axis represents the vaccine, and the type in the legend represents the vaccination status.

We have found that the vaccine effectivenesses with fully vaccinated Pfizer and Moderna are 84.2% and 87.7%, respectively, which is high compared to other vaccines.

4.3. Estimation of world seroprevalence

We derive the predictive posterior distributions of θi(V)(t) and θi(I)(t) for the ith country in t date. Recall that θi(V)(t) and θi(I)(t) denote the proportion of the effectively vaccinated population and seroprevalence by infection of the ith country at t date, respectively. We also define seroprevalence of the ith country at t date as

θi(t)=θi(V)(t)+θi(I)(t)θi(V)(t)θi(I)(t).

The predictive posterior distribution of θi(V)(t) is derived from the effectively vaccinated population, Mi,j in (2), divided by the population Pi. Recall that the index j in Mi,j indicate report index, and reports are not given for everyday. When there is no report in date t, we use the most recent report from date t. The predictive posterior distribution of θi(I)(t) is derived from the distribution

log(θi(I)(t)/θi(C)(t))TN(βi+β1(I)PDi+β2(I)Gi,τ2)

in (16), given θi(C)(t), PDi and Gi.

Next, we define the trend of world seroprevalences using θi(I)(t), θi(V)(t) and θi(t). We define θt(I), θt(V) and θt as

θt(V)=iPiθi(V)(t)/P,θt(I)=iPiθi(I)(t)/P,θt=iPiθi(t)/P,

where P is the sum of population of the all countries considered. The variables θt(I), θt(V) and θt describe the trends of world seroprevalence by infection, the proportion of effectively vaccinated in the world and the world seroprevalence, respectively, and these are represented in Figure 6.

Figure 6.

Figure 6.

The trends of θt(V), θt(I) and θt from beginning of January 2021 to the end of July 2021. The gray area denotes the 95% credible interval. The black line represents the posterior mean. The left, center, right graphs represent the trends of θt(V), θt(I) and θt, respectively.

As of 31st July the 95% credible intervals of θ(V), θ(I) and θ are [22.3%,25.2%], [15.2%,40.0%] and [35.5%,56.8%], respectively. We compare our estimate with the result by Bergeri et al. [2], which estimates the world seroprevalence in July 2021 as [40.7%,49.8%]. The centers of our estimate and [2] are 46.15% and 45.25%, respectively. The difference between the centers is 0.9%p. The range of our interval estimate covers the interval by Bergeri et al. [2]. Thus, our result is consistent with the result of [2].

We present a treemap in Figure 7, which shows the posterior means of seroprevalences by country on 31st July 2021. The seroprevalences of China and India are 47% and 55%, respectively, which are similar to the world's seroprevalence on this date. France and UK attain over 70% seroprevalence on this date.

Figure 7.

Figure 7.

The treemap presents the posterior means of seroprevalence by country on 31st July 2021. Each tile represents a country, and its area is proportional to the corresponding population. The color and the value in each tile represent the seroprevalence θi(t) when t is 31st July 2021.

We evaluate the accuracy of credible intervals for the seroprevalence via the leave-one-out cross-validation method. We split the serosurvey data into train data and test observation. Each observation of the data is a serosurvey for a particular date in a particular country. We fit the proposed model with the training data and then derive the predictive posterior distribution for the test observation. Out of 99 test observations, 91 are included in the predictive posterior distribution, i.e. we get a coverage probability of 92%.

5. Discussion

We have proposed a novel Bayesian approach to estimate the seroprevalence of COVID-19 antibodies among the global adult population. This approach begins by estimating the seroprevalences due to infection and vaccination for each country, and then employs a Bayesian hierarchical model to combine these estimates into a global seroprevalence figure. Additionally, we constructed informative priors by utilizing external sources, such as data from clinical trials.

There are many studies on the estimation of seroprevalence in a population. However, these studies focus on estimating the seroprevalence on the date and country in which the sample is collected, and hence the estimation of the world seroprevalence is not apparent. Furthermore, the previous works on the vaccination data were mainly on the cumulative doses administrated and the fully vaccinated population, while the method proposed in the paper predicted the effective vaccinated population using the information on the efficacies of vaccines.

The methods used in this paper could be improved. Firstly, in the hierarchical model for the seroprevalence of infection, additional covariates could be explored and used for the model. The covariates used in this study are national statistics which do not depend on the date factor. Thus, we expect that explanatory power could be improved by adding the date-dependent covariate, such as the daily number of COVID tests in a country. Secondly, the model could be refined by considering the sampling period, as the current analysis only uses the last day of the sampling period. This approach may not fully capture the trends in infection rates over time. Lastly, the current study has some limitations, including being based on data up to July 2021 and not accounting for the decline in neutralizing antibodies among vaccinated individuals, potential underreporting of COVID-19 cases, changes in social isolation regulations, population adherence, and test availability. To improve the accuracy of the results, it may be necessary to update the data and incorporate these limitations into the model.

Supplementary Material

Supplemental Material
CJAS_A_2335569_SM1677.pdf (188.1KB, pdf)

Acknowledgments

The first and second authors equally contributed to this work.

Funding Statement

Kwangmin Lee was supported by Chonnam National University (Grant number: 2023-0482) and the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. RS-2023-00211979). Seongil Jo was supported by INHA UNIVERSITY Research grant and the National Research Foundation of Korea (NRF) grant funded by the Korea government(MSIT) (NRF-2022R1A5A7033499 and RS-2023-00209229). Jaeyong Lee was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government(MSIT) (No. 2020R1A4A1018207 and NRF-2023R1A2C1003050).

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

  • 1.Arora R.K., Joseph A., Van Wyk J., Rocco S., Atmaja A., May E., Yan T., Bobrovitz N., Chevrier J., Cheng M.P., Williamson T., and Buckeridge D.L., SeroTracker: a global SARS-CoV-2 seroprevalence dashboard, Lancet Infect. Dis. 21 (2021), pp. e75–e76. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Bergeri I., Whelan M., Ware H., Subissi L., Nardone A., Lewis H.C., Li Z., Ma X., Valenciano M., Cheng B., Ariqi L.A., Rashidian A., Okeibunor J., Azim T., Wijesinghe P., Le L.-V., Vaughan A., Pebody R., Vicari A., Yan T., Yanes-Lane M., Cao C., Clifton D.A., Cheng M.P., Papenburg J., Buckeridge D., Bobrovitz N., Arora R.K., and Van Kerkhove M.D., Unity Studies Collaborator Group , Global epidemiology of SARS-CoV-2 infection: A systematic review and meta-analysis of standardized population-based seroprevalence studies, jan 2020-oct 2021, MedRxiv (2021), pp. 2021–12.
  • 3.Bodnar O., Link A., Arendacká B., Possolo A., and Elster C., Bayesian estimation in random effects meta-analysis using a non-informative prior, Stat. Med. 36 (2017), pp. 378–399. [DOI] [PubMed] [Google Scholar]
  • 4.Bodnar O., Link A., and Elster C., Objective Bayesian inference for a generalized marginal random effects model, Bayesian Anal. 11 (2016), pp. 25–45. [Google Scholar]
  • 5.de Valpine P., Turek D., Paciorek C.J., Anderson-Bergman C., Lang D.T., and Bodik R., Programming with models: Writing statistical algorithms for general model structures with NIMBLE, J. Comput. Graph. Stat. 26 (2017), pp. 403–413. [Google Scholar]
  • 6.Dong Q. and Gao X., Bayesian estimation of the seroprevalence of antibodies to SARS-CoV-2, JAMIA Open 3 (2020), pp. 496–499. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Forni G. and Mantovani A., on behalf of the COVID-19 Commission of Accademia Nazionale dei Lincei, Rome , COVID-19 vaccines: Where we stand and challenges ahead, Cell Death Differ. 28 (2021), pp. 626–639. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Gelman A., Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper), Bayesian Anal. 1 (2006), pp. 515–534. [Google Scholar]
  • 9.Higdon M.M., Wahl B., Jones C.B., Rosen J.G., Truelove S.A., Baidya A., Nande A.A., ShamaeiZadeh P.A., Walter K.K., Feikin D.R., Patel M.K., Knoll M.D., and Hill A.L., A systematic review of COVID-19 vaccine efficacy and effectiveness against SARS-CoV-2 infection and disease, MedRXiv. (2021), pp. 9. [DOI] [PMC free article] [PubMed]
  • 10.Kline D., Li Z., Chu Y., Wakefield W.C., Miller J., Turner A.N., and Clark S.J., Estimating seroprevalence of SARS-CoV-2 in Ohio: A Bayesian multilevel poststratification approach with multiple diagnostic tests, preprint (2020), arXiv:2011.09033. [DOI] [PMC free article] [PubMed]
  • 11.Lee K., Jo S., and Lee J., Seroprevalence of SARS-CoV-2 antibodies in South Korea, J. Korean Stat. Soc. 50(3) (2021), pp. 891–904. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Lu H., Stratton C.W., and Tang Y.W., Outbreak of pneumonia of unknown etiology in Wuhan, China: The mystery and the miracle, J. Med. Virol. 92 (2020), pp. 401–402. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Mathieu E., Ritchie H., Ortiz-Ospina E., Roser M., Hasell J., Appel C., Giattino C., and Rodés-Guirao L., A global database of COVID-19 vaccinations, Nat. Hum. Behav. 5(7) (2021), pp. 947–953. [DOI] [PubMed] [Google Scholar]
  • 14.Stringhini S., Wisniak A., Piumatti G., Azman A.S., Lauer S.A., Baysson H., De Ridder D., Petrovic D., Schrempft S., Marcus K., Yerly S., Vernez I.A., Keiser O., Hurst S., Posfay-Barbe K.M., Trono D., Pittet D., Gétaz L., Chappuis F., Eckerle I., Vuilleumier N., Meyer B., Flahault A., Kaiser L., and Guessous I., Seroprevalence of anti-SARS-CoV-2 IgG antibodies in Geneva, Switzerland (SEROCoV-POP): A population-based study, The LANCET 396 (2020), pp. 313–319. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.The New York Times , Coronavirus vaccine tracker, 2021. Available at https://www.nytimes.com/interactive/2020/science/coronavirus-vaccine-tracker.html. Accessed July 31, 2021.
  • 16.UNICEF , COVID-19 vaccine market dashboard, 2021. Available at https://www.unicef.org/supply/covid-19-vaccine-market-dashboard/. Accessed July 31, 2021.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Material
CJAS_A_2335569_SM1677.pdf (188.1KB, pdf)

Articles from Journal of Applied Statistics are provided here courtesy of Taylor & Francis

RESOURCES