This document outlines an estimation strategy developed to estimate the Seroprevalence of COVID19 in King County in early to mid August of 2020. The estimation makes use of the survey weights developed by Marketing Systems Group. The stratified cluster survey was carried out with two strata, a random address-based sample and a convenience sample. The survey weights have been raked to King county population characteristics (race, Hispanic status, sex, age, and income), assuming equal sampling weights for sampling units (households) within each stratum. This required imputation of missing data on the raking variables. We carry out estimation using Survey R package (Lumley 2020)
There are various methodologic challenges to account for when calculating the estimate and corresponding standard errors.
We present the final estimate based on three adjustments; one accounting for missingness, one accounting for raking, and one accounting for the sensitivity and specificity of the testing procedure.
Additional analyses were carried out to account for other possible sources of bias and variance.
Note that all data presented here (before the appendix) have already been cleaned and processed. This includes:
## Reading in data
surv_df <- read.csv("data/cleaned_survey_data.csv")
ps_obj <- list(
age_sex = read.csv("data/age_sex_distr.csv"),
hisp_sex = read.csv("data/hisp_sex_distr.csv"),
race_sex = read.csv("data/race_sex_distr.csv"),
inc_sex = read.csv("data/inc_sex_distr.csv"),
inc_hhs = read.csv("data/inc_hhs_distr.csv")
)
The dataset contains all individuals who provided blood samples for testing for the presence of COVID19 antibody. However, some (79) individuals did not have sufficient quantities of blood to run tests. To account for this missingness, we assume the data are missing at random, and use a two phase model to approach.
The weights provided for this analysis were calculated using raking. When weighted estimates are based on raking the corresponding standard errors can be reduced (see, for example chapter 7 of Lumley 2010). This is accounted for in our analysis using the R survey package.
The estimate (and corresponding standard error) resulting from the data and the above adjustments is for the proportion of individuals who tested positive on both tests. A prevalence estimate can be calculated from the estimate of proportion positive if the sensitivity and specificity of the combined test are known using the following formula:
\[ \text{Prevalence} = \frac{\text{Proportion Positive} + \text{Specificity} - 1 }{\text{Sensitivity} + \text{Specificity} - 1} \]
Covariate values used for weighting that were missing were imputed using hot-deck imputation. See the last last section for a report on the missingness of the data. It is worth noting that all individuals with missing Hispanic status were assigned to be non-Hispanic. Additionally, Hispanic status was used for the imputation of the race variable. Hispanic status was not used for the imputation of any other variable.
All individuals who did not initially provide a sufficient quantity of blood were asked to provide a second specimen for testing. The subset of these individuals who did provide a second specimen had their value in the dataset for their test replaced by the test results for the second specimen. Thus, individuals who required two blood samples to have tests completed are treated identically to those who required only one blood sample. The 79 individuals who did not provide a second specimine were retained in the dataset and assigned a missing value for their test result.
A small number (8) of the individuals who provided a second sample of blood were then included in the dataset twice. These individuals had two entries in the dataset used to construct the weights, this analysis. While these duplicates created small inaccuracies in the estimated weights that are used, we think these inaccuracies are ignorable.
Two sequential tests were used to confirm seropositivity for each individual. To treat both tests as a single test, calculations were carried out to determine the sensitivity and specificity of the sequential testing procedure. This combination assumes that the tests are independent conditional on each individuals true status. A tool for this calculation and the formula used can be found here.
The convenience sample had a higher proportion of individuals who thought they had previously been infected with COVID, and a higher proportion who had tested positive for COVID prior to this survey because of a suspected COVID illness.
make_sum_tabl <- function(var, valnames = c("Negative", "Positive", "Missing")) {
outputFormat <- knitr::opts_knit$get("rmarkdown.pandoc.to")
variable <- enexpr(var)
sum_df <- surv_df %>% group_by(randsamp) %>%
summarise(Count = n(),
n_neg = sum(!!variable %in% 0),
n_pos = sum(!!variable %in% 1),
n_mis = sum(is.na(!!variable)),
percent_neg = round(100 * n_neg / Count, 2),
percent_pos = round(100 * n_pos / Count, 2),
percent_mis = round(100 * n_mis / Count, 2),
pnm_neg = round(100 * n_neg / (n_neg + n_pos), 2),
pnm_pos = round(100 * n_pos /(n_neg + n_pos), 2))
sum_df$randsamp <- recode(sum_df$randsamp,
"n" = "Convenience",
"y" = "Address-Based")
names(sum_df)[1] <- "Sample Type"
if (length(outputFormat) == 0) outputFormat <- "html"
if (outputFormat %in% c('latex', 'html')) {
sum_df %>% gt::gt() %>% gt::tab_spanner(
label = "Number",
columns = vars(n_neg, n_pos, n_mis)
) %>% gt::tab_spanner(
label = "Percentage",
columns = vars(percent_neg, percent_pos, percent_mis)
) %>%
gt::tab_spanner(
label = "Percentage of Non-Missing",
columns = vars(pnm_neg, pnm_pos)
) %>% gt::cols_label(
n_neg = valnames[1], n_pos = valnames[2], n_mis = valnames[3],
percent_neg = valnames[1], percent_pos = valnames[2],
percent_mis = valnames[3], pnm_neg = valnames[1],
pnm_pos = valnames[2],
)
}else{
sum_df %>% knitr::kable()
}
}
make_sum_tabl(covid_illness, c("No", "Yes", "Missing"))
Sample Type | Count | Number | Percentage | Percentage of Non-Missing | |||||
---|---|---|---|---|---|---|---|---|---|
No | Yes | Missing | No | Yes | Missing | No | Yes | ||
Convenience | 504 | 295 | 90 | 119 | 58.53 | 17.86 | 23.61 | 76.62 | 23.38 |
Address-Based | 860 | 615 | 110 | 135 | 71.51 | 12.79 | 15.70 | 84.83 | 15.17 |
make_sum_tabl(const_test_res)
Sample Type | Count | Number | Percentage | Percentage of Non-Missing | |||||
---|---|---|---|---|---|---|---|---|---|
Negative | Positive | Missing | Negative | Positive | Missing | Negative | Positive | ||
Convenience | 504 | 355 | 3 | 146 | 70.44 | 0.60 | 28.97 | 99.16 | 0.84 |
Address-Based | 860 | 683 | 2 | 175 | 79.42 | 0.23 | 20.35 | 99.71 | 0.29 |
This issue is studied in greater detail in the senisitivy analysis section
In the first stage, we construct a two-phase object that will account for the missing tests data due to “quantity not sufficient” of some individuals.
## This object is created for comparisons
surv_obj <- survey::svydesign(
ids = ~HOUSEHOLD,
weights = ~weight,
strata = ~randsamp,
data = surv_df %>% filter(!is.na(both_pos))
)
two_phase_surv <- survey::twophase(
id = list(~HOUSEHOLD, ~individual),
strata = list(~randsamp, ~randsamp),
weights = list(~weight, ~wt2),
subset = ~I(qs),
data = surv_df, method = "approx"
)
make_pretty_out <- function(surv_obj) {
sv_mean <- survey::svymean(~both_pos, surv_obj, na.rm = TRUE)
est <- round(100 * as.numeric(sv_mean), 2)
std_err <- round(100 * as.numeric(sv_mean %>% survey::SE()), 2)
data.frame("Estimate" = est, "Standard Error" = std_err)
}
make_table <- function(surv_objs, names) {
outputFormat <- knitr::opts_knit$get("rmarkdown.pandoc.to")
if ("survey.design" %in% class(surv_objs)) {
fin_df <- make_pretty_out(surv_objs)
} else {
fin_df <- NULL
for (sv_idx in seq_along(names)) {
fin_df <- bind_rows(
fin_df,
bind_cols(
"Survey Object" = names[sv_idx],
make_pretty_out(surv_objs[[sv_idx]])
)
)
}
}
if (length(outputFormat) > 0) {
if (outputFormat %in% c('latex', 'html') ) {
fin_df %>% gt::gt() %>% gt::cols_label(
"Standard.Error" = "Standard Error"
) %>% gt::tab_spanner(
label = "Percent Positive (on both Tests)",
columns = c("Estimate", "Standard.Error")
)
}else{
fin_df %>% knitr::kable()
}
}else{
fin_df %>% gt::gt()
}
}
make_table(list(surv_obj, two_phase_surv),
c("Single-Phase Object (Dropping Missing Values)",
"Using Two-Phase Sampling"))
Survey Object | Percent Positive (on both Tests) | |
---|---|---|
Estimate | Standard Error | |
Single-Phase Object (Dropping Missing Values) | 3.3 | 0.83 |
Using Two-Phase Sampling | 3.3 | 0.82 |
Above, note that accounting for the missing values reduces the standard errors, but does not change the estimate.
The above code assumed that individuals were missing completely at random. However, we want to make a weaker assumption (missing at random). Thus, we call calibrate on the two-phase object to calibrate the second phase to the population. For more information (see section 9.2 of Lumley 2010):
post_cal_obj <- survey::calibrate(two_phase_surv, phase = 2,
calfun = "raking", stage = 0,
~AGECAT+HISPANIC+sex+RACECAT+INCOME)
make_table(list(two_phase_surv, post_cal_obj),
c("Pre-Calibration", "Post-Calibration"))
Survey Object | Percent Positive (on both Tests) | |
---|---|---|
Estimate | Standard Error | |
Pre-Calibration | 3.30 | 0.82 |
Post-Calibration | 3.22 | 0.81 |
Note: This changes both the estimate and standard error
Next we wish to perform raking on our weights from the first phase of our survey. Currently this is not supported with the two-phase object, so instead a new single-phase object is created with weights that approximately account for both of phases of the two-phase design.
It is worth noting that nothing methodological is happening in this step, we are only creating a new object in R that will satisfy all of our needs.
new_one_phase <- survey::svydesign(
ids = ~HOUSEHOLD,
weights = weights(post_cal_obj),
strata = ~randsamp,
data = subset(surv_df, qs == 1)
)
make_table(list(surv_obj, two_phase_surv$phase1$full,
two_phase_surv$phase1$sample,
post_cal_obj, new_one_phase),
c("Original single-phase object",
"Phase one object from two-phase object",
"Phase one object from two-phase object (ignoring missingness)",
"Previous two-phase object",
"New single-phase object"))
Survey Object | Percent Positive (on both Tests) | |
---|---|---|
Estimate | Standard Error | |
Original single-phase object | 3.30 | 0.83 |
Phase one object from two-phase object | 3.30 | 0.83 |
Phase one object from two-phase object (ignoring missingness) | 3.30 | 0.83 |
Previous two-phase object | 3.22 | 0.81 |
New single-phase object | 3.22 | 0.80 |
Now that we have created a single phase object we can rake the survey object so that the fact the weights were calculated using raking will be accounted for in the estimation of standard errors. As a reminder, when weights are calculated using raking, standard errors of estimates using these weights are reduced because we are able to take into account properly the population information used for raking (Lumley 2010).
post_rake_obj <- survey::rake(
new_one_phase,
sample.margins = list(
~AGECAT + sex,
~HISPANIC + sex,
~RACECAT + sex,
~INCOME + sex
# ~INCOME + hhs
),
population.margins = list(
ps_obj$age_sex[, c("AGECAT", "sex", "Freq")],
ps_obj$hisp_sex[, c("HISPANIC", "sex", "Freq")],
ps_obj$race_sex[, c("RACECAT", "sex", "Freq")],
ps_obj$inc_sex[, c("INCOME", "sex", "Freq")]
# ps_obj$inc_hhs[, c("INCOME", "hhs", "Freq")]
)
)
my_summarise <- function(data, group_var) {
data %>%
group_by({{ group_var }}) %>%
summarise(mean = mean(mass))
}
make_table(list(new_one_phase, post_rake_obj),
c("Before Raking", "After Raking"))
Survey Object | Percent Positive (on both Tests) | |
---|---|---|
Estimate | Standard Error | |
Before Raking | 3.22 | 0.80 |
After Raking | 3.24 | 0.73 |
Note above that the standard error has been reduced. In normal circumstances, we would expect there to be no change in the estimate, but since we have eliminated 8 observations that were duplicates since the original weights were calculated, the estimate changes by a small amount.
While all of the above estimates were providing estimates for the proportion of individuals who tested positive on both tests, we are actually interested in estimating prevalence or the proportion of individuals who have been infected with COVID19.
One complication for this correction is that sero-positivity was determined using a sequence of two tests. A first test was conducted, and individuals who tested positive on the first test were given a second test. Only individuals who tested positive on both tests were counted as positive. This combination assumes that the tests are independent conditional on each individuals true status, and allows us to treat this sequential testing procedure as a single test. The calculations used for this combination can be found on the epitools website “Epitools - Sensitivity and Specificity of Two Tests Used ...” (n.d.). Note that the sensitivity and specificity of these tests are derived using positive and negative predictive agreement respectively with other benchmark tests. For more details on the comparators used for the Ab and ElisaG tests, see “Platelia SARS-CoV-2 Total Ab” (n.d.) and “Anti-SARS-CoV-2 ELISA (IgG) Instruction for Use” (n.d.) respectively.
combine_tests <- function(t1_spec, t1_sens, t2_spec, t2_sens) {
return(c(
"spec" = 1 - (1 - t1_spec) * (1 - t2_spec),
"sens" = t1_sens * t2_sens
)
)
}
## First (AB) test
## Combined results
## Positive Percent Agreement: 92.16% (47/51); 95% CI: (81.5% - 96.91%)
## Negative Percent Agreement: 99.56% (684/687); 95% CI: (98.72 – 99.85%)
## Second Test: Anti-SARS-CoV-2 ELISA (IgG)
#IgG Sensitivity (PPA) 90% (27/30) (74.4%; 96.5%) (midpoint: 85.45)
#IgG Specificity (NPA) 100% (80/80) (95.4%; 100%) (midpoint: 97.7)
testing_accuracy <- data.frame(
"Test" = c("Bio-Rad Ab test", "Euroimmun elisaG (IgG)"),
"Specificity" = c(0.9956, 1),
"Sensitivity" = c(0.9216, 0.9)
)
comb_sens_spec <- combine_tests(
t1_spec = testing_accuracy$Specificity[1],
t1_sens = testing_accuracy$Sensitivity[1],
t2_spec = testing_accuracy$Specificity[2],
t2_sens = testing_accuracy$Sensitivity[2]
)
tabl_dff <- testing_accuracy %>% mutate(
Specificity = paste0(100 * Specificity, "%"),
Sensitivity = paste0(100 * Sensitivity, "%"),
) %>% bind_rows(
c("Test" = "Sequential Test",
"Specificity" = paste0(round(100 * comb_sens_spec["spec"], 2), "%"),
"Sensitivity" = paste0(round(100 * comb_sens_spec["sens"], 2), "%"))
)
colnames(tabl_dff)[2:3] <- c("Negative Percent agreement",
"Positive percent agreement")
gt::gt(
tabl_dff
) %>% gt::cols_align(
align = "center",
columns = 2:3
)
Test | Negative Percent agreement | Positive percent agreement |
---|---|---|
Bio-Rad Ab test | 99.56% | 92.16% |
Euroimmun elisaG (IgG) | 100% | 90% |
Sequential Test | 100% | 82.94% |
The estimates of sensitivity and specificity used come from reports of the tests’ performance. The source for the Ab test can be found below the last table of page 15 of “Platelia SARS-CoV-2 Total Ab” (n.d.) (link to document). The source for the IgG test can be found in the last table of page 12 of “Anti-SARS-CoV-2 ELISA (IgG) Instruction for Use” (n.d.)(link to document).
Next, recall the formula that relates the proportion of individuals with positive tests to the proportion of individuals who have been infected with COVID19:
\[ \text{Prevalence} = \frac{\text{Proportion Positive} + \text{Specificity} - 1 }{\text{Sensitivity} + \text{Specificity} - 1} \]
While there is a single agreed upon method for adjusting the observed proportion of positive tests to obtain an estimate of prevalence, there is no such analog for the standard error.
Here, we draw from a suggestion by REICZIGEL, FÖLDI, and ÓZSVÁRI (2010) for how to adjust our confidence limits in the presence of a test with known sensitivity and specificity.
The adjustment described by REICZIGEL, FÖLDI, and ÓZSVÁRI (2010) is based on inverting a hypothesis test. At a high level, this is done using the following method:
For a proposed number of positive individuals \(x\), it is possible to estimate the probability that exactly this many individuals were truly positive given the observed number of positive tests and the sensitivity and specificity of the test.
Once this value is calculated for every possible \(x\), a distribution can be constructed for all possible values of the prevalence (by dividing \(x\) by the sample size).
Next, an interval is selected such that 95% of the distributions mass is contained within this interval (and contains the estimate near the middle).
Note that this has been implemented in an R function posted on the authors’ website. The code for this function can be seen by opening the “Code to adjust confidence limits” tab below.
ci4prev=function(poz,n,se=1,sp=1,level=.95,dec=3,method="bl"){
# calculates exact confidence limits for the prevalence of disease
# adjusted for sensitivity and specificity of the diagnostic test
# written by Jenő Reiczigel, 2009 (reiczigel.jeno@aotk.szie.hu)
# if using please cite
# Reiczigel, Földi and Ózsvári (2010) Exact confidence limits for
# prevalence of a disease with an imperfect diagnostic test,
# Epidemiology and infection, 138: 1674-1678.
# poz - number of test positives
# n - sample size
# se - test sensitivity
# sp - test specificity
# level - prescribed confidence level
# dec - required number of decimals for the result
# method - "bl" for Blaker, "st" for Sterne,
# "cp" for Clopper-Pearson, "wi" for Wilson (for n>500)
# ci4prev calls the following functions:
# blakerci() for Blaker's interval (see below),
# sterne.int() for Sterne's interval (see below),
# binconf() from library(Hmisc) for Wilson & Clopper-Pearson
if(method=="cp") cl=binconf(poz,n,method="e",alpha=1-level)[2:3]
else if (method=="st") cl=sterne.int(poz,n,alpha=1-level)
else if (method=="bl") cl=blakerci(poz,n,level=level)
else if (method=="wi") cl=binconf(poz,n,method="w",alpha=1-level)[2:3]
else stop('valid methods are "bl", "st", "cp", and "wi"')
adj.cl=(cl+sp-1)/(se+sp-1)
adj.cl=pmax(adj.cl,c(0,0))
adj.cl=pmin(adj.cl,c(1,1))
names(adj.cl)=NULL
return(round(adj.cl,digits=dec))
}
require(Hmisc)
# -----------------------------------
# Blaker's interval (by Helge Blaker)
# -----------------------------------
blakerci <- function(x,n,level=.95,tolerance=1e-04){
lower = 0
upper = 1
if (x!=0){lower = qbeta((1-level)/2, x, n-x+1)
while (acceptbin(x, n, lower + tolerance) < (1 - level))
lower = lower+tolerance
}
if (x!=n){upper = qbeta(1 - (1-level)/2, x+1, n-x)
while (acceptbin(x, n, upper - tolerance) < (1 - level))
upper = upper-tolerance
}
c(lower,upper)
}
# Computes the Blaker exact ci (Canadian J. Stat 2000)
# for a binomial success probability
# for x successes out of n trials with
# confidence coefficient = level; uses acceptbin function
acceptbin = function(x, n, p){
#computes the Blaker acceptability of p when x is observed
# and X is bin(n, p)
p1 = 1 - pbinom(x - 1, n, p)
p2 = pbinom(x, n, p)
a1 = p1 + pbinom(qbinom(p1, n, p) - 1, n, p)
a2 = p2 + 1 - pbinom(qbinom(1 - p2, n, p), n, p)
return(min(a1,a2))
}
######################################################################
#
# EXACT CONFIDENCE BOUNDS FOR A BINOMIAL PARAMETER p
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#
# Calculate exact Sterne confidence bounds for a binomial parameter
# by the method described in Duembgen (2004): Exact confidence
# bounds in discrete models - algorithmic aspects of Sternes method.
# Preprint, available on www.stat.unibe.ch/~duembgen
#
# Kaspar Rufibach, August 2004
#
######################################################################
# define function sterne.int (find example at the bottom)
sterne.int <- function(x,n,alpha=0.05,del=10^-5){
logit <- function(p){log(p/(1-p))}
invlogit <- function(y){exp(y)/(1+exp(y))}
theta <- function(k,x,n){(lchoose(n,x)-lchoose(n,k))/(k-x)}
Feta <- function(x,eta){pbinom(x,n,invlogit(eta))}
##############################################################
# The function pi_eta(X,eta) automatically accounts for the
# fact that if k_alpha(X)=min(J) then a_alpha^st(X)=a_alpha(X)
piXeta <- function(x,eta){
if (invlogit(eta)>=1){f <- 0} else {
J <- c(0:(x-1),(x+1):n)
# on (-infty,theta_0]
t1 <- theta(0,x,n)
if (is.na(t1)!=1 && eta<=t1){f <- 1-Feta(x-1,eta)}
# on [theta_0,mode]
k1 <- J[J<(x-1)]
if (length(k1)>0){
the1 <- theta(k1,x,n)
the2 <- theta(k1+1,x,n)
pos <- (the1<=eta)*(eta<the2)
if (sum(pos)>0){f <- 1-Feta(x-1,eta)+Feta(max(k1*pos),eta)}}
# mode
the1 <- theta(x-1,x,n)
the2 <- theta(x+1,x,n)
if (eta>=the1 && eta<=the2){f <- 1}}
# on [mode,theta_n]
k2 <- J[J>(x+1)]
if (length(k2)>0){
the1 <- theta(k2-1,x,n)
the2 <- theta(k2,x,n)
kre <- sum(k2*(the1<eta)*(eta<=the2))
if (kre>0){f <- 1-Feta(kre-1,eta)+Feta(x,eta)}}
# on [theta_n,infty)
t2 <- theta(n,x,n)
if (is.na(t2)!=1 && eta>=t2){f <- Feta(x,eta)}
f}
#####################################
# lower bound a_alpha^st(X)
if (x==0){pu <- 0} else {
J <- c(0:(x-1),(x+1):n)
k1 <- min(J)
pi1 <- piXeta(x,theta(k1,x,n))
# calculation of k_alpha(X)
if (pi1>=alpha){kal <- k1} else {
k <- x-1
while (k1<k-1){
k2 <- floor((k+k1)/2)
pi2 <- piXeta(x,theta(k2,x,n))
if (pi2>=alpha){k <- k2} else {k1 <- k2}
}
kal <- k
}
# calculation of a_alpha^st(X)
b1 <- theta(kal,x,n)
pi1 <- 1-Feta(x-1,b1)+Feta(kal-1,b1)
if (pi1<=alpha){b <- b1} else {
b <- max(theta(kal-1,x,n),logit(del))
pi <- 1-Feta(x-1,b)+Feta(kal-1,b)
while (b1-b>del || pi1-pi>del){
b2 <- (b+b1)/2
pi2 <- 1-Feta(x-1,b2)+Feta(kal-1,b2)
if (pi2>alpha){
b1 <- b2
pi1 <- pi2} else {
b <- b2
pi <- pi2}}}
pu <- invlogit(b)}
######################################
# upper bound b_alpha^st(X)
if (x==n){po <- 1} else {
J <- c(0:(x-1),(x+1):n)
k1 <- max(J)
pi1 <- piXeta(x,theta(k1,x,n))
# calculation of k_alpha(X)
if (pi1>=alpha){kau <- k1} else {
k <- x+1
pi <- 1
while (k1>k+1){
k2 <- floor((k+k1)/2)
pi2 <- piXeta(x,theta(k2,x,n))
if (pi2>=alpha){k <- k2} else {k1 <- k2}
}
kau <- k
}
# calculation of b_alpha^st(X)
b1 <- theta(kau,x,n)
pi1 <- 1-Feta(kau,b1)+Feta(x,b1)
if (pi1<=alpha){
b <- b1
po <- pi1} else {
b <- min(theta(kau+1,x,n),b1+n)
pi <- 1-Feta(kau,b)+Feta(x,b)
while (b-b1>del || pi1-pi>del){
b2 <- (b+b1)/2
pi2 <- 1-Feta(kau,b2)+Feta(x,b2)
if (pi2>alpha){
b1 <- b2
pi1 <- pi2} else {
b <- b2
pi <- pi2}}}
po <- invlogit(b)}
c('a_alpha^St'=pu,'b_alpha^St'=po)
}
adj_for_sesp <- function(AP, Sp, Se) { (AP + Sp - 1) / (Se + Sp - 1) }
post_adj_est <- function(survey_obj, se = comb_sens_spec["sens"],
sp = comb_sens_spec["spec"], digs = 2){
srvy_mean <-survey::svymean(~both_pos, survey_obj,
deff = "replace")
prop_pos <- as.numeric(srvy_mean)
if (is.null(survey_obj$variables)) {
survey_obj$variables <- survey_obj$phase1$sample$variables
}
iid_n <- nrow(survey_obj$variables) / survey::deff(srvy_mean)
prev_est <- round(100 * adj_for_sesp(prop_pos, sp, se), digs)
num_pos <- round(iid_n * prop_pos)
ci <- ci4prev(num_pos, round(iid_n), se = se, sp = sp, dec = 2 + digs)
chrs <- paste0("%.", digs, "f")
tibble(
# "Effective Sample Size" = iid_n,
# "Number Tested Positive" = num_pos,
"Prevalence Estimate (95% CI)" = sprintf(paste0(chrs, " (",
chrs, ", ",chrs, ")"),
prev_est, 100 * ci[1],
100 * ci[2]))
}
prop_pos_by <- function(survey_obj, form, revalkey = NULL,
se = comb_sens_spec["sens"],
sp = comb_sens_spec["spec"],
digs = 4, return_type = "tab",
oth_col_name = NULL){
prop_pos <- survey::svyby(~both_pos, form,
survey_obj, survey::svymean,
deff = "replacem")
vname <- colnames(prop_pos)[1]
prop_pos[, 1] <- as.character(prop_pos[, 1])
adj_df <- prop_pos %>% mutate(across(.cols = c("both_pos", "se"),
.fns = function(x) round(x * 100, 2)))
adj_df$prev_est <- round(adj_for_sesp(adj_df$both_pos,
Sp = sp, Se = se), digs)
counts <- surv_obj$variables %>%
group_by(eval(parse(text = vname))) %>%
count()
colnames(counts)[1] <- vname
counts[[vname]] <- as.character(counts[[vname]])
efss <- left_join(adj_df, counts, by = vname) %>%
mutate(efs = n / DEff.both_pos,
num_pos = round(efs * prev_est / 100),
efs = round(efs))
efss$ub <- efss$lb <- NA
for (bound_idx in 1:nrow(efss)) {
if (!is.na(efss$num_pos[bound_idx])) {
efss[bound_idx, c("lb", "ub")] <-
round(100 * ci4prev(efss$num_pos[bound_idx],
efss$efs[bound_idx], se = se,
sp = sp, dec = 2 + digs), digs)
}else{
efss[bound_idx, c("prev_est", "lb", "ub")] <- NA
}
}
if (return_type == "tab") {
adj_df <- efss %>% mutate(
est_ci = sprintf("%.2f (%.2f, %.2f)",
prev_est, lb, ub)
) %>% select(1, both_pos, se, est_ci)
if (!is.null(revalkey)) {
adj_df[, 1] <- dplyr::recode(
adj_df[, 1], !!!revalkey
)
}
if (!is.null(oth_col_name)) {
colnames(adj_df)[1] <- oth_col_name
}
adj_df %>% gt::gt() %>% gt::tab_spanner(
label = "Percent Positive (on both Tests)",
columns = c("both_pos", "se")
) %>% gt::tab_spanner(
label = "Prevalence",
columns = c("est_ci")
) %>%
gt::cols_label(
"both_pos" = "Estimate",
"se" = "Standard Error",
"est_ci" = "Estimate (95% CI)"
) %>% gt::cols_align("center")
}else{
efss
}
}
source("bootstrap_ci.R")
desn_eff <- survey::svymean(~both_pos, post_rake_obj, deff = "replace") %>% survey::deff()
fin_reslt <- post_adj_est(survey_obj = post_rake_obj, digs = 2) %>% gt::gt() %>%
gt::cols_align("center") %>% gt::tab_footnote(
footnote = paste0("The effective sample size is ",
round(nrow(post_rake_obj$variables) / desn_eff),
" for these calculations."),
locations = gt::cells_column_labels(
columns = vars(`Prevalence Estimate (95% CI)`))
)
fin_reslt
Prevalence Estimate (95% CI)1 |
---|
3.91 (2.35, 5.96) |
1
The effective sample size is 589 for these calculations.
|
All subgroup specific estimates are calculated using the all the adjustments described above.
revalkey <- c("1" = "White", "2" = "Black", "3" = "Asian", "4" = "Other")
prop_pos_by(post_rake_obj, ~RACECAT, revalkey, oth_col_name = "Race")
Race | Percent Positive (on both Tests) | Prevalence | |
---|---|---|---|
Estimate | Standard Error | Estimate (95% CI) | |
White | 1.44 | 0.47 | 1.74 (1.02, 3.57) |
Black | 9.77 | 3.41 | 11.78 (7.25, 24.77) |
Asian | 1.95 | 1.11 | 2.35 (1.05, 7.41) |
Other | 10.75 | 4.77 | 12.96 (7.53, 33.10) |
plot_df <- prop_pos_by(post_rake_obj, ~RACECAT,
revalkey, return_type = "data")
plot_df$RACECAT <- recode(plot_df$RACECAT, !!!revalkey)
ggplot(plot_df, aes(x = RACECAT, y = prev_est / 100,
ymin = lb / 100, ymax = ub/100)) +
geom_point(size = 1) + geom_pointrange(size = 1) +
ylim(0, NA) +
ylab("Estimated Prevalence") +
labs(title = "Estimated Prevalence Across Race Group",
subtitle = "In King County in early to mid August 2020",
caption = "Points indicate estimates and lines indicate 95% CI") +
# scale_y_log10(labels = scales::percent) +
scale_y_continuous(labels = scales::percent, limits = c(0, NA)) +
coord_flip() + theme(
axis.title.y = element_blank(),
axis.title.x = element_blank()
)
sexkey <- c("1" = "Male", "2" = "Female")
prop_pos_by(post_rake_obj, ~sex, sexkey, oth_col_name = "Sex")
Sex | Percent Positive (on both Tests) | Prevalence | |
---|---|---|---|
Estimate | Standard Error | Estimate (95% CI) | |
Male | 3.64 | 1.03 | 4.39 (2.99, 8.50) |
Female | 2.84 | 0.91 | 3.42 (2.37, 7.38) |
plot_df_sex <- prop_pos_by(post_rake_obj, ~sex,
revalkey, return_type = "data")
plot_df_sex$sex <- recode(plot_df_sex$sex, !!!sexkey)
ggplot(plot_df_sex, aes(x = sex, y = prev_est / 100,
ymin = lb / 100, ymax = ub / 100)) +
geom_pointrange(size = 1) +
ylim(0, NA) +
ylab("Estimated Prevalence") +
labs(title = "Estimated Prevalence Across Sex",
subtitle = "In King County in early to mid August 2020",
caption = "Points indicate estimates and lines indicate 95% CI") +
# scale_y_log10(labels = scales::percent) +
scale_y_continuous(labels = scales::percent, limits = c(0, NA)) +
coord_flip() + theme(
axis.title.y = element_blank(),
axis.title.x = element_blank()
)
prop_pos_by(post_rake_obj, ~binary_income, oth_col_name = "Income") %>%
gt::tab_footnote(
footnote = "Median household income in King County is $95,000",
locations = gt::cells_column_labels(
columns = vars(Income)))
Income1 | Percent Positive (on both Tests) | Prevalence | |
---|---|---|---|
Estimate | Standard Error | Estimate (95% CI) | |
Above 100,000 | 1.30 | 0.47 | 1.57 (0.94, 3.48) |
At or below 100,000 | 6.09 | 1.66 | 7.34 (4.96, 13.99) |
1
Median household income in King County is $95,000
|
inc_df <- prop_pos_by(post_rake_obj, ~binary_income, return_type = "data")
ggplot(inc_df, aes(x = binary_income, y = prev_est / 100,
ymin = lb / 100, ymax = ub / 100)) +
geom_pointrange(size = 1) +
ylab("Estimated Prevalence") +
labs(title = "Estimated Prevalence Across Income Categories",
subtitle = "In King County in early to mid August 2020",
caption = "Points indicate estimates and lines indicate 95% CI") +
# scale_y_log10(labels = scales::percent) +
scale_y_continuous(labels = scales::percent, limits = c(0, NA)) +
coord_flip() + theme(
axis.title.y = element_blank(),
axis.title.x = element_blank()
)
prop_pos_by(post_rake_obj, ~agegrp2, oth_col_name = "Age Group")
Age Group | Percent Positive (on both Tests) | Prevalence | |
---|---|---|---|
Estimate | Standard Error | Estimate (95% CI) | |
0≤age<16 | 2.10 | 1.19 | 2.53 (1.12, 7.98) |
16≤age<25 | 9.05 | 4.85 | 10.91 (4.81, 31.97) |
25≤age<45 | 1.83 | 0.71 | 2.21 (1.19, 5.25) |
45≤age<65 | 4.04 | 1.47 | 4.87 (3.03, 11.14) |
65≤age<100 | 2.50 | 1.40 | 3.01 (1.32, 9.31) |
agrp_df <- prop_pos_by(post_rake_obj, ~agegrp2, return_type = "data",
oth_col_name = "Age Group")
ggplot(agrp_df, aes(x = agegrp2, y = prev_est / 100,
ymin = lb / 100, ymax = ub / 100)) +
geom_pointrange(size = 1) +
ylab("Estimated Prevalence") +
labs(title = "Estimated Prevalence Across Age Group",
subtitle = "In King County in early to mid August 2020",
caption = "Points indicate estimates and lines indicate 95% CI") +
# scale_y_log10(labels = scales::percent) +
scale_y_continuous(labels = scales::percent, limits = c(0, NA)) +
coord_flip() + theme(
axis.title.y = element_blank(),
axis.title.x = element_blank()
)
In this section, we consider testing for between group difference across the previously considered categorical variables (race, sex, income, and age group). For these comparisons, we consider pairwise comparisons between different levels of a categorical variable as part of a single question. However we treat hypotheses concerned with different categorical variables as distinct from one another. As an example we consider the comparison between Blacks and whites and the comparison between Blacks and Asians as part of a single question, but consider the comparison between Blacks and Asians separate from the comparison between males and females. Thus, we adjust for multiple comparisons between levels within categories, but not for the multiple categorical variables. We control the family-wise error rate (FWER) using a Bonferroni adjustment for pairwise comparisons between different groups within each category, but do not adjust for the number of categorical variables for which we are making between-category comparisons (in our case four).
Each table below displays the Bonferroni-corrected p-value for each possible pairwise comparison.
make_resl <- function(var, pval_adj = TRUE) {
bval <- unique(new_one_phase$variables[[var]])
bval <- bval[!is.na(bval)]
if (!is.numeric(bval)) {
bval <- paste0("'", bval, "'")
repl_str <- ","
}else{
repl_str <- ",|\\ "
}
all_res <- NULL
for (bv_idx in seq_along(bval)) {
sub_res <- survey::svyglm(
as.formula(paste0("both_pos ~ relevel(as.factor(",
var, "), ", bval[bv_idx], ")")),
family = "binomial",
new_one_phase) %>% summary() %>% coef()
levels <- map(str_split(rownames(sub_res), "\\)"),
function(x) {
y <- x[length(x) - c(1, 0)]
gsub(repl_str, "", y)
})[-1]
levels <- do.call(rbind, levels)
sub_res <- data.frame(
level1 = levels[, 1], level2 = levels[, 2],
odds_ratio = exp(sub_res[-1, 1]),
inv_confint = paste0(
"(", round(1 / exp(sub_res[-1, 1] + qnorm(0.975) * sub_res[-1, 2]), 1),
", ", round(1 / exp(sub_res[-1, 1] - qnorm(0.975) * sub_res[-1, 2]), 1),
")"),
pval = sub_res[-1, ncol(sub_res)]
)
rownames(sub_res) <- NULL
all_res <- bind_rows(sub_res, all_res)
}
if (pval_adj) {
return(all_res %>% mutate(
pval = as.character(round(pmin(1, pval * nrow(all_res) / 2), 6))
) )
} else {
return(all_res)
}
}
table_pairwise <- function(pval_tab, rename_vec) {
pval_tab$odds_ratio <- pval_tab$inv_confint <- NULL
pval_tab$level1 <- gsub("\\\"|\\ ", "", pval_tab$level1)
if (!is.null(rename_vec)) {
rfrmt <- pval_tab %>% mutate(
level1 = recode(level1, !!!rename_vec),
level2 = recode(level2, !!!rename_vec))
}else{
rfrmt <- pval_tab
}
rfrmt <- rfrmt %>% arrange(level1, level2)
rfrmt <- rfrmt %>%
pivot_wider(names_from = level2, values_from = pval,
values_fill = "") %>%
arrange(level1)
rfrmt <- rfrmt[, c(1, 1 + order(colnames(rfrmt)[-1]))]
rfrmt %>% gt::gt(rowname_col = "level1") %>%
gt::tab_spanner("Comparison Group", 2:ncol(rfrmt)) %>%
gt::cols_align("center") %>% gt::tab_stubhead(label = "Baseline Group")
}
race_rename <- c("1" = "White", "2" = "Black",
"3" = "Asian", "4" = "Other")
table_pairwise(make_resl("RACECAT"), race_rename)
Baseline Group | Comparison Group | |||
---|---|---|---|---|
Asian | Black | Other | White | |
Asian | 0.101053 | 0.132775 | 1 | |
Black | 0.101053 | 1 | 0.000978 | |
Other | 0.132775 | 1 | 0.003113 | |
White | 1 | 0.000978 | 0.003113 |
sex_rename <- c("1" = "Male", "2" = "Female")
table_pairwise(make_resl("sex"), sex_rename)
Baseline Group | Comparison Group | |
---|---|---|
Female | Male | |
Female | 0.550988 | |
Male | 0.550988 |
table_pairwise(make_resl("binary_income"), NULL)
Baseline Group | Comparison Group | |
---|---|---|
Above 100000 | At or below 100000 | |
Above100000 | 0.000984 | |
Atorbelow100000 | 0.000984 |
table_pairwise(make_resl("agegrp2"), NULL)
Baseline Group | Comparison Group | ||||
---|---|---|---|---|---|
0≤age<16 | 16≤age<25 | 25≤age<45 | 45≤age<65 | 65≤age<100 | |
0≤age<16 | 0.767777 | 1 | 1 | 1 | |
16≤age<25 | 0.767777 | 0.156947 | 1 | 1 | |
25≤age<45 | 1 | 0.156947 | 1 | 1 | |
45≤age<65 | 1 | 1 | 1 | 1 | |
65≤age<100 | 1 | 1 | 1 | 1 |
After considering all pairwise differences between groups, the following groups were found to be significantly different at the \(0.05\) level:
signf <- bind_rows(
make_resl("RACECAT") %>% mutate(
level1 = recode(level1, !!!race_rename),
level2 = recode(level2, !!!race_rename),
Category = "Race"),
make_resl("sex") %>% mutate(
level1 = recode(level1, !!!sex_rename),
level2 = recode(level2, !!!sex_rename),
Category = "Sex"),
make_resl("binary_income") %>% mutate(Category = "Income"),
make_resl("agegrp2") %>% mutate(Category = "Age Group")) %>%
filter(pval < 0.05) %>% group_by(pval) %>%
filter(row_number() == 1) %>% ungroup()
signf$level1 <- gsub("\\\"", "", signf$level1)
signf <- signf %>% mutate(pval = round(as.numeric(pval), 6))
signf$odds_ratio <- round(1 / signf$odds_ratio, 1)
gt::gt(signf[, c(2, 1, 3, 4, 5, 6)] %>% group_by(Category) ) %>%
gt::cols_label(level2 = gt::html("Baseline <br> Group"),
level1 = gt::html("Comparison <br> Group"),
odds_ratio = gt::html("Estimated <br> Odds Ratio"),
inv_confint = gt::html("Confidence <br> Interval"),
pval = gt::html("Bonferroni-Adjusted <br> P-value")) %>%
gt::cols_align(columns = 1:5, align = "c")
Baseline Group |
Comparison Group |
Estimated Odds Ratio |
Confidence Interval |
Bonferroni-Adjusted P-value |
---|---|---|---|---|
Race | ||||
White | Other | 8.2 | (2.5, 26.9) | 0.003113 |
White | Black | 7.3 | (2.6, 20.6) | 0.000978 |
Income | ||||
Above 100000 | At or below 100000 | 5.0 | (1.9, 12.8) | 0.000984 |
## checking Banjamini-Hochberg method
all_res <- list(make_resl("RACECAT", pval_adj = FALSE),
make_resl("sex", pval_adj = FALSE),
make_resl("binary_income", pval_adj = FALSE),
make_resl("agegrp2", pval_adj = FALSE))
# install.packages("sgof")
make_pval_mat <- function(x){
look <- all_res[[x]] %>%
mutate(pval = round(pval, 9)) %>%
group_by(pval) %>%
filter(row_number() == 1)
fin_df <- round(cbind(pvalue = sort(look$pval),
cutoff_Bonf = 0.005 / length(look$pval),
cutoff_BH = 0.005 * seq(look$pval) / length(look$pval),
cutoff_BY = 0.005 * seq(look$pval) / (length(look$pval) *
sum(1 / seq(look$pval)))), 5)
cbind(fin_df,
"rej_Bonf" = fin_df[, 1] < fin_df[, 2],
"rej_BH" = fin_df[, 1] < fin_df[, 3],
"rej_BY" = fin_df[, 1] < fin_df[, 4])
}
# map(1:4, make_pval_mat)
As mentioned earlier, the convenience sample is likely to be biased towards having more COVID19 positive individuals. One solution to account for this would be to adjust weights based on those in the sample who reported testing positive on a previous COVID19 test. Unfortunately, the openly available data from King County do not match the collected data in the following ways:
As an alternative to using King County data, we treat the address-based sample and the given survey weights to construct a distribution of those with positive tests. We then use this distribution to adjust the convenience sample weights such that the weighted distribution of positive tests in both strata (address-based and convenience sample) match. This could have been used in our main analysis, but there were multiple issues with this approach:
More statistically principled solutions to account for bias in convenience based samples exist (see Valliant 2020), and would be worth considering given more time. These alternative approaches would likely require re-calculation of the weights for the entire sample.
## Post-stratify on the observed number of positive tests
## in the address-based sample
make_ps_est <- function(svy_df, des_obj, type = "none") {
surv_df <- svy_df
props <- surv_df %>% group_by(randsamp) %>%
summarise(prop_felt_sick = mean(covid_illness, na.rm = TRUE),
prop_test_pos = mean(const_test_res, na.rm = TRUE))
w_mis_il_conv <-
which(surv_df$randsamp == "n" & is.na(surv_df$covid_illness))
surv_df[w_mis_il_conv, "covid_illness"] <-
rbinom(length(w_mis_il_conv), size = 1,
prob = props %>% filter(randsamp == "n") %>%
pull(prop_felt_sick))
w_mis_il_abs <-
which(surv_df$randsamp == "y" & is.na(surv_df$covid_illness))
surv_df[w_mis_il_abs, "covid_illness"] <-
rbinom(length(w_mis_il_abs), size = 1,
prob = props %>% filter(randsamp == "y") %>%
pull(prop_test_pos))
w_mis_pt_conv <-
which(surv_df$randsamp == "n" & is.na(surv_df$const_test_res))
surv_df[w_mis_pt_conv, "const_test_res"] <-
rbinom(length(w_mis_pt_conv), size = 1,
prob = props %>% filter(randsamp == "n") %>%
pull(prop_test_pos))
w_mis_pt_abs <-
which(surv_df$randsamp == "y" & is.na(surv_df$const_test_res))
surv_df[w_mis_pt_abs, "const_test_res"] <-
rbinom(length(w_mis_pt_abs), size = 1,
prob = props %>% filter(randsamp == "y") %>%
pull(prop_test_pos))
ps_one_phase <- survey::svydesign(
ids = ~HOUSEHOLD,
weights = weights(post_cal_obj),
strata = ~randsamp,
data = subset(surv_df, qs == 1)
)
if (type == "pos_test") {
pos_tests <-
surv_df %>% filter(randsamp == "y") %>%
group_by(const_test_res) %>% count(name = "Freq")
post_strat_obj <- survey::postStratify(
ps_one_phase, ~const_test_res, pos_tests
)
}else if (type == "cov_ill") {
## Post-stratify on the number of individuals reporting
## having felt sick with COVID19 previously.
think_had_covid <-
surv_df %>% filter(randsamp == "y") %>%
group_by(covid_illness) %>% count() %>%
## Modify counts so weights still reflect
## King county population totals
ungroup() %>%
mutate(Freq = round(n * sum(ps_obj$hisp_sex$Freq) /
sum(n))) %>%
select(-n)
post_strat_obj <- survey::postStratify(
ps_one_phase, ~covid_illness, think_had_covid
)
}else{
post_strat_obj <- ps_one_phase
}
second_phase_one <- survey::svydesign(
ids = ~HOUSEHOLD,
weights = weights(post_strat_obj),
strata = ~randsamp,
data = subset(surv_df, qs == 1)
)
post_rake_obj2 <- survey::rake(
second_phase_one,
sample.margins = list(
~AGECAT + sex,
~HISPANIC + sex,
~RACECAT + sex,
~INCOME + sex
# ~INCOME + hhs
),
population.margins = list(
ps_obj$age_sex[, c("AGECAT", "sex", "Freq")],
ps_obj$hisp_sex[, c("HISPANIC", "sex", "Freq")],
ps_obj$race_sex[, c("RACECAT", "sex", "Freq")],
ps_obj$inc_sex[, c("INCOME", "sex", "Freq")]
# ps_obj$inc_hhs[, c("INCOME", "hhs", "Freq")]
)
)
return(post_rake_obj2)
}
post_test <- make_ps_est(surv_df, post_cal_obj, type = "pos_test")
post_ill <- make_ps_est(surv_df, post_cal_obj, type = "cov_ill")
just_rake <- make_ps_est(surv_df, post_cal_obj, type = "none")
make_table(list(#post_cal_obj,
just_rake, post_test, post_ill),
c(#"Post-calibration two-phase object",
"Raked Object (main analysis)",
"Raked and balanced on positive tests",
"Raked and balanced on reporting illness"))
Survey Object | Percent Positive (on both Tests) | |
---|---|---|
Estimate | Standard Error | |
Raked Object (main analysis) | 3.24 | 0.73 |
Raked and balanced on positive tests | 2.88 | 0.63 |
Raked and balanced on reporting illness | 3.02 | 0.67 |
comp_prev_ests <- bind_rows(
# post_adj_est(post_cal_obj),
post_adj_est(just_rake),
post_adj_est(post_test),
post_adj_est(post_ill)
)
ess_cmpr <- sapply(list(just_rake, post_test, post_ill),
FUN = function(x) {
round(
nrow(x$variables) /
survey::svymean(~both_pos,x, deff = "replace") %>%
survey::deff())
})
deffs_cmpr <- sapply(list(just_rake, post_test, post_ill),
FUN = function(x) {
survey::svymean(~both_pos,x, deff = "replace") %>%
survey::deff() %>% round(2)
})
comp_prev_ests$`Adjustment Strategy` <-
c(#"Post-calibration two-phase object",
"Raked Object (main analysis)",
"Raked and balanced on positive tests",
"Raked and balanced on reporting illness")
comp_prev_ests$`Design Effect` <- deffs_cmpr
comp_prev_ests$`Effective Sample Size` <- ess_cmpr
comp_prev_ests[, c(2, 3, 4, 1)] %>% gt::gt() %>%
gt::cols_align("center")
Adjustment Strategy | Design Effect | Effective Sample Size | Prevalence Estimate (95% CI) |
---|---|---|---|
Raked Object (main analysis) | 2.18 | 589 | 3.91 (2.35, 5.96) |
Raked and balanced on positive tests | 1.84 | 698 | 3.47 (2.17, 5.28) |
Raked and balanced on reporting illness | 1.99 | 644 | 3.65 (2.15, 5.45) |
When the above sensitivity analysis was conducted, individuals with missing values for answers to the COVID survey questions had their values imputed assuming missingness at random (conditional on strata). To study how this imputation effected the point estimates and standard error, here we carry out multiple imputations and study the distribution of the resulting estimate. Here balancing is carried out based on the distribution of individuals who reported feeling ill.
if (file.exists("mult_imp_sens.rds")) {
all_res <- readRDS("mult_imp_sens.rds")
} else {
all_res <- data.frame("est" = rep(NA, 1000), "se" = NA)
for (res_idx in 1:nrow(all_res)) {
cat(res_idx, " ")
subobj <- make_ps_est(surv_df, post_cal_obj, type = "cov_ill")
survey_res <- survey::svymean(~both_pos, subobj)
all_res[res_idx, ] <- c("est" = as.numeric(survey_res),
"se" = survey_res %>% survey::SE())
}
saveRDS(all_res, file = "mult_imp_sens.rds")
}
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000
hist(100 * all_res$est, xlab = "Percent positive estimate", main = "")
Here we see that while these imputations to introduce some variability in the percent positive estimate, this variability never results in an estimate that is much more than a tenth of a percent away from the original estimate.
hist(100 * all_res$se, xlab = "Standad error", main = "")
Again we see that while these imputations have little effect on the estimated standard error of the percent positive estimate. Here the range of standard errors is less than five hundredths of a percent.
While the diagnostic tests do have reported sensitivities, and specificities, there is some uncertainty associated with these values. Here we seek to understand how much variability in our point estimate would be possible due to these factors. This is done using two approaches:
Considering the four test accuracy measures (sens and spec for both tests), we choose values (either 5% or 95% quantiles) that would give the highest estimate of prevalence. Thus for both tests the 95% quantile for specificity is chosen (fewer individuals who tested positive are actually negative) and the 5% quantile for sensitivity (more individuals who tested negative are actually positive). These values are then used to calculate the sensitivity and specificity of the combined test, which is subsequently used to estimate the prevalence. An analogous method is used to obtain the low estimate of prevalence, but the small values for specificity and large values for sensitivity are used.
The second approach uses a bootstrap to draw (independently) the four test performance measures and then uses the measures to calculate the performance of the combined test and the prevalence implied from this performance and the estimate of percentage of positive tests. These bootstraps use a normal distribution for sampling these measures. Taking the 5%, 50%, and 95% quantiles of this bootstrap distribution, we calculate the estimate and confidence intervals for these scenarios.
shft <- qnorm(0.95) / qnorm(0.975)
lower_est <- function(x) {x[1] * (0.5 + shft) + x[2] * (0.5 - shft)}
upper_est <- function(x) {x[1] * (0.5 - shft) + x[2] * (0.5 + shft)}
lower_prev_scen <- lapply(test_info, FUN = function(x) {
list("spec" = pmin(1, lower_est(x$spec)),
"sens" = pmin(1, upper_est(x$sens)))
})
upper_prev_scen <- lapply(test_info, FUN = function(x) {
list("spec" = pmin(1, upper_est(x$spec)),
"sens" = pmin(1, lower_est(x$sens)))
} )
lw_p <- combine_tests(t1_spec = lower_prev_scen$first_test$spec,
t1_sens = lower_prev_scen$first_test$sens,
t2_spec = lower_prev_scen$second_test$spec,
t2_sens = lower_prev_scen$second_test$sens)
up_p <- combine_tests(t1_spec = upper_prev_scen$first_test$spec,
t1_sens = upper_prev_scen$first_test$sens,
t2_spec = upper_prev_scen$second_test$spec,
t2_sens = upper_prev_scen$second_test$sens)
look_three_scen <- function(low, mid, high,
scen_names = c(
"Low Specificity, High Sensitivity",
"Best Guess Test Performance",
"High Specificity, Low Sensitivity"
)){
mult_res <- bind_rows(
bind_cols(
Scenario = scen_names[1],
post_adj_est(post_rake_obj, se = low["sens"], sp = low["spec"])),
bind_cols(Scenario = scen_names[2],
post_adj_est(post_rake_obj, se = mid["sens"], sp = mid["spec"])),
bind_cols(Scenario = scen_names[3],
post_adj_est(post_rake_obj,
se = high["sens"], sp = high["spec"])),
)
desn_eff <- survey::svymean(~both_pos, post_rake_obj, deff = "replace") %>%
survey::deff()
mult_res %>% gt::gt() %>% gt::cols_align("center") %>% gt::tab_footnote(
footnote = paste0("The effective sample size is ",
round(nrow(post_rake_obj$variables) / desn_eff),
" for these calculations."),
locations = gt::cells_column_labels(
columns = vars(`Prevalence Estimate (95% CI)`))
)
}
look_three_scen(lw_p, comb_sens_spec, up_p)
Scenario | Prevalence Estimate (95% CI)1 |
---|---|
Low Specificity, High Sensitivity | 3.14 (1.85, 4.84) |
Best Guess Test Performance | 3.91 (2.35, 5.96) |
High Specificity, Low Sensitivity | 5.46 (3.29, 8.31) |
1
The effective sample size is 589 for these calculations.
|
# apply(many_sens_spec, 2, mean)
make_int <- function(test_positivity, sens_spec_df) {
all_res <- matrix(NA, nrow = nrow(sens_spec_df), ncol = 3)
for (b_idx in 1:nrow(all_res)) {
all_res[b_idx, ] <- c(adj_for_sesp(
AP = test_positivity,
Sp = sens_spec_df[b_idx, "spec"],
Se = sens_spec_df[b_idx, "sens"]),
sens_spec_df[b_idx, "spec"],
sens_spec_df[b_idx, "sens"]
)
}
colnames(all_res) <- c("prev", "spec", "sens")
return(all_res)
}
interval <- make_int(as.numeric(survey::svymean(~both_pos, post_rake_obj)),
many_sens_spec)
# hist(100 * interval[, 1], freq = FALSE, main = "Histogram of possible prevalence point estimates", xlab = "Prevalence Estimate (percentage)")
quants <- quantile(x = interval[, 1], c(0.05, 0.5, 0.95), type = 1)
new_df <- interval[match(quants, interval[, 1]), ]
look_three_scen(new_df[1, 2:3], new_df[2, 2:3], new_df[3, 2:3],
c("Lower Prevalence", "Median Guess", "Higher Prevalence"))
Scenario | Prevalence Estimate (95% CI)1 |
---|---|
Lower Prevalence | 3.78 (2.27, 5.78) |
Median Guess | 4.00 (2.40, 6.11) |
Higher Prevalence | 4.24 (2.54, 6.47) |
1
The effective sample size is 589 for these calculations.
|
Above we see that the estimates vary less when considering the range of the combined test than when choosing extreme values for all four test accuracy measures.
After the many adjustments made, the estimated Seroprevalence estimate is:
fin_reslt
Prevalence Estimate (95% CI)1 |
---|
3.91 (2.35, 5.96) |
1
The effective sample size is 589 for these calculations.
|
While the above estimate provides a best guess at the prevalence in King county there are multiple limitations:
The sample was constructed assuming that both a convenience sample and an address-based sample contribute equally to the analysis. This assumption is likely to have resulted in an estimate that is higher than the true prevalence. This is because individuals from the convenience sample were more likely to report being ill with COVID-like illness and more likely to have tested positive due to a COVID-like illness compared to the individuals in the address-based sample. A more careful construction of weights is possible using methods described in Valliant (2020), but would require discarding the weights that were already calculated.
In the calculations of weights, imputations were carried for individuals with missing values for any demographic covariates used to reweight the sample (age, sex, race, Hispanic status, and income). A hot-deck imputation method was used, except for Hispanic status in which all individuals with a missing value were assigned non-Hispanic. These weights were treated as fixed throughout the analysis. If the variability introduced by these imputations were included (by using a multiple imputation method for example) the confidence intervals for prevalence would likely be slightly wider.
Two tests were used to confirm seropositivity for each individual. This analysis treated the two tests as a single test, by assuming that the tests performances were independent conditional on an individuals status. While there is little reason to think this assumption incorrect, if this assumption is incorrect the estimate of prevalence could be biased. Unfortunately checking this assumption is impossible with the current data because all individuals who tested negative on the initial test did not receive a second test.
# initial_df <- readxl::read_excel("../02_initial_resp_dat_preimp/initial_resp_dat.xlsx")
# imputed_df <- readxl::read_excel(
# "../03_from_mansour_postimp/KC_Out_1.Weights from Mansour.xlsx",
# sheet = 2)
# sub_inital <- match(unique(initial_df$hhid), initial_df$hhid)
# initial_df <- initial_df[sub_inital, ]
#
# # here are all caps
# imputed_df$hhid <- paste0(imputed_df$HOUSEHOLD, "-", imputed_df$INDIVIDUAL)
# sub_imputed <- match(unique(imputed_df$hhid), imputed_df$hhid)
# imputed_df <- imputed_df[sub_imputed, ]
#
# cmbd_dfs <- left_join(imputed_df, initial_df, by = "hhid")
#
# cmbd_dfs$pre_imp_agecat <- agrp_nms[cut(cmbd_dfs$ageyrs, agrp_lvs,
# include.lowest = TRUE, labels = FALSE)]
# cmbd_dfs$INCOME <- cmbd_dfs$INCOME - 1
Below are given crosstabs of the values pre- and post-imputation for each variable used for the construction of sampling weights. Rows correspond to the post-imputation values and columns corresponded to the pre-imputation values.
## Compare Sex
surv_df$sex <- c("M", "F")[surv_df$sex]
knitr::kable(table(surv_df[, c("sex", "sex")], useNA = "al"))
F | M | NA | |
---|---|---|---|
F | 714 | 0 | 0 |
M | 0 | 650 | 0 |
NA | 0 | 0 | 0 |
knitr::kable(table(surv_df[, c("RACECAT", "race1")], useNA = "al"))
A | AW | B | N | NW | P | R | U | W | NA | |
---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | 2 | 0 | 0 | 1 | 0 | 77 | 31 | 840 | 2 |
2 | 0 | 0 | 76 | 0 | 0 | 0 | 5 | 1 | 0 | 0 |
3 | 254 | 0 | 0 | 0 | 0 | 0 | 15 | 9 | 0 | 1 |
4 | 0 | 0 | 0 | 23 | 0 | 18 | 8 | 1 | 0 | 0 |
NA | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
level_key <- c("1" = "White", "2" = "Black", "3" = "Asian", "4" = "Other")
surv_df$RACECAT_with_NA <- recode(
surv_df$race1, .default = NA_real_, .missing = NA_real_,
A = 3, AW = 1, B = 2, N = 4, P = 4, NW = 1,
W = 1, R = NA_real_, U = NA_real_,
)
surv_df$RACECAT <- recode(surv_df$RACECAT, !!!level_key)
surv_df$RACECAT_with_NA <- recode(surv_df$RACECAT_with_NA, !!!level_key)
knitr::kable(table(surv_df[, c("RACECAT", "RACECAT_with_NA")], useNA = "al"))
Asian | Black | Other | White | NA | |
---|---|---|---|---|---|
Asian | 254 | 0 | 0 | 0 | 25 |
Black | 0 | 76 | 0 | 0 | 6 |
Other | 0 | 0 | 41 | 0 | 9 |
White | 0 | 0 | 0 | 843 | 110 |
NA | 0 | 0 | 0 | 0 | 0 |
inc_key <- c("< $20,000", "$20,000 - $40,000", "$40,001 - $60,000",
"$60,001 - $80,000", "$80,001 - $100,000",
"$100,001 - $120,000", "$120,001+")
names(inc_key) <- 1:7
surv_df$INCOME <- factor(x = surv_df$INCOME, levels = 1:7,
labels = inc_key)
surv_df$INCOME_with_NA <- surv_df$INCOME
surv_df$INCOME_with_NA[which(is.na(surv_df$binary_income_with_NA))] <- NA
knitr::kable(table(surv_df[, c("INCOME", "INCOME_with_NA")], useNA = "al"))
< $20,000 | $20,000 - $40,000 | $40,001 - $60,000 | $60,001 - $80,000 | $80,001 - $100,000 | $100,001 - $120,000 | $120,001+ | NA | |
---|---|---|---|---|---|---|---|---|
< $20,000 | 50 | 0 | 0 | 0 | 0 | 0 | 0 | 6 |
$20,000 - $40,000 | 0 | 101 | 0 | 0 | 0 | 0 | 0 | 10 |
$40,001 - $60,000 | 0 | 0 | 113 | 0 | 0 | 0 | 0 | 8 |
$60,001 - $80,000 | 0 | 0 | 0 | 135 | 0 | 0 | 0 | 18 |
$80,001 - $100,000 | 0 | 0 | 0 | 0 | 129 | 0 | 0 | 15 |
$100,001 - $120,000 | 0 | 0 | 0 | 0 | 0 | 141 | 0 | 16 |
$120,001+ | 0 | 0 | 0 | 0 | 0 | 0 | 571 | 51 |
NA | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
agrp_lvs <- c(0, 12, 17, 24, 34, 44, 64,100)
agrp_nms <- c("0≤age≤12",
paste0(agrp_lvs[-c(1, length(agrp_lvs))],
"<age≤",
agrp_lvs[-(1:2)]))
surv_df$AGECAT <- agrp_nms[surv_df$AGECAT]
surv_df$pre_imp_agecat <- surv_df$AGECAT
surv_df$pre_imp_agecat[which(is.na(surv_df$agegrp2_with_NA))] <- NA
knitr::kable(table(surv_df[, c("AGECAT", "pre_imp_agecat")], useNA = "al"))
0≤age≤12 | 12<age≤17 | 17<age≤24 | 24<age≤34 | 34<age≤44 | 44<age≤64 | 64<age≤100 | NA | |
---|---|---|---|---|---|---|---|---|
0≤age≤12 | 127 | 0 | 0 | 0 | 0 | 0 | 0 | 7 |
12<age≤17 | 0 | 62 | 0 | 0 | 0 | 0 | 0 | 4 |
17<age≤24 | 0 | 0 | 77 | 0 | 0 | 0 | 0 | 2 |
24<age≤34 | 0 | 0 | 0 | 213 | 0 | 0 | 0 | 6 |
34<age≤44 | 0 | 0 | 0 | 0 | 263 | 0 | 0 | 6 |
44<age≤64 | 0 | 0 | 0 | 0 | 0 | 381 | 0 | 10 |
64<age≤100 | 0 | 0 | 0 | 0 | 0 | 0 | 198 | 8 |
NA | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
knitr::kable(table(surv_df[, c("HISPANIC", "HISPANIC_with_NA")], useNA = "al"))
1 | 2 | NA | |
---|---|---|---|
1 | 98 | 0 | 0 |
2 | 0 | 645 | 621 |
NA | 0 | 0 | 0 |
prc_ms <- function(x) round(mean(is.na(x)) * 100, 2)
percent_missing <- apply(
surv_df[, c("sex", "pre_imp_agecat", "INCOME_with_NA", "RACECAT_with_NA",
"HISPANIC_with_NA")],
2, prc_ms)
fin_df <- tibble("Variable" = c("Sex", "Age Category", "Income",
"Race", "Ethnicity"),
"Percent Imputed" = percent_missing)
gt::gt(fin_df)
Variable | Percent Imputed |
---|---|
Sex | 0.00 |
Age Category | 3.15 |
Income | 9.09 |
Race | 11.00 |
Ethnicity | 45.53 |
make_table <- function(impt_col, orig_col, my_df = surv_df, cnm = NULL){
RACECAT <- enexpr(impt_col)
col_race <- enexpr(orig_col)
race_sum <- my_df %>% group_by(!!RACECAT) %>%
summarise("n" = n(),
"Count Imputed" = sum(is.na(!!col_race)),
"Percent Imputed" = prc_ms(!!col_race)) %>%
ungroup() %>%
mutate("Percentage of Sample" = round(100 * n/sum(n), 2)) %>%
rename("Count All" = n)
tots <- race_sum %>% summarise(
name = "Total",
"Count All" = sum(`Count All`),
"Count Imputed" = sum(`Count Imputed`),
"Percent Imputed" = round(
100 * sum(`Count Imputed`) / sum(`Count All`), 2),
"Percentage of Sample" = 100)
colnames(tots)[1] <- colnames(race_sum)[1]
all_inf <- rbind(race_sum, tots)
all_inf <- all_inf[, c(1, 2, 5, 3, 4)]
if (!is.null(cnm)) {colnames(all_inf)[1] <- cnm}
gt::gt(all_inf) %>%
gt::tab_style(
style = list(
gt::cell_text(weight = "bold")
),
locations = gt::cells_body(
rows = nrow(all_inf)
)
)
}
These tables show the per-covariate proportion of values that were imputed. As an example, if 90 individuals in the original sample indicated being between 17 and 24 years old, but after imputation there are 100 individuals who are classified as being between 17 and 24 years old, then the percentage of final values that were imputed would be \((100 - 90) / 100 = 10\%\).
make_table(sex, sex, cnm = "Sex")
Sex | Count All | Percentage of Sample | Count Imputed | Percent Imputed |
---|---|---|---|---|
F | 714 | 52.35 | 0 | 0 |
M | 650 | 47.65 | 0 | 0 |
Total | 1364 | 100.00 | 0 | 0 |
make_table(RACECAT, RACECAT_with_NA, cnm = "Race")
Race | Count All | Percentage of Sample | Count Imputed | Percent Imputed |
---|---|---|---|---|
Asian | 279 | 20.45 | 25 | 8.96 |
Black | 82 | 6.01 | 6 | 7.32 |
Other | 50 | 3.67 | 9 | 18.00 |
White | 953 | 69.87 | 110 | 11.54 |
Total | 1364 | 100.00 | 150 | 11.00 |
These are the income levels used to create weights for the analysis.
make_table(INCOME, INCOME_with_NA, cnm = "Income")
Income | Count All | Percentage of Sample | Count Imputed | Percent Imputed |
---|---|---|---|---|
< $20,000 | 56 | 4.11 | 6 | 10.71 |
$20,000 - $40,000 | 111 | 8.14 | 10 | 9.01 |
$40,001 - $60,000 | 121 | 8.87 | 8 | 6.61 |
$60,001 - $80,000 | 153 | 11.22 | 18 | 11.76 |
$80,001 - $100,000 | 144 | 10.56 | 15 | 10.42 |
$100,001 - $120,000 | 157 | 11.51 | 16 | 10.19 |
$120,001+ | 622 | 45.60 | 51 | 8.20 |
Total | 1364 | 100.00 | 124 | 9.09 |
These are the income levels used for the subgroup analysis.
make_table(binary_income, binary_income_with_NA, surv_df, "Binary Income")
Binary Income | Count All | Percentage of Sample | Count Imputed | Percent Imputed |
---|---|---|---|---|
Above 100,000 | 779 | 57.11 | 67 | 8.60 |
At or below 100,000 | 585 | 42.89 | 57 | 9.74 |
Total | 1364 | 100.00 | 124 | 9.09 |
These are the age groups used to create weights for the analysis.
make_table(AGECAT, pre_imp_agecat, cnm = "Age Group")
Age Group | Count All | Percentage of Sample | Count Imputed | Percent Imputed |
---|---|---|---|---|
0≤age≤12 | 134 | 9.82 | 7 | 5.22 |
12<age≤17 | 66 | 4.84 | 4 | 6.06 |
17<age≤24 | 79 | 5.79 | 2 | 2.53 |
24<age≤34 | 219 | 16.06 | 6 | 2.74 |
34<age≤44 | 269 | 19.72 | 6 | 2.23 |
44<age≤64 | 391 | 28.67 | 10 | 2.56 |
64<age≤100 | 206 | 15.10 | 8 | 3.88 |
Total | 1364 | 100.00 | 43 | 3.15 |
These are the age groups used for the subgroup analysis.
make_table(agegrp2, agegrp2_with_NA, surv_df, "Age Group")
Age Group | Count All | Percentage of Sample | Count Imputed | Percent Imputed |
---|---|---|---|---|
0≤age<16 | 171 | 12.54 | 9 | 5.26 |
16≤age<25 | 108 | 7.92 | 4 | 3.70 |
25≤age<45 | 488 | 35.78 | 12 | 2.46 |
45≤age<65 | 391 | 28.67 | 10 | 2.56 |
65≤age<100 | 206 | 15.10 | 8 | 3.88 |
Total | 1364 | 100.00 | 43 | 3.15 |
make_table(HISPANIC, HISPANIC_with_NA, cnm = "Ethnicity")
Ethnicity | Count All | Percentage of Sample | Count Imputed | Percent Imputed |
---|---|---|---|---|
1 | 98 | 7.18 | 0 | 0.00 |
2 | 1266 | 92.82 | 621 | 49.05 |
Total | 1364 | 100.00 | 621 | 45.53 |