--- title: "CESIR - simulation study" output: html_document: highlight: haddock theme: readable --- * 10,000 observations (5,000 cases, 5,000 controls) + CESIR: ~50,000 observations + chosen to be slightly smaller than CESIR to have fewer variables (e.g. when doing boxplots) * 200 observations per variable => 50 variables + CESIR: 217 obs. per variable ```{r observations} n <- 10000 p <- n/200 ``` * Prevalences will range from 3% (1 in 33.3) to 0.005% (1 in 20,000) + CESIR: similar prevalences ```{r prevalences} # We choose prevalences from 0.005% to 3% evenly on a log scale (which corresponds to the lognormal distribution in CESIR): prevalence.sim <- exp(seq(from = log(0.03), to = log(0.00005), length.out = p)) prevalence.sim <- round(prevalence.sim, 5) hist(x = prevalence.sim, breaks = 5, las = 1, xlab = "Prevalence", main = "Distribution of \n risk factor prevalences") points(x = prevalence.sim[50:1], y = c(rep(36, 36), rep(5, 5), rep(3, 3), rep(2, 2), rep(2, 2), rep(2, 2)), pch = c(2, 17)[1 + 1:50 %in% c(1, 13, 26, 38, 50)], cex = c(1, 2)[1 + 1:50 %in% c(1, 13, 26, 38, 50)]) legend("topright", pch = c(17, 2), legend = c("relevant variable", "irrelevant variable"), bty = "n", pt.cex = c(2, 1)) ``` * 10% of risk factors are relevant + In the CESIR data, about 10% of risk factors had a p-value less than 0.05 (in logistic model) ```{r rel.index} # From the 50 potential risk factors, We choose 5 relevant variables, evenly spaced from most frequent to most sparse rel.index <- round(seq(from = 1, to = p, length.out = p*0.10)) rel.index ``` * low correlation of risk factors + In the CESIR data, there were only few correlated drugs, therefore the assumption is valid ```{r risk.factors, cache = TRUE} # The matrix of common probabilities: mat <- outer(X = prevalence.sim, Y = prevalence.sim) mat[lower.tri(mat)] <- NA # to make matrix symmetric # set.seed(1) # the seed should be set in run_simulation.R mat <- mat * 2 ^ rnorm(n = length(mat), mean = 0, sd = 0.25) mat[lower.tri(mat)] <- t(mat)[lower.tri(mat)] # to make matrix symmetric diag(mat) <- prevalence.sim library(bindata) # We first generate a large population (i.e. the whole country's population) of people in a road traffic accident n.large <- n*2 X.large <- rmvbin(n = n.large, commonprob = mat) ``` * Baseline risk (intercept) for the population will be 0 + this was also observed in the CESIR database (Avalos et al. Epidem 2012) + of all the drivers in a road traffic accident with another person, half of them will be responsible + if accident happens alone, it's also fair to assume that half the time the driver is responsible * The risk factors will have effect size 1 + CESIR: most effect sizes range from -1 to 1 * of the 5 relevant risk factors, 3 will be negative and 2 will be positive + CESIR: of the 23 significant risk factors, 17 have negative coefficient and 6 have positive coefficient * unobserved confounder z with prevalence 1% + Assumption by Avalos et al. (Stat Med 2012) + effect size chosen so that signal to noise ratio of 3 (Avalos et al., Stat Med 2012) ```{r outcome} beta <- c(-1, 1, -1, 1, -1) Var.Xb <- sum(apply(X.large[, rel.index], 2, var) * beta ^ 2) q <- 2 library(bindata) z.large <- matrix(rbinom(n = n.large*q, size = 1, prob = 0.5), nrow = n.large, ncol = q) # s.n.ratio <- Var.Xb/Var.Z s.n.ratio <- 3 gamma <- sqrt(Var.Xb/sum(apply(z.large, 2, var) * s.n.ratio)) gamma <- rep(x = gamma * c(1, -1), length.out = q) Var.Z <- sum(apply(z.large, 2, var) * gamma ^ 2) eta <- X.large[, rel.index] %*% beta + z.large %*% gamma y.large <- rbinom(n = n.large, size = 1, prob = binomial()$linkinv(eta)) table(y.large) ``` ```{r generate.cohort} # From the large population, we now create an index of the first 5000 cases (index.1) and the first 5000 controls (index.0): index.1 <- which(y.large == 1)[1:(n/2)] index.0 <- which(y.large == 0)[1:(n/2)] X1 <- X.large[index.1, ] X0 <- X.large[index.0, ] X <- rbind(X1, X0) dim(X) z <- c(z.large[index.1, ], z.large[index.0, ]) y <- c(rep(x = 1, times = n/2), rep(x = 0, times = n/2)) table(y) ``` ```{r housekeeping} rm(X.large, X0, X1, index.0, index.1, n.large, y.large, z.large) ```