--- title: "Simulations of Mixed-Type Clinical Data" author: "Cat Coombes" date: "`r Sys.Date()`" output: html_document: theme: yeti highlight: kate toc: true --- ## Introduction This document contains the code used to simulate 32,400 diverse, clinically realistic data sets for assessing clustering algorithms. The core functionality employs the Umpire R-package, a tool developed to simulate "correlated blocks" (i.e. clusters) of continuous data.[(Zhang 2012)[1] Although originally designed for gene expression data, the fundamental concept can be more generally applied: Controlled generation of continuous data with known cluster identities and complex noise with multivariate distribution, with the possibility of generating survival data. The functions employed in this document to adapt Umpire to meaningfully simulate clinical data have been incorporated into a user-friendly re-release of the Umpire package (Version 2.0 and later) and are freely available on CRAN.[2] ```{r} library(Umpire) ``` ## Functions ### Cancer Model and Cancer Engine: The Base Unit of Simulated Clustered Data in Umpire To create clustered structure within our data, we simulate cluster identities by building a _cancer model_ with a given number of subtypes (i.e. clusters). _nPossible_ sets the number of features we believe to be relevant to defining cluster identities, which we arbitrarily assume to be 1/3 of the features. _nPattern_ defines the number of _k_ clusters. Altering the _prevalence_ parameter allows us to create clusters of unequal size. Using _BlockHyperParameters_, we create blocks of correlated data to mimic correlations between features in real clinical data for more realistic modeling of clusters. Umpire implements a "hit function" (Zhang 2012) to generate correlated blocks with individual and population heterogeneity, which we adapt for small feature spaces. A generalizable hit function has been implemented in Umpire 2.0. #### Prevalence and Cluster Size We wish the prevalence of clusters to vary, with two basic types orientations: simulations where clusters are of approximately equal size and simulations where clusters are of distinctly unequal size. We create a function to sample prevalence from a Dirichlet distribution. More in-depth tests are conducted in ClusterSizes-Prevalence.Rmd. ```{r Prevalence} Prevalence <- function(k){ if(k<=8){ a = 10 prev <- rdirichlet(1, rep(a, k)) as.numeric(prev) } else if (k>=16){ my.min=-1 w = 5 while(my.min<0.01){ alphas <- c(rep(w*1,k/4), rep(w*2,k/4), rep(w*4,k/4), rep(w*8,k/4)) prev <- rdirichlet(1, c(alphas)) my.min <- min(prev) } # end while loop as.numeric(prev) } # end if/else cluster size for unequal weights } # end function ``` #### Cancer Engine The final product of the _CancerModel_ and associated block correlations is a _CancerEngine_. ```{r EngineBuilder} EngineBuilder <- function(p, f, k){ if(f == 9){ hitfn = function(n) 2 nextra = 0 npos = 9 mebs = 1 mibs = 1 } else if(f == 27) { hitfn = function(n) 2 nextra = 0 npos = 9 mebs = 3 mibs = 3 } else if(f == 81) { hitfn = function(n) 5 nextra = 7 npos = 20 mebs = 3 mibs = 3 } else if(f == 243) { hitfn = function(n) 5 nextra = 7 npos = 20 mebs = 9 mibs = 9 } mod <- CancerModel(name="Cluster Simulation Model", nPossible=npos, #number of possible 'hits' based on multi-hit theory of cancer, where smaller numbers are well separated nPattern=k, #number of subtypes HIT = hitfn, prevalence = Prevalence(k) ) bHyp <- BlockHyperParameters(nExtraBlocks=nextra, #how many features do you want that are not related to the clusters? meanBlockSize=mebs, sigmaBlockSize= 0, minBlockSize= mibs, mu0=6, sigma0=1.5, rate=28.11, shape=44.25, p.cor=0.6, wt.cor=5 ) eng <- makeBlockStructure(mod, bHyp) } ``` The _Cancer_Engine_ can be used to simulate a matrix of numeric data. All data are noisy, and clinical data are no exception. Umpire allows us to generate noisy data with additive and multiplicative components based on supplied parameters. We create a ClinicalNoiseModel to apply clinically meaningful noise to our simulations. ```{r ClinicalNoiseModel} ClinicalNoiseModel <- function(shape=1.02, scale=0.05/shape) { new("NoiseModel", additiveOffset=0, additiveScale=rgamma(1, shape=shape, scale=scale), multiplicativeScale=0 )} Noisy <- function(sim){ mod <- ClinicalNoiseModel() noi <- blur(mod, sim) } ``` ### Binning Binary Data from Continuous Simulated Data We use the _BimodalIndex_ package to identify cutpoints when binning. The standalone _bimodalIndex_ is "a continuous measure of the extent to which a set of (inivariate) data fits a two-component mixture model," returning a larger score for components that are balanced in size or widely separated. _bimodalIndex_ calculates row-wise, so we transpose to calculate the index and perform the associated transformations. _BimodalIndex_ returns various statistics for each feature in a bimodal distribution, with means _mu1_ and _mu2_, standardized distance _delta_ between them, and the percentage _pi_ of samples in group 1. With at least 50 samples, a sample with _BI_ greater than 1.1 can be said to be bimodal. To convert continuous data to binary, first we randomly sample _b_ vectors which we wish to convert to binary. For these with _BI_ > 1.1, we split the vector into values _0_ and _1_ at the midpoint between _mu1_ and _mu2_. For those with _BI_ < 1.1, we randomly sample a percentile between 65% and 95%, cutting across the smaller of the two distributions, to set the cutoffs. In both cases, we assign 0 or 1 arbitrarily, selecting one at random and assigning it to the lower bound, and the other accordingly. ```{r MakeBinary} library(BimodalIndex) MakeBinary <- function(bi, bin, ftNum){ #Randomly order 1 and 0 samp <- sample(c(0,1), replace=FALSE) lower <- samp[1] upper <- samp[2] if(bi[ftNum,"BI"] > my.bi){ bound <- (bi[ftNum,"mu1"]+bi[ftNum,"mu2"])/2 bin[ftNum,] <- ifelse(bin[ftNum,] my.bi) if(sum(bin[ftNum,]) == 0){ pickTwenty <- sample(1:length(bin[ftNum,]), length(bin[ftNum,])*0.2) bin[ftNum, pickTwenty] <- 1 } if(sum(bin[ftNum,]) == length(bin[ftNum,])){ pickTwenty <- sample(1:length(bin[ftNum,]), length(bin[ftNum,])*0.2) bin[ftNum, pickTwenty] <- 0 } bin } # end function LabelBinary <- function(bin, ftNum, Type){ per1 <- sum(bin[ftNum,])/length(bin[ftNum,]) Type <- ifelse(per1>0.9 | per1<0.1, "asymm", "symm") Type } ``` ## Binning Categorical Data from Continuous Data ```{r MakeCategories} # Determine if the feature is nominal or ordinal, and store that information in a vector. catType <- sample(c(1,2), 1, replace=TRUE) #1 is ordinal, 2 is nominal MakeCategories <- function(bin, ftNum, nPts, catType){ # Sample a number of categories between 3 and 9. cc <- sample(3:9, 1, replace=TRUE) # Sample bin sizes from the Dirichlet distribution a <- 20 r <- rdirichlet(1, rep(a, cc)) # Construct a vector of percentage cutoffs cuts <- round(c(0, nPts*cumsum(r))) # create ids if(catType == 1){ id <- 1:cc } else if (catType == 2){ id <- sample(cc) } # bin it M <- cut(rank(bin[ftNum,]), breaks=cuts, labels=FALSE) bin[ftNum,] <- id[M] bin }#close the function LabelCategories <- function(catType){ if(catType == 1){#ordinal Type <- "ordinal" } else if(catType == 2){#nominal Type <- "nominal" } Type } ``` ## A Function to Allocate Features into Groups by Data Type ```{r Allocator} Allocator <- function(cont, bina, cat, nFeats){ #percentages or fractions of each data type numCont <- nFeats*cont numBin <- nFeats*bina numCat <- nFeats*cat feats <- sample(1:nFeats, nFeats, replace=FALSE) contFeats <- feats[1:numCont] binFeats <- feats[(numCont+1):(numCont+numBin)] catFeats <- feats[(numCont+numBin+1):(numCont+numBin+numCat)] x <- cbind(contFeats, binFeats, catFeats) x <- as.data.frame(x) }#close function ``` ## Simulations ### Setup and Parameters ```{r seed} set.seed(194718) ``` We set the cutoff for the binary index to 1.1, following _bimodalIndex_ package description. We use the following parameters: ```{r Parameters} nPts <- c(200, 800, 3200) nFeats = c(9, 27, 81, 243) kClust = c(2, 6, 16) DataTypes = c("continuousControl", "binaryControl", "categoricalMixed", "nominal", "ordinal", "balanced", "binaryMostly", "categoricalMostly", "continuousMostly") nReps <- 1:3 my.bi <-1.1 nSims = length(nPts) * length(nFeats) * length(kClust) * length(DataTypes) * max(nReps) nSims ``` We make a matrix of all possible combinations to store information and track progress. ```{r Test-Combinations} require(utils) tt <- expand.grid(nPts, nFeats, kClust, DataTypes, nReps, keep.out.attrs=FALSE, stringsAsFactors=FALSE) tt$keep.out.attrs <- NULL colnames(tt) <- c("nPts", "nFeats", "kClust", "DataTypes", "nReps") nSims <- length(nPts) * length(nFeats) * length(kClust) * length(DataTypes) * length(nReps) tt <- cbind(tt, FileName = as.character(rep(NA), nSims), stringsAsFactors=FALSE) head(tt) ``` Define a location for saving. ```{r SavePath} #simloc <- "file path" ``` ### Simulation Generation ```{r Simulations, eval=FALSE} for(p in 1:length(nPts)){ for(f in 1:length(nFeats)){ for(k in 1:length(kClust)){ for(d in 1:length(DataTypes)){ for(r in 1:length(nReps)){ location <- tt$nPts==nPts[p] & tt$nFeats==nFeats[f] & tt$kClust==kClust[k] & tt$DataTypes==DataTypes[d] & tt$nReps==nReps[r] #### Check for the file fn <- paste("dat", nPts[p], nFeats[f], kClust[k], DataTypes[d], nReps[r], sep="_") fn <- paste(fn, "rda", sep=".") fullfn <- file.path(simloc, fn) ################################################################## ####Raw Data #Build an Engine eng <- EngineBuilder(p=nPts[p], f=nFeats[f], k=kClust[k]) #Simulate Data sim <- rand(eng,nPts[p]) #Add noise noi <- Noisy(sim$data) ################################################################# ####Binned Data #Set up for binning bin <- noi bi <- bimodalIndex(bin, verbose=TRUE) #Calculate the Bimodal Index Type <- as.character(rep("continuous", nFeats[f])) #Bin the data according to the supplied DataType if(DataTypes[d] == "binaryControl"){ for(j in 1:nFeats[f]){ bin <- MakeBinary(bi, bin, j) Type[j] <- LabelBinary(bin, j, Type) } } else if(DataTypes[d] == "ordinal") { for(j in 1:nFeats[f]){ catType <- 1 #1 is ordinal, 2 is nominal bin <- MakeCategories(bin=bin, ftNum=j, nPts=nPts[p], catType) Type[j] <- LabelCategories(catType) } } else if(DataTypes[d] == "nominal") { for(j in 1:nFeats[f]){ catType <- 2 #1 is ordinal, 2 is nominal bin <- MakeCategories(bin=bin, ftNum=j, nPts=nPts[p], catType) Type[j] <- LabelCategories(catType) } } else if(DataTypes[d] == "categoricalMixed") { for(j in 1:nFeats[f]){ catType <- sample(c(1,2), 1, replace=TRUE) #1 is ordinal, 2 is nominal bin <- MakeCategories(bin=bin, ftNum=j, nPts=nPts[p], catType) Type[j] <- LabelCategories(catType) } } else if(DataTypes[d] == "balanced"){ asn <- Allocator(cont=1/3, bina=1/3, cat=1/3, nFeats[f]) for(b in 1:length(asn$binFeats)){ bin <- MakeBinary(bi=bi, bin=bin, ftNum=asn$binFeats[b]) Type[asn$binFeats[b]] <- LabelBinary(bin, b, Type) } for(ca in 1:length(asn$catFeats)){ catType <- sample(c(1,2), 1, replace=TRUE) #1 is ordinal, 2 is nominal bin <- MakeCategories(bin=bin, ftNum=asn$catFeats[ca], nPts=nPts[p], catType) Type[asn$catFeats[ca]] <- LabelCategories(catType) } } else if(DataTypes[d] == "continuousMostly"){ asn <- Allocator(cont=7/9, bina=1/9, cat=1/9, nFeats[f]) for(b in 1:length(asn$binFeats)){ bin <- MakeBinary(bi=bi, bin=bin, ftNum=asn$binFeats[b]) Type[asn$binFeats[b]] <- LabelBinary(bin, b, Type) } for(ca in 1:length(asn$catFeats)){ catType <- sample(c(1,2), 1, replace=TRUE) #1 is ordinal, 2 is nominal bin <- MakeCategories(bin=bin, ftNum=asn$catFeats[ca], nPts=nPts[p], catType) Type[asn$catFeats[ca]] <- LabelCategories(catType) } } else if(DataTypes[d] == "binaryMostly"){ asn <- Allocator(cont=1/9, bina=7/9, cat=1/9, nFeats[f]) for(b in 1:length(asn$binFeats)){ bin <- MakeBinary(bi=bi, bin=bin, ftNum=asn$binFeats[b]) Type[asn$binFeats[b]] <- LabelBinary(bin, b, Type) } for(ca in 1:length(asn$catFeats)){ catType <- sample(c(1,2), 1, replace=TRUE) #1 is ordinal, 2 is nominal bin <- MakeCategories(bin=bin, ftNum=asn$catFeats[ca], nPts=nPts[p], catType) Type[asn$catFeats[ca]] <- LabelCategories(catType) } } else if(DataTypes[d] == "categoricalMostly"){ asn <- Allocator(cont=1/9, bina=1/9, cat=7/9, nFeats[f]) for(b in 1:length(asn$binFeats)){ bin <- MakeBinary(bi=bi, bin=bin, ftNum=asn$binFeats[b]) Type[asn$binFeats[b]] <- LabelBinary(bin, b, Type) } for(ca in 1:length(asn$catFeats)){ catType <- sample(c(1,2), 1, replace=TRUE) #1 is ordinal, 2 is nominal bin <- MakeCategories(bin=bin, ftNum=asn$catFeats[ca], nPts=nPts[p], catType) Type[asn$catFeats[ca]] <- LabelCategories(catType) } } #end if/else binning statement #Save the data as a list facts <- c(nPts=nPts[p], nFeats=nFeats[f], kClust=kClust[k], DataTypes=DataTypes[d], Repeat=nReps[r]) fullsim <- list(facts=facts, FileName = fn, rawsim = sim$data, clustID = sim$clinical, noisesim = noi, bin = bin, datType = Type) save(fullsim, file=fullfn) tt$FileName[location] <- fn #Save the tracker table save(tt, file=file.path(simloc, "trackerTable.rda")) print(fn) #tracking output }}}}} #end opening for statements (p, f, k, d, r) ``` ## References 1. Zhang, Roebuck, Coombes. _Simulating Gene Expression Data To Estimate Sample Size For Class and Biomarker Discovery_. International Journal on Advanced in Life Sciences, vol 4, no 1&2, 2012. 2. Coombes 2020. Umpire paper. 3. Zhang, Coombes. _Sources of variation in false discovery rate estimation include sample size, correlation, and inherent differences between groups._ BMC Bioinformatics 2012, 13(Suppl 13):S1.