---
title: "Simulations of Mixed-Type Clinical Data"
author: "Cat Coombes"
date: "`r Sys.Date()`"
output:
  html_document:
    theme: yeti
    highlight: kate
    toc: true
---

## Introduction
This document contains the code used to simulate 32,400 diverse, clinically realistic data sets for assessing clustering algorithms. The core functionality employs the Umpire R-package, a tool developed to simulate "correlated blocks" (i.e. clusters) of continuous data.[(Zhang 2012)[1] Although originally designed for gene expression data, the fundamental concept can be more generally applied: Controlled generation of continuous data with known cluster identities and complex noise with multivariate distribution, with the possibility of generating survival data. The functions employed in this document to adapt Umpire to meaningfully simulate clinical data have been incorporated into a user-friendly re-release of the Umpire package (Version 2.0 and later) and are freely available on CRAN.[2] 

```{r}
library(Umpire)
```

## Functions
### Cancer Model and Cancer Engine: The Base Unit of Simulated Clustered Data in Umpire
To create clustered structure within our data, we simulate cluster identities by building a _cancer model_ with a given number of subtypes (i.e. clusters). _nPossible_ sets the number of features we believe to be relevant to defining cluster identities, which we arbitrarily assume to be 1/3 of the features. _nPattern_ defines the number of _k_ clusters. Altering the _prevalence_ parameter allows us to create clusters of unequal size.

Using _BlockHyperParameters_, we create blocks of correlated data to mimic correlations between features in real clinical data for more realistic modeling of clusters. Umpire implements a "hit function" (Zhang 2012) to generate correlated blocks with individual and population heterogeneity, which we adapt for small feature spaces. A generalizable hit function has been implemented in Umpire 2.0.

#### Prevalence and Cluster Size
We wish the prevalence of clusters to vary, with two basic types orientations: simulations where clusters are of approximately equal size and simulations where clusters are of distinctly unequal size. We create a function to sample prevalence from a Dirichlet distribution. More in-depth tests are conducted in ClusterSizes-Prevalence.Rmd.

```{r Prevalence}
Prevalence <- function(k){
    if(k<=8){
      a = 10
      prev <- rdirichlet(1, rep(a, k))
      as.numeric(prev)
    } else if (k>=16){
      my.min=-1
      w = 5
      while(my.min<0.01){
      alphas <- c(rep(w*1,k/4), rep(w*2,k/4), rep(w*4,k/4), rep(w*8,k/4))
      prev <- rdirichlet(1, c(alphas))
      my.min <- min(prev)
      } # end while loop
      as.numeric(prev)
    } # end if/else cluster size for unequal weights
} # end function
```

#### Cancer Engine
The final product of the _CancerModel_ and associated block correlations  is a _CancerEngine_.
```{r EngineBuilder}
EngineBuilder <- function(p, f, k){
          if(f == 9){
            hitfn = function(n) 2
            nextra = 0
            npos = 9
            mebs = 1
            mibs = 1
          } else if(f == 27) {
            hitfn = function(n) 2
            nextra = 0
            npos = 9
            mebs = 3
            mibs = 3
          } else if(f == 81) {
            hitfn = function(n) 5
            nextra = 7
            npos = 20
            mebs = 3
            mibs = 3
          } else if(f == 243) {
            hitfn = function(n) 5
            nextra = 7
            npos = 20
            mebs = 9
            mibs = 9
          }
          mod <- CancerModel(name="Cluster Simulation Model",
                   nPossible=npos, #number of possible 'hits' based on multi-hit theory of cancer, where smaller numbers are well separated
                   nPattern=k, #number of subtypes
                   HIT = hitfn,
                   prevalence = Prevalence(k)
                   )
          bHyp <- BlockHyperParameters(nExtraBlocks=nextra, #how many features do you want that are not related to the clusters? 
                  meanBlockSize=mebs,
                  sigmaBlockSize= 0,
                  minBlockSize= mibs,
                  mu0=6,
                  sigma0=1.5,
                  rate=28.11,
                  shape=44.25,
                  p.cor=0.6,
                  wt.cor=5
                  )
          eng <- makeBlockStructure(mod, bHyp)
}
```

The _Cancer_Engine_ can be used to simulate a matrix of numeric data.

All data are noisy, and clinical data are no exception. Umpire allows us to generate noisy data with additive and multiplicative components based on supplied parameters. We create a ClinicalNoiseModel to apply clinically meaningful noise to our simulations.
```{r ClinicalNoiseModel}
ClinicalNoiseModel <- function(shape=1.02, scale=0.05/shape) {
  new("NoiseModel",
      additiveOffset=0,
      additiveScale=rgamma(1, shape=shape, scale=scale),
      multiplicativeScale=0
)}
Noisy <- function(sim){
          mod <- ClinicalNoiseModel()
          noi <- blur(mod, sim)
}
```

### Binning Binary Data from Continuous Simulated Data
We use the _BimodalIndex_ package to identify cutpoints when binning. The standalone _bimodalIndex_ is "a continuous measure of the extent to which a set of (inivariate) data fits a two-component mixture model," returning a larger score for components that are balanced in size or widely separated. _bimodalIndex_ calculates row-wise, so we transpose to calculate the index and perform the associated transformations.

_BimodalIndex_ returns various statistics for each feature in a bimodal distribution, with means _mu1_ and _mu2_, standardized distance _delta_ between them, and the percentage _pi_ of samples in group 1. With at least 50 samples, a sample with _BI_ greater than 1.1 can be said to be bimodal. 

To convert continuous data to binary, first we randomly sample _b_ vectors which we wish to convert to binary. For these with _BI_ > 1.1, we split the vector into values _0_ and _1_ at the midpoint between _mu1_ and _mu2_. For those with _BI_ < 1.1, we randomly sample a percentile between 65% and 95%, cutting across the smaller of the two distributions, to set the cutoffs. In both cases, we assign 0 or 1 arbitrarily, selecting one at random and assigning it to the lower bound, and the other accordingly.
```{r MakeBinary}
library(BimodalIndex)

MakeBinary <- function(bi, bin, ftNum){
               #Randomly order 1 and 0
               samp <- sample(c(0,1), replace=FALSE)
               lower <- samp[1]
               upper <- samp[2]
               
               
              if(bi[ftNum,"BI"] > my.bi){
                bound <- (bi[ftNum,"mu1"]+bi[ftNum,"mu2"])/2
                bin[ftNum,] <- ifelse(bin[ftNum,]<bound, lower, upper)
              } else {
                bound <- sample(5:35, 1)
                bound <- bound*0.01
                low.cutoff <- quantile(bin[ftNum,], bound)
                high.cutoff <- quantile(bin[ftNum,], 1-bound)
             
                if(bi[ftNum,"pi"]<0.5){
                  bin[ftNum,] <- ifelse(bin[ftNum,]<low.cutoff, lower, upper)
                } else {
                  bin[ftNum,] <- ifelse(bin[ftNum,]<high.cutoff, lower, upper)
                } # end inner if/else (bi[ftNum, "pi"]<0.5)
              } # end outer if/else (bi[ftNum, "BI] > my.bi)
              
               
              if(sum(bin[ftNum,]) == 0){
                pickTwenty <- sample(1:length(bin[ftNum,]), length(bin[ftNum,])*0.2)
                bin[ftNum, pickTwenty] <- 1
              }
              if(sum(bin[ftNum,]) == length(bin[ftNum,])){
                pickTwenty <- sample(1:length(bin[ftNum,]), length(bin[ftNum,])*0.2)
                bin[ftNum, pickTwenty] <- 0              
              }
               
              bin
} # end function

LabelBinary <- function(bin, ftNum, Type){
              per1 <- sum(bin[ftNum,])/length(bin[ftNum,])
              Type <- ifelse(per1>0.9 | per1<0.1, "asymm", "symm")
              Type
}
```

## Binning Categorical Data from Continuous Data
```{r MakeCategories}
# Determine if the feature is nominal or ordinal, and store that information in a vector.
catType <- sample(c(1,2), 1, replace=TRUE) #1 is ordinal, 2 is nominal


MakeCategories <- function(bin, ftNum, nPts, catType){

# Sample a number of categories between 3 and 9.
  cc <- sample(3:9, 1, replace=TRUE)

# Sample bin sizes from the Dirichlet distribution
  a <- 20
  r <- rdirichlet(1, rep(a, cc))

# Construct a vector of percentage cutoffs
  cuts <- round(c(0, nPts*cumsum(r)))
  
# create ids
  if(catType == 1){
    id <- 1:cc
  } else if (catType == 2){
    id <- sample(cc)
  }

# bin it
  M <- cut(rank(bin[ftNum,]), breaks=cuts, labels=FALSE)
  bin[ftNum,] <- id[M]

bin
}#close the function   


LabelCategories <- function(catType){
  if(catType == 1){#ordinal
    Type <- "ordinal"
  } else if(catType == 2){#nominal
    Type <- "nominal"
  }
  Type
}
```

## A Function to Allocate Features into Groups by Data Type
```{r Allocator}
Allocator <- function(cont, bina, cat, nFeats){ #percentages or fractions of each data type
  numCont <- nFeats*cont
  numBin <- nFeats*bina
  numCat <- nFeats*cat
  feats <- sample(1:nFeats, nFeats, replace=FALSE)
  contFeats <- feats[1:numCont]
  binFeats <- feats[(numCont+1):(numCont+numBin)]
  catFeats <- feats[(numCont+numBin+1):(numCont+numBin+numCat)]
  x <- cbind(contFeats, binFeats, catFeats)
  x <- as.data.frame(x)
}#close function
```

## Simulations
### Setup and Parameters
```{r seed}
set.seed(194718)
```

We set the cutoff for the binary index to 1.1, following _bimodalIndex_ package description.

We use the following parameters:
```{r Parameters}
nPts <- c(200, 800, 3200)
nFeats = c(9, 27, 81, 243)
kClust = c(2, 6, 16)
DataTypes = c("continuousControl", "binaryControl", "categoricalMixed", "nominal", "ordinal", 
              "balanced", "binaryMostly", "categoricalMostly", "continuousMostly")
nReps <- 1:3

my.bi <-1.1

nSims = length(nPts) * length(nFeats) * length(kClust) * length(DataTypes) * max(nReps)
nSims
```

We make a matrix of all possible combinations to store information and track progress.
```{r Test-Combinations}
require(utils)
tt <- expand.grid(nPts, nFeats, kClust, DataTypes, nReps,
            keep.out.attrs=FALSE, stringsAsFactors=FALSE)
tt$keep.out.attrs <- NULL
colnames(tt) <- c("nPts", "nFeats", "kClust", "DataTypes", "nReps")
nSims <- length(nPts) * length(nFeats) * length(kClust) * length(DataTypes) * length(nReps)
tt <- cbind(tt,
            FileName = as.character(rep(NA), nSims),
            stringsAsFactors=FALSE)
head(tt)
```

Define a location for saving.
```{r SavePath}
#simloc <- "file path"
```

### Simulation Generation
```{r Simulations, eval=FALSE}
for(p in 1:length(nPts)){
  for(f in 1:length(nFeats)){
    for(k in 1:length(kClust)){
      for(d in 1:length(DataTypes)){
         for(r in 1:length(nReps)){
          
location <- tt$nPts==nPts[p] & 
            tt$nFeats==nFeats[f] & 
            tt$kClust==kClust[k] & 
            tt$DataTypes==DataTypes[d] & 
            tt$nReps==nReps[r]

#### Check for the file
fn <- paste("dat", nPts[p], nFeats[f], kClust[k], DataTypes[d], nReps[r], sep="_")
fn <- paste(fn, "rda", sep=".")
fullfn <- file.path(simloc, fn)

##################################################################
####Raw Data

#Build an Engine
eng <- EngineBuilder(p=nPts[p], f=nFeats[f], k=kClust[k])

#Simulate Data
sim <- rand(eng,nPts[p])

#Add noise
noi <- Noisy(sim$data)

#################################################################
####Binned Data

#Set up for binning
bin <- noi
bi <- bimodalIndex(bin, verbose=TRUE) #Calculate the Bimodal Index
Type <- as.character(rep("continuous", nFeats[f]))

#Bin the data according to the supplied DataType
      if(DataTypes[d] == "binaryControl"){
            for(j in 1:nFeats[f]){
            bin <- MakeBinary(bi, bin, j)
            Type[j] <- LabelBinary(bin, j, Type)
            }
      } else if(DataTypes[d] == "ordinal") {
            for(j in 1:nFeats[f]){
            catType <- 1 #1 is ordinal, 2 is nominal
            bin <- MakeCategories(bin=bin, ftNum=j, nPts=nPts[p], catType)
            Type[j] <- LabelCategories(catType)
            }
      } else if(DataTypes[d] == "nominal") {
            for(j in 1:nFeats[f]){
            catType <- 2 #1 is ordinal, 2 is nominal
            bin <- MakeCategories(bin=bin, ftNum=j, nPts=nPts[p], catType)
            Type[j] <- LabelCategories(catType)
            }
      } else if(DataTypes[d] == "categoricalMixed") {
            for(j in 1:nFeats[f]){
            catType <- sample(c(1,2), 1, replace=TRUE) #1 is ordinal, 2 is nominal
            bin <- MakeCategories(bin=bin, ftNum=j, nPts=nPts[p], catType)
            Type[j] <- LabelCategories(catType)
            }
      } else if(DataTypes[d] == "balanced"){
            asn <- Allocator(cont=1/3, bina=1/3, cat=1/3, nFeats[f])
            for(b in 1:length(asn$binFeats)){
              bin <- MakeBinary(bi=bi, bin=bin, ftNum=asn$binFeats[b])
              Type[asn$binFeats[b]] <- LabelBinary(bin, b, Type)
            }
            for(ca in 1:length(asn$catFeats)){
              catType <- sample(c(1,2), 1, replace=TRUE) #1 is ordinal, 2 is nominal
              bin <- MakeCategories(bin=bin, ftNum=asn$catFeats[ca], nPts=nPts[p], catType)
              Type[asn$catFeats[ca]] <- LabelCategories(catType)
            }
      } else if(DataTypes[d] == "continuousMostly"){
            asn <- Allocator(cont=7/9, bina=1/9, cat=1/9, nFeats[f])
            for(b in 1:length(asn$binFeats)){
              bin <- MakeBinary(bi=bi, bin=bin, ftNum=asn$binFeats[b])
              Type[asn$binFeats[b]] <- LabelBinary(bin, b, Type)
            }
            for(ca in 1:length(asn$catFeats)){
              catType <- sample(c(1,2), 1, replace=TRUE) #1 is ordinal, 2 is nominal
              bin <- MakeCategories(bin=bin, ftNum=asn$catFeats[ca], nPts=nPts[p], catType)
              Type[asn$catFeats[ca]] <- LabelCategories(catType)
            }
      } else if(DataTypes[d] == "binaryMostly"){
            asn <- Allocator(cont=1/9, bina=7/9, cat=1/9, nFeats[f])
            for(b in 1:length(asn$binFeats)){
              bin <- MakeBinary(bi=bi, bin=bin, ftNum=asn$binFeats[b])
              Type[asn$binFeats[b]] <- LabelBinary(bin, b, Type)
            }
            for(ca in 1:length(asn$catFeats)){
              catType <- sample(c(1,2), 1, replace=TRUE) #1 is ordinal, 2 is nominal
              bin <- MakeCategories(bin=bin, ftNum=asn$catFeats[ca], nPts=nPts[p], catType)
              Type[asn$catFeats[ca]] <- LabelCategories(catType)
            }
      } else if(DataTypes[d] == "categoricalMostly"){
            asn <- Allocator(cont=1/9, bina=1/9, cat=7/9, nFeats[f])
            for(b in 1:length(asn$binFeats)){
              bin <- MakeBinary(bi=bi, bin=bin, ftNum=asn$binFeats[b])
              Type[asn$binFeats[b]] <- LabelBinary(bin, b, Type)
            }
            for(ca in 1:length(asn$catFeats)){
              catType <- sample(c(1,2), 1, replace=TRUE) #1 is ordinal, 2 is nominal
              bin <- MakeCategories(bin=bin, ftNum=asn$catFeats[ca], nPts=nPts[p], catType)
              Type[asn$catFeats[ca]] <- LabelCategories(catType)
            }
      } #end if/else binning statement
      
#Save the data as a list
facts <- c(nPts=nPts[p], nFeats=nFeats[f], kClust=kClust[k], DataTypes=DataTypes[d], Repeat=nReps[r])
fullsim <- list(facts=facts, FileName = fn, rawsim = sim$data, clustID = sim$clinical, noisesim = noi, bin = bin, datType = Type)
save(fullsim, file=fullfn)

tt$FileName[location] <- fn

#Save the tracker table
save(tt, file=file.path(simloc, "trackerTable.rda"))

print(fn) #tracking output

}}}}} #end opening for statements (p, f, k, d, r)
```

## References
1. Zhang, Roebuck, Coombes. _Simulating Gene Expression Data To Estimate Sample Size For Class and Biomarker Discovery_. International Journal on Advanced in Life Sciences, vol 4, no 1&2, 2012.

2. Coombes 2020. Umpire paper.

3. Zhang, Coombes. _Sources of variation in false discovery rate estimation include sample size, correlation, and inherent differences between groups._ BMC Bioinformatics 2012, 13(Suppl 13):S1.