Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 May 1.
Published in final edited form as: J Biomed Inform. 2020 Mar 12;105:103408. doi: 10.1016/j.jbi.2020.103408

Empirically-Derived Synthetic Populations to Mitigate Small Sample Sizes

Erin E Fowler a, Anders Berglund b, Michael J Schell b, Thomas A Sellers, Steven Eschrich b, John Heine a,*
PMCID: PMC7839232  NIHMSID: NIHMS1587157  PMID: 32173502

Abstract

Limited sample sizes can lead to spurious modeling findings in biomedical research. The objective of this work is to present a new method to generate synthetic populations (SPs) from limited samples using matched case-control data (n = 180 pairs), considered as two separate limited samples. SPs were generated with multivariate kernel density estimations (KDEs) with constrained bandwidth matrices. We included four continuous variables and one categorical variable for each individual. Bandwidth matrices were determined with Differential Evolution (DE) optimization by covariance comparisons. Four synthetic samples (n = 180) were derived from their respective SPs. Similarity between observed samples with synthetic samples was compared assuming their empirical probability density functions (EPDFs) were similar. EPDFs were compared with the maximum mean discrepancy (MMD) test statistic based on the Kernel Two-Sample Test. To evaluate similarity within a modeling context, EPDFs derived from the Principal Component Analysis (PCA) scores and residuals were summarized with the distance to the model in X-space (DModX) as additional comparisons.

Four SPs were generated from each sample. The probability of selecting a replicate when randomly constructing synthetic samples (n = 180) was infinitesimally small. MMD tests indicated that the observed sample EPDFs were similar to the respective synthetic EPDFs. For the samples, PCA scores and residuals did not deviate significantly when compared with their respective synthetic samples.

The feasibility of this approach was demonstrated by producing synthetic data at the individual level, statistically similar to the observed samples. The methodology coupled KDE with DE optimization and deployed novel similarity metrics derived from PCA. This approach could be used to generate larger-sized synthetic samples. To develop this approach into a research tool for data exploration purposes, additional evaluation with increased dimensionality is required. Moreover, given a fully specified population, the degree to which individuals can be discarded while synthesizing the respective population accurately will be investigated. When these objectives are addressed, comparisons with other techniques such as bootstrapping will be required for a complete evaluation.

Keywords: synthetic data generation, kernel density estimation, overfitting, Differential Evolution, Principal Component Analysis, DModX

1. Background

Often data is limited in biomedical research. There are numerous reasons for limited sample sizes including low disease-incidence rates, rare diseases [1, 2], underserved/underrepresented subpopulations [3], the inability to share data across institutions due to privacy concerns [1], large-dimensionality relative to the number of patient samples [as the case in omics research] [4, 5], and limited timeframes for data collection, often due to grant funding cycles coupled with low disease incidence. These represent major barriers that can hinder adoption of decision models in clinical practice. Generating synthetic data for healthcare related studies offers a solution to some of these barriers [69]. Our work differs from this related work in healthcare as it specifically addresses the inadequately sampled data problem (i.e. limited sample size). Describing a sample as limited is a subjective determination by researchers where the number of observations in a given sample does not provide a sufficiently rich set of realizations such that resampling methods can be used to provide robust simulations.

One application for generating synthetic populations (SPs) includes data modeling. Before applying a decision model in the clinical setting, potential models are typically explored during a discovery phase and then validated. The discovery phase includes analyzing samples from the respective populations to find measures (patient variables) related to the desired clinical endpoint, creating a model, and optimizing the model. Once the model parameters are fixed, the model can be evaluated with new data without further modeling adjustments. Both stages of development are dependent upon having access to adequately sized samples, which is often the major bottleneck. Inadequate sample sizes can lead to either false positive discoveries due to overfitting [10] or to false negative results due to reduced power. Automated variable selection, pretesting potential predictor variables, and dichotomizing continuous variables are all contributing factors to spurious findings [11]. The extent that triaged variables and analyses are discussed in the literature is limited. Harrell [12] proposed a rough rule of thumb to decrease the risk of overfitting when modeling data. Briefly for binary data, the number of variables (more aptly degrees of freedom) considered when model building should be at most L/10 range, where L is the sample size of the smaller group, and it includes all variables examined even if they did not make it into the specified model [13]. Others have suggested a lower threshold of L/5 may be sufficient [14]. It is important to note that these guidelines may be necessary but may not be generally sufficient for a given model to generalize with independent data. If the sample does not represent the respective population, model generalization may not follow.

There are several techniques currently available to address limited sample conditions when modeling assuming that the sample is representative of its respective population. Bootstrapping [15] is a highly successful statistical resampling method often used when access to multiple samples is limited, but the method has noted deficiencies when applied to small sample sizes [16]. When using the bootstrap for modeling, one sample can be used to specify the model parameters, and other samples can be used to evaluate the model’s performance. A major concern is that a bootstrap sample only uses values seen in the sample. Cross-validation is another technique that can be used to evaluate a model’s performance. This technique tests the model’s performance on sample(s) that were not used for training. Leave-one-out cross-validation produces a stringent test that underestimates the performance but has a drawback; in some instances removing one sample from the dataset will not produce enough variation [10]. More generally, cross-validation can be highly variable [17]. Cross-validation depends on repeatedly re-training to estimate the error rates. As such, it does not result in a specific set of model parameters, but rather parameters for each step of cross-validation. Another technique includes splitting the sample into training and validation subsets. Splitting the sample provides for independent replication but is dependent upon a dataset that is large enough to separate. As a potential alternative approach, we will borrow from SP generation methods [18] and explore generating synthetic patient data.

Synthetic population generation methods are frequently used in policy planning studies [19], such as land use and transportation research [2022] and for healthcare applications [69]. In the planning studies, often the objective is to construct the probability density function (PDF) for the population of interest (target) given the aggregated data [marginal empirical PDF (EPDF) or tabulated data] and a sample of disaggregated EPDF with specified constraints. Established approaches [18, 19, 23] include deterministic re-weighting techniques and methods that use some form of stochastic component in the synthesis. Iterative proportional fitting (IPF) is a deterministic re-weighting method that is a well-tried approach [24] based on iteratively adjusting contingency tables relative to the given constraints as detailed by Lomax and Norman [25]. Although IPF extends to any number of dimensions [26], there are noted efficiency problems for large dimensionality [27]. The limitations with IPF are discussed in detail elsewhere [22, 25, 27]. The conditional probability approach uses the observed sample to estimate the target population by randomly sampling one variable at a time. This approach becomes cumbersome for a larger number of variables, and the outcome is dependent upon how the constraints are introduced (i.e. ordered) into the stochastic PDF reconstruction [23]. Healthcare applications include generating samples from large populations [69]. These approaches include hidden Markov models [8], techniques that reconstruct time series data coupled with sampling the EPDF of the relevant variables [6] and methods that estimate the PDF from the data, not accounting for variable correlation [9].

In this report, a new methodology for generating SPs is presented with initial findings to demonstrate proof-of-concept. We have related longer-term objectives for developing this method that complement the limited sample problem: (1) investigating the lower sample size bound for synthesizing a population with a given set of covariates from a limited sample. That is given a fully specified population, an estimation of the minimum sample size (determined by systematically discarding individuals) that permits synthesizing an accurate population should be understood; and (2) evaluating whether it is possible to use SPs to both develop a model [including variable selection and model selection and specification] and then validate the respective model. Note, the first objective is our primary research endeavor because understanding the synthesis limit is superior to the second objective. Our intended applications for SP generation include exploring various models for risk prediction and disease classification, and investigating relationships between covariates. For example, both neural network (NN) and kernel based partial least squares regression modeling can require extensive training when there are many adjustable (free) parameters as illustrated in our previous work in breast and lung cancer [2830]. Applying such models with many parameters to a limited sample may not be productive without either using a resampling technique or enhancing the sample size synthetically. The notion of a limited sample is relative. One sample size may be considered limited in one situation such as the case for NNs with many degrees of freedom while being sufficient for a given parametric model. To address such modeling problems, the lower sample size bound that permits the accurate population synthesis should be understood.

2. Methods

2.1. Overview

A brief outline of our approach is provided, followed by an in-depth description of its interconnected components. We define an SP as a population of conceptually unlimited individuals (realizations) that is inferred from an observed sample. The case and control groups are samples from distinct populations (i.e. case and control populations) with 180 individuals in each sample. Synthetic samples are defined as 180 realizations selected from respective case or control SP. Histograms (univariate and multivariate) are referenced as EPDFs in this report. The boundary conditions differ from some of the approaches discussed above. We start with an observed sample, which is generally limited (with respect to the hypothetical population), with the objective of constructing the SP from which synthetic samples can be constructed. We use limited to imply a given EPDF has many missing variable combinations (i.e. missing function values), representing missing individuals (not a missing variable for a given individual). A given sample is used to synthesize the respective PDF corresponding to the SP. The PDF is scaled and normalized to create the respective SP. This synthesis is based on the assumption that this PDF is a smooth continuous hypersurface. Synthetic samples were constructed randomly from the SP. Multivariate kernel density estimation with an unconstrained bandwidth matrix was used to synthesize these PDFs and their respective SPs. The kernel has parameters that are arranged in a bandwidth matrix within its argument. These parameters control the amount of smoothing in the PDF estimation and are determined with an optimization procedure. Differential Evolution (DE) optimization [31] was used to determine the bandwidth parameters for the kernel density estimation by comparing the covariance matrix elements of the observed samples with that of the synthetic samples. The optimization process operates on another population of parameter vectors that remains fixed in size (defined below). The components of these parameter vectors are the bandwidth matrix elements arranged in vector form. The optimization takes place over generations where the parameter vector population evolves, becoming more fit with respect to the objective function (i.e. covariance comparisons), but remaining fixed in size. The optimization mechanism induces random mutations to occur within the parameter vector population. Within a given generation, parameter vector pairs compete to move to the next generation as shown in Figure 1. For a given competition, an SP is generated for each parameter vector and the respective synthetic samples are generated; a covariance matrix is generated for each synthetic sample and compared with the sample’s covariance matrix; the more-fit vector moves to the next generation as illustrated in Figure 1. Parameter vectors move into the optimization process from one generation (left side of Figure 1) and possibly exit the optimization (right side of Figure 1) to the next generation. Multiple forms of analyses were used to thoroughly evaluate the similarity between the observed sample and synthetic samples after the optimization process terminated: (1) the maximum mean discrepancy (MMD) metric with the Kernel Two-Sample Test [32] was used to compare the EPDFs; (2) covariance matrices were compared; and (3) to evaluate this approach within a modeling context, novel similarity metrics derived from Principal Component Analysis (PCA) were employed to compare the multivariate structure.

Figure 1.

Figure 1.

Synthetic Sample Generation Flow Diagram: This shows the steps for generating a synthetic sample. Differential evolution (DE) optimization is a cyclic process that determines the optimal H. The parameter vectors are outputted by DE (left) and the more fit vector moves to the next DE generation (right).

2.2. Sample and Synthetic Data

Observed samples were from a matched case-control study (n = 180 pairs) comprised of women who underwent mammography at Moffitt Cancer Center. Cases and controls were considered as two populations and the observed samples (case-sample and control-sample) were treated as two samples from two different PDFs. Women with first time unilateral breast cancer defined the case-sample. The control-sample included women without a history of breast cancer. Controls were individually matched to breast cancer cases by age (±2 years), and other factors not relevant to this work. The data accrual and nuances of the matching were discussed previously [3335]. Although this matched case-control data was constructed and investigated for epidemiologic purposes previously, we used it to explore this SP generation methodology in a more realistic situation, where the epidemiologic context is irrelevant for this report. For feasibility, we selected these variables for each individual: (1) age measured in years [yr] ranging from 30–90; (2) height measured in inches [in] ranging from 54–88, (3) mass [kg] ranging from 39–136, (4) breast density measure (percentage), referred to as PD, ranging from 0–80; and (5) menopausal status [MS]. Measures 1–4 are integer variables and measure 5, corresponding to MS, is binary categorical variable with MS = 0 for menopausal women and MS = 1 otherwise; in the expressions below, λ was used for the respective MS index for brevity. The dynamic range for the integer variables is defined as ni = maximum value – minimum value +1 giving: [n1 n2, n3, n4] = [61, 25, 98, 81], respectively. The maximum number of unique variable combinations in a given SP is given by RT = n1× n2× n3× n4 = 12, 105, 450. In the case and control samples, 142 and 139 women were menopausal, respectively.

The variables selected for this report are important for breast cancer risk prediction modeling studies [33, 36]. In particular, age and breast density are established breast cancer risk factors; an increase in either quantity infers greater risk. Body mass index is another strong risk factor derived from dividing an individual’s mass measured in kilograms with their height squared, where height is measured in meters. We separated BMI into its factors because they provide a more precise description of an individual. Moreover, these variables have interactions that must be accounted for in risk prediction modeling schemes. For example, as age increases breast density generally decreases. BMI is also negatively correlated with breast density, and MS is related to age. These relationships are discussed in detail elsewhere [3739]. We selected MS because it is categorical variable, making the variable set diverse. Categorical variables are often encountered in epidemiologic studies. Synthetic population generation methods should be able to accommodate variables of this type. We used inch as the height unit because larger metrics (meters or feet) were too coarse, whereas a smaller metric, such as cm, was too resolved.

Synthetic samples were generated similar to the way the observed samples were established. Because controls were matched to cases (not selected randomly), we generated two SPs from which synthetic samples could be constructed. For a given SP, s0μ(x), we considered the four integer variables as components of a vector referred to as, x, and generated the conditional PDF based on MS: s0μ(xλ). In this expression the index, μ, defined the case (μ=1) and control (μ=0) status, and the subscript, 0, indicated the population before scaling and normalization (discussed in Section 2.4) were applied to form a given SP. The general PDF expression for both populations can be expressed explicitly in one equation by considering the two related conditional PDFs and the Law of total probability with the respective probability weights expressed as

s0μ(x)=s0μ(x0)×aμ0+s0μ(x1)×aμ1, Eq. [1]

where aμλ are the respective weights derived for the corresponding samples by considering the proportions derived from the samples: a00 = 0.773 and a01 = 1.0 – a00, derived from the control-sample proportions, and a10 = 0.788 and a11 = 1.0 – a10, derived from the case-sample proportions. Expressing the total PDF in two components avoids the problem of estimating the covariance between continuous and binary variables in the optimization procedure. A synthetic sample was produced by drawing 180 realizations at random from the respective SP using the methods discussed below. We note in the modeling situation, a given synthetic-control realization must be matched to its case. This can be achieved by drawing a random realization from a restricted region of s00 corresponding to the matching variables. In this report, synthetic case and control samples were generated and evaluated separately forgoing the matching necessity. As indicated in Eq. [1], there are two subsamples for each observed sample based on MS that correspond with two hypothetical subpopulations. The number of individuals for the respective subsample is defined by N = 180 × aμλ. These subsamples were used to generate their respective synthetic subpopulations, referred to as components below.

2.3. Synthetic Population Density Estimation

The SP generation starts with using a multivariate kernel to estimate the underlying synthetic PDF for each sample. In this estimation, we have assumed that there is an underlying smooth hypothetical PDF (and respective population). We defined an m component row vector for the ith individual (an element from a given sample) xi = (x1, x2, x3…xm) = [x1, x2, x3…xm]T. In this report, m = 4. The components of xi are xij and are sometimes referred to as features or predictor variables. We defined the prospective m component column vector as x with components as xj, which is a synthetic entity. Note a specific x or xi references a specific set of four variables. A given sample assumes the role of the training dataset for the respective SP generation. For motivational purposes, we first illustrate the simpler formulism. It may be shown [40] that the estimated PDF for a sampled population with N individuals for the constrained bandwidth estimation is given by this expression

g(x)=1cxNi=1Nexp[j=1m(xjxij)22σj2]=1Ni=1Nk(x), Eq. [2]

where k(x) is the kernel, we have assumed independent realizations, cx is a multivariate normalization constant, and σj are the bandwidth parameters. We used a Gaussian kernel because it generates the smoothest PDF estimate [41], and its form lends itself to include variable interactions easily, although there are many valid kernel choices [42, 43]. This operation is performed for each x. Equation [2] is too restrictive for our application because it lacks variable interaction. The above expression can be generalized to capture the relevant covariance structure [41] giving the unconstrained bandwidth expression

g(x)=1cxNi=1Nexp(12[xxi]TH1[xxi])=1Ni=1NkH(x), Eq. [3]

where kH(x) is the full bandwidth kernel. We have included the normalization constants in Eqs. [2 & 3] for completeness, although they were dropped in the analyses. H1 is the bandwidth matrix, which is symmetric and positive definite. We note, the performance of the density estimation depends heavily upon the bandwidth choice, whereas the kernel choice is not crucial [44]. These conditions on H−1 ensure each term in the sum provides a valid probability density increment. If H−1 is diagonal, indicating the lack of correlation between the variables, the above expression reduces to Eq. [2]. Initial experimentation showed we could drop the 1/2 scale factor in the argument of Eq. [3] to obtain crude estimate of H. Using Eq. [3] with Eq, [1], the population PDF expression that includes the subpopulations for both cases and controls is explicitly expressed compactly as

s0μ(x)=gμ0(x)×aμ0+gμ1(x)×aμ1. Eq. [4]

Thus, the optimization problem requires two estimations per synthetic population due to the MS splitting.

2.4. Bandwidth Matrix Determination

The bandwidth parameters affect both the degree of smoothing and orientation of the PDF. Selecting the optimal bandwidth parameters for unconstrained H1 is critical and non-trivial. In particular, bandwidth selection for the unconstrained multivariate problem is the subject of ongoing research, [45, 46]. DE was used as the optimization method based on covariance comparisons for each subpopulation. Within a given generation, parameter vector pairs compete to go to the next DE generation (Figure 1). The following procedure was used to construct each synthetic subpopulation separately from the respective subsamples as depicted in Figure 1 (see central portion). The N × m design matrix for a given subsample was defined as Xm with elements xij, where the rows of Xm contain xiT . Mean centering the columns of Xm gives X. The covariance matrix for a given subsample is expressed as

C=1N1XTX, Eq. [5]

where C is an m×m matrix and N is the number of individuals in a given subsample. The covariance matrix of a random sample of N individuals drawn from a given subpopulation is defined as 𝐂syn, using the same operation in Eq. [5]. For the optimization procedure, we used the normalized absolute difference matrix E, with elements: eij=|cijcijsyncij|×100 , where E is m×m matrix. H is specified by minimizing the difference

ε=i=1mj=1ieij. Eq. [6]

The above expression was used as the fitness function for the DE optimization. When two parameter vectors compete, the vector producing the smaller ε moves to the next generation, illustrated in Figure 2.

Figure 2.

Figure 2.

Differential Evolution Trial Competition: This illustrates the basic competition for one trial within a given generation. There are Np trial competitions per generation. The ith vector from the current generation, wig, competes with a trial vector, uig. The trial vector is a mutation of wig with attributes derived from the population, wig, shown in Figure 3. A synthetic population (SP) is generated with each vector. Synthetic samples are constructed from each SP; these are used to make the H comparisons.

A brief description of DE is provided, aided by the schema shown in Figure 1 that feeds to Figure 2. This is an easily implemented stochastic technique, used in our previous work [28, 30, 36], that has advantages in comparison with other evolutionary strategies as outlined in a DE survey [47]. Where possible, we followed the notation and conventions established by the DE founders [31] unless otherwise noted. As such, the variable labeling and indexing introduced for this development are restricted to this section unless noted. This is a global optimization technique that constructs a solution via competition. It starts with a randomly initialized parameter vector population (zeroth generation), where the components of a given parameter vector are indexed by j: the components correspond to the elements of the bandwidth matrix, H1, defined previously. A uniform random variable (rv) was used to generate these components for the zero-generation constrained to ±0.10×estimated respective covariance from a given subsample. This DE-population was comprised of Np vectors (i.e. the DE parameter vector population size) with Dn components; noting, the population size remains the same for all generations. We refer to the entire DE-population as wg. The subscript g is the generation index, and its members are labeled as wig, where i is the member index. For this development, Dn is the number of unique elements in H-1. For a given g, we cycle through i = 1 to Np (Np is defined below and is the DE population size equal to the number of vectors) and select wig at each increment as illustrated in Figure 2. A mutant vector is constructed by adding a weighted difference between two vectors to a third vector, all randomly selected from wg expressed as

vig=wqg+F×(wrgwsg). Eq. [7]

This construction in Eq. [7] is subject to this condition: q ≠ r ≠ s ≠ i, where q, r, and s are uniformly distributed, [1, Np], integer rvs; this step induces diversity and is depicted in Figure 3. The weight, F, is referred to as the amplification factor and controls the mutation size or effect. The mutant vector is used to construct a trial vector defined as uig that competes with wig shown in Figure 2. A trial vector is constructed by permitting the possibility (at random) of crossing components of this mutant vector with the components of the target vector dictated by a crossover rate (CR) shown in Figure 3. The crossover operation occurs at the vector component level and dictates the mutation rate. This is to enhance the possibility that the trial vector has at least one component from the mutation process shown in Figure 3. There are two avenues for the mutation to occur at each component: (1) if CR < a uniformly distributed, [0,1], rv or (2) if a uniformly distributed [1, Dn] integer rv = j. These random quantities are defined as cj and dj, respectively. A given uig represents a mutation (most of the time) of wig derived from Eq. [7]. Target and trial vectors (pairwise) compete to move the next generation by comparing their respective ε calculated with Eq. [6] after generating an SP for each vector. A synthetic sample is constructed for each SP and their respective covariance matrices are compared with that of the sample shown in Figure 1. The vector that produces the smaller ε moves to the next to generation; Np comparisons of wig with uig produce the next generation population defined as wg+1. Thus, successive generations of wg become incrementally more fit (i.e. decreasing ε). This process is performed generation by generation, illustrated in Figure 2, until convergence. Although H1 has m2 components, we force symmetry on the solution and used m×(m+1)/2 components; thus, the vectors in the parameter population have Dn = m×(m+1)/2 components. For the DE parameter vector population size, we use Np = 10×Dn= 10×10=100 (i.e., the number of vectors in the DE parameter vector population). We also force the positive definite property on the solution by checking H for positive Eigenvalues; note, when H is positive definite so is it inverse, H−1. When a possible solution is not positive definite, we assign a large error term to the competition rather than using Eq. [6]. In the event that both vectors produce a non-positive definite H, the target vector moves to the next generation by definition. Thus, as the generations unfold, the possibility of forming a non-positive definite potential solution tends to zero.

Figure 3.

Figure 3.

Trial Vector Construction: The trial vector, uig, is constructed with components from wig and the mutant vector vig. The vig construction is shown in Eq. [7]. The Vectors uig and wig compete shown in Figure 2.

2.4. Synthetic Sample Construction

Synthetic samples of N realizations were constructed during the DE optimization procedure and after its termination corresponding to the four subpopulations (see Figure 1). To sample a given SP, we adapted the standard method of mapping a uniformly distributed random variable to a random variable with a specified (target) PDF, sometimes referred to as inverse sampling. For completeness, we outlined this standard transform method to put our approach in context. We define two random variables: (1) u, over this range [0,1] with a PDF given by pu(u) = 1, i.e. u is uniformly distributed random variable; and (2) t over this range (−∞, ∞) with the target PDF labeled as pt(t), where pt(t) will be determined. We then determined the function q(u) = t, such that t is distributed as pt(t). The solution conserves probability using the cumulative functions

0udu=-tpt(z)dzu=Pt(t), Eq. [8]

where Pt(t) is the cumulative probability function for t. In passing, the above shows that mapping any random variable with its own cumulative function produces a uniform random variable. Although we used Eq. [8], we show that standard and equivalent solution expressed as

t=q(u)=Pt1(u). Eq. [9]

To construct a given SP, we first scaled and normalized its g(x) components defined in Eq. [3] with reasonable approximations. For simplicity, the case-control and MS indices have been suppressed. We assumed that there were approximately W women in each subpopulation (W ~ million) and scaled g(x): h1(𝐱) = W × g(𝐱). Then, for all x meeting the condition h1(𝐱) < 1, we set h1(𝐱) = 0, which gives h(𝐱). This normalization permits only whole synthetic individuals and removes certain combinations of variables from the SP (i.e. reducing RT). Incorporating the indexing, the final expression for both SPs is given by

sμ(x)=hμ0(x)×aμ0+hμ1(x)×aμ1, Eq. [10]

which follows from Eq. [4].

As an alternative to adding the components, Eq. [10] was realized stochastically in our experiments. To construct a given SP, a component was first selected at random dependent upon aμλ to comply with Eq. [10]. To sample a given component, we let u1 be a uniformly distributed, [1, w], integer random variable, where w is the area under the specific hμλ(𝐱) component, and is equivalent to the number of women in the component after the scaling and normalization. We assembled all of the remaining unique combinations of the coordinates x = (x1,...,xm) into a one dimensional variable, zi ,where z1 is combination 1, z2 is combination 2, zi is the ith combination, and continue the ordering to the last combination labeled as zR, where R is the number of remaining unique combinations (after scaling and normalizing, R < RT). We ordered the corresponding number of individuals with each combination into another one-dimensional variable, fi, also with R elements. It is important to recognize that combination ordering and labels are not important as long as the correspondence between zi and fi is maintained, which can be expressed as fi = fi(zi). The cumulative function for fi is given by

Fj=i=1jfi, Eq. [11]

for j = 1 to R (i.e., FR = w). This is the discrete and scaled analog of Pt(t) expressed in Eq. [8]. A given synthetic individual was selected by generating a realization of a uniformly distributed random variable, u1 defined above, and determining Fj-1 < u1 ≤ Fj. The interval Fj-Fj-1 gives fj noting that fj points to the x combination indexed by zj producing one synthetic individual. We performed this operation repeatedly (i.e. selecting the hμλ(𝐱) component and then selecting the individual with u1) to generate a synthetic sample of n individuals from a given SP. This solution is analogous to solving for a particular value of t (equating with x) given one realization of u expressed in Eq. [9].

2.6. Evaluation Methods

Multiple methods were used to compare the observed sample with synthetic samples when the optimization was completed. The Kernel Two-Sample Test (two-sample test) based on the MMD metric developed by Gretton et al [32] was used to compare the EPDF from the sample with that of the respective synthetic sample. Counting experiments were investigated to understand the degree to which synthetic samples included individual replicates from its respective observed sample. Additionally, covariance matrices were compared. To evaluate the similarity within the modelling context, two PCA-based methods were employed. In all comparisons, four synthetic samples were randomly constructed from the respective SP and compared with their respective observed sample. These methods are described below.

A brief description of MMD metric and related test is provided using the same notation as Gretton et al [32], where possible. The comparison is based on a distribution free test to determine if two EPDFs are the same, noting it was designed to accept the null hypothesis. The MMD metric has multiple forms. When both EPDFs have the same number of elements (n), the squared MMD metric takes the following form used in this report

MMDu2=1n(n1)i=1njink(xi,xj)+k(yi,yj)k(xi,yj)k(xj,yi), Eq. [12]

where subscript u indicates an unbiased estimate, xi are the vectors for the individuals from a given sample and yj are the vectors for the individuals from the respective synthetic sample, and the kernel terms refer to the forms expressed in Eqs. [2 & 3]. The null hypothesis test at significance level α (i.e. the EPDFs are the same) has the acceptance region expressed as

MMDu2<[4Kn]log(α1). Eq. [13]

Equation [11] is referred to as the test-threshold below generated with α = 0.05. The Type II error probability of this test approaches zero as n becomes large [32]. The parameter K is determined from the bounds of the respective sample derived from the following inequality

0k(xi,xj)K. Eq. [14]

This test was performed twice for each observed sample and synthetic sample comparison; the test was performed with the kernel, k, expressed in Eq. [2] and with the kernel, kH, expressed in Eq. [3]. Because both the acceptance region and sample bounds scale with K, the quantities in Eq. [12] and Eq. [14] were generated with the non-normalized forms of k and kH to keep the quantities from becoming too small. Each observed sample was compared with its respective synthetic samples using Eq. [12] with Eqs. [13 & 14]. The bandwidth matrices were derived from the covariance of the respective samples calculated without considering MS for these comparisons; when using diagonal H, variances determined from the respective sample were used to estimate the diagonal terms.

Additional experiments were performed to complement the comparison analysis. These were performed to evaluate the degree to which a given synthetic sample included individual replicates from its respective observed sample. This was implemented by creating 1000 synthetic control-samples with n = 180 from its SP and counting replicates. Due to the way the optimization process was controlled, covariance matrices were also compared between the observed samples and synthetic samples by evaluating the related matrix elements with 95% confidence intervals (CIs) separated by MS. To generate CIs for the sample covariance elements, 1000 bootstrap trials were performed. The covariance matrices were compared with the corresponding realizations of the respective synthetic covariance matrices. For the synthetic sample covariance elements, CIs were estimated by evaluating 1000 synthetic samples (each with n individuals) constructed from the respective SP.

The other objective was to assess the similarity within a modeling context to ensure that the multivariate structure between the samples and respective synthetic samples was similar. In this situation, we used a given observed sample as the reference and examined the similarity between respective synthetic samples with this reference. We applied two unique forms of analysis based on principal component analyses (PCA). First, synthetic-samples were compared to the respective sample using PCA modeling; a PCA model derived from the sample was applied to the respective synthetic samples and predicted scores (expansion coefficients) were compared. This was evaluated by comparing the EPDFs (for the scores) along each individual principal component. Strong outliers were observable in the score plots, whereas the detection of moderate outliers required another approach. In the second approach, the distance to the model in X-space (DModX) statistic was used to evaluate differences between the sample and synthetic samples, where the residuals were evaluated. In short, the residuals should not deviate significantly. We note the PCA expansion and residuals are orthogonal complements. The PCA model was calculated (trained) on the sample and the variables were expanded with square and interaction-terms of the original variables. Each variable was centered and scaled to unit variance. The derived PCA models were then applied to the respective synthetic sample to predict the scores and generate DModX quantities. Residuals between the observed sample and synthetic samples were compared with a t-test. PCA models were calculated using Evince (version 2.7.9, Prediktera AB, Umea, Sweden). Box Plots with Violin lines were generated in MATLAB (R2017b, MathWorks Inc. Natick MA, USA) using the GRAMM toolbox [48]. The SP and DE algorithms were developed in the IDL version 8.6 (Exelis Visual Information Solutions, Boulder, Colorado) programming environment.

3. Results

3.1. Sample Sparsity and Synthetic Replications

The two observed samples (e.g. cases and controls) were sparse representations of their respective populations (here we use sparsity to define missing variable combinations in a given EPDF). When considering the case and control sample as one group, all individuals were unique (i.e. no xi vectors were the same). The sparse characteristic of the control-sample (n=180) is illustrated in Figure 4. In this example, various projections of the control-sample EPDF onto the mass-PD plane are shown with color coding. The light-blue (points) shows the projection for age = 58 years and height = 64 inches (i.e. two individuals). Blue and purple points show the sparsity for Age = 58 years with height ranging from 65 – 66 and 67 – 68 inches. Red and gray points show the sparsity for age ranging 59 – 60 years with height = 64 inches. Black points show the sparsity for age = 60 years with height ranging from 65 – 66 inches. The most populated projection (age ranging from 58 – 60 years and height ranging from 64 – 68 inches) includes all points (representing 15 individuals or 8% of the control-sample).

Figure 4.

Figure 4.

Control-sample sparsity illustration: This illustration shows multiple projections of the sample’s EPDF onto the mass-PD plane. Mass (kg) is on the vertical axis and PD (breast density measure) on horizontal axis. For example light-blue (two individuals) includes individuals with age = 58 years with height = 64 inches (spare projection). Each point represents one individual.

The SP generation methods produced a dense reconstruction of the sample’s PDF, illustrated in Figure 5. This shows a two-dimensional slice through the four dimensional SP (pane A) at age = 58 years and height = 64 inches corresponding with the light-blue points in Figure 4. Note the dense nature of this conditional PDF (not renormalized) in comparison with Figure 4. Taking a profile along the horizontal direction at mass = 64kg, gives the associated conditional PDF (not normalized) for PD (breast density metric) shown in pane B (before scaling and normalization). This smooth continuous curve illustrates the properties of the kernel reconstruction. When combining MS, there were R = 936,412 unique variables combinations for the controls and R = 1,156,433 for cases after scaling and normalization.

Figure 5.

Figure 5.

Synthetic control population reconstruction illustration. This shows a two-dimensional slice through the synthetic population in pane- A for women with age = 58 years and height = 64 inches. This corresponds to the sparsest sample (n = 2) in Figure 4 (light-blue). The dashed line marks a profile for women with mass = 64kg. The corresponding conditional PDF for PD (breast density measure) is shown in pane-B before scaling and normalization were applied.

We next applied the SP generation method to create 1,000 synthetic control-samples. We examined the frequency of generating a synthetic sample containing an individual that was present in the observed sample (replicate). The replicate evaluation revealed that the average number of synthetic samples (each with n= 180 individuals) was 18 (average) before a replicate individual from the observed sample was included. In 97.2% of the samples with replicates, there was only one replicate and in 2.8% of samples either 2–3 replicates were observed. In the similarity comparisons that follow, based on these findings, we conclude that the similarities do not arise from simple replicates of the observed sample.

3.2. Synthetic Population Generation Maintains EPDF Similarity

We compared four generated synthetic samples to the corresponding observed sample to determine if their EPDFs differed by using the two-sample kernel test. The findings from these two-sample tests using kH, the full bandwidth kernel, are shown in Table 1. In the comparisons between the control-sample with the four synthetic samples, the null hypothesis was accepted for each synthetic sample, indicating the synthetic samples have similar EPDFs to the observed sample. The findings comparing the case-sample with four synthetic samples were similar for kH as well. Table 2 shows the tests based on k, the diagonal bandwidth kernel. The findings are similar to those shown in Table 1; in all comparisons, the null hypothesis was not rejected, suggesting that the SP generation method created synthetic samples that are distributed similarly to the observed sample. For reference, the bandwidth matrices determined with DE are provided in Table 3: (A) menopausal controls; (B) menopausal cases; (C) non-menopausal controls; and (D) non-menopausal cases.

Table 1.

Kernel Two-sample Test based on the full bandwidth kernel. Synthetic samples were compared to their respective case and control samples. Both K and the test threshold are provided. All quantities were generated with the full bandwidth kernel, kH expressed in Eq. [3] with Eq. [12].

Synthetic-sample Control-sample Case-sample
MMDu2 (K = 9.94E-01, threshold = 5.13E-0.1) MMDu2 (K = 9.97E-01, threshold = 5.15E-01)
1 −3.32E-03 −4.45E-03
2 −2.37E-03 −1.96E-03
3 −1.66E-03 2.52E-04
4 3.53E-05 −9.03E-04

Table 2.

A Kernel Two-sample Test based on the diagonal bandwidth kernel. Synthetic samples were compared to their respective case and control samples. Both K and the test threshold are provided. All quantities were generated with a diagonal bandwidth matrix kernel, k, expressed in Eq. [2] with Eq. [12].

Synthetic sample Control-sample Case-sample
MMDu2 (K = 9.93E-01 threshold = 5.13E-0.1) MMDu2 (K = 9.98E-01, threshold = 5.15E-01)
1 −3.32E-03 −3.87E-03
2 −2.44E-03 −1.93E-03
3 −1.47E-03 −2.57E-04
4 −2.96E-04 −1.10E-07

Table 3.

Bandwidth Matrix Elements The bandwidth matrix, H, generated with DE is provided for each subgroup: (A) menopausal controls; (B) menopausal cases; (C) non-menopausal controls; and (D) non-menopausal cases. These were used to generate the synthetic populations.

A

Menopausal Controls n = 132 (34 Generations)

Height (inch) Age PD Weight (kg)
Height (inch) 5.16 −2.08 2.93 5.20
Age −2.08 62.0 −14.4 −5.33
PD 2.93 −14.4 185.6 −48.1
Weight (kg) 5.20 −5.33 −48.1 141.3
B

Menopausal Cases n = 142 (29 Generations)

Height (inch) Age PD Weight (kg)
Height (inch) 6.27 −4.94 −2.28 10.6
Age −4.94 66.9 −3.69 −10.1
PD −2.28 −3.69 188.3 −59.4
Weight (kg) 10.6 −10.1 −59.4 164.8
C

non-Menopausal Controls n = 48 (147 Generations)

Height (inch) Age PD Weight (kg)
Height (inch) 6.68 2.14 11.9 2.23
Age 2.14 31.2 −24.8 −6.17
PD 11.9 −24.8 242.0 −65.3
Weight (kg) 2.23 −6.17 −65.3 73.3
D

non-Menopausal Cases n = 38 (191 Generations)

Height (inch) Age PD Weight (kg)
Height (inch) 11 −0.41 −10.8 9.02
Age −0.41 19.7 −10.4 −2.85
PD −10.8 −10.4 314.8 −130.5
Weight (kg) 9.02 −2.85 −130.5 199.3

3.3. Covariance Comparisons

Covariance matrix elements of the synthetic samples were pairwise compared to the corresponding matrix of observed samples. Comparisons of the matrix elements are enabled visually by considering the CIs on the synthetic quantities. When the central value of the sample quantity is spanned by the interval of the respective synthetic entity, the variation was not considered significant. The control comparison is shown in Table 4, where the sample is shown in 4A and synthetic sample in 4B. The case comparison is shown in Table 5, where the sample is shown in 5A and synthetic sample in 5B. Synthetic quantities did not deviate significantly from the respective sample quantities when considering intra-group pairwise comparisons.

Table 4.

Covariance Matrices for Controls: These tables show the covariance matrix for the control-sample in pane-A and a synthetic control-sample (one realization) in pane-B. Confidence intervals are provided parenthetically.

A

Control-sample n = 180

height (in) age (yr) PD mass (kg)
height (in) 6.3 (5.1, 7.4) −3.9 (−7.9, 0.0) 6.6 (1.6, 11.6) 3.9 (0.2, 7.7)
age −3.9 (−7.9, 0.0) 108.2 (88.7, 127.7) −41.4 (−65.4, −17.4) −2.8 (−16.6, 10.9)
PD 6.6 (1.6, 11.6) −41.4 (−65.4, −17.4) 212.9 (167.3, 258.6) −52.5 (−73.5, −31.5)
mass (kg) 3.9 (0.2, 7.7) −2.8 (−16.6, 10.9) −52.5 (−73.5, −31.5) 116.0 (84.1, 148.0)
B

Synthetic Control-sample n = 180

height (inch) Age (yr) PD mass (kg)
height (inch) 6.5 (5.4, 7.6) −3.9 (−6.6, −1.3) 11.4 (7.7, 15.1) 6.3 (2.7, 9.8)
Age −3.9 (−6.6, −1.3) 105.8 (94.7, 116.9) −40.5 (−52.8, −28.1) −3.4 (−14.0, 7.2)
PD 11.4 (7.7, 15.1) −40.5 (−52.8, −28.1) 185.0 (159.3, 210.8) −35.9 (−53.3, −18.5)
mass (kg) 6.3 (2.7, 9.8) −3..4 (−14.0, 7.2) −35.9 (−53.3, −18.5) 128.4 (103.7, 153.0)

Table 5.

Covariance Matrices for the Cases: These tables show the covariance matrix for the case-sample in pane-A and synthetic case-sample (one realization) in pane-B. Confidence intervals are provided parenthetically.

A

Case-sample n = 180

height (in) age (yr) PD mass (kg)
height (in) 7.1 (5.6, 8.5) −7.4 (−11.8, −3.0) −1.6 (−8.3, 5.0) 9.6 (4.8, 14.4)
Age −7.4 (−11.8, −3.0) 109.3 (90.5, 128.1) −43.3 (−68.5, −18.1) −2.9 (−21.4, 15.6)
PD −1.6 (−8.3, 5.0) −43.3 (−68.5, −18.1) 245.7 (198.4, 292.9) −81.3 (−109.4, −53.2)
mass (kg) 9.6 (4.8, 14.4) −2.9 (−21.4, 15.6) −81.3 (−109.4, −53.2) 165.6 (122.2, 209.0)
B

Synthetic Case-sample n = 180

height (in) age (yr) PD mass (kg)
height (inch) 8.5 (6.9, 10.1) −13.3 (−17.4, −9.1) 0.2 (−6.3, 6.7) 8.2 (3.4, 13.0)
age −13.3 (−17.4, −9.1) 104.8 (87.5, 122.0) −30.1 (−53.9, −6.3) −1.2 (−19.9, 17.6)
PD 0.2 (−6.3, 6.7) −30.1 (−53.9, −6.3) 199.1 (148.8, 249.3) −58.3 (−87.0, −29.6)
mass (kg) 8.2 (3.4, 13.0) −1.2 (−19.9, 17.6) −58.3 (−87.0, −29.6) 129.8 (98.8, 160.8)

3.4. PCA Comparisons

Characteristics of the synthetic-samples were also evaluated using PCA model comparisons with the corresponding observed sample. The associated findings are shown in Figure 6. Figure 6A (top left) shows the PCA scores for PC1 and PC2 (first two principal components) for the case-sample. The first principal component explains 26% of the variation, with the second explaining 18.2%, giving 44.2 % in total for this sample. These principal components are plotted together with the predictions of the four different synthetic-samples; these predictions were based on the trained PCA models for the case-sample and control-sample (two separate models were trained). We note, none of the synthetic samples contained replicates from the respective sample. The score plots show a distinct separation in all datasets due to MS (menopausal status). Approximately 99% of the individuals in the right group in Figure 6A had MS = 1 (about 1% had MS = 0), while the compact subgroup to the left consisted of individuals with MS = 0 (except one individual or approximately 0.14%). Interestingly, this artifact was a characteristic of the observed sample that is also present in the synthetic sample structure as well. The PCA score plot in Figure 6C shows similar behavior in that there are few differences between the observed sample and the synthetic-samples. Note, the separation due to MS was also captured. The second dimension (y-axis) is in the opposite direction for the control-sample compared to the case-sample. This is a common artifact in PCA because the signs are arbitrary. This is usually accounted for by multiplication of −1 (i.e. PCA is invariant under a sign change).

Figure 6.

Figure 6.

PCA models for the Observed and Synthetic Samples: The first two principal components derived from the case-sample are shown in pane-A together with the predictions of the synthetic data. There is no visible difference between the observed and the four different synthetic dataset. The residuals, DmodX, are shown in pane-B as a boxplot with violin density lines. The differences in the residuals between the observed and synthetic data were not significant. Similar results were noted for the control-sample in pane-C and pane-D.

Figure 6A illustrates that there is no visible deviation between the observed sample and the four synthetic samples, as represented by the PCA model elements PC1 and PC2. The variation not explained by the PCA model is shown in the DModX plot in Figure 6B for the case-sample. The DModX for each sample (observed sample and 4 synthetic samples) is represented by a boxplot and the EPDFs are shown as violin plots. In each plot, the p-value from the comparison with the sample residual is provided. The difference in each comparison was not significant. Because the DModX values across the 5 analyses were statistically similar, the sample and synthetic samples were essentially indistinguishable. The same analysis was also performed on the control-sample shown in Figure 6D, which produced similar findings.

4. Discussion

We presented a novel methodology to generate SPs given observed samples with limited size. The approach is based on an unconstrained multivariate kernel density estimation driven by an optimization procedure. The similarity of samples from the SP versus the observed sample was demonstrated with multiple evaluation methods. The two-sample tests indicated that the observed samples and equally-sized synthetic samples were similar for all comparisons. We also compared the covariance structure, which showed that the observed samples and synthetic samples did not deviate significantly. Within the modeling context, the feasibility of our approach was evident based on the PCA model comparisons. A benefit of the PCA approach is that all variables under consideration are evaluated simultaneously and repeatedly because each PCA dimension is a linear combination of the input variables. For either cases or controls, the respective synthetic samples exhibited similar behavior under the PCA transform. We presented this approach to generate SP samples for model building purposes. The method also lends itself to generating sampling requirements for multivariate investigations given a population sample.

Although our study addressed many technical challenges, there are several limitations and nuances worth noting. We considered five variables. This dimensionality was sufficient to develop a general framework and provide initial feasibility. To make the approach into a model building tool, the dimensionality of the input space must be increased, necessitating more powerful data processing capabilities. As discussed in the Background Section, our ultimate objective is to determine the lower bound for synthesizing a population given a set of covariates and fully specified population. This is the more encompassing endpoint as resampling techniques for modeling purposes are effective when the limited sample is an adequate representation from the population it was derived from; at this time, the status of our samples are unknown. We considered four continuous and one dichotomous variable. Often, the input space could include a wider mix of continuous, ordinal, and nominal variables. We used a conditional probability approach to address the limited mixture of variables in this work. Additional methods of estimating the covariance between mixed variable types [49] are required to render the problem suitable for kernel processing and to reduce the number of conditional probability branches. Otherwise, extensive branching becomes a limitation for datasets with few samples. The process was driven by considering the covariance directly. It is not clear at this time if this is generally optimal in wider settings. Alternatively, the optimization could be driven by these PCA similarity metrics or other summary metrics such as the two-sample test used in this evaluation. Additional experimentation is required to determine the optimal method of driving the optimization process.

When this approach is more fully developed, it could produce synthetic samples for model building exercises, providing the underlying data structure of the sample was captured in the synthetic population. For example, in epidemiologic cancer studies often the disease incidence is low, limiting the number of individual samples with cancer. Once a model is optimized, it can be initially validated with the sample population producing a fully specified model suitable for independent validation. Moreover, studies are planned to determine the possibility of performing independent validation with SPs. In this report, we presented our methodology and demonstrated its feasibility. Because the technique is not yet fully developed, we do not provide comparisons with other techniques such as bootstrapping or cross-validation at this time. First, our methodology will require scaling to accommodate higher dimensionality. Secondly, understanding its ability to synthesize a population accurately must be understood, which is the antecedent problem to the model building excursive with limited data. After addressing these objectives, comparison with other approaches will be investigated.

5. Conclusions

The work presented in this report contributes to both kernel density estimation and synthetic population research. The optimization feedback loop can be modified to incorporate other endpoints. The work also introduced two related methods of analyzing structure similarity with PCA. These similarity metrics have applications well beyond those presented in this report. The use of synthetic samples may be useful to mitigate limited sample sizes for initial model building exercises providing our technique is evaluated under more general conditions. Planned research includes addressing the limitations addressed above.

Supplementary Material

1

Highlights.

Kernel density estimation was used to create synthetic individuals from a limited sample

The full bandwidth matrix was specified with differential evolution optimization

Synthetic samples were statistically similar to the limited sample

Limited sample size problem can be mitigated with synthetic patients

Acknowledgments

Funding: This work was supported by Moffitt Cancer Center grant #17032001 (Miles for Moffitt) and National Institutes of Health Grant #R01CA114491.

Footnotes

Declarations

Ethics and consent to participate: This study was approved by the Institutional Review Board (IRB), University of South Florida, Tampa, FL (IRB# Ame13_104715).

Consent for publication: The work does not contain personal identifiers.

Availability of data: The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Competing interests: Dr. Sellers receives royalty income from copyright from published work. The other authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  • [1].Mascalzoni D, Paradiso A, Hansson M. Rare disease research: Breaking the privacy barrier. Appl Transl Genom. 2014;3:23–9.doi: 10.1016/j.atg.2014.04.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].Darquy S, Moutel G, Lapointe A-S, D’Audiffret D, Champagnat J, Guerroui S, et al. Patient/family views on data sharing in rare diseases: study in the European LeukoTreat project. Eur J Hum Genet. 2016;24:338–43.doi: 10.1038/ejhg.2015.115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Erves JC, Mayo-Gamble TL, Malin-Fair A, Boyer A, Joosten Y, Vaughn YC, et al. Needs, Priorities, and Recommendations for Engaging Underrepresented Populations in Clinical Research: A Community Perspective. J Community Health. 2017;42:472–80.doi: 10.1007/s10900-016-0279-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [4].Lay JO Jr, Borgmann S, Liyanage R, Wilkins CL. Problems with the “omics”. Trends in Analytical Chemistry. 2006;25.doi: 10.1016/j.trac.2006.10.007. [DOI] [Google Scholar]
  • [5].Micheel CM, Nass SJ, Omenn GS. Evolution of Translational Omics: Lessons Learned and the Path Forward: National Academies Press; 2012 [PubMed] [Google Scholar]
  • [6].Buczak AL, Babin S, Moniz L. Data-driven approach for creating synthetic electronic medical records. Bmc Med Inform Decis. 2010;10 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [7].Chen JQ, Chun D, Patel M, Chiang E, James J. The validity of synthetic clinical data: a validation study of a leading synthetic data generator (Synthea) using clinical quality measures. Bmc Med Inform Decis. 2019;19.doi: 10.1186/s12911-019-0793-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [8].Dahmen J, Cook D. A Synthetic Data Generation System for Healthcare Applications. Sensors (Basel). 2019;19.doi: 10.3390/s19051181. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].Goncalves AR, Sales AP, Ray P, Soper B. NCI Pilot 3 - Synthetic Data Generation Report In: Energy USDo, editor.2018.doi: 10.2172/1430997. [DOI] [Google Scholar]
  • [10].Hawkins DM. The problem of overfitting. Journal of chemical information and computer sciences. 2004;44:1–12 [DOI] [PubMed] [Google Scholar]
  • [11].Babyak MA. What you see may not be what you get: a brief, nontechnical introduction to overfitting in regression-type models. Psychosomatic medicine. 2004;66:41121.doi: 10.1097/01.psy.0000127692.23278.a9. [DOI] [PubMed] [Google Scholar]
  • [12].Harrell FE Jr., Lee KL, Mark DB. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics in medicine. 1996;15:361–87.doi:. [DOI] [PubMed] [Google Scholar]
  • [13].Harrell Jr F, E. Regression Modeling and Validation Strategies. 1997 [Google Scholar]
  • [14].Vittinghoff E, McCulloch CE. Relaxing the rule of ten events per variable in logistic and Cox regression. American journal of epidemiology. 2007;165:710–8.doi: 10.1093/aje/kwk052. [DOI] [PubMed] [Google Scholar]
  • [15].Efron B, Gong G. A leisurely look at the bootstrap, the jackknife, and cross-validation. The American Statistician. 1983;37:36–48.doi: 10.1080/00031305.1983.10483087. [DOI] [Google Scholar]
  • [16].Isaksson A, Wallman M, Göransson H, Gustafsson MG. Cross-validation and bootstrapping are unreliable in small sample classification. Pattern Recognition Letters. 2008;29:1960–5.doi: 10.1016/j.patrec.2008.06.018. [DOI] [Google Scholar]
  • [17].Efron B, Tibshirani R. Improvements on cross-validation: the 632+ bootstrap method. Journal of the American Statistical Association. 1997;92:548–60.doi: 10.1080/01621459.1997.10474007. [DOI] [Google Scholar]
  • [18].Heppenstall A, Harland K, Smith D, Birkin M. Creating realistic synthetic populations at varying spatial scales: a comparative critique of population synthesis techniques. Geocomputation 2011 Conference Proceedings, UCL, London2011. p. 1–8.doi: 10.18564/jasss.1909. [DOI] [Google Scholar]
  • [19].Smith DM, Clarke GP, Harland K. Improving the synthetic data generation process in spatial microsimulation models. Environment and Planning: Economy and Space. 2009;41:1251–68.doi: 10.1068/a4147. [DOI] [Google Scholar]
  • [20].Barthelemy J, Toint PL. Synthetic population generation without a sample. Transportation Science. 2013;47:266–79.doi: 10.1287/trsc.1120.0408. [DOI] [Google Scholar]
  • [21].Müller K, Axhausen KW. Preparing the Swiss Public-Use Sample for generating a synthetic population of Switzerland: Eidgenössische Technische HochschuleZürich, IVT, Institute for Transport Planning and Systems; 2012.doi: 10.3929/ethz-a-007340012. [DOI] [Google Scholar]
  • [22].Zhu Y, Ferreira J. Synthetic population generation at disaggregated spatial scales for land use and transportation microsimulation. Transportation Research Record: Journal of the Transportation Research Board. 2014;2429:168–77.doi: 10.3141/2429-18. [DOI] [Google Scholar]
  • [23].Harland K, Heppenstall A, Smith D, Birkin M. Creating realistic synthetic populations at varying spatial scales: a comparative critique of population synthesis techniques. Journal of Artificial Societies and Social Simulation. 2012;15:1–15.doi: 10.18564/jasss.1909. [DOI] [Google Scholar]
  • [24].Ryan J, Maoh H, Kanaroglou P. Population synthesis: Comparing the major techniques using a small, complete population of firms. Geographical Analysis. 2009;41:181–203.doi: 10.1111/j.1538-4632.2009.00750.x. [DOI] [Google Scholar]
  • [25].Lomax N, Norman P. Estimating population attribute values in a table:“get me started in” iterative proportional fitting. The Professional Geographer. 2016;68:451–61.doi: 10.1080/00330124.2015.1099449. [DOI] [Google Scholar]
  • [26].Simpson L, Tranmer M. Combining Sample and Census Data in Small Area Estimates: Iterative Proportional Fitting with Standard Software. The professional Geographer. 2005;57:222–34.doi: 10.1111/j.0033-0124.2005.00474.x. [DOI] [Google Scholar]
  • [27].Ma L, Srinivasan S. Synthetic Population Generation with Multilevel Controls: A Fitness-Based Synthesis Approach and Validations Computer-Aided Civil and Infrastructure Engineering 2015;30:135–50.doi: 10.1111/mice.12085. [DOI] [Google Scholar]
  • [28].Land WH Jr. , Margolis D, Kallergi M, Heine JJ. A Kernel Approach for Ensemble Decision Combinations with two-view Mammography Applications. International Journal of Functional Genomics and Personalised Medicine 2010;3:157–82.doi: 10.1504/IJFIPM.2010.037152. [DOI] [Google Scholar]
  • [29].Land WH , Heine J, Embrechts M, Smith T, Choma R, Wong L. New approach to breast cancer CAD using partial least squares and kernel-partial least squares. In: Fitzpatrick JM, Reinhardt J,M, , editors. Proc SPIE International Symposium Medical Imaging 2005: Image Processing San Diego, California, United States2005.doi: 10.1117/12.593112. [DOI] [Google Scholar]
  • [30].Behera M, Fowler EE, Owonikoko TK, Land WH, Mayfield W, Chen Z, et al. Statistical learning methods as a preprocessing step for survival analysis: evaluation of concept using lung cancer data. Biomedical engineering online. 2011;10:97.doi: 10.1186/1475-925X-10-97. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [31].Price KV, Storn RM, Lampinen JA. Differential evolution : a practical approach to global optimization. Berlin ; New York: Springer; 2005.doi: 10.1007/3-540-31306-0. [DOI] [Google Scholar]
  • [32].Gretton A, Borgwardt KM, Rasch MJ, Schölkopf B, Smola A. A kernel two-sample test. Journal of Machine Learning Research. 2012;13:723–73.doi: 10.5555/2188385.2188410. [DOI] [Google Scholar]
  • [33].Heine JJ, Fowler EEE, Flowers CI. Full field digital mammography and breast density: comparison of calibrated and noncalibrated measurements. Acad Radiol. 2011;18:14306.doi: 10.1016/j.acra.2011.07.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [34].Heine JJ, Cao K, Rollison DE, Tiffenberg G, Thomas JA. A Quantitative Description of the Percentage of Breast Density Measurement Using Full-field Digital Mammography. Acad Radiol. 2011;18:556–64.doi: 10.1016/j.acra.2010.12.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [35].Heine JJ, Cao K, Rollison DE. Calibrated measures for breast density estimation. Acad Radiol. 2011;18:547–55.doi: 10.1016/j.acra.2010.12.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [36].Fowler EE, Sellers TA, Lu B, Heine JJ. Breast Imaging Reporting and Data System (BI-RADS) breast composition descriptors: automated measurement development for full field digital mammography. Medical physics. 2013;40:113502.doi: 10.1118/1.4824319. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [37].Boyd NF, Martin LJ, Sun L, Guo H, Chiarelli A, Hislop G, et al. Body size, mammographic density, and breast cancer risk. Cancer epidemiology, biomarkers & prevention : a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Oncology. 2006;15:2086–92.doi: 10.1158/1055-9965.EPI-06-0345. [DOI] [PubMed] [Google Scholar]
  • [38].Heine JJ, Malhotra P. Mammographic tissue, breast cancer risk, serial image analysis, and digital mammography. Part 2. Serial breast tissue change and related temporal influences. Acad Radiol. 2002;9:317–35.doi: 10.1016/s1076-6332(03)80374-4. [DOI] [PubMed] [Google Scholar]
  • [39].Heine JJ, Malhotra P. Mammographic tissue, breast cancer risk, serial image analysis, and digital mammography. Part 1. Tissue and related risk factors. Acad Radiol. 2002;9:298–316.doi: 10.1016/s1076-6332(03)80373-2. [DOI] [PubMed] [Google Scholar]
  • [40].Cacoullos T. Estimation of a multivariate density. Annals of the Institute of Statistical Mathematics. 1966;18:179–89.doi: 10.1007/BF02869528. [DOI] [Google Scholar]
  • [41].Gramacki A. Nonparametric kernel density estimation and its computational aspects. Cham, Switzerland: Springer International Publishing AG; 2018.doi: 10.1007/978-3-319-71688-6. [DOI] [Google Scholar]
  • [42].Powell MJD. The Theory of Radial Basis Function Approximation in 1990 In: Light W, editor. Wavelets, Subdivision, Algorithms, and Radial Basis Functions. Oxford: Oxford University Press; 1992. p. 105–210 [Google Scholar]
  • [43].Scott DW. Multivariate density estimation : theory, practice, and visualization. New York: Wiley; 1992 [Google Scholar]
  • [44].Duong T. ks: Kernel density estimation and kernel discriminant analysis for multivariate data in R. Journal of Statistical Software. 2007;21:1–16.doi: 10.18637/jss.v021.i07. [DOI] [Google Scholar]
  • [45].Gramacki A, Gramacki J. FFT-Based Fast Bandwidth Selector for Multivariate Kernel Density Estimation Computational Statistics & Data Analysis 2017;106:27–45.doi: 10.1016/j.csda.2016.09.001. [DOI] [Google Scholar]
  • [46].Duong T, Hazelton ML, Cross-validation Bandwidth Matrices for Multivariate Kernel Density Estimation. Scandinavian Journal of Statistics. 2005;32:485–506 [Google Scholar]
  • [47].Das S, Suganthan PN. Differential evolution: A survey of the state-of-the-art. IEEE transactions on evolutionary computation. 2011;15:4–31.doi: 10.1109/TEVC.2010.2059031. [DOI] [Google Scholar]
  • [48].Morel P. Gramm: grammar of graphics plotting in Matlab. Journal of Open Source Software. 2018;3.doi: 10.21105/joss.00568. [DOI] [Google Scholar]
  • [49].Vernizzi G, Nakai M. A Geometrical Framework for Covariance Matrices of Continuous and Categorical Variables. Sociol Method Res. 2015;44:48–79.doi: 10.1177/0049124114543243. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1

RESOURCES