MiSDEED: a synthetic data engine for microbiome study power analysis and study design

Philippe Chlenski; Melody Hsu; Itsik Pe’er

doi:10.1093/bioadv/vbac043

. 2022 Jun 16;2(1):vbac043. doi: 10.1093/bioadv/vbac043

MiSDEED: a synthetic data engine for microbiome study power analysis and study design

Philippe Chlenski ^1,^✉, Melody Hsu ², Itsik Pe’er ^3,^4,⁵

Editor: Sofia Forslund

PMCID: PMC9710642 PMID: 36699411

Abstract

Summary

MiSDEED (Microbial Synthetic Data Engine for Experimental Design) is a command-line tool for generating synthetic longitudinal multinode data from simulated microbial environments. It generates relative-abundance timecourses under perturbations for an arbitrary number of time points, samples, locations and data types. All simulation parameters are exposed to the user to facilitate rapid power analysis and aid in study design. Users who want additional flexibility may also use MiSDEED as a Python package.

Availability and implementation

MiSDEED is written in Python and is freely available at https://github.com/pchlenski/misdeed.

1 Introduction

The behavior of the microbiome, partially elucidated by improvements in genome sequencing and data analysis, is generating considerable research interest. For instance, the Human Microbiome Project (Turnbaugh et al., 2007) endeavors to collect data on a mass scale to investigate the role of the microbiome in the context of human health and disease. However, despite improvements in sequencing, sample collection itself still incurs significant overhead and many niches remain understudied. Furthermore, microbial relative-abundance data, the most typical form of data collected in such studies, have a number of properties that make classical statistical analysis challenging: it is longitudinal, compositional, noisy and stochastic (Bodein et al., 2019). Thus, investigators who wish to study specific ecosystems or develop new tools for inference on such datasets must commit to the costly process of gathering real data at scale, generate synthetic data from scratch or submit to potentially inappropriate assumptions of conventional power analysis.

The genetic power calculator (Purcell et al., 2003) streamlined research in statistical genetics by facilitating closed-form power analysis of hypothetical studies. Similarly, several tools help design microbiome studies: Web-GLV (Kuntal et al., 2019) enables researchers to visualize the dynamics of microbial systems using assumed ecological parameters, and Mattiello et al. (2016) provide a power calculator for case-control studies on microbial ecosystems near equilibrium. The steady-state assumptions underlying the closed-form estimation of statistical power here can be inappropriate in trajectory-dependent contexts such as dynamical systems with chaotic responses to noise or studies examining the dynamics of a system’s transition between steady states. Moreover, for data-dependent applications such as machine learning model development, closed-form estimates of statistical power are unhelpful, whereas direct access to the actual underlying simulation is useful. The generalized Lotka-Volterra (gLV) modeling assumptions underlying the steady-state characterization are already used to generate synthetic data when designing inference methods for longitudinal microbial relative-abundance data (Joseph et al., 2020a).

Here, we present MiSDEED: the Microbial Synthetic Data Engine for Experimental Design, a flexible tool for generating synthetic longitudinal data from dynamic simulated ecosystems. Synthetic data generated by MiSDEED can be used in study design to simulate the analysis of real data collected under varying regimes, or in machine learning for model design and transfer learning.

2 Overview

2.1 Generative model

MiSDEED’s synthetic data generator (drawn in Fig. 1) samples reads from probability distributions governed by gLV dynamics over a discrete set of time points T. Each generator has a set I of nodes, which may represent different data types (e.g. metagenomics and metabolomics measurements of the same system) or two interacting ecosystems with the same data type.

Each node $i \in I$ is initialized with a fixed dimensionality d_i, a vector of growth rates $\vec{g_{i}}$ , and an initial abundance vector $z_{i, 0}$ . A generator also has up to $| I |^{2}$ pairwise directed interactions between nodes. An interaction $M_{i, i'}$ between some nodes i and $i'$ is a matrix of dimension $d_{i} \times d_{i'}$ . The matrix $M_{i, i'}$ describes the elementwise effects of each element in node i on each element in node $i'$ . Finally, the generator has a set J of interventions which may be applied to any node $i \in I$ such that each intervention j has a vector u_j of intervention magnitudes and another vector b_j of responses to the intervention. If intervention j is applied to node i, then u_j should have $| T |$ dimensions and b_j should have d_i dimensions.

Once generator parameters have been set, synthetic data can be produced in one of three ways: as a single timecourse, as multiple timecourses from varying initial conditions or as multiple timecourses following a case-control split. In each case, the generator numerically solves the following equations with a biological noise term ε:

\begin{matrix} ε \sim N (\vec{0}, σ I) \\ \frac{d z_{i, t}}{d t} = z_{i, t - 1} ⊙ (g_{i} + \sum_{i' \in I} M_{i, i'} z_{i', t} + \sum_{j \in J} u_{j, t} b_{j} + ε) . \end{matrix}

(1)

Each timecourse contains three derived matrices of synthetic data: Z (latent absolute abundances), X (latent relative abundances/probabilities) and Y (observed relative abundances sampled from X). To simulate read sampling, at each time point t a fixed number of reads R is drawn according to a multinomial distribution parameterized by the relative abundances X_t, i.e. the t-th row of X:

Y_{t} \sim Multinomial (R, X_{t}) .

(2)

This generative process is outlined graphically in Figure 1.

Each gLV parameter (the growth rates g_i, initial abundances $z_{i, 0}$ and interaction matrices $M_{i, i'}$ ) can be specified by the user. Guidelines for convenient inference of gLV parameters can be found in the ‘parameter inference’ section. In practice, users may not have good estimates of gLV parameters. In these circumstances, the generator defines the gLV parameters according to the following distributions:

\begin{matrix} g_{i} \sim U (- 1, 1) \\ z_{i, 0} \sim Logn (0, 1) \end{matrix} .

(3)

Interaction matrices are generated using the method described in Allesina and Tang (2012): first a square D × D matrix M is constructed, where

D = \sum_{i \in I} d_{i}

(4)

is the sum of all of the node dimensions in the generator. Then, off-diagonal symmetric entries of $M_{i, j}$ and $M_{j, i}$ where $i \neq j$ are populated by pairs drawn from the following multivariate normal distribution:

〈 M_{i, j}, M_{j, i} 〉 \sim N (〈 0, 0 〉, [\begin{matrix} 1 & ρ \\ ρ & 1 \end{matrix}]) .

(5)

The entries of M are subsequently sparsified according to the connectivity parameter C such that each entry is turned to 0 with probability $1 - C$ :

\begin{matrix} I_{i j} \sim Bernoulli (C) \\ M_{i j} = M_{i j} * I_{i j} \end{matrix} .

(6)

The diagonal entries $M_{i, i}$ of M are set to d, the self-interaction penalty. Finally, the matrix M is divided up into node–node interaction matrices according to the dimensions of each node: the square submatrices of size $d_{i} \times d_{i}$ on the diagonals become self-interactions, whereas off-diagonal submatrices become cross-node interactions. The parameters for generating the interactions can be set by the user, as noted in Table 1.

Table 1.

Variable parameters in MiSDEED

Category	Parameters
Generator	Number of time points, number of nodes, node names, node dimensions, time to first sample
Random interaction matrices	C (connectivity), d (negative self-interaction size), σ (multivariate normal variance), ρ (multivariate normal correlation)
Custom gLV parameters	Interaction matrices, growth rates, initial abundances, interventions and intervention responses
Synthetic data generation	Biological noise variance, number of reads, time step size, downsampling rate
Multiple samples	Number of individuals, probability of 0-valued initial abundances
Case-control	Case-control ratio, intervention node, intervention effect size

Open in a new tab

2.2 Generalized generative model

The previous section described the generation of a single timecourse, which would correspond to a single community. Users may wish to generate multiple timecourses at once, for instance simulating an entire community of patients at once. MiSDEED allows partial or complete sharing of gLV parameters and sampling parameters across individuals. By default, all gLV parameters are kept constant with the exception of initial abundances $z_{i, 0}$ , which are randomly drawn from the same distribution for each individual.

MiSDEED also provides rudimentary case-control functionality: interventions are modeled as an always-on intervention with a random response vector parameterized by the effect size s:

\begin{matrix} u_{C} = \vec{1} \\ b_{c} \sim U (- s / 2, s / 2) \end{matrix} .

(7)

2.3 Usage

MiSDEED is designed to be used as a standalone command-line tool. The MiSDEED repository also contains the Python package underlying MiSDEED, a handful of utility scripts to support data visualization and learning gLV parameters and a set of Jupyter notebooks showing common uses of the MiSDEED Python package. MiSDEED can produce, save and plot large amounts of synthetic data with varying initial conditions and model assumptions. To support power analysis, many variables can freely be changed by the user. These are listed in Table 1.

2.4 Visualization

The MiSDEED command-line tool and Python package contain two functions for data visualization. One function allows the visualization of individual timecourses as stacked bar plots, as seen in Figure 2, Subplot A. This is one of the most typical data visualization used for microbial relative-abundance data, and it allows inspection of changes in relative abundance over time or in response to interventions. When stacked bar plots are generated for multiple systems at once, they can provide some intuition into the convergence time and existence of stable attractive states for a particular system of equations. A second function allows the visualization of many trajectories at once by projecting d_i-dimensional trajectories down to their top two principal components, similar to the plot in Figure 2, Subplot B. Principal component analysis (PCA) visualizations facilitate easy inspection of trajectory convergence and change of states under interventions. For example, in Figure 2, Subplot B, it becomes evident that the case and control trajectories diverge in PCA space, corroborating the intuition that their differences should be detectable by an appropriate choice of statistical test and experimental conditions.

Fig. 2. — (A) Simulated metagenomic (top) and metabolomic (bottom) relative-abundance timecourses with an intervention at t = 20 (blue line). This intervention affects metabolite abundances directly and propagates into the metagenomics node gradually via metabolomics–metagenomics interactions. (B) Twenty noiseless PCA-transformed case and control metagenomic trajectories show how interventions induce convergence to distinct fixed points

2.5 Power analysis example

As an example use case, one may use a community matrix and growth rates learned from a pilot dataset and initialize ‘metagenomics’ and ‘metabolomics’ nodes such that the latter has no intrinsic growth rates or self-interactions, but interactions with the ‘metagenomics’ node according to some a priori assumptions. Perturbing metabolite abundances directly, the user may investigate how many patients must be enrolled in order to distinguish reliably between samples with and without this perturbation applied.

In our example, we demonstrate that for a fixed number of study participants, increasing the read depth has a marked effect on the probability of observing a statistically significant difference between samples. For a study with 30 participants, it is sufficient to use a read depth of 300 to distinguish between case and control samples with high probability.

A Jupyter notebook demonstrating a detailed approach to power analysis using MiSDEED generators is included in the MiSDEED Github repository.

2.6 Parameter inference

Many users may find themselves in the situation of having some pilot dataset they wish to use as a basis for their simulations, but no existing gLV parameters. To facilitate the use of MiSDEED in this common situation, an implementation of the method for gLV trajectory inference given in Stein et al. (2013) is provided as part of the MiSDEED Python package. Using the notation used earlier in this paper, the Stein et al. (2013) method consists of computing the matrix

(M_{i, i}, g_{i}, b_{i}) = F Y^{T} {(Y Y^{T} + D_{λ})}^{- 1},

(8)

where F is the matrix of time-scaled changes in z_t between successive time points and Y is a row-by-row concatenation of the abundances z_t, a vector $\vec{1}$ of 1s and all time-varying intervention indicator vectors u_i.

It is worth noting that this method is sensitive to noise, demands a relatively large sample size in order to be accurate and presumes that absolute abundances are observed. Users who wish to do inference on relative-abundance trajectories should consider looking into the compositional Lotka-Volterra method laid out in Joseph et al. (2020b).

2.7 Data realism

Researchers considering the use of MiSDEED for power analysis may naturally be curious about benchmarking the realism of the MiSDEED-synthesized data according to some metrics of interest or altering the generative model underlying MiSDEED to more closely match an alternative set of assumptions about the way that their data should look.

We provide a Jupyter notebook demonstrating how all probability distributions can be overridden in the MiSDEED package, as well as a general outline of how data realism can be compared across synthetic and empirical datasets. The empirical data from Stein et al. (2013) is used for comparisons. Specifically, datasets are compared in terms of the distribution of alpha-diversities across samples, differential abundance between real and simulated datasets, and sparsity of various samples. Since the realism of MiSDEED’s simulations can be quite sensitive to the quality of the gLV parameter inference described earlier, we perform the comparisons to empirical data side by side with comparisons to gLV parameters inferred from MiSDEED simulations.

3 Discussion

MiSDEED dispatches the steady-state of classical microbiome power analysis, instead offering a flexible framework for rapidly generating large amounts of realistic microbial trajectory data which can be used for study design, transfer learning and algorithm development. MiSDEED relies on gLV, which fails to model ecosystems with a high degree of mutualism and therefore limits some of its uses; however, its modular design allows makes it possible to use other models. Future development will focus on expanding code-free interfaces to MiSDEED; more flexible modeling assumptions for broader use cases, including nonuniform time points, individual variation in interaction matrices and growth rates, and population clusters; alternatives to gLV-based modeling for dynamics like mutualism; methods for modeling spatial ecological dynamics; phylogenetically related dimensions on nodes; and investigation into the value of MiSDEED-generated data for transfer learning and algorithm development.

Funding

This material is based upon work supported by the National Science Foundation Graduate Research Fellowship to P.C. under grant no. DGE-2036197, NIH/NCI Grant No. U54CA209997 Driving Biological Projects, and Columbia University’s 2020/2021 Data Science Institute Seed Grant.

Conflict of Interest: none declared.

Data availability

All data used in this study are available at github.com/pchlenski/misdeed.

Contributor Information

Philippe Chlenski, Department of Computer Science, Columbia University, New York, NY 10027, USA.

Melody Hsu, Department of Computer Science, Columbia University, New York, NY 10027, USA.

Itsik Pe’er, Department of Computer Science, Columbia University, New York, NY 10027, USA; Department of Systems Biology, Columbia University, New York, NY 10027, USA; Data Science Institute, Columbia University, New York, NY 10027, USA.

References

Allesina S., Tang S. (2012) Stability criteria for complex ecosystems. Nature, 483, 205–208. [DOI] [PubMed] [Google Scholar]
Bodein A. et al. (2019) A generic multivariate framework for the integration of microbiome longitudinal studies with other data types. Front. Genet., 10, 963. [DOI] [PMC free article] [PubMed] [Google Scholar]
Joseph T., et al. (2020a) Efficient and accurate inference of mixed microbial population trajectories from longitudinal count data. Cell Syst., 10, 463–469. [DOI] [PubMed] [Google Scholar]
Joseph T., et al. (2020b) Compositional Lotka-Volterra describes microbial dynamics in the simplex. PLoS Comput. Biol., 16, e1007917. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kuntal B., et al. (2019) Web-gLV: a web based platform for Lotka-Volterra based modeling and simulation of microbial populations. Front. Microbiol., 10, 288. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mattiello F., et al. (2016) A web application for sample size and power calculation in case-control microbiome studies. Bioinformatics, 13, 2038–2040. [DOI] [PubMed] [Google Scholar]
Purcell S. et al. (2003) Genetic power calculator: design of linkage and association genetic mapping studies of complex traits. Bioinformatics, 19, 149–150. [DOI] [PubMed] [Google Scholar]
Stein R., et al. (2013) Ecological modeling from time-series inference: insight into dynamics and stability of intestinal microbiota. PLoS Comput. Biol., 9, e1003388. [DOI] [PMC free article] [PubMed] [Google Scholar]
Turnbaugh P. et al. (2007) The human microbiome project. Nature, 449, 804–810. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

All data used in this study are available at github.com/pchlenski/misdeed.

[vbac043-B1] Allesina S., Tang S. (2012) Stability criteria for complex ecosystems. Nature, 483, 205–208. [DOI] [PubMed] [Google Scholar]

[vbac043-B2] Bodein A. et al. (2019) A generic multivariate framework for the integration of microbiome longitudinal studies with other data types. Front. Genet., 10, 963. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbac043-B3] Joseph T., et al. (2020a) Efficient and accurate inference of mixed microbial population trajectories from longitudinal count data. Cell Syst., 10, 463–469. [DOI] [PubMed] [Google Scholar]

[vbac043-B4] Joseph T., et al. (2020b) Compositional Lotka-Volterra describes microbial dynamics in the simplex. PLoS Comput. Biol., 16, e1007917. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbac043-B5] Kuntal B., et al. (2019) Web-gLV: a web based platform for Lotka-Volterra based modeling and simulation of microbial populations. Front. Microbiol., 10, 288. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbac043-B6] Mattiello F., et al. (2016) A web application for sample size and power calculation in case-control microbiome studies. Bioinformatics, 13, 2038–2040. [DOI] [PubMed] [Google Scholar]

[vbac043-B7] Purcell S. et al. (2003) Genetic power calculator: design of linkage and association genetic mapping studies of complex traits. Bioinformatics, 19, 149–150. [DOI] [PubMed] [Google Scholar]

[vbac043-B8] Stein R., et al. (2013) Ecological modeling from time-series inference: insight into dynamics and stability of intestinal microbiota. PLoS Comput. Biol., 9, e1003388. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbac043-B9] Turnbaugh P. et al. (2007) The human microbiome project. Nature, 449, 804–810. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

MiSDEED: a synthetic data engine for microbiome study power analysis and study design

Philippe Chlenski

Melody Hsu

Itsik Pe’er

Roles

Abstract

Summary

Availability and implementation

1 Introduction

2 Overview

2.1 Generative model

Fig. 1.

Table 1.

2.2 Generalized generative model

2.3 Usage

2.4 Visualization

Fig. 2.

2.5 Power analysis example

2.6 Parameter inference

2.7 Data realism

3 Discussion

Funding

Data availability

Contributor Information

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

MiSDEED: a synthetic data engine for microbiome study power analysis and study design

Philippe Chlenski

Melody Hsu

Itsik Pe’er

Roles

Abstract

Summary

Availability and implementation

1 Introduction

2 Overview

2.1 Generative model

Fig. 1.

Table 1.

2.2 Generalized generative model

2.3 Usage

2.4 Visualization

Fig. 2.

2.5 Power analysis example

2.6 Parameter inference

2.7 Data realism

3 Discussion

Funding

Data availability

Contributor Information

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases