Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2020 Jan 8;36(8):2623–2625. doi: 10.1093/bioinformatics/btz971

Gapsplit: efficient random sampling for non-convex constraint-based models

Thomas C Keaty 1,2, Paul A Jensen 1,2,3,
Editor: Jonathan Wren
PMCID: PMC7178416  PMID: 31913465

Abstract

Summary

Gapsplit generates random samples from convex and non-convex constraint-based models by targeting under-sampled regions of the solution space. Gapsplit provides uniform coverage of linear, mixed-integer and general non-linear models.

Availability and implementation

Python and Matlab source code are freely available at http://jensenlab.net/tools.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Constraint-based models allow systems-level interrogation of biochemical networks with biochemical parameters. Constraint-based metabolic models are represented by an underdetermined system of equations that produce an infinite number of solutions. Rather than focus on any single solution, modelers can use ensembles of randomly sampled solutions to analyze network properties (Schellenberger and Palsson, 2009).

The leading algorithms for sampling constraint-based models use the ‘hit-and-run’ (HR) framework (Smith, 1984). HR sampling walks through a model’s solution space by randomly selecting directions based on a set of warmup points. HR’s efficiency is tied to the convexity of the solution space. Since a convex combination of any number of existing solutions is also a solution, HR algorithms can quickly generate new solutions without resolving the model. When applied to constraint-based models, current HR samplers [ACHR (Kaufman and Smith, 1998), OptGP (Megchelenbrink et al., 2014) and CHRR (Haraldsdóttir et al., 2017)] quickly generate a series of samples that converge to a stable distribution. However, models containing reactions with fixed bounds can drastically reduce the fraction of the total sample space covered by the HR samplers [see Binns et al. (2015) and data below]. One random sampler with improved coverage uses a ‘poling’ method to push the random walk of the HR sampler away from previous samples (Binns et al., 2015). While the poling method improved coverage, the resulting optimization problems are non-linear and require orders of magnitude more computation time.

Adding transcriptional regulation or enzymatic complexes to a model requires discrete variables, making the model non-convex. HR samplers cannot directly sample non-convex models. As a workaround, the ll-ACHR sampler uses a boxing approximation to sample models with (non-convex) loopless flux constraints (Saa and Nielsen, 2016). The box constraints enclose the non-convex solution space with a convex hull, and the convex hull, not the original solution space, is sampled. Care must be taken to reject any infeasible samples that lie outside the solution space but inside the convex hull. The efficiency of boxed models also decreases with additional discrete variables or non-convex constraints (Kiatsupaibul et al., 2011).

We present a new class of random sampler for constraint-based models. Our algorithm—called Gapsplit—uses mathematical programming to find solutions in the underexplored areas of the model’s solution space. Unlike HR algorithms, Gapsplit samples convex and non-convex models directly. Samples identified by Gapsplit uniformly cover a model’s solution space. Gapsplit yields better coverage than HR samplers for tightly constrained and non-convex models.

2 The Gapsplit sampler

Gapsplit is designed to find sample points that uniformly cover the entire solution space. Gapsplit’s objective is to minimize the size of each variable’s max gap, the largest interval between two adjacent sample points (Fig. 1A). Given a set of samples, Gapsplit selects a single variable and identifies its max gap. Gapsplit adds a constraint requiring the next solution be in the center of the max gap (the target; see Fig. 1A). The model is solved to find such a solution, the constraint is removed and the process is repeated with a different variable. To speed up sampling, Gapsplit also attempts to simultaneously split the max gaps of k randomly selected variables. Gapsplit uses a quadratic objective function to minimize the distance between the next solution and the centers of the max gaps for the k other reactions. (A complete description of the design and performance of Gapsplit is presented in the Supplementary Material.) The Gapsplit algorithm can be applied to any mathematical program including models with binary or integer constraints.

Fig. 1.

Fig. 1.

(A) Gapsplit finds random samples for each variable (orange circles) within the variable’s feasible range. If a variable is selected for targeting, the next solution will be at a target that splits the maximum remaining gap in half. (B) Gapsplit yields better coverage with fewer samples than HR samplers. The mean (solid line) and 95% confidence intervals (color shading) are shown for 100 runs of each sampler on three metabolic models: bacteria Streptococcus mutans (iSMU v1.0; Jijakli and Jensen 2019); Pseudomonas aeruginosa (iMO1056; Oberhardt et al. 2008); and the yeast Saccharomyces cerevisiae (iND750; Duarte et al. 2004). Bounds were fixed to glucose minimal media as specified in the original publications. (C) The coverage of HR samplers improves if the bounds on exchange reactions are relaxed to arbitrarily large values. However, such conditions are not physiologically reasonable. (D) Gapsplit is slower than the CHRR but faster than the ACHR samplers when models have fixed bounds. Mean time per 1000 samples is shown for 25 independent runs of each algorithm. Error bars show the standard deviation. Opening the bounds on the model slows all three algorithms. (E) Gapsplit can sample non-convex models. Adding binary constraints for gene associations (met + genes, blue) or logical constraints for transcriptional regulation (met + TRN, orange) does not affect sampling of the yeast metabolic model (met, black). Each line represents the mean coverage for 50 independent simulations

3 Results

The fraction of the solution space sampled by an algorithm is the coverage (Binns et al., 2015), defined as

coverage=1mean{relativemaxgap(xi)}

where xi is a variable in the model and the relative max gap is the max gap of xi divided by the feasible range of xi. Coverage ranges from 0 (all points at the edges of the feasible range) to 1 (uniform coverage by an infinite number of points). A coverage of 0.9 indicates that the model’s variables, on average, have a maximum relative gap of 10%.

Gapsplit achieves better overall coverage with fewer samples than HR algorithms. We generated samples from three genome-scale metabolic models using Gapsplit, ACHR and CHRR (Fig. 1B). Both ACHR and CHRR plateaued within a few hundred samples at a coverage of 0.2, meaning each variable, on average, had an unsampled gap that covered 80% of the feasible space. Gapsplit samples quickly covered the solution space, reaching a coverage of 0.8 within 500 samples and plateauing with coverage over 0.9 after 2000 samples. The models were sampled as published with default bounds corresponding to glucose minimal media. We hypothesized that the fixed bounds prevented the HR samplers from covering a larger fraction of the sample space. HR samplers use a constrained random walk where the step size and direction choice become limited as points approach the boundary of the solution space. The step size is truncated in directions where a full step would violate the model’s constraints. Thus the speed of the random walk is slow along ‘narrow’ dimensions of the solution space, i.e. when the model is highly constrained.

Indeed, opening all bounds to arbitrarily large values (1000xi1000 for all exchange reactions xi) improved the coverage of the ACHR and CHRR samplers (Fig. 1C). Gapsplit generated the best coverage for all models, but the HR samplers also achieved coverage of at least 0.8. The Gapsplit algorithm does not use a random walk. Instead, it jumps from one solution to the largest unexplored region of the model. The relative spacing of the samples is unaffected by the shape of the solution space, which allows Gapsplit to outperform HR samplers on models with constrained solution spaces.

For models with fixed bounds, Gapsplit is slower than CHRR and faster than ACHR on a per sample basis (Fig. 1D). However, Gapsplit is more efficient than either HR algorithm in the time required to reach a specific level of coverage since it yields better coverage per sample. Gapsplit is slower than either HR sampler for models with arbitrarily open bounds. However, we note that such models are not physiologically realistic since flux balance analysis requires at least one fixed constraint to limit nutrient uptake (Orth et al., 2010).

We can indirectly compare Gapsplit and the poling-based method (Binns et al., 2015) since both studies benchmark against the same ACHR implementation. Poling and Gapsplit achieve similar coverage after on a genome-scale model. However, the poling-based method is 10–100 times slower than ACHR, while Gapsplit is only 2–10 times slower than ACHR. Besides efficiency, we believe that not requiring a non-linear solver is an advantage of Gapsplit over poling. Poling-based sampling of a mixed-integer model would require solving sequential mixed-integer non-linear programs, which are notoriously difficult and require specialized software. Gapsplit requires only the original solver used for the model, widening its applicability.

An advantage of Gapsplit is sampling non-convex models including those with discrete variables. We tested Gapsplit on two models of the yeast Saccharomyces cerevisiae with binary variables: a metabolic model (Duarte et al., 2004) with gene–protein–reaction rules encoded as Boolean constraints (Jensen et al., 2011; Shlomi et al., 2007) and a combined metabolic/regulatory model (Herrgård et al., 2006). The model with gene–protein–reaction rules has the same solution space as the original model, but adding the transcriptional regulatory network creates a non-convex solution space (Jensen et al., 2011). Gapsplit’s performance was unaffected by either the binary variables or the non-convexity (Fig. 1E).

Gapsplit’s performance is tuned by changing only a single parameter: the number of secondary variables to target at each iteration. Changing this parameter (expressed as a fraction of the model’s total number of variables) can affect Gapsplit’s performance (Supplementary Fig. S1). However, a single value (5%) was chosen as the default setting and worked well for all the experiments in this study. We do not expect users will need to tune this parameter for other models although they can change the parameter if needed.

4 Conclusions

Gapsplit is a new class of random sampler for constraint-based models. It samples convex and non-convex models and gives better coverage than HR samplers on models with fixed bounds. Gapsplit is available in Matlab and Python and is compatible with models from the COBRA Toolbox (Heirendt et al., 2019), TIGER (Jensen et al., 2011) and cobrapy (Ebrahim et al., 2013). We believe Gapsplit opens new possibilities for exploring non-convex models including models with transcriptional regulation. Using Gapsplit, researchers can develop mixed-integer algorithms that incorporate random sampling.

Funding

This work was supported by the National Institutes of Health [EB027396].

Conflict of Interest: none declared.

Supplementary Material

btz971_Supplementary_Data

References

  1. Binns M. et al. (2015) Sampling with poling-based flux balance analysis: optimal versus sub-optimal flux space analysis of Actinobacillus succinogenes. BMC Bioinformatics, 16, 49. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Duarte N.C. et al. (2004) Reconstruction and validation of Saccharomyces cerevisiae iND750, a fully compartmentalized genome-scale metabolic model. Genome Res., 14, 1298–1309. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Ebrahim A. et al. (2013) COBRApy: COnstraints-Based Reconstruction and Analysis for python. BMC Syst. Biol., 7, 74. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Haraldsdóttir H.S. et al. (2017) CHRR: coordinate hit-and-run with rounding for uniform sampling of constraint-based models. Bioinformatics, 33, 1741–1743. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Heirendt L. et al. (2019) Creation and analysis of biochemical constraint-based models using the COBRA toolbox v.3.0. Nat. Protoc., 14, 639–702. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Herrgård M.J. et al. (2006) Integrated analysis of regulatory and metabolic networks reveals novel regulatory mechanisms in Saccharomyces cerevisiae. Genome Res., 16, 627–635. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Jensen P.A. et al. (2011) TIGER: toolbox for integrating genome-scale metabolic models, expression data, and transcriptional regulatory networks. BMC Syst. Biol., 5, 147. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Jijakli K., Jensen P.A. (2019) Metabolic modeling of Streptococcus mutans reveals complex nutrient requirements of an oral pathogen. mSystems, 4:e00529–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Kaufman D.E., Smith R.L. (1998) Direction choice for accelerated convergence in hit-and-run sampling. Oper. Res., 46, 84–95. [Google Scholar]
  10. Kiatsupaibul S. et al. (2011). An analysis of a variation of hit-and-run for uniform sampling from general regions. ACM Trans. Model. Comput. Simul., 21, 16:1–16:11. [Google Scholar]
  11. Megchelenbrink W. et al. (2014) optGpSampler: an improved tool for uniformly sampling the solution-space of genome-scale metabolic networks. PLoS One, 9, e86587. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Oberhardt M.A. et al. (2008) Genome-scale metabolic network analysis of the opportunistic pathogen Pseudomonas aeruginosa PAO1. J. Bacteriol., 190, 2790–2803. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Orth J.D. et al. (2010) What is flux balance analysis? Nat. Biotechnol., 28, 245–248. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Saa P.A., Nielsen L.K. (2016) Ll-ACHRB: a scalable algorithm for sampling the feasible solution space of metabolic networks. Bioinformatics (Oxford, England), 32, 2330–2337. [DOI] [PubMed] [Google Scholar]
  15. Schellenberger J., Palsson B.O. (2009) Use of randomized sampling for analysis of metabolic networks. J. Biol. Chem., 284, 5457–5461. [DOI] [PubMed] [Google Scholar]
  16. Shlomi T. et al. (2007). A genome-scale computational study of the interplay between transcriptional regulation and metabolism. Mol. Syst. Biol., 3: 101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Smith R.L. (1984) Efficient Monte Carlo procedures for generating points uniformly distributed over bounded regions. Oper. Res., 32, 1296–1308. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btz971_Supplementary_Data

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES