Nunchaku: optimally partitioning data into piece-wise contiguous segments

Yu Huo; Hongpei Li; Xiao Wang; Xiaochen Du; Peter S Swain

doi:10.1093/bioinformatics/btad688

. 2023 Nov 15;39(12):btad688. doi: 10.1093/bioinformatics/btad688

Nunchaku: optimally partitioning data into piece-wise contiguous segments

Yu Huo ^1,², Hongpei Li ³, Xiao Wang ⁴, Xiaochen Du ⁵, Peter S Swain ^6,^7,^✉

Editor: Pier Luigi Martelli

PMCID: PMC10697733 PMID: 37966918

Abstract

Motivation

When analyzing 1D time series, scientists are often interested in identifying regions where one variable depends linearly on the other. Typically, they use an ad hoc and therefore often subjective method to do so.

Results

Here, we develop a statistically rigorous, Bayesian approach to infer the optimal partitioning of a dataset not only into contiguous piece-wise linear segments, but also into contiguous segments described by linear combinations of arbitrary basis functions. We therefore present a general solution to the problem of identifying discontinuous change points. Focusing on microbial growth, we use the algorithm to find the range of optical density where this density is linearly proportional to the number of cells and to automatically find the regions of exponential growth for both Escherichia coli and Saccharomyces cerevisiae. For budding yeast, we consequently are able to infer the Monod constant for growth on fructose. Our algorithm lends itself to automation and high throughput studies, increases reproducibility, and should facilitate data analyses for a broad range of scientists.

Availability and implementation

The corresponding Python package, entitled Nunchaku, is available at PyPI: https://pypi.org/project/nunchaku.

1 Introduction

A common scientific problem is understanding the relationship between two variables. When the dependent variable, or some transformation of it, depends linearly on the independent variable, the underlying system linking the two often behaves more simply than generally. As a consequence, scientists commonly focus their efforts on identifying and understanding this linear regime.

A well-known example is the growth of a population of cells. In log phase, when the logarithm of the number of cells increases linearly with time, the total mass of every intracellular component grows exponentially and the mass per cell is approximately constant. Such steady-state conditions regularize growth; metabolic fluxes are balanced; and physiology simplifies, generating behaviours controlled by only a handful of variables (Scott and Hwa 2023).

Biologists therefore often wish to determine when growth is in log phase. Historically the approach has been to plot the logarithm of a variable correlating with the number of cells, such as optical density (OD), against time and to identify a linear region by eye (Monod 1949). Today this subjective technique is still used, with one scientist’s linear region not necessarily the same as another’s.

A challenge to developing objective approaches is identifying a suitable nonlinear model with which to compare the linear one. There is no general way to describe all relationships that we may observe. With a mechanistic understanding, we might generate a nonlinear description, but such an understanding is often lacking and, anyhow, may obviate the need to find linear regimes.

Here, we circumvent this problem by inferring the piece-wise linear description that best approximates an entire 1D time series. By doing so, we reframe the task to one of detecting change points—time points where the process generating the time series changes, a well-studied problem (Stephens 1994) with an established frequentist solution (Baranowski et al. 2019). We use a Bayesian approach, complementing others (Hutter 2007, Papastamoulis et al. 2019), and generalize by allowing each segment of data to be described by a linear combination of arbitrary basis functions, with straight lines being but one example. For a given set of basis functions, we compare the evidence for every possible piece-wise linear combination, found by marginalizing over all possible fits to all possible contiguous subdivisions of the data. For linear segments and for the optimal choice of segments, we provide statistics for each segment, allowing users to select straightforwardly the segment or segments of most interest. To illustrate our algorithm, we primarily discuss two examples: determining the range of OD of a liquid culture where the OD depends linearly on the number of cells and finding the exponential phases of microbial growth curves.

2 Materials and methods

2.1 Inferring contiguous regions using model comparison

Given 1D time-series data and a set of K basis functions, we wish to divide the data into the group of contiguous segments that is best characterized by piece-wise linear combinations of the basis functions. Irrespective of the data’s behaviour, we will always find such a group. Our approach answers two questions: how many piece-wise contiguous segments best describe the data given the basis functions and where the optimal segment boundaries lie.

Let us assume that we have observations, $(x_{j}, y_{j}^{(r)})$ , where j runs from 1 to N and the x_j are in ascending order; r indexes the N_r replicates if any. We denote these observations collectively as D.

First, we consider whether we should divide the data into M or $M'$ segments, using Bayesian model comparison (MacKay 2003). Assuming equal prior probabilities, $P (M) = P (M')$ , we write the Bayes’ factor as:

\frac{P (M | D)}{P (M' | D)} = \frac{P (D | M) P (M)}{P (D | M') P (M')} = \frac{P (D | M)}{P (D | M')},

(1)

and therefore we should determine the evidence $P (D | M)$ for each M.

The evidence is a marginal likelihood. For M contiguous segments, there are M–1 unknown boundary points, which we denote as $n \equiv (n_{1}, \dots, n_{M - 1})$ with $n_{i} < n_{i + 1}$ . These points are integers and index an x_j. The two remaining boundaries are the indices for the first and last x values: 1 and N. We assume that each segment contains a minimal number of data points $ℓ_{min}$ , so that $n_{i + 1} \geq n_{i} + ℓ_{min}$ . The choice of $ℓ_{min}$ depends on the type and number of basis functions: in general, $ℓ_{min} \geq K$ .

The evidence is a sum over all potential n:

\begin{matrix} P (D | M) = \sum_{n} P (D | n, M) P (n | M) \\ = f (N, M, ℓ_{min}) \sum_{n} P (D | n, M) \end{matrix}

(2)

where we use that any permissible n_i is equally likely as any other to write the prior $P (n | M)$ as a function of N, M, and $ℓ_{min}$ . Specifically, this bounded uniform prior is the reciprocal of the number of possible n, which satisfy

n_{1} \geq ℓ_{min}, n_{2} \geq n_{1} + ℓ_{min}, \dots, n_{M - 1} \geq N - ℓ_{min} .

(3)

for a given M and $ℓ_{min}$ . We therefore have:

\begin{matrix} P (n | M) = [\sum_{n_{1} = ℓ_{min}}^{N - (M - 1) ℓ_{min}} \times \sum_{n_{2} = n_{1} + ℓ_{min}}^{N - (M - 2) ℓ_{min}} \times \dots \\ \times \sum_{n_{M - 1} = n_{M - 2} + ℓ_{min}}^{N - ℓ_{min}} 1]^{- 1} \\ = f (N, M, ℓ_{min}) . \end{matrix}

(4)

Second, for a given M and n, we fit the data to M different linear combinations of the basis functions, one for each segment, with each combination independent of the other. The linear combination ending near the data points indexed by n_i and $n_{i + 1}$ depends only on the data indexed by the indices $n_{i} + 1$ and $n_{i + 1}$ inclusively, denoted D_i, and this data does not determine any other linear combination. Therefore, mathematically,

\begin{matrix} P (D | n, M) = P (D_{1} | 1, n_{1}) \times P (D_{2} | n_{1} + 1, n_{2}) \times \dots \\ \times P (D_{M} | n_{M - 1} + 1, N) \end{matrix}

(5)

where $P (D_{i} | n_{i} + 1, n_{i + 1})$ is the likelihood of a linear combination of the basis functions describing the data indexed by $n_{i} + 1$ to $n_{i + 1}$ .

2.1.1 Finding $P (D | n_{,} M)$

For each segment of the data, we consider the K basis functions, each individually denoted $ϕ_{k} (x)$ and collectively $ϕ (x)$ , and correspondingly K coefficients, each denoted m_k. If fitting lines, we have two basis functions: $ϕ_{1} = 1$ and $ϕ_{2} = x$ , and two m_k where m₁ determines the line’s y-intercept and m₂ its gradient. We then set $ℓ_{min} = 3$ so that there are sufficient data points in each segment to define a line.

We let $P (y_{j} | x_{j}, m)$ describe how a data point y_j at x_j deviates from the linear combination of basis functions and assume that this deviation is independent of the deviations of other data points.

For the ith segment, we then have

\begin{matrix} P (D_{i} | n_{i} + 1, n_{i + 1}) = \int d m P (m) \prod_{r = 1}^{N_{r}} \prod_{j = n_{i} + 1}^{n_{i + 1}} P (y_{j}^{(r)} | x_{j}, m) \\ = P (m) \int d m \prod_{r = 1}^{N_{r}} \prod_{j = n_{i} + 1}^{n_{i + 1}} P (y_{j}^{(r)} | x_{j}, m) \end{matrix}

(6)

assuming the prior $P (m)$ is a constant, with each m_k independently and uniformly distributed in some bounded region so that

P (m) = {\begin{matrix} \frac{1}{(m_{1}^{max} - m_{1}^{min}) \dots (m_{K}^{max} - m_{K}^{min})} & for m_{i} \in [m_{i}^{min}, m_{i}^{max}] \\ 0 & otherwise \end{matrix}

(7)

for fixed $m_{k}^{min}$ and $m_{k}^{max}$ for all k.

2.1.2 Marginalizing $P (D | n, M)$

Using Equation (5), we factorize the sum in Equation (2):

\begin{matrix} \sum_{n} P (D | n, M) = \sum_{n_{1} = ℓ_{min}}^{N - (M - 1) ℓ_{min}} P (D_{1} | 1, n_{1}) \\ \times \sum_{n_{2} = n_{1} + ℓ_{min}}^{N - (M - 2) ℓ_{min}} P (D_{2} | n_{1} + 1, n_{2}) \times \dots \\ \times \sum_{n_{M - 2} = n_{M - 3} + ℓ_{min}}^{N - 2 ℓ_{min}} P (D_{M - 2} | n_{M - 3} + 1, n_{M - 2}) \\ \times \sum_{n_{M - 1} = n_{M - 2} + ℓ_{min}}^{N - ℓ_{min}} P (D_{M - 1} | n_{M - 2} + 1, n_{M - 1}) \\ \times P (D_{M} | n_{M - 1}, N) \end{matrix}

(8)

and use the method of variable elimination (Zhang and Poole 1996) to evaluate these sums. First we perform the rightmost one, over $n_{M - 1}$ , to generate a function of $n_{M - 2}$ . We then perform the next rightmost sum, over $n_{M - 2}$ , of this function and the next term in Equation (8), which generates a function of $n_{M - 3}$ . We repeat this process until we reach the leftmost sum over n₁, enabling $O (M N^{2})$ operations in total instead of $O (N^{M})$ . We evaluate Equation (4) similarly.

All that remains is to determine $P (D_{i} | n_{i} + 1, n_{i + 1})$ so that we can find $P (D | M)$ via Equation (2) and Equation (8).

2.1.3 Finding $P (D_{i} | n_{i} + 1, n_{i + 1})$ for known measurement error

To proceed, we assume that $P (y_{j} | x_{j}, m)$ is a normal distribution with mean $ϕ {(x_{j})}^{T} m$ , or equivalently $\sum_{k} m_{k} ϕ_{k} (x_{j})$ , and a standard deviation σ_j. If we know the σ_j, e.g. by approximating each by the corresponding measurement error, then Equation (6), the likelihood of a linear combination describing the data indexed by $n_{i} + 1$ to $n_{i + 1}$ , becomes

\begin{matrix} P (D | n_{i} + 1, n_{i + 1}, σ) = P (m) \prod_{j = n_{i} + 1}^{n_{i + 1}} {(\sqrt{2 π} σ_{j})}^{- N_{r}} \\ \times \int d m exp [- \sum_{r = 1}^{N_{r}} \sum_{j = n_{i} + 1}^{n_{i + 1}} \frac{{[y_{j}^{(r)} - ϕ {(x_{j})}^{T} m]}^{2}}{2 σ_{j}^{2}}] . \end{matrix}

(9)

To evaluate the integral, we extend it to infinite range for all m_k—a suitable approximation because we expect the integrand to be strongly peaked at the most likely values of each m_k (MacKay 2003). We can then perform the integration analytically.

Consider data with a single replicate. Define $ℓ_{i} = n_{i + 1} - n_{i}$ to be the number of x values in the ith segment and $z^{(i)}$ to be a vector with components $y_{j} / σ_{j}$ , with the superscript i used to denote the ith segment. Let $Φ (X)$ be the $K \times ℓ_{i}$ matrix with components $Φ_{k j} = ϕ_{k} (x_{j}) / σ_{j}$ , and further defining

\begin{matrix} A^{(i)} = Φ Φ^{T} & ; & {\bar{m}}^{(i)} = {(A^{(i)})}^{- 1} Φ z^{(i)} \end{matrix}

(10)

so that $A_{k k'}^{(i)} = \sum_{j} ϕ_{k} (x_{j}) ϕ_{k'} (x_{j})$ . The matrix $A^{(i)}$ is a symmetric K × K matrix, which is invertible when the basis functions $ϕ_{k}$ are linearly independent and when $ℓ_{i} \geq K$ . Then standard algebra gives

\sum_{j = n_{i} + 1}^{n_{i + 1}} \frac{{[y_{j} - ϕ {(x_{j})}^{T} m]}^{2}}{2 σ_{j}^{2}} = \frac{1}{2} {(m - {\bar{m}}^{(i)})}^{T} A^{(i)} (m - {\bar{m}}^{(i)}) + U^{(i)}

(11)

where

2 U^{(i)} = {(z^{(i)})}^{T} z^{(i)} - {({\bar{m}}^{(i)})}^{T} A^{(i)} {\bar{m}}^{(i)} .

(12)

Using Equation (11) and the results for integrating multivariate Gaussian distributions (MacKay 2003), we have that

\begin{matrix} \int d m exp [- \sum_{j = n_{i} + 1}^{n_{i + 1}} \frac{{[y_{j} - ϕ {(x_{j})}^{T} m]}^{2}}{2 σ_{j}^{2}}] = {(2 π)}^{\frac{K}{2}} {(det A^{(i)})}^{- \frac{1}{2}} \\ \times e^{- U^{(i)}} . \end{matrix}

(13)

If we are fitting straight lines with K = 2 and $ϕ_{1} = 1$ and $ϕ_{2} = x$ , then it is useful to define (Hinrichsen et al. 2017)

\begin{matrix} T_{1} & = & \sum_{j} \frac{y_{j}^{2}}{2 σ_{j}^{2}} & ; & T_{2} & = & \sum_{j} \frac{x_{j}^{2}}{2 σ_{j}^{2}} \\ T_{3} & = & \sum_{j} \frac{1}{2 σ_{j}^{2}} & ; & T_{4} & = & \sum_{j} \frac{y_{j}}{σ_{j}^{2}} \\ T_{5} & = & \sum_{j} \frac{x_{j} y_{j}}{σ_{j}^{2}} & ; & T_{6} & = & \sum_{j} \frac{x_{j}}{σ_{j}^{2}} \end{matrix}

(14)

with j running from $n_{i} + 1$ to $n_{i + 1}$ . Using these definitions,

\begin{matrix} \begin{matrix} A^{(i)} = (\begin{matrix} 2 T_{3} & T_{6} \\ T_{6} & 2 T_{2} \end{matrix}) & ; & {\bar{m}}^{(i)} = (\begin{matrix} \frac{2 T_{2} T_{4} - T_{5} T_{6}}{4 T_{2} T_{3} - T_{6}^{2}} \\ \frac{2 T_{3} T_{5} - T_{4} T_{6}}{4 T_{2} T_{3} - T_{6}^{2}} \end{matrix}) \end{matrix} \\ U^{(i)} = T_{1} - \frac{T_{2} T_{4}^{2} + T_{3} T_{5}^{2} - T_{4} T_{5} T_{6}}{4 T_{2} T_{3} - T_{6}^{2}} \end{matrix}

(15)

and the integral becomes $(2 π) {(4 T_{2} T_{3} - T_{6}^{2})}^{- \frac{1}{2}} e^{- U^{(i)}}$ .

With more than one replicate, z runs over all y in all replicates, with the replicates arranged contiguously, and is of length $N_{r} ℓ_{i}$ ; $Φ$ has rows of length $N_{r} ℓ_{i}$ with $x_{n_{i} + 1}$ to $x_{n_{i + 1}}$ repeated N_r times in each row to match the corresponding y values. For the linear case, the sums in Equation (14) are over both j and the number of replicates, so that T₁, e.g., becomes $\sum_{j, r} \frac{{(y_{j}^{(r)})}^{2}}{2 σ_{j}^{2}}$ .

Returning to Equation (9), we find

\begin{matrix} P (D_{i} | n_{i} + 1, n_{i + 1}, σ) = P (m) (\prod_{j = n_{i} + 1}^{n_{i + 1}} {(\sqrt{2 π} σ_{j})}^{- N_{r}}) \\ \times {(2 π)}^{\frac{K}{2}} {(det A^{(i)})}^{- \frac{1}{2}} e^{- U^{(i)}} \end{matrix}

(16)

with the help of Equation (13). For this approximation to be valid, we require that the strongly peaked region in m space is within the a priori range for m. The area under the integrand in Equation (13) is proportional to the square root of $det A^{(i)}$ , and the prior range of m must be large enough to contain this area. Using Equation (7), we need

{(det A^{(i)})}^{\frac{1}{2}} \times P (m) ≪ 1.

(17)

2.1.4 Finding the boundary points

After determining the optimal number of segments into which to divide the data from Equaion (1), we next find their boundary points. Using Bayes’ theorem, the posterior for n is

P (n | D, M, σ) = \frac{P (D | n, M, σ) P (n | M)}{P (D | M, σ)}

(18)

which we evaluate using Equations (2, 4, and 5). We use the mean posterior value of n_i to estimate the optimal n_i:

\begin{matrix} E [n_{i}] = \sum_{n} n_{i} P (n | D, M, σ) \\ = \frac{P (n | M)}{P (D | M, σ)} \sum_{n} n_{i} P (D | 1, n_{1}, σ) \dots P (D | n_{M - 1}, N, σ) \end{matrix}

(19)

which we sum following Equation (8). The posterior variance, $Var [n_{i}]$ , determines the error in this estimate, which we find similarly.

2.1.5 Finding $P (D | M)$ for unknown measurement error

If the σ_j are unknown, we assume the same constant σ for all j with a uniform prior probability between $[σ_{min}, σ_{max}]$ (Gelman 2006). Equation (2) then becomes

\begin{matrix} P (D | M) = f (N, M, ℓ_{min}) \sum_{n} P (D | n, M) \\ = f (N, M, ℓ_{min}) P (σ) \sum_{n} \int_{σ_{min}}^{σ_{max}} d σ P (D | n, M, σ) . \end{matrix}

(20)

The constant $P (σ) = 1 / (σ_{max} - σ_{min})$ will cancel in Equation (1) when we compare the evidence for different M.

Using the equivalent of Equations (9 and 13), we find that

\begin{matrix} P (D_{i} | n_{i} + 1, n_{i + 1}, σ) = P (m) {(\sqrt{2 π} σ)}^{- N_{r} ℓ_{i} + K} \\ \times {(det A^{(i)})}^{- \frac{1}{2}} exp [- \frac{U^{(i)}}{σ^{2}}] \end{matrix}

(21)

where we now explicitly follow σ and so set the σ_j in Equation (10) to unity, making z_i = y_i and $Φ_{k j} = ϕ_{k} (x_{j})$ . Similarly for the linear case, the σ_j become unity in Equation (14).

Consequently,

\begin{matrix} P (D | n, M, σ) = P (D_{1} | 1, n_{1}, σ) \times P (D_{2} | n_{1} + 1, n_{2}, σ) \times \dots \\ \times P (D_{M} | n_{M - 1} + 1, N, σ) \\ = P {(m)}^{M} {(\sqrt{2 π} σ)}^{- N_{r} N + M K} \prod_{i = 1}^{M} {(det A^{(i)})}^{- \frac{1}{2}} \\ \times exp (- \frac{\sum_{i = 1}^{M} U^{(i)}}{σ^{2}}) . \end{matrix}

(22)

Although with Equation (22) it is possible to approximate analytically the integral over σ in Equation (20) by extending the range of the integrand to $(0, \infty)$ , the resulting expression prevents us from summing over n using variable elimination. Instead, we swap the sum and the integral to write

P (D | M) = f (N, M, ℓ_{min}) P (σ) \int_{σ_{min}}^{σ_{max}} d σ \sum_{n} P (D | n, M, σ)

(23)

and numerically evaluate, using variable elimination to sum over n in Equation (23) for each σ chosen by the integration algorithm.

We find the expected boundary points via Equation (19), again numerically integrating over σ.

Performing the integration: To stabilize the numerical integration, we scale the integrand of Equation (23) by its value at the most likely value of σ, making the integrand nearly always less than one and preventing overflow. We use expectation-maximization (EM) to estimate the most likely σ for a given M. The EM algorithm finds the σ that maximizes $P (D | M, σ)$ (Bishop 2006). We guess a value of σ, σ_o say, and find $P (n | D, σ_{o}, M)$ from Equation (18). To update σ_o, we maximize $Q (σ, σ_{o})$ with respect to σ, where

\begin{matrix} Q (σ, σ_{o}) = \sum_{n} P (n | D, M, σ_{o}) log P (D, n | M, σ) \\ = E [log P (D | n, M, σ) + log P (n | M, σ)] \\ = E [log P (D | n, M, σ) + log f (N, M, ℓ_{min})] \end{matrix}

(24)

with the expectations taken over $P (n | D, M, σ_{o})$ . Expanding Equation (24) using Equation (22), there are only two terms that depend on σ, and we can differentiate to find the updated $σ = σ_{n}$ :

σ_{n}^{2} = \frac{2}{N_{r} N - M K} \sum_{i = 1}^{M} E [U_{i}] .

(25)

We use the equivalent of Equation (19) with $σ = σ_{o}$ to evaluate these expectations and iterate until the value of σ converges.

2.1.6 Implementation

For basis functions that generate lines, we compare the different linear segments by calculating the gradient, intercept, and the coefficient of determination R² of the line maximizing the likelihood for each segment. The user can then select a desired segment, such as the one with the largest gradient.

The algorithm requires the a priori bounded region of m in Equation (7). Again specializing to straight lines, the prior specifies the range of the intercept m₁ and the gradient m₂: $[m_{1}^{min}, m_{1}^{max}]$ and $[m_{2}^{min}, m_{2}^{max}]$ . The user can either provide both ranges or only the range of m₂ or give the maximal range of y possible in the experiment, $[y_{min}, y_{max}]$ . If the user provides only the range of m₂, we estimate $m_{1}^{min}$ as $\min (- m_{2}^{max} x_{max}, m_{2}^{min} x_{min})$ and $m_{1}^{max}$ as $\max (- m_{2}^{min} x_{max}, m_{2}^{max} x_{min})$ . If the user provides the range of y, we estimate the range of m₂ as $[- g_{max}, g_{max}]$ , with $g_{max} = (y_{max} - y_{min}) / Δ x_{min}$ and $Δ x_{min}$ being the smallest difference between two neighbouring x values.

2.1.7 Availability

We coded the algorithm as a Python package available at https://pypi.org/project/nunchaku and via pip. We have also embedded nunchaku into our omniplate software for analyzing plate-reader data (Montaño-Gutierrez et al. 2022).

2.1.8 Generating and testing with synthetic data

To test our method, we generated a piece-wise linear function f(x) with $1 \leq M \leq 10$ continuous linear segments, each having between 10 and 50 data points and with a unit distance, $Δ x = 1$ , between data points. We sampled θ, the angle between each segment and the x-axis, from a uniform distribution on the interval $[- {tan}^{- 1} (20), {tan}^{- 1} (20)]$ , so that the gradient, $tan θ$ , lies between $[- 20, 20]$ . Furthermore we ensured that the difference in θ between neighbouring segments is larger than a fixed minimum, θ₀. We added Gaussian noise, $ϵ \sim Normal (0, σ^{2})$ , to give three replicates of $y = f (x) + ϵ$ . We generated 3600 synthetic datasets in total, a combination of 200 different piece-wise linear functions f(x), three values of θ₀, and six values of σ. In Figs 1 and 2, $θ_{0} = 10^{°}$ .

Figure 1. — The nunchaku algorithm correctly predicts the number of linear segments in synthetic data when the measurement noise is not too high. (A) Example synthetic datasets with the ground truth in blue (small circles) and the triplicate data in light grey. The large red circles are the predicted boundaries of each linear segment with the best-fit line in red. Left: with a measurement error of 0.25, the predictions overlap the data; Right: with a measurement error of 8, the predictions miss some segments, which the noise obscures. As a prior, we specify only that the gradient of each line lies between $[- 25, 25]$ . For this data, a measurement error of 0.25 is 0.5% of the mean of y and an error of 8 is almost 15%. (B) The algorithm underestimates the number of linear segments only once the magnitude of the measurement noise becomes sufficiently high. The actual number of segments is M; the estimated number is $\hat{M}$ .

Figure 2. — Nunchaku performs as well as or better than the NOT algorithm (Baranowski *et al.* 2019). This algorithm only supports input of one y value for each x value: we therefore input either one replicate or the mean of three replicates. The data are generated similarly to that in Fig. 1 (Section 2). As a prior for nunchaku, we specify that the gradient of each line lies between $[- 25, 25]$ . (A) The root mean squared error (RMSE) between the ground truth and the best-fit lines. (B) The difference between the predicted number of segments $\hat{M}$ and the ground truth M (left) and the percentage of correct predictions of M with $\hat{M} = M$ (right).

2.2 Experimental methods

We used a prototrophic strain of Saccharomyces cerevisiae (FY4), precultured in synthetic complete (SC) medium with 2% (w/v) sodium pyruvate in a 30°C shaking incubator at 180 rpm for two days. Before the experiment, we diluted the cells 6-fold and let them grow for six hours. After washing the cells twice with fresh minimal media (Verduyn et al. 1992), we inoculated them into minimal media with different concentrations of fructose on a 96-well microplate. The liquid volume of each well was 200 μl.

For Escherichia coli, we precultured cells in 3 ml liquid Luria broth (LB) with one colony from a fresh plate and grew aerobically to log phase (6 h) at 37°C with 250 rpm shaking. We then inoculated 3 μl culture into 147 μl fresh LB medium per well on a 96-well microplate.

We used either a Tecan Infinite M200 Pro or F200 plate reader at 30°C for S.cerevisiae and 37°C for E.coli with linear shaking at amplitude 6 mm. Measurements of absorbance at 600 nm, OD₆₀₀, were taken every 10 min.

Data were analyzed using the omniplate software (Montaño-Gutierrez et al. 2022).

2.3 Fitting Monod’s equation

After estimating the specific growth rate λ at each concentration of fructose s, we have a dataset $D \equiv {(λ_{i}, s_{i})}$ with 38 data points. We use Bayesian inference to estimate the constants $λ_{max}$ and K_M of Monod’s equation. Assuming a Gaussian measurement error of $λ_{max}$ with a standard deviation σ and independent measurements, the likelihood

\begin{matrix} P (D | λ_{max}, K_{M}, σ) = {(\sqrt{2 π} σ)}^{- N} \\ \times \prod_{i = 1}^{N} exp [- \frac{{(λ_{i} - λ_{max} \frac{s_{i}}{K_{M} + s_{i}})}^{2}}{2 σ^{2}}] . \end{matrix}

(26)

To marginalize over σ, we assume $P (σ) \propto 1 / σ$ , so that

\begin{matrix} P (D | λ_{max}, K_{M}) \propto \int_{0}^{\infty} d σ P (D | λ_{max}, K_{M}, σ) P (σ) \\ \propto {[\sum_{i = 1}^{N} {(λ_{i} - λ_{max} \frac{s_{i}}{K_{M} + s_{i}})}^{2}]}^{- \frac{N}{2}} . \end{matrix}

(27)

We further assume that the prior $P (λ_{max}, K_{M})$ is uniform, and so the posterior probability $λ_{max}$ and K_M is proportional to the likelihood, Equation (27). We therefore maximize the likelihood with respect to $λ_{max}$ and K_M using the BFGS algorithm. We estimate the errors in these inferences using the diagonal elements of the Hessian matrix $- \nabla \nabla log P (D | λ_{max}, K_{M})$ evaluated at the maximum of the likelihood (MacKay 2003).

3 Results

3.1 Approximating data with a piece-wise linear model

Although our goal is to allow scientists to choose objectively the segment of their data that is ‘most’ linear, we adopt a general methodology and allow the data to be described by linear combinations of arbitrary basis functions. For straight lines, there are two basis functions, $ϕ_{1} (x) = 1$ and $ϕ_{2} (x) = x$ , but datasets may require higher order polynomials or even Gaussian or sigmoid functions (Bishop 2006).

For a 1D time series and a given set of basis functions, we will infer the optimal piece-wise description—the number of contiguous segments into which we should divide the data, where the boundaries of each of those segments should be, and the best-fit linear combination of basis functions for each segment. Deciding which of these segments is then most appropriate for the task in hand is unavoidably subjective. It is straightforward, however, to compare different segments by comparing properties of their best-fit linear combinations. For lines, these properties include their gradients and R² value—how much of the variance of the dependent variable is explained by the independent one (Moses 2017).

We use a Bayesian approach to infer the best piece-wise description and assume only that the data of each segment is normally distributed around a linear combination of the basis functions (Section 2). To proceed analytically we marginalize over all coefficients constituting the linear combination for each segment using a mild approximation and choose the optimal number of segments by comparing marginal likelihoods. The data points bounding each segment are then estimated by the means of their posterior distribution. We consider the case with known measurement error separately from an unknown one and call our algorithm nunchaku.

3.2 Verifying our approach

To verify our methodology (Section 2), we first focused on identifying linear regions. We generated synthetic data using piece-wise linear functions, where we know the number of segments and their gradients, added Gaussian noise, and then inferred from this data the optimal number of segments and the gradients of the best-fit lines, assuming that we know the magnitude of the measurement noise (Fig. 1A).

The algorithm predicts correctly the number of segments when the noise in the data is sufficiently low (Fig. 1B and Supplementary Fig. S1), but underestimates this number when the noise is larger. Such noise tends to blur two neighbouring segments so they seem one, rather than cause a single segment to appear as two or more. Similarly, if we decrease the angle between neighbouring segments, the noise is more likely to make two neighbouring segments appear contiguous, and the algorithm’s accuracy falls (Supplementary Fig. S1).

We confirmed that the algorithm also correctly predicts the underlying piece-wise linear functions, and hence the gradient of the lines generating the data in the segments (Supplementary Fig. S1). As expected, this accuracy falls too with more noisy data.

When the measurement error is unknown, the results are similar (Supplementary Fig. S1), but the algorithm is slower because we numerically integrate over all possible magnitudes of this measurement error. We also confirmed that the algorithm’s performance is robust to broad choices of the prior distribution (Supplementary Fig. S2).

We next compared our methodology to the Narrowest-Over-Threshold (NOT) algorithm (Baranowski et al. 2019), a state-of-the-art frequentist approach. Whether we consider the root mean square error between the best-fit lines and the ground truth (Fig. 2A) or the predicted number of segments (Fig. 2B), our algorithm consistently performs as well as or better (see also Supplementary Fig. S3). This greater accuracy however comes at the expense of speed: the NOT algorithm is faster than our implementation of nunchaku.

Finally, we demonstrated that nunchaku works with other basis functions, including constant functions, third-order polynomials, and sines (Supplementary Fig. S4).

3.3 Application 1: finding the range of OD that increases linearly with cell number

The OD of a microbial culture increases linearly with the number of cells only for sufficiently small ODs. At higher ODs, the light from the spectrophotometer may scatter off multiple cells, and the relationship between OD and the number of cells becomes nonlinear (Stevenson et al. 2016). To calibrate OD measurements, researchers often serially dilute a dense culture of microbes and measure the relationship between the OD and the dilution factor (Warringer and Blomberg 2003, Stevenson et al. 2016) (Fig. 3A). Interpolating this curve, we can convert an OD measurement to the corresponding dilution factor and so correct for any nonlinearity between the OD and cell numbers.

Figure 3. — The nunchaku algorithm gives intuitive results when applied to biological data. (A) The calibration curve for plate-reader measurements of the OD of *S.cerevisiae*, found by diluting an overnight culture in 2% fructose, is nonlinear (blue dots). There are three replicate measurements for each dilution factor. Our algorithm identifies two linear segments (boundaries marked as circles). Lighter orange circles bound the segment with the highest R². We specify the likely maximal range of OD as our prior: $[0, 2]$ . Inset: the logarithm of the model evidence for the number of segments. (B) Identifying contiguous linear segments in the logarithm of the OD of growing *E.coli* cells as a function of time allows us to identify automatically the region of exponential growth. We show the mean of four replicate measurements (blue) with twice their standard deviation shaded. Circles denote the boundaries of linear segments; orange circles bound the segment with the best-fit line with highest gradient and so highest specific growth rate. The average specific growth rate over this segment is 1.5 h^–1. Inset: the logarithm of the model evidence for the number of segments. (C) With our algorithm, we can automatically identify the region of exponential growth in multiple datasets, here 38, to reveal growth laws such as Monod’s equation. We plot the specific growth rate in log phase for *S.cerevisiae* as a function of the concentration of fructose, with the solid line a fit of Monod’s equation: $λ_{max} = 0.422 \pm 0.006$ h^–1 and $K_{M} = 0.026 \pm 0.002$ % (w/v). The shaded area shows the 95% confidence interval. Inset: three example growth curves with dots marking the region of exponential growth, identified as the segment with the highest gradient. For panels (B) and (C), we specify a prior on the range of the gradient: $[0, 5]$ h^–1.

Dilution factors, however, are not intuitive units, and it is useful to identify the range of ODs over which there is a linear relationship with cell numbers. Not only is this range itself important, but by using the ratio of the maximum of the range to the corresponding dilution factor, we can re-scale the dilution factors back into ODs.

We used the nunchaku algorithm to identify the linear range, using basis functions that generate straight lines and an unknown measurement error. Two linear segments are optimal, and the one of interest, where OD is proportional to the number of cells, is the segment beginning at the smallest OD. This segment also has the highest coefficient of determination R². Its maximal OD is 0.66 for a relative cell number of 0.25 (Fig. 3A), and we should therefore multiply the dilution factors by 0.66/0.25, or 2.6, to convert back to ODs.

3.4 Application 2: identifying the log phase of microbial growth

Microbes are most often studied when growing exponentially, with the $log (OD)$ of the culture increasing linearly with time (Monod 1949). Researchers identify this log-phase growth from microbial growth curves.

To detect log phase automatically, we applied nunchaku, again with basis functions generating lines, to OD measurements of E.coli (Fig. 3B). Partitioning the data into six segments is optimal, and the segment whose best-fit line has the highest gradient—the greatest specific growth rate—corresponds to exponential growth.

Monod noticed an empirical relationship between the nutrient concentration and the specific growth rate of microbes in log phase (Monod 1949). Denoting this growth rate as λ, the maximal specific growth rate as $λ_{max}$ , and the nutrient concentration as s, his equation becomes

λ = λ_{max} \frac{s}{K_{M} + s}

(28)

where K_M is now called the Monod constant. To estimate $λ_{max}$ and K_M, researchers systematically vary the concentration of the carbon source and identify the log phase and the corresponding gradient for each growth curve.

Here, we use the nunchaku algorithm to select data to estimate $λ_{max}$ and K_M for S.cerevisiae growing on fructose (Section 2), from 38 growth curves measured with plate readers (Fig. 3C). Each biological replicate has two technical replicates.

4 Discussion

Determining where data are best described by a line is a problem familiar to most scientists. We present a statistically rigorous solution, which we generalize by considering linear combinations of arbitrary basis functions. Our methodology is Bayesian and similar in approach to earlier work that focused on piece-wise constant functions (Hutter 2007).

Like all Bayesian inference, our algorithm depends on prior information: the bounds on the coefficients constituting the linear combination of basis functions. For basis functions generating lines, these bounds describe the range of the gradients and intercepts of all possible lines within a segment. The optimal number of segments will depend on this prior if the amount of data is sufficiently small, as it should (MacKay 2003). In practice, however, users interested in lines need specify only one prior range with the other inferred (Section 2), and we see that although a wide prior favours fewer segments, a single segment is robustly assigned to sections of the data that appear linear.

Our method makes two assumptions about how the data deviate from a linear combination of basis functions. We assume these deviations are independent and we assume that each deviation obeys a normal distribution. For some data, a distribution with a purely nonnegative support, such as a log normal, may be more appropriate. Although we can use such a distribution in principle, in practice some of the steps that we performed analytically would have to become numerical. Further, if nothing is known a priori about these deviations, we assume that their standard deviation is identical for all time points. Our algorithm would work too if the standard deviations vary but are proportional to a known function of x_j and y_j.

Our work adds to existing algorithms for detecting change points in time series, including those aimed at analyzing microbial growth (Papastamoulis et al. 2019). We have simplified this problem by considering change points to occur only at data points and by imposing no continuity between the functions underlying the data for each segment. These simplifications are not restrictive for our task of finding one particular segment of interest. Identifying change points more generally typically requires Markov chain Monte Carlo methods (Stephens 1994, Papastamoulis et al. 2019).

The nunchaku algorithm by using enumeration is robust and lends itself to automation, facilitating high throughput studies. It should both ease and increase the reproducibility of data analyses for a wide range of scientists.

Supplementary Material

btad688_Supplementary_Data

Click here for additional data file.^{(1MB, pdf)}

Acknowledgements

We thank Ramon Grima and Edward WJ Wallace for helpful comments and the Biotechnology and Biological Sciences Research Council (P.S.S. and Y.H.) and the Darwin Trust (Y.H. and X.D.) for funding.

Contributor Information

Yu Huo, Centre for Engineering Biology, University of Edinburgh, Edinburgh EH9 3BF, United Kingdom; School of Biological Sciences, University of Edinburgh, Edinburgh EH9 3BF, United Kingdom.

Hongpei Li, School of Biological Sciences, University of Edinburgh, Edinburgh EH9 3BF, United Kingdom.

Xiao Wang, School of Biological Sciences, University of Edinburgh, Edinburgh EH9 3BF, United Kingdom.

Xiaochen Du, School of Biological Sciences, University of Edinburgh, Edinburgh EH9 3BF, United Kingdom.

Peter S Swain, Centre for Engineering Biology, University of Edinburgh, Edinburgh EH9 3BF, United Kingdom; School of Biological Sciences, University of Edinburgh, Edinburgh EH9 3BF, United Kingdom.

Supplementary data

Supplementary data are available at datashare.ed.ac.uk/handle/10283/2002.

Conflict of interest

None declared.

Funding

This research was funded in whole, or in part, by the Biotechnology and Biological Sciences Research Council [BB/W006545/1]. For the purpose of open access, the authors have applied a creative commons attribution (CC BY) licence to any author accepted manuscript version arising.

Data availability

The data underlying this article are available in Edinburgh DataShare at https://doi.org/10.7488/ds/7548.

References

Baranowski R, Chen Y, Fryzlewicz P.. Narrowest-over-threshold detection of multiple change points and change-point-like features. J R Stat Soc Series B Stat Methodol 2019;81:649–72. [Google Scholar]
Bishop CM. Pattern Recognition and Machine Learning. Berlin, Germany: Springer, 2006. [Google Scholar]
Gelman A. Prior distributions for variance parameters in hierarchical models. Bayesian Anal 2006;1:515–34. [Google Scholar]
Hinrichsen M, Lenz M, Edwards JM. et al. A new method for post-translationally labeling proteins in live cells for fluorescence imaging and tracking. Protein Eng Des Sel 2017;30:771–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hutter M. Exact Bayesian regression of piecewise constant functions. Bayesian Anal 2007;2:635–64. [Google Scholar]
MacKay DJ. Information Theory, Inference and Learning Algorithms. Cambridge, UK: Cambridge University Press, 2003. [Google Scholar]
Monod J. The growth of bacterial cultures. Annu Rev Microbiol 1949;3:371–94. [Google Scholar]
Montaño-Gutierrez LF, Moreno NM, Farquhar IL. et al. Analysing and meta-analysing time-series data of microbial growth and gene expression from plate readers. PLoS Comput Biol 2022;18:e1010138. [DOI] [PMC free article] [PubMed] [Google Scholar]
Moses AM. Statistical Modelling and Machine Learning for Molecular Biology. Boca Raton, FL: CRC Press. 2017. [Google Scholar]
Papastamoulis P, Furukawa T, Van Rhijn N. et al. Bayesian detection of piecewise linear trends in replicated time-series with application to growth data modelling. Int J Biostat 2019;16:20180052. [DOI] [PubMed] [Google Scholar]
Scott M, Hwa T.. Shaping bacterial gene expression by physiological and proteome allocation constraints. Nat Rev Microbiol 2023;21:327–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stephens DA. Bayesian retrospective multiple-changepoint identification. J R Stat Soc Ser C Appl Stat 1994;43:159–78. [Google Scholar]
Stevenson K, McVey AF, Clark IB. et al. General calibration of microbial growth in microplate readers. Sci Rep 2016;6:38828. [DOI] [PMC free article] [PubMed] [Google Scholar]
Verduyn C, Postma E, Scheffers WA. et al. Effect of benzoic acid on metabolic fluxes in yeasts. Yeast 1992;8:501–17. [DOI] [PubMed] [Google Scholar]
Warringer J, Blomberg A.. Automated screening in environmental arrays allows analysis of quantitative phenotypic profiles in Saccharomyces cerevisiae. Yeast 2003;20:53–67. [DOI] [PubMed] [Google Scholar]
Zhang NL, Poole D.. Exploiting causal independence in Bayesian network inference. JAIR 1996;5:301–28. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btad688_Supplementary_Data

Click here for additional data file.^{(1MB, pdf)}

Data Availability Statement

The data underlying this article are available in Edinburgh DataShare at https://doi.org/10.7488/ds/7548.

[btad688-B1] Baranowski R, Chen Y, Fryzlewicz P.. Narrowest-over-threshold detection of multiple change points and change-point-like features. J R Stat Soc Series B Stat Methodol 2019;81:649–72. [Google Scholar]

[btad688-B2] Bishop CM. Pattern Recognition and Machine Learning. Berlin, Germany: Springer, 2006. [Google Scholar]

[btad688-B3] Gelman A. Prior distributions for variance parameters in hierarchical models. Bayesian Anal 2006;1:515–34. [Google Scholar]

[btad688-B4] Hinrichsen M, Lenz M, Edwards JM. et al. A new method for post-translationally labeling proteins in live cells for fluorescence imaging and tracking. Protein Eng Des Sel 2017;30:771–80. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad688-B5] Hutter M. Exact Bayesian regression of piecewise constant functions. Bayesian Anal 2007;2:635–64. [Google Scholar]

[btad688-B6] MacKay DJ. Information Theory, Inference and Learning Algorithms. Cambridge, UK: Cambridge University Press, 2003. [Google Scholar]

[btad688-B7] Monod J. The growth of bacterial cultures. Annu Rev Microbiol 1949;3:371–94. [Google Scholar]

[btad688-B8] Montaño-Gutierrez LF, Moreno NM, Farquhar IL. et al. Analysing and meta-analysing time-series data of microbial growth and gene expression from plate readers. PLoS Comput Biol 2022;18:e1010138. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad688-B9] Moses AM. Statistical Modelling and Machine Learning for Molecular Biology. Boca Raton, FL: CRC Press. 2017. [Google Scholar]

[btad688-B10] Papastamoulis P, Furukawa T, Van Rhijn N. et al. Bayesian detection of piecewise linear trends in replicated time-series with application to growth data modelling. Int J Biostat 2019;16:20180052. [DOI] [PubMed] [Google Scholar]

[btad688-B11] Scott M, Hwa T.. Shaping bacterial gene expression by physiological and proteome allocation constraints. Nat Rev Microbiol 2023;21:327–42. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad688-B12] Stephens DA. Bayesian retrospective multiple-changepoint identification. J R Stat Soc Ser C Appl Stat 1994;43:159–78. [Google Scholar]

[btad688-B13] Stevenson K, McVey AF, Clark IB. et al. General calibration of microbial growth in microplate readers. Sci Rep 2016;6:38828. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad688-B14] Verduyn C, Postma E, Scheffers WA. et al. Effect of benzoic acid on metabolic fluxes in yeasts. Yeast 1992;8:501–17. [DOI] [PubMed] [Google Scholar]

[btad688-B15] Warringer J, Blomberg A.. Automated screening in environmental arrays allows analysis of quantitative phenotypic profiles in Saccharomyces cerevisiae. Yeast 2003;20:53–67. [DOI] [PubMed] [Google Scholar]

[btad688-B16] Zhang NL, Poole D.. Exploiting causal independence in Bayesian network inference. JAIR 1996;5:301–28. [Google Scholar]

PERMALINK

Nunchaku: optimally partitioning data into piece-wise contiguous segments

Yu Huo

Hongpei Li

Xiao Wang

Xiaochen Du

Peter S Swain

Roles

Abstract

Motivation

Results

Availability and implementation

1 Introduction

2 Materials and methods

2.1 Inferring contiguous regions using model comparison

2.1.1 Finding P(D|n,M)

2.1.2 Marginalizing P(D|n,M)

2.1.3 Finding P(Di|ni+1,ni+1) for known measurement error

2.1.4 Finding the boundary points

2.1.5 Finding P(D|M) for unknown measurement error

2.1.6 Implementation

2.1.7 Availability

2.1.8 Generating and testing with synthetic data

Figure 1.

Figure 2.

2.2 Experimental methods

2.3 Fitting Monod’s equation

3 Results

3.1 Approximating data with a piece-wise linear model

3.2 Verifying our approach

3.3 Application 1: finding the range of OD that increases linearly with cell number

Figure 3.

3.4 Application 2: identifying the log phase of microbial growth

4 Discussion

Supplementary Material

Acknowledgements

Contributor Information

Supplementary data

Conflict of interest

Funding

Data availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

2.1.1 Finding $P (D | n_{,} M)$

2.1.2 Marginalizing $P (D | n, M)$

2.1.3 Finding $P (D_{i} | n_{i} + 1, n_{i + 1})$ for known measurement error

2.1.5 Finding $P (D | M)$ for unknown measurement error