Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Jul 21.
Published in final edited form as: Ann Appl Stat. 2023 Jan 24;17(1):357–377. doi: 10.1214/22-aoas1631

MODELING CELL POPULATIONS MEASURED BY FLOW CYTOMETRY WITH COVARIATES USING SPARSE MIXTURE OF REGRESSIONS

By Sangwon Hyun 1,a, Mattias Rolf Cape 2, Francois Ribalet 2, Jacob Bien 1
PMCID: PMC10360992  NIHMSID: NIHMS1917146  PMID: 37485300

Abstract

The ocean is filled with microscopic microalgae, called phytoplankton, which together are responsible for as much photosynthesis as all plants on land combined. Our ability to predict their response to the warming ocean relies on understanding how the dynamics of phytoplankton populations is influenced by changes in environmental conditions. One powerful technique to study the dynamics of phytoplankton is flow cytometry which measures the optical properties of thousands of individual cells per second. Today, oceanographers are able to collect flow cytometry data in real time onboard a moving ship, providing them with fine-scale resolution of the distribution of phytoplankton across thousands of kilometers. One of the current challenges is to understand how these small- and large-scale variations relate to environmental conditions, such as nutrient availability, temperature, light and ocean currents. In this paper we propose a novel sparse mixture of multivariate regressions model to estimate the time-varying phytoplankton subpopulations while simultaneously identifying the specific environmental covariates that are predictive of the observed changes to these subpopulations. We demonstrate the usefulness and interpretability of the approach using both synthetic data and real observations collected on an oceanographic cruise conducted in the northeast Pacific in the spring of 2017.

Key words and phrases. Mixture of regressions, expectation-maximization, flow cytometry, sparse regression, ocean, microbiome, phytoplankton, clustering, gating, alternating direction method of multipliers

1. Introduction.

Marine phytoplankton are responsible for as much photosynthesis as all plants on land combined, making them a crucial part of the earth’s biogeochemical cycle and climate (Field et al. (1998)). A better understanding of the ecology of marine phytoplankton species and their relationship with the ocean environment is, therefore, important both to basic biology and to shedding light on their role in carbon dioxide uptake. In order to study these single cell organisms in the ocean, flow cytometry has been instrumental for the past three decades (Sosik, Olson and Armbrust (2010)).

Flow cytometry measures light scatter and fluorescence emission of individual cells at rates of up to thousands of cells per second. Light scattering is proportional to cell size, and fluorescence is unique to the emission spectra of pigments; these parameters can be used to identify populations of phytoplankton with similar optical properties. Over the two decades, automated environmental flow cytometers such as CytoBuoy (Dubelaar et al. (1999)), FlowCytoBot (Olson, Shalapyonok and Sosik (2003)) and SeaFlow (Swalwell, Ribalet and Armbrust (2011)) have provided an unprecedented view of dynamics of phytoplankton across large temporal and spatial scales.

Automated in-situ flow cytometry data can be represented as a scatterplot-valued time series, y(1),,y(T), where an nt by d matrix y(t) whose rows are vectors yi(t)Rd:i=1,,nt is called a cytogram and can be thought of as a d-dimensional scatterplot representing nt particles observed during time interval t. The d dimensions of the scatterplot represent d optical properties that are useful in distinguishing different cell types from each other. Figure 1 shows an example of three cytograms collected by SeaFlow in June 2017 during a two-week cruise conducted in the northeast Pacific. With SeaFlow, cytograms are of dimension d=3.

Fig. 1.

Fig. 1.

A schematic showing the data setup. (Top) This figure shows the trajectory of the Gradients 2 cruise which moves north and then south along a trajectory starting at Hawaii. (Middle) The individual three-dimensional particles are measured rapidly and continuously. From this we form T cytograms y(t),t=1,,T at an hourly time resolution. The three data dimensions have simplified labels Red, Orange and Diameter; the first two represent fluorescence emission, and the last measures cell diameter. (Bottom panel) At each time t=1,,T, environmental covariates X(t)Rp are also available through remote sensing and on-board measurements. Only a few of the 30+ normalized covariates are highlighted here. Our proposed model identifies subpopulations by modeling them as Gaussian clusters whose means and probabilities are driven by environmental covariates.

As apparent in the figure, the points within the cytograms display clear clustering structure. These different clusters correspond to cell populations of different types of phytoplankton. As the environmental conditions change, the populations change over time. In particular, in optical space, two noticeable phenomena over time are:

  1. The number of cells in a given population can increase or decrease with populations sometimes even appearing or disappearing entirely.

  2. The centers of the cell populations are not fixed but rather move over time.

Using expert knowledge and close manual inspection, oceanographers have been able to explain how some of these phenomena can be attributed to specific changes in environmental factors (e.g., oscillations in cell size due to sunlight and cell division) (Vaulot and Marie (1999), Sosik et al. (2003), Ribalet et al. (2015)).

Our goal is to develop a statistical approach for identifying how environmental factors can be predictive of changes to the cytograms. The promise of such a tool would be to discover new relationships between cell populations and environmental factors beyond those that may be known, or visible to the human eye.

Based on these observations and with this goal in mind, our statistical model for time-varying cytograms postulates a finite mixture model in which both the cluster probabilities and centers are allowed to vary over time. Changes to the cluster probabilities over time can capture the growing/shrinking and appearing/disappearing described above, while changes to the centers over time can capture the drifting/oscillating.

To be clear, our method does not explicitly incorporate the time (or space) aspect of the data. Instead, in our model these cluster probabilities and centers are controlled by p time-varying covariates X(t)Rp. While our model can accommodate features that are purely functions of time (e.g., sin t,t2, spline basis functions, etc.), our focus here is on environmental covariates. Our analysis uses biological, physical and chemical variables, shown in the bottom panel of Figure 1, that were retrieved from the Simons Collaborative Marine Atlas Project (CMAP) database (https://simonscmap.com) which is a public database compiling various oceanographic data over space and time.

One key strength of our method is the variable selection property, allowing the analyst to identify the subset of covariates that are the strongest predictor of each cluster’s mean and probability movement over time. For instance, in Figure 2 the estimated coefficients reveal that higher sea surface temperature and lower phosphate can predict a decrease in probability of cluster E located in the lower-left corner, and time-lagged sunlight and nitrate well predict the horizontal and vertical movement of cluster E’s center.

Fig. 2.

Fig. 2.

Our method produces estimates of cluster centers (shown as disks) and cluster probabilities (represented by the size of the disk) for every time point. The covariance of each mixture component (represented by an ellipse) is assumed to be constant over time. Blue and red show parameter estimates at two different time points. In the background, particles from only one time point are shown in (partially transparent) dark blue with the size of a point proportional to the particle’s biomass. The right figure takes a closer look at a subregion of the cytogram, shown in the lower left corner of the left figure, focusing on cluster E which is a Prochlorococcus population. The change in the probability of cluster E is well predicted by sea surface temperature and phosphate, and the horizontal and vertical movement of cluster E’s center are each predicted by time-lagged sunlight and nitrate. Note, we are showing only five of the 10 clusters used for estimation.

Our framework represents a substantial improvement in the detail and richness of how this data can be modeled and analyzed. Flow cytometry data are traditionally analyzed by a technique called gating, which counts the number of cells falling into certain fixed, expert-drawn polygonal regions of Rd corresponding to each cell population (Verschoor et al. (2015)), reducing each scatterplot into several counts (giving the number of cells in each gated region) (Hyrkas et al. (2015)). Subjectivity in manual gating has been shown to be an obstacle to reproducibility (Hahne et al. (2009)). Furthermore, the presence of overlapping cell communities suggests that hard assignments to fixed disjoint regions may not be advisable. These and other shortcomings have led multiple authors to develop mixture model based approaches, as discussed in Aghaeepour et al. (2013). While such models are an improvement over traditional gating, they do not naturally extend to oceanography in which we have a time series of cytograms. Naively, one might think one could get away with fitting a separate mixture model to each individual cytogram. However, doing so leaves one with the problem of matching clusters from distinct clusterings, a task made particularly challenging since these clusters can move, change in size and appear/disappear. Our approach fits a single mixture model jointly across the entire time series while integrating information from the covariates. By using all data sources in a single mixture model, our method is able to estimate the distinct components, even in cases where two populations’ centers may be nearby or a cluster may sometimes vanish.

In the statistics literature the term finite mixture of regressions is used to refer to mixture models in which (univariate) means are modeled as functions of covariates (see, e.g., McLachlan and Peel (2006)). Early works, such as Wang et al. (1996), use information criteria and exhaustive search while more modern approaches have used penalized sparse models (Khalili and Chen (2007), Städler, Bühlmann and van de Geer (2010)). Our methodology differs from these methods in three respects: first, our means are multivariate (d-dimensional); second, the mixture weights are also modeled as functions of the covariates; third, the model coefficients are penalized. Of these, the first two aspects are shared by Grün and Leisch (2008) but without penalization. The idea of allowing the mixture weights to be functions of the features is more common in the machine learning literature, where such models are called mixtures of experts (Jordan and Jacobs (1993)).

To the best of our knowledge, this is the first attempt to extend mixture modeling of flow cytometry data by directly linking mixture model parameters with environmental covariates via sparse multivariate regression models. In Section 2 we describe our proposed model in detail. In Section 3 we use our proposed model to draw rich new insights from a marine data source. We also conduct two realistic numerical simulations based on some pseudo-real ocean flow cytometry data. We provide an R package called flowmix that can be run both on a single machine and also on remote high performance servers that use a parallel computing environment. While our focus is on time-varying flow cytometry in the ocean, our method can be applied more broadly to any collection of cytograms with associated covariates. For example, in biomedical applications each cytogram could correspond to a blood sample from a different person, and person-specific covariates could model the variability in cytograms.

2. Methodology.

2.1. Likelihood function of cytogram.

We model the nt particles yi(t)i=1nt, measured at time t, as i.i.d. draws from a probabilistic mixture of K different d-variate Gaussian distributions, conditional on the covariate vector X(t)Rp. The latent variable Zi(t) determines the cluster membership,

PZi(t)=kX(t)=πkt,k=1,,K, (1)

and the data is drawn from the kth Gaussian distribution,

yi(t)X(t),Zi(t)=k~𝒩dμkt,Σk,

where the cluster center μktRd and cluster probability πkt at time t are modeled as functions of X(t),

μkt(β)=β0k+βkTX(t),πkt(α)=expα0k+X(t)Tαkl=1Kexpα0l+X(t)Tαl

for regression coefficients β0kRd,βkRp×d,αkRp and α0kR; throughout, we use α,β and Σ to denote the collection of coefficients α0k,αkk=1K,β0k,βkk=1K and Σkk=1K for brevity. Since all random variables are conditional on the covariates X(t), we will omit it hereon for brevity. Denoting the density of the kth Gaussian component of data at time t as ϕ;μkt,Σ(k), the log-likelihood function is

logα,β,Σ;yi(t)i,t=t=1Ti=1ntlogk=1Kπkt(α)ϕyi(t);μkt(β),Σk. (2)

By modeling the Gaussian means μktk,t and mixture probabilities πktk,t as regression functions of X(t) at each time point t=1,,T, our model directly allows environmental covariates to predict the two main kinds of cell population changes over time—movement in optical space and change in relative population abundance. Furthermore, the signs and magnitudes of the entries of α and β directly quantify the contribution of environment covariates to each population’s abundance and direction of movement in cytogram space.

2.2. Penalties and constraints.

In practice, there are a large number of environmental covariates that may, in principle, be predictive of a cytogram. Also, the number of regression parameters is (p+1)(d+1)K which can be large relative to the number of cytograms T. Furthermore, we would prefer models in which only a small number of parameters is nonzero. Therefore, we penalize the log-likelihood with lasso penalties (Tibshirani (1996)) on α and β.

In our application each cell population has a limited range in optical properties, due to biological constraints. We incorporate this domain knowledge into the model by constraining the range of μk1,,μkT over time. Since βkTX(t)=μktβ0k, limiting the size of βkTX(t) is equivalent to limiting the deviation of the kth cluster mean at all times t=1,,T away from the overall center β0k. Motivated by this, we add a hard constraint so that βkTX(t)2r for some fixed radius value r>0.

The choice of r should be specific to the data application. For one-dimensional cytograms of cell diameter measurements, used in the analysis in Section 4.0.1, the size of r holds the intuitive meaning of not allowing the average optical properties of a particular cell population to deviate more than a multiplicative upper and lower bound over time compared to an overall average.

The constraint also plays an important role for model interpretability. We wish for the kth mixture component to correspond to the same cell population over all time. When a cell population vanishes, we would like πkt to go to zero rather than for μkt to move to an entirely different place in cytogram space.

Our estimator is thus a solution to the following optimization problem:

minimizeα,β,Σ1Nlogα,β,Σ;yi(t)i,t+λαk=1Kαk1+λβk=1Kβk1,subjecttoβkTX(t)2rt=1,,Tk=1,,K. (3)

We divide the log-likelihood term by N:=t=1Tnt to make the scale consistent with that of a single particle.

2.3. Multiplicity generalization.

Cytogram datasets can be extremely large, and cell populations can have highly imbalanced probabilities. To overcome the computational and methodological difficulties posed by these issues, we generalize the model to assign to particle yi(t) a multiplicity factor Ci(t) (which defaults to 1). The log-likelihood in (2) becomes

n1,,nTα,β,Σ;y(1),,y(T),C(1),,C(T)=t=1Ti=1ntCi(t)logk=1Kπkt(α)ϕyi(t);μkt(β),Σk, (4)

where y(t)Rnt×d and C(t)Rnt. Furthermore, the scaling by N in the optimization objective (3) is generalized to N:=t=1Ti=1ntCi(t), the overall sum of the multiplicities.

The multiplicity generalization is useful for an approximate data representation by placing particles in bins and dealing with bin counts. We discretize cytogram space along a lattice of B=Ddd-dimensional cubes Ebb=1B whose centers y˜bRd can be arranged as the rows of a matrix y˜RB×d. This coarsened data representation involves counts Cb(t)b,t of the number of particles in each fixed bin Eb:

Cb(t)=i=1nt1i:yi(t)Eb,

whose collection is C(t)RB. Using Cb(t) and y˜b(t):=y˜b to replace Ci(t) and yi(t) in (4), we obtain the log-likelihood of the binned data,

(B,,B)α,β,Σ;(y˜,,y˜),C(1),,C(T)=t=1Tb=1BCb(t)logk=1Kπkt(α)ϕy˜b;μkt(β),Σk. (5)

Whereas before each cytogram required its own set of nt particle locations, in the binned data representation the same set of locations are shared across all t which is indicated by the notation (y˜,,y˜).

This binned likelihood is identical to the original log-likelihood (2) after replacing each particle by its bin center. The computational savings are apparent from noticing that b=1B𝟙Cb(t)0nt since, typically, only a small subset of the bins Eb contain any particles. Additionally, the number of Gaussian density calculations are reduced by a factor of T since the particles y˜b do not depend on t.

There is no finite value of B for which the binned log-likelihood in (5) is equal to the log-likelihood calculated on the original data, due to the nonzero distance between bin centers y˜b and data yi(t), even for very large B. However, the following proposition 1 establishes that parameter estimation from the binned data is asymptotically equivalent to parameter estimation from the original data, as the number of bins B grows to . The proof is provided in Supplement A (Hyun et al. (2023)). As for what occurs for finite values of B, a simulation study in Supplement F suggests that even using a relatively small number of bins can achieve similar predictive performance as using the original data.

Proposition 1. Let

Θ˜B:=argmin(α,β,Σ)Θ1N(B,,B)α,β,Σ;(y˜,,y˜),C(1),,C(T)+g(α,β) (6)

be the set of minimizers of the penalized negative log-likelihood of the binned data, and let

Θˆ:=argmin(α,β,Σ)Θ1Nlogα,β,Σ;yi(t)i,t+g(α,β) (7)

be that of the original data. The term g(α,β) encapsulates the penalties on α and β and the constraint on β in (3). Assume the following:

  1. The parameter space Θ of (α,β,Σ) is compact, and λminΣk<cΘ= for all k=1,,K for some constant c>0.

  2. The data belongs to a compact set 𝒴 with maxy,y'𝒴yy'R for some positive constant R<.

  3. The log-likelihood logα,β,Σ;yi(t)i,t< for all (α,β,Σ)Θ.

Then, given any sequence θ˜BΘ˜B(B=1,2,) of minimizers of the penalized negative loglikelihood of the binned data, a sequence sB exists such that the subsequence θ˜sB converges to an element in Θˆ,

limBθ˜sBΘˆ (8)

This generalization to a binned data representation can be thought of as trading off some data resolution for significant computational savings in practice. To illustrate, the entire set of 3D particles, collected during the Gradients 2 cruise, divide into about T=300 hourly cytograms containing nt100,000 particles each (as visualized in Figure 3). This occupies dt=1Tnt doubles, or 800 Mb in memory for d=3. Equally burdensome is the size of the responsibilities γitki,t,k (to be defined shortly in Section 2.4) and densities of each particle with respect to all K clusters, which are each tntKd doubles or 2.5 Gb in memory for K=10. By contrast, when binned with D=40, this becomes 40 Mb in memory.

Fig. 3.

Fig. 3.

Original particles (left) and binned counts with D=40 (middle) and binned biomass (right). In the middle and right plots, the size of the points are proportional to the multiplicity. The left-hand-side original cytogram contain one hour’s worth of particles, for a total of nt=36,757 points, occupying a total of 0.86 Mb of memory. The binned cytogram in the middle occupies about 1/8th the memory. The right-hand side shows binned biomass data which has lesser imbalance in cluster distribution than the binned count data in the middle.

The biomass representation of data uses carbon quotas—the amount of carbon in each particle, in pgC per cell—Ci(t):=Biomassyi(t), instead of repeated particle counts as multiplicities, and the binned biomass representation of data aggregates the total carbon biomass in each bin as Cb(t)=i:iAb(t)Biomassyi(t). The data analysis in our paper uses the binned biomass representation.

From a modeling viewpoint the carbon biomass representation is an attractive alternative to the particle count representation because our cytograms have highly imbalanced particle clusterings, a setting in which mixture models generally perform poorly (Xu and Jordan (1996)). From a biogeochemical standpoint, biomass distributions are meaningful since cell count is usually inversely proportional to particle size: small cells tend to dominate numerically the ocean due to their smaller size and lesser expenditure of biochemical resources (Marañón (2015)).

However, representing data with biomass is not without complication. Using biomass as multiplicities requires an additional assumption that carbon atoms can be treated in the same way we have treated particles. However, we know that carbon atoms arrive in bundles (according to particle sizes) and, therefore, treating them as independent is an unrealistic assumption. That said, in practice, we see that this simplifying assumption still produces useful and interpretable estimated models.

2.4. Penalized expectation-maximization algorithm.

Directly maximizing the penalized log-likelihood (3), generalized with multiplicities, is difficult due to its nonconvexity. We outline a penalized EM algorithm (Pan and Shen (2007)) for indirectly maximizing the objective.

Recall from (1) that latent variable Zi(t) encodes the particle’s cluster membership:

Zi(t){1,,K},

Also, define the joint log-likelihood of the data and the latent variables to be

logcα,β,Σ;yi(t)i,t,Zi(t)i,t,Ci(t)i,t=t=1Ti=1ntCi(t)k=1K1Zi(t)=klogπkt(α)ϕyi(t);μkt(β),Σk. (9)

Now, denote the conditional probability of membership as

γitk(α,β,Σ)=Pα,β,ΣZi(t)=ky(t),X(t),

sometimes called responsibilities in the literature.

Given some latest estimates of the parameters (αˆ,βˆ,Σˆ), we make use of the surrogate objective Q(α,β,Σαˆ,βˆ,Σˆ), defined as the penalized conditional expectation (in terms of the conditional distribution of Z(t)y(t),X(t), of the joint penalized log-likelihood,

Q(α,β,Σαˆ,βˆ,Σˆ)=1Nt=1Ti=1ntCi(t)k=1Kγitk(αˆ,βˆ,Σˆ)logπkt(α)ϕyi(t);μkt(β),Σkλαk=1Kαk1λβk=1Kβk1k=1Kt=1T1βkTX(t)2r. (10)

The algorithm alternates between estimating the conditional membership probabilities γitk, and updating the latest parameter estimates (αˆ,βˆ,Σˆ) by the maximizer of the penalized Q function in (10):

  1. E-step Given (αˆ,βˆ,Σˆ), estimate the conditional membership probabilities as

γitk(αˆ,βˆ,Σˆ)=ϕyi(t);μkt(βˆ),Σˆkπkt(αˆ)l=1Lϕyi(t);μlt(βˆ),Σˆlπlt(αˆ), (11)

for k=1,,K;t=1,,T;i=1,,nt. For the first iteration, choose some initial values for means μktμkinit, probabilities πkt1/K and ΣˆkΣkinit=gId for some constant g>0.

  1. M-step Using γitk=γitk(αˆ,βˆ,Σˆ), maximize (10) with respect to each parameter α,β and Σ:
    1. Update αˆ : The maximizer of (10) with respect to α is
αˆargmaxα0k=1Kαkk=1K1Nt=1Tk=1Kγtkα0k+X(t)Tαkntlogl=1Kexpα0l+X(t)Tαlλαk=1Kαk1

for sums γtk=i=1ntCi(t)γitk.

  1. Update βˆ: Update β according to the ADMM algorithm described in Section 2.5 and Supplement B. Since the problem decouples across clusters, we solve separately for each k,

βˆ0k,βˆkargminβ0k,βk12Nt=1Ti=1ntCi(t)γitkyi(t)β0kβkTX(t)T×Σˆk1yi(t)β0kβkTX(t)+λββk1subjecttoβkTX(t)2rt=1,,T.
  1. Update Σˆ: The maximizer of (10) with respect to Σk for each k=1,,K is

Σˆkt=1Ti=1ntCi(t)γitkritkritkTt=1Ti=1ntCi(t)γitk

for ritk=yi(t)βˆ0kβˆkTX(t).

Note, the M-step breaks into a convex problem over α (step 2a) and a nonconvex problem over (β,Σ) (step 2b and 2c). For the latter part of the M-step, instead of jointly optimizing over (β,Σ), we perform two successive partial optimizations—first, with respect to β, and next, with respect to Σ.

This algorithm is terminated when the penalized log-likelihood has a negligible relative improvement. In practice, we run the EM algorithm multiple times and retain the run with the highest final log-likelihood for a better chance at achieving the true optimum. For μkinit we randomly choose K out of all t=1Tnt cytogram particles. Initial covariances Σkinitk=1K are set to have diagonal entries g equal to 1/K times the cytogram range in each dimension. The α part of the M-step is solved using glmnet with family set to “multinomial” (Friedman, Hastie and Tibshirani (2010)). The β part of the M-step requires a custom alternating direction method of multipliers (ADMM) solver, outlined in the next section.

2.5. ADMM algorithm in M-step for β.

The β M-step (in step b) is very slow if computed using a noncustomized solver—for instance, using CVX (Grant and Boyd (2014)); it is the slowest component of the EM algorithm by a factor of 10 or more. To improve performance, we devise a customized alternating direction method of multipliers (ADMM) algorithm (Boyd et al. (2011)). We start by observing that this optimization problem decouples across k. Since each k{1,,K} can be solved separately, we will drop the subscript k, hereon, and write the variables β0k and βk as β0 and β,γitk as γit, and Σˆk as Σˆ for notational simplicity.

Consider the minimization problem in step b of the M-step of the penalized EM algorithm. The objective to minimize can be written as

fβ0,β=12Ni,tCi(t)γityi(t)β0βTX(t)TΣˆ1yi(t)β0βTX(t)+λββ1+1βTX(t)2r.

We can obtain the overall minimizer via partial minimization with respect to β0; writing βˆ0(β):=argminβ0fβ0,β for this partial minimizer, setting the gradient to 0 yields a closed form expression of βˆ0(β)=i,tCi(t)γityi(t)βTX(t)i,tCi(t)γit. The objective to minimize with respect to β then becomes

fβˆ0(β),β=12Ni,tCi(t)γity˜i(t)βTX˜(t)TΣˆ1y˜i(t)βTX˜(t)+λββ1+1βTX(t)2r,

where y˜(t):=yi(t)y and X˜(t):=X(t)X are data centered by weighted averages y:=i,tCi(t)γityi(t)/i,tCi(t)γit and X:=i,tCi(t)γitX(t)/i,tCi(t)γit. Now, introducing augmented variables ZRT×d and WRp×d, we can rewrite minβfβˆ0(β),β as

minimizeβ,Z,W12Ni,tCi(t)γity˜i(t)βTX˜(t)TΣˆ1y˜i(t)βTX˜(t)+λW1,
subjecttoZ(t)2r,XIβ=ZW, (12)

which can be solved using an ADMM whose full details are deferred to Supplement B. All steps are computationally simple, consisting of least squares reduced to rapidly solvable Sylvester equations, 2 ball projection and soft thresholding. The implementation in the flowmix R package is highly optimized and faster than any other component of the EM algorithm.

2.6. Cross-validation for selection of λα,λβ.

We choose the regularization parameter values λα,λβ, using five-fold cross-validation over a discrete two-dimensional grid of candidate values Lα×Lβ, in which Lα and Lβ each contain logarithmically-spaced positive real numbers. We form the five folds consisting of every fifth time block containing 20 consecutive time points. Denote these five test folds’ time points as sets Ioo=15 so that I1={1,,20,101,,120,},I2={21,,40,121,,140,} and so forth. Writing Io={1,,T}Io, the test datasets comprise of the subsetted data Xo,yo,Coo=15 for Xo:=X(t):tIo,yo:=y(t):tIo and Co:=C(t):tIo, and the corresponding training dataset comprise of Xo,yo,Coo=15.

The five-fold cross-validation score is calculated as the average of the out-of-sample negative log-likelihood in (4),

Sλα,λβ=15o=15nt:tIoαˆo,βˆo,Σˆo;yo,Xo,Co,

where αˆo,βˆo and Σˆo are the estimated coefficients from the training data set Xo,yo,Co). (We include Xo in () to emphasize which subset of the covariates the loglikelihood is based on.) The cross-validated regularization parameter values λα and λβ are the minimizer of the cross-validation score,

λˆα,λˆβ=argminλαLα,λβLβSλα,λβ.

A real data example of cross-validation scores in action is shown in Figures 12 and 13 in the Supplementary Material. Our scheme of training/test splits places a strong emphasis on even temporal coverage of the test data. Since our data are in hourly resolution (equivalent to 20 kilometers in space) and cross-validation folds are made of 20-hour-long time blocks, the temporal closeness of the test time points Io and the training time points Io is negligible. For data with finer time resolution, our recommendation is to form a time barrier between the training and test time points or to form larger time blocks for test folds. Also, in this work we do not discuss how to select the number of clusters K based on data. In simulation, we demonstrate that slightly overspecifying the number of clusters results in equivalent predictive performance as the true number of clusters; see Section 3.1.2 for details.

3. Numerical results.

3.1. Simulated data.

In order to examine the numerical properties of our proposed method, we apply our model to simulated data whose setup is closely related to our main flow cytometry datasets.

3.1.1. Noisy covariates.

The main source of noise in our data is in the environmental covariates from a variety of sources—in-situ and remote-sensing measurements and oceanographic model-derived product (Boyer et al. (2013)), each with different temporal and spatial resolution and varying amounts of uncertainties. In order to investigate the effect of uncertainty in the covariates, we conduct a simulation in which synthetic cytograms are generated from a true model and underlying covariates, and then our model is estimated with access to only artificially obscured covariates.

We generate synthetic data with T=100 time points, K=2 clusters and p=10 covariates XiRTi=110, as shown in Figure 4—one sunlight variable X1, one changepoint variable X2 and eight spurious covariates Xii=310. From these covariates, T one-dimensional cytograms are generated from the generative model in Section 2.1 with the true underlying coefficient values,

α0,1=0,α0,2=0,α1=000T,α2=08.610T,β0,1=0,β0,2=3,β1=0.300T,β2=0.300T. (13)
Fig. 4.

Fig. 4.

(Left) The thick black line shows the first covariate X1RT which is a smoothed and standardized version of the par (sunlight) covariate from Section 4.0.1. The three thin lines show the obscured sunlight variables for three different noise levels σadd. The next covariate is a changepoint variable X2RT, shown as a thick red line. The remaining eight spurious covariates Xii=310 are generated as T i.i.d. entries from 𝒩0,1+σadd2; these are not shown here. (Right) An example of a generated dataset, whose particles are shown as grey points in the background. The two true cluster means are plotted as colored lines whose thickness is proportional to the cluster probabilities. Particles for both clusters are generated as 𝒩(0,1) around the cluster means. Cluster 1 is only present in the second half and has one quarter of the number of particles in cluster 2 in those time points. A thin dashed line is shown in the first half where the cluster probability is zero.

Both clusters’ means follow the sunlight X1. Cluster 1 has nt=200 particles for all time points t=1,,100. Cluster 2 overlaps with cluster 1, is present only in the second half of the time range t=51,,100 and is 1/4th as populous as cluster 1 at those time points. Both cluster variances are equal to 1 so that particles from each cluster are generated from 𝒩(0,1) around their respective means, and the spurious covariates play no role in data generation that is, all other coefficients not specified in (13) are zero.

On each new synthetic dataset, we estimate a cross-validated two-cluster model using radius r=1.5, but instead of sunlight covariate X1, we use the obscured X1noisy=X1+ϵ,ϵ~𝒩0,σadd2IT for estimation. Also, the eight spurious covariates Xii=310 are each generated as 𝒩0,1+σadd2 to match the magnitude of X1noisy. We consider a certain range of additive noise σadd{0,0.3,0.6,,2.7} and 100 synthetic datasets for each value σadd.

The left plot of Figure 5 shows the out-of-sample model prediction performance of 100 estimated models for each noise level σadd, measured as the negative log-likelihood evaluated on a large independent test dataset. As expected, out-of-sample prediction gradually worsens with increasing covariate noise σadd, then plateaus at about σadd=2.7.

Fig. 5.

Fig. 5.

(Left) Out-of-sample prediction performance using covariates obscured by Gaussian noise variance σadd2 for the simulation setup described in Section 3.1.1. (Right) The probability of the sunlight covariate (the only relevant covariate for cluster means) being estimated as nonzero is shown in black lines. The corresponding probabilities for the eight spurious covariates are shown in red lines (thin red lines are individual covariates, and the thick red line is the average). The solid and dashed lines show results from cluster 1 and cluster 2, respectively. In both clusters the sunlight variable is more likely to be selected than the spurious variables. This advantage is more pronounced for cluster 1 than for cluster 2 which is only has data in the second half of the time range.

The right plot of Figure 5 demonstrates the variable selection property of our method, focusing on the β coefficients. Focusing on the sunlight variable—the only true predictor of mean movement—we see that it is more likely to be selected than are spurious covariates and is less likely to be selected as σadd increases. Additionally, we see that selecting sunlight is possible, even when σadd is high if the cluster has higher relative probability and has nonzero probability in a longer time range.

3.1.2. Cluster number misspecification.

In addition to covariate noise, we explore the effect of misspecifying the number of clusters K in the model. We first form a ground truth model by taking the five-cluster estimated model from the one-dimensional T=296 data in Section 4.0.1 and Figure 7 and zero-thresholding the smaller estimated coefficients. We then generate new data 30 times from this underlying true model and estimate a K-cluster cross-validated model, for K{2,3,4,5,6,7,8}. Figure 6 shows out-of-sample prediction performance, measured as the negative log-likelihood on a large independent test set generated from the true model. We see that models estimated with K<5 clusters have sharply deteriorating out-of-sample prediction. On the other hand, models estimated with K>5 than five clusters have average out-of-sample prediction performance in the same range as that of K=5 cluster models. A closer examination of the estimated models reveals that, out of the K>5 clusters, five clusters are usually estimated accurately, and the remaining K5 clusters are estimated with near-zero probability. These results suggest that one can slightly overspecify the number of clusters for estimation with little harm to prediction performance. Automatic approaches to choosing K is an interesting area of future work.

Fig. 7.

Fig. 7.

(Top) The one-dimensional cell diameter biomass cytograms (log transformed) at an hourly time resolution is shown here. In the background, the one-dimensional biomass distribution of binned cell diameter data is shown in greyscale. (Bottom) The estimated five-cluster model is overlaid on the same plot; the five solid lines are the five estimated cluster means, whose thickness show the values of the K=5 cluster probabilities πktk=1K over time t=1,,296 (individual hours). The shaded region around the solid lines are the estimated ±2 standard deviation around the cluster means.

Fig. 6.

Fig. 6.

Out-of-sample prediction performance for K-cluster models estimated from five-cluster pseudo-real datasets (which were each generated from a simplified version of a model estimated from real one-dimensional data, in Section 4.0.1). Models estimated with fewer than five clusters have sharply worse out-of-sample prediction performance. On the other hand, estimated models with 5 clusters or more have similar out-of-sample prediction performance, because the extra clusters are estimated to have zero probability, and play no role in the prediction.

4. Application to Seaflow cruise.

In this section we apply our model to data collected on a research cruise in the North Pacific Ocean and from the Simons CMAP database (https://simonscmap.com/). The MGL1704 cruise traversed two oceanographic regions over the course of about two weeks, between dates 2017-05-28 and 2017-06-13. As seen in Figure 1, the cruise started in the North Pacific Subtropical Gyre (low latitude, dominated by warm, saltier water), traveling north to the Subpolar Gyre (high latitude, low-temperature, low-salt, nutrient-rich water) and returned back south. We first describe the data and model setup, then discuss the results.

Environmental covariates.

A total of 33 environment covariates (see Table 1 and Figure 11 of the Supplementary Material) were colocalized with cytometric data by averaging the environmental data measurements within a rectangle of every discrete point of the cruise trajectory in space and time, aggregated to an hourly resolution. These data were processed and downloaded from the Simons CMAP database (Ashkezari et al. (2021)) and accessed through the CMAP4R R package (Hyun et al. (2020)). In addition to these covariates, we created four new covariates by lagging the sunlight covariate in time by {three, six, nine, 12} hours. This was motivated by scientific evidence showing that the peak of phytoplankton cell division is out of phase with sunlight (Ribalet et al. (2015)). We also created two new changepoint variables demarcating the two crossings events of the cruise through a biological transition line at latitude 37. These derived covariates play the role of allowing a more flexible conditional representation of the cytograms, using information from the covariates. All covariates, except for the two changepoint variables, were centered and scaled to have sample variance of 1. Altogether, we formed a covariate matrix XR(30812)×39. (The first 12 time points are deleted due to the the lagging of the sunlight variable.)

Response data (cytograms).

The response data (cytograms) were collected onboard using a continuous-time flow cytometer, called SeaFlow, which continuously analyzes sea water through a small opening and measures the optical properties of individual microscopic particles (Swalwell, Ribalet and Armbrust (2011)). The data consist of measurements of light scatter and fluorescence emissions of individual particles. Data are organized into files recorded every three minutes, where each file contains measurements of the cytometric characteristics of between 1000 and 100,000 particles ranging from 0.5 to 5 microns in diameter. The size of data in any given file depends on the cell abundance of phytoplankton within the sampled region. Each particle is characterized by two measures of fluorescence emission (chlorophyll and phycoerythrin), its diameter (estimated from light scatter measurements by the application of Mie theory for spherical particles), its carbon content (cell volume is converted to carbon content) and its label (identified based on a combination of manual gating and a semisupervised clustering method), as described in Ribalet et al. (2019). Note that we use the particle labels only for comparison to our approach in Section 4.1. Particles were aggregated by hour, resulting in T=296 cytograms for the duration of the cruise, with matching time points as rows of X.

Lastly, the cytogram data yi(t)R3:i=1,,ntt were log transformed due to skewness of the original distributions, augmented with biomass multiplicity Ci(t):i=1,,ntt and binned using D=40 equally sized bins in each dimension, as described in Section 2.3. In the analyses to follow in Sections 4.0.14.0.2, we consider two data representations for analysis: a d=1 case with only the binned cell diameter biomass cytograms, and the full d=3 dimensional binned biomass cytograms.

Practicalities.

The regularization parameters λα,λβ were chosen using five-fold cross-validation, as described in Section 2.6. Every application of the EM algorithm was repeated five times (for three-dimensional data) or 10 times (for one-dimensional example). The model means were restricted using a ball constraint of radius r, as described in Section 2.2. In the one-dimensional data analysis in 4.0.1, the radius reflects the underlying assumption that carbon quotas should, at most, double or halve, peaking during the day, due to carbon fixation via photosynthesis by the cell, and halving due to cell division (i.e. the mother cell divides into two equal daughter cells). Assuming spherical particles, this would correspond to a log scale day-night cell diameter difference of log(2)/30.231, halved to obtain r=0.1153. The three-dimensional data analysis in Section 4.0.2 first shifts and scales the log cell diameter to be in the same range as the other axes and uses r=0.5 which is similar in scale to the radius used in the one-dimensional analysis.

4.0.1. Application to one-dimensional cell diameter data.

In this section we apply our model to one-dimensional cytograms at the hourly time resolution. The one-dimensional setting is useful for visualization because single plots can display the entire data and fitted model parameters, displaying cluster means μkRT1K as lines and cluster probabilities πk,RT1K as line thickness as well as shaded approximate 95% conditional density intervals from ϕ,μkt,Σk. The estimated means and probabilities are shown in Figure 7, and the estimated coefficients can be seen in Table 2 of the Supplementary Material.

Overall, the estimated model effectively captures the visual patterns in the cytogram data. Clusters 3 and 5 correspond to two well-known populations, called Synechococcus and Prochlorococcus, respectively. The most prominent phenomenon is the daily fluctuation of the mean of cluster 5 which is clearly predicted using a combination of time-lagged sunlight and ocean altimetry. Also notable is change in probability of cluster 3, which is predicted well by physical and chemical covariates, such as sea surface temperature and phosphate. The overlapping two clusters 3 and 4 are also accurately captured as separate clusters.

As we will see shortly in the three-dimensional analysis, introducing the other two axes of the cytograms (i.e. one-dimensional cytograms to three-dimensional cytograms) clearly helps further distinguish between clusters and identify finer-grain cluster mean movement. Furthermore, cluster 4, which has a large variance and serves as a catch-all background cluster, does not appear to represent a specific cell population and rather exists to improve the other clusters’ model fits.

We also estimated the stability of β coefficients of this model by calculating the nonzero proportion of each of the estimated coefficients produced from subsampled datasets. These nonzero proportions are displayed alongside the original coefficient estimates in Tables 7 and 8 in the Supplementary Material, and the entire procedure is detailed in Supplement D. The stability estimates seem quite sensible; they show high nonzero probability of sunlight variables for Prochlorococcus (cluster 5) as well as overall low nonzero probabilities for the covariates of cluster 4, the background cluster.

4.0.2. Application to three-dimensional data.

In this section we apply our model to the full three-dimensional data. First, in Figure 8 we display one dimension (cell diameter) of the estimated 10-cluster three-dimensional model, as a direct comparison to the one-dimensional cell diameter analysis in Section 4.0.1. Cluster 10 is recognized by domain experts to correspond to Prochlorococcus. The separation of the two heavily overlapping clusters 9 and 10 and their independent means’ movement is visually not apparent in the cell diameter data alone; indeed, the estimated one-dimensional model in Figure 7 only captures a single Prochlorococcus cluster 5.

Fig. 8.

Fig. 8.

A one-dimensional slice of the estimated model of the full three-dimensional data, showing only the cell diameter axis. This figure is directly comparable to Figure 7 using only one-dimensional cell diameter data. The colored solid lines track the 10 estimated cluster means over time, and the line thickness shows the cluster probabilities over time. (The shaded 95% probability regions were omitted for clarity of presentation.) This model on three-dimensional data suggests finer movement of a larger number of cell populations that is not detectable using only the one-dimensional data. In particular, a clean separation of the heavily overlapping clusters 9 and 10 was not possible in the one-dimensional model but is clear in the three-dimensional model (also see Figure 9 that this separation is made apparent by using the additional red axis).

The full three-dimensional data and estimated model are challenging to display in print. A better medium than flat images is a video of t=1,,T images over time, which we show in https://youtu.be/jSxgVvT2wr4. Figure 14 of the Supplementary Material shows one frame from this video (corresponding to one t) which overlays with several plots: three two-dimensional projections of the cytogram, two different angles of the three-dimensional cytograms, the cruise location on a map, the covariates over time and the cluster probabilities at each time and as a time series. The first four panels of this snapshot are shown in Figure 9 in higher resolution. The mean fluctuations and cluster probability dynamics over time are clearly captured in the full video and are explained next, in the context of covariates.

Fig. 9.

Fig. 9.

The estimated three-dimensional 10-cluster model, described in Section 4.0.2, at one time point. The size of the blue points represents the biomass in each of the 403 bins. The panels show various views of the cytograms—three 2D scatterplots and our estimated parameters (means, probabilities and covariances). The red dots mark the cluster centers at this time point, and the size (radius) of these red dots are proportional to the cluster probabilities. The red ellipses in dashed lines show the estimated 95% probability region of the data formed from the estimated Gaussian covariance of each cluster. The 10 estimated model clusters’ mean fluctuations and cluster probability dynamics over time can be seen in the full video in https://youtu.be/jSxgVvT2wr4—a single frame of this video is shown in Figure 9.

The estimated mean movement and the β coefficients shown in Tables 46 in the Supplementary Material reveal interesting scientific insights. The cell diameter of Prochlorococcus seems to be well predicted by sunlight and lagged variants of sunlight. To elaborate, the estimated entries of β5, corresponding to the covariates p1, p2 and p3 and the cell diameter axis, were estimated as 0.008, 0.010 and 0.013—meaning that the mean cell diameters of Prochlorococcus are predicted to increase by these amounts with a unit increase in each covariate value. This supports biochemical intuition about the cell size being directly driven by sunlight. Indeed, important physiological processes of phytoplankton cells, including growth, division and fluorescence (particularly of the pigment chlorophyll-A) are known to undergo diel variability, that is, timed with the day-night or light cycle.

Estimated cluster probabilities and the coefficients α, shown in Tables 3, are also quite interpretable. A higher positive estimated entry of αk means that a unit increase of that covariate corresponds to a larger increase of the relative probability of the kth cluster. The probability of Cluster 8 (which occupies a region in the orange fluorescence axis that clearly corresponds to the Synechococcus population) is associated with primary productivity (coefficient value of 0.19), oxygen (0.46) and nitrate (−0.35). Rapid increases in the abundance and biomass of Synechococcus, associated with high productivity, have previously been observed over narrow regions of the Pacific at the boundary between the Subtropical and Subpolar Gyres (Gradoville et al. (2020)). High productivity in the ocean is often linked to high oxygen saturation, a result of oxygen production during photosynthesis, and low nitrate, as a result of consumption of this nutrient required for Synechococcus’s cell growth (Moore et al. (2002)). Linkages to such biochemical factors unique to this specific Synechococcus cluster are otherwise difficult to identify but are clearly identified in our model. In contrast, for cluster 10 (Prochlorococcus) the largest α coefficients correspond to sea surface temperature (0.87) and phosphate (−0.94). These results reflect this organism’s observed distribution in the Pacific Ocean, namely, its Subtropical Gyre, where high surface temperatures and low concentrations of phosphate tend to favor small-celled Prochlorococcus, leading to higher cluster probabilities. Interestingly, nitrate was not detected by the model as a relevant covariate, which is in good agreement with the physiology of Prochlorococcus, which often lack the genes necessary for nitrate assimilation (Berube et al. (2015)).

On the other hand, the large positive α coefficients for cluster 2 (Picoeukaryotes), associated with phosphate (0.35), reflects its more northerly distribution in the North Pacific Subpolar Gyre, a region of the ocean distinguished by higher surface concentration of nutrients, including phosphate, which allow for greater growth of these relatively larger phytoplankton.

Finally, cluster 3 is particularly interesting, as it captures the calibration beads injected by the instrument as an internal standard. The location of this cluster is much more apparent in the full three-dimensional representation in Figure 9. This is the only population whose origin and location is known a priori and thus serves as a negative control, which the model is expected to capture. Indeed, in our estimated 10-cluster three-dimensional model, this bead is clearly captured as a separate population whose mean movement is minimal over time. Interestingly, three-dimensional models with fewer than 10 clusters fail to capture the calibration bead as a separate population.

4.1. Comparison to gating.

In Figure 10 we compare the relative biomass of Prochlorococcus, measured in two ways. The dark grey line shows the relative biomass of Prochloroccocus, gated in Ribalet et al. (2019) using flowDensity bioconductor package (Malek et al. (2015)), applied semiautomatically to individual three-dimensional cytograms recorded roughly every three minutes, then aggregated to an hourly level. There is a noticeable discrepancy between the two methods on June 8th and 9th. The dark grey line abruptly rises from near 0 to about 0.5, while the purple line follows a gradual increase from June 8th onward. The reasons for this discrepancy are apparent from visual examination of the gated cytograms. First, the gating results have no continuity or smoothness over time, having been applied to individual cytograms. More importantly, while our model consistently tracks the Prochlorococcus cluster as a single ellipsoidal cluster 10, the semiautomatic gating function erroneously includes external particles—many from our model’s cluster 9, which domain experts would not consider to be Prochlorococcus.

Fig. 10.

Fig. 10.

This figure shows the relative biomass of Prochlorococcus, measured in two ways—using traditional gating (black line) and using the estimated cluster probability of cluster 10 (purple) in the three-dimensional data in Section 4.0.2 and Figure 9. One noticeable discrepancy is on June 8th and 9th. The gating (black line) abruptly jumps from 0 to 0.5, due to flaws in automatic gating, while our model (purple) suggests a gradual increase on June 8th and onward. Visual inspection and expert annotation of this cluster in the cytogram suggests that our model cluster 10 is correctly tracking Prochlorococcus.

5. Conclusion.

In this work we propose a novel sparse mixture of multivariate regressions model for modeling flow cytometry data. We devise a penalized expectation-maximization algorithm with parameter constraints and implement a specific ADMM solver which is called in the M-step. Our simulations and application results in Sections 3 and 4 demonstrate that our proposed model can reveal interpretable insights from flow cytometry data and help scientists identify how environmental conditions influence the dynamics of phytoplankton populations.

Our method provides scientists with a rich description of the association between environmental factors and phytoplankton cell populations. It leverages covariates and all cytograms to identify cell populations. This means two cell populations that might be indistinguishable in a single cytogram could be differentiated if their dynamics (i.e. dependence on covariates) are distinct from each other. Thus, even when one is not interested in the covariates themselves but only the estimation of cell populations (as in gating), this method still may be the best choice. In applying the method, we recover some known associations, such as Prochlorococcus and light (positive controls), we did not identify some known nonassociations (negative controls), and also produced some new associations that can be studied. Also, in investigating a discrepancy between our method and a preexisting gating approach, we uncovered some undesirable behavior of the preexisting approach and showcased our method’s ability to perform the difficult task of automatic and consistent gating of overlapping clusters in cytograms over time.

While the motivation from this methodology comes from oceanography, the flow cytometry technology is important to many other areas, including biomarker detection (Gedye et al. (2014)), diagnosis of human diseases such as tumors (Brown and Wittwer (2000)) and ecological studies (Props et al. (2016)). For instance, in a biomedical application, covariates can be patient attributes, and the response can be cytograms obtained from patient blood samples. In fact, the statistical methodology developed here can be applied to any context in which modeling cytograms in terms of features is reasonable; the time ordering of the data is not required for application. We, therefore, expect it to be valuable in a wide range of fields.

Our model diagnostics in Supplement E indicate some leftover time dependence in the data residuals from our model. To remedy this within the framework of our model, one might add time-lagged versions of the covariates or even summaries from cytograms y(t) to directly incorporate time-space autocorrelation in our model. Alternatively, one could also extend the d-by-d cluster covariance ΣkRd×d to be a time-varying matrix Σkt. This covariance matrix can take time structure that is not driven by covariates X(t) but has dependence (e.g. time autocorrelation) or smoothness that is learned directly from the data. However, a time-series extension also complicates our existing cross-validation strategy for tuning λα and λβ and constitutes a significant departure from our current proposed model. We view a time-series extension of our model to be an excellent methodology direction to pursue next.

The methodology has several exciting directions for future work. Our mixture model methodology would greatly benefit from a principled, automatic choice of the number of K based on the data. It would be also be interesting to see how relaxing the Gaussian cluster assumption to different distributions, for example, skewed, multivariate t distributions, helps improve the flexibility of our approach. A model with feature-dependent covariances Σkk=1,,K could enable more flexible prediction as well. Also promising are the extension and comparison to more nonparametric approaches to the conditional distribution of cytograms or to the entire joint model of cytograms and environmental covariates.

On the application side it would be interesting to compare estimated models on data from other oceanographic cruises traversing the same trajectory or different areas and to see to what extent the estimated relationship between cytograms and environmental covariates can be replicated.

Supplementary Material

supplement

Acknowledgments.

The authors would like to express gratitude to the referees and editors for their constructive feedback and suggestions. The authors acknowledge the Center for Advanced Research Computing (CARC) at the University of Southern California for providing computing resources that have contributed to the research results reported within this publication. https://carc.usc.edu.

Funding.

This work was supported by grants by the Simons Collaboration on Computational Biogeochemical Modeling of Marine Ecosystems/CBIOMES (Grant ID: 549939 to JB, Microbial Oceanography Project Award ID 574495 to FR). Dr. Jacob Bien was also supported, in part, by NIH Grant R01GM123993 and NSF CAREER Award DMS-1653017. We thank Dr. E. Virginia Armbrust for supporting SeaFlow deployment on the cruise in the North Pacific funded by the Simons Foundation grant (SCOPE Award ID 426570SP to EVA). We also thank Chris Berthiaume and Dr. Annette Hynes for their help in processing and curating SeaFlow data.

Footnotes

SUPPLEMENTARY MATERIAL

Supplement to “Modeling cell populations measured by flow cytometry with covariates using sparse mixture of regressions” (DOI: 10.1214/22-AOAS1631SUPP; .pdf). Supplement A: Proof of Proposition 1. This section contains the full proof of a theoretical result stating that using a relatively small number of bins can achieve similar predictive performance as using the original data. Supplement B: ADMM details. This section describes the full details of the ADMM optimization algorithm. Supplement C: Additional data analysis results. This section contains additional figures and tables for the data analysis from Section 4. Supplement D: Variable Selection Stability. This section describes how to quantify uncertainty of model estimates, using the subsampling bootstrap for β coefficients in a 1-dimensional model. Supplement E: Probabilistic gating and residual analysis. This section describes a probabilistic classification procedure that is used to analyze residuals and leftover time information. Supplement F: Effect of binning on estimation. This section describes a simulation study used to quantify how much model estimation accuracy suffers as a result of binning the data. Supplement G: Model performance with non-Gaussian data. This section describes an investigation of the model performance in two scenarios in which data deviate from mixtures of Gaussians.

REFERENCES

  1. Aghaeepour N, Finak G, Consortium F, Consortium DREAM, Hoos H, Mosmann TR, Brinkman R, Gottardo R and Scheuermann RH (2013). Critical assessment of automated flow cytometry data analysis techniques. Nat. Methods 10 228–238. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Ashkezari MD, Hagen NR, Denholtz M, Neang A, Burns TC, Morales RL, Lee CP, Hill CN and Armbrust EV (2021). Simons collaborative marine atlas project (Simons CMAP): An open-source portal to share, visualize and analyze ocean data. BioRxiv [Google Scholar]
  3. Berube PM, Biller SJ, Kent AG, Berta-Thompson JW, Roggensack SE, Roache-Johnson KH, Ackerman M, Moore LR, Meisel JD et al. (2015). Physiology and evolution of nitrate acquisition in prochlorococcus. ISME J 9 1195–1207. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Boyd S, Parikh N, Chu E, Peleato B and Eckstein J (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn 3 1–122. [Google Scholar]
  5. Boyer TP, Antonov JI, Baranova OK, Garcia HE, Johnson DR, Mishonov AV, O’Brien TD, Seidov D, Smolyar II et al. (2013). World ocean database 2013
  6. Brown M and Wittwer C (2000). Flow cytometry: Principles and clinical applications in hematology. Clin. Chem 46 1221–1229. [PubMed] [Google Scholar]
  7. Dubelaar GBJ, Gerritzen PL, Beeker AER, Jonker RR and Tangen K (1999). Design and first results of CytoBuoy: A wireless flow cytometer for in situ analysis of marine and fresh waters. Cytometry 37 247–254. [PubMed] [Google Scholar]
  8. Field CB, Behrenfeld MJ, Randerson JT and Falkowski P (1998). Primary production of the biosphere: Integrating terrestrial and oceanic components. Science 281 237–240. [DOI] [PubMed] [Google Scholar]
  9. Friedman J, Hastie T and Tibshirani R (2010). Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw 33 1–22. [PMC free article] [PubMed] [Google Scholar]
  10. Gedye CA, Hussain A, Paterson J, Smrke A, Saini H, Sirskyj D, Pereira K, Lobo N, Stewart J et al. (2014). Cell surface profiling using high-throughput flow cytometry: A platform for biomarker discovery and analysis of cellular heterogeneity. PLoS ONE 9 e105602. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Gradoville MR, Farnelid H, White AE, Turk-Kubo KA, Stewart B, Ribalet F, Ferrón S, Pinedo-Gonzalez P, Armbrust EV et al. (2020). Latitudinal constraints on the abundance and activity of the cyanobacterium UCYN-A and other marine diazotrophs in the North Pacific. Limnol. Oceanogr 65 1858–1875. [Google Scholar]
  12. Grant M and Boyd S (2014). CVX: Matlab software for disciplined convex programming, version 2.1 http://cvxr.com/cvx. [Google Scholar]
  13. Grün B and Leisch F (2008). FlexMix version 2: Finite mixtures with concomitant variables and varying and constant parameters. J. Stat. Softw 28 1–35.27774042 [Google Scholar]
  14. Hahne F, LeMeur N, Brinkman RR, Ellis B, Haaland P, Sarkar D, Spidlen J, Strain E and Gentleman R (2009). FlowCore: A bioconductor package for high throughput flow cytometry. BMC Bioinform 10 106. 10.1186/1471-2105-10-106 [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Hyrkas J, Clayton S, Ribalet F, Halperin D, Virginia Armbrust E and Howe B (2015). Scalable clustering algorithms for continuous environmental flow cytometry. Bioinformatics 32 417–423. [DOI] [PubMed] [Google Scholar]
  16. Hyun S, Mishra A, Müller C and Bien J (2020). R package for CMAP Access [Online; accessed 19-Dec-2019].
  17. Hyun S, Rolf Cape M, Ribalet F and Bien J (2023). Supplement to “Modeling Cell Populations Measured By Flow Cytometry With Covariates Using Sparse Mixture of Regressions” 10.1214/22-AOAS1631SUPP [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Jordan MI and Jacobs RA (1993). Hierarchical mixtures of experts and the EM algorithm. In Proceedings of 1993 International Conference on Neural Networks (IJCNN-93-Nagoya, Japan) [Google Scholar]
  19. Khalili A and Chen J (2007). Variable selection in finite mixture of regression models. J. Amer. Statist. Assoc 102 1025–1038. MR2411662 10.1198/016214507000000590 [DOI] [Google Scholar]
  20. Malek M, Taghiyar MJ, Chong L, Finak G, Gottardo R and Brinkman RR (2015). FlowDensity: Reproducing manual gating of flow cytometry data by automated density-based cell population identification. Bioinformatics 31 606–607. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Marañón E (2015). Cell size as a key determinant of phytoplankton metabolism and community structure. Annu. Rev. Mar. Sci 7 241–264. [DOI] [PubMed] [Google Scholar]
  22. McLachlan GJ and Peel D (2006). Finite Mixture Models Wiley, New York. [Google Scholar]
  23. Moore LR, Post AF, Rocap G and Chisholm SW (2002). Utilization of different nitrogen sources by the marine cyanobacteria Prochlorococcus and Synechococcus. Limnol. Oceanogr 47 989–996. [Google Scholar]
  24. Olson RJ, Shalapyonok A and Sosik HM (2003). An automated submersible flow cytometer for analyzing pico- and nanophytoplankton: FlowCytobot. Deep-Sea Res., Part 1, Oceanogr. Res. Pap 50 301–315. [Google Scholar]
  25. Pan W and Shen X (2007). Penalized model-based clustering with application to variable selection. J. Mach. Learn. Res 8 1145–1164. [Google Scholar]
  26. Props R, Monsieurs P, Mysara M, Clement L and Boon N (2016). Measuring the biodiversity of microbial communities by flow cytometry. Methods Ecol. Evol 7 1376–1385. [Google Scholar]
  27. Ribalet F, Swalwell J, Clayton S, Jiménez V, Sudek S, Lin Y, Johnson ZI, Worden AZ and Armbrust EV (2015). Light-driven synchrony of Prochlorococcus growth and mortality in the subtropical Pacific gyre. Proc. Natl. Acad. Sci. USA 112 8008–8012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Ribalet F, Berthiaume C, Hynes A, Swalwell J, Carlson M, Clayton S, Hennon G, Poirier C, Shimabukuro E et al. (2019). SeaFlow data v1, high-resolution abundance, size and biomass of small phytoplankton in the North Pacific. Sci. Data 6 277. 10.1038/s41597-019-0292-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Sosik HM, Olson RJ and Armbrust EV (2010). Flow cytometry in phytoplankton research. In Chlorophyll a Fluorescence in Aquatic Sciences: Methods and Applications 171–185. Springer, Berlin. [Google Scholar]
  30. Sosik HM, Olson RJ, Neubert MG, Shalapyonok A and Solow AR (2003). Growth rates of coastal phytoplankton from time-series measurements with a submersible flow cytometer. Limnol. Oceanogr 48 1756–1765. [Google Scholar]
  31. Städler N, Bühlmann P and van de Geer S (2010). 1-penalization for mixture regression models. TEST 19 209–256. MR2677722 10.1007/s11749-010-0197-z [DOI] [Google Scholar]
  32. Swalwell JE, Ribalet F and Armbrust EV (2011). SeaFlow: A novel underway flow-cytometer for continuous observations of phytoplankton in the ocean. Limnol. Oceanogr., Methods 9 466–477. [Google Scholar]
  33. Tibshirani R (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288. MR1379242 [Google Scholar]
  34. Vaulot D and Marie D (1999). Diel variability of photosynthetic picoplankton in the equatorial Pacific. J. Geophys. Res., Oceans 104 3297–3310. [Google Scholar]
  35. Verschoor CP, Lelic A, Bramson JL and Bowdish DME (2015). An introduction to automated flow cytometry gating tools and their implementation. Front. Immunol 6 380. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Wang P, Puterman ML, Cockburn I and Le N (1996). Mixed Poisson regression models with covariate dependent rates. Biometrics 52 381–400. [PubMed] [Google Scholar]
  37. Xu L and Jordan MI (1996). On convergence properties of the EM algorithm for Gaussian mixtures. Neural Comput 8 129–151. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplement

RESOURCES