Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2025 Apr 16;26(2):bbaf167. doi: 10.1093/bib/bbaf167

MINE: a new way to design genetics experiments for discovery

Isaac Torres 1, Shufan Zhang 2, Amanda Bouffier 3, Michael Skaro 4, Yue Wu 5, Lauren Stupp 6, Jonathan Arnold 7,, Y Anny Chung 8, H-Bernd Schuttler 9
PMCID: PMC12001805  PMID: 40237762

Abstract

The Maximally Informative Next Experiment or MINE is a new experimental design approach for experiments, such as those in omics, in which the number of effects or parameters p greatly exceeds the number of samples n (p > n). Classical experimental design presumes n > p for inference about parameters and its application to p > n can lead to over-fitting. To overcome p > n, MINE is an ensemble method, which makes predictions about future experiments from an existing ensemble of models consistent with available data in order to select the most informative next experiment. Its advantages are in exploration of the data for new relationships with n < p and being able to integrate smaller and more tractable experiments to replace adaptively one large classic experiment as discoveries are made. Thus, using MINE is model-guided and adaptive over time in a large omics study. Here, MINE is illustrated in two distinct multiyear experiments, one involving genetic networks in Neurospora crassa and a second one involving a genome-wide association study in Sorghum bicolor as a comparison to classic experimental design in an agricultural setting.

Keywords: ensemble methods, MINE, mixed linear models, genetic networks, GWAS

Introduction

The classic approach to experimental design was developed by R. A. Fisher for linear models in 1935 and had a profound effect on all of science [1, 2]. Growing out of his work at the Rothamsted Experiment Station, he introduced widely the notion of precision of an experiment, randomization, ways of controlling heterogeneity through blocking and by the use of covariates, and the vast subject of experimental design in the context of linear models [3].

The focus of all of these efforts was not on discovery per se. Rather, the end goal was the precision of estimates and the power to test effects in a controlled experiment with the proper randomization and blocking practices in place. The number of replicates was such that the number of observations (n) was typically much greater than the number of effects (p) being estimated in the model. Unfortunately, this is no longer the typical situation of an omics experiment [4], such as a genome-wide association study (or GWAS) as an example. Instead, there may be only n = 1943 samples of sorghum accessions but over p = 400 000 potential effects of single nucleotide polymorphisms (or SNPs) on the complex trait of interest for an agronomic crop [5]. The methods of classic experimental design are limited to the situation of n > p for inference about model parameters and are not designed for data exploration for hypothesis generation.

While the original goal was the precision of estimates [2], the new goal is discovery. The reason that precision is less important is that the goal of such studies has shifted to the discovery of relationships in the data. The focus on precision of effects can only be addressed in follow-up studies when the relevant variables in the experiment have been identified and related. We desire to discover the appropriate nonlinear kinetic models that underly the biological clock at the molecular level [6] as we carry out a very expensive sequence of transcriptomic experiments. We wish to discover the relation of plant functional traits to SNPs in the nuclear genome or the assemblage of fungal symbionts in the microbiome most beneficial for plant growth [7–9] using models drawn from systems and population ecology [10]. How can the classic linear models [11] and newer models of systems biology [6] guide a discovery process, in which some GWAS studies have millions of SNPs, such as those on human height [12]?

A new approach to the design of large genomics experiments is introduced, one in which model-guided discovery is used adaptively [13] over time to find the variables that matter, considering a system in which the number of potential effects in the system (p) far exceeds the number of observations (n) on the system. The methodological approach utilizes ensemble methods [14] drawn from statistical physics [15] and, ultimately, Boltzmann’s 19th-century work [16]. The particular ensemble method explored here for model-guided discovery is called MINE, which stands for Maximally Informative Next Experiment [17].

Ensemble methods

Ensemble methods were developed in the 19th century by Boltzmann to describe the motion of an ideal gas [16]. In this situation, there is an Avogadro number (A) of particles in a 1 L box, but only three measurements are made: temperature, pressure, and volume. How is the motion of A particles described with 6A degrees of freedom each described with only three measurements?

With so little data, the data did not strongly support just one model. Boltzmann’s solution was not to give up on identifying one best model but rather to make predictions from an ensemble of models [15]. Omics experiments face exactly the same problem [14], but the paucity of data with respect to the complexity of the model is not as severe as in the problem Boltzmann faced. The interest may be in identifying the dynamics of genes and their products in carbon metabolism [14] or the biological clock [18], but genetics dictates that only a limited number of samples at different time points can be made to identify the system, while there are many parameters required to describe the system. For example, the number of measurements at different time points on the biological clock may be on the order of 60 000 measurements (n), but there are over 90 000 rate constants and initial conditions (p) in the model that must be estimated [19, 20]. Much as for an ideal gas, averaging over the ensemble allows for detailed predictions about complex biological systems, such as the clock.

A simple example is used to illustrate the approach of ensemble methods. The first step is writing down the model specification for the measurements. There are n measurements Inline graphic) drawn from an unknown distribution parameterized by p parameters in Inline graphic and some of these parameters are ancillary parameters, such as Inline graphic below, in which there is less interest. In addition, the variables Inline graphic) describe the experimental conditions. For example, the list U might specify the SNPs used in a GWAS field trial. Then the model specification would take the form:

graphic file with name DmEquation1.gif

where C is a normalization constant chosen to make the integral over the data Y equal to 1. The quantity Inline graphicis known as the Hamiltonian. Ideally, the distribution would be observed directly, but, in practice, what is available are sample moments of the data Y.

Since the goal is to identify a model Inline graphic supported by the data Y, a change of viewpoint is needed. As in the method of maximum likelihood [21], the model specification Inline graphic is viewed as a function of the model parameters Inline graphic, and the data Y and experimental conditions U are taken as fixed:

graphic file with name DmEquation2.gif (1)

where Inline graphic is a normalization constant chosen to make the integral over all parameters Inline graphic in the parameter space equal to 1. This normalization constant is only a function of the data Y. The magnitude of the ensemble Inline graphic, or Inline graphic for short, is larger when the model Inline graphic is more supported by the data Inline graphicIt may be useful to think of the ensemble Inline graphic as a posterior distribution to the model specification Inline graphic with the two functions, Inline graphic and Inline graphic, connected by Bayes theorem [22].

The ensemble Inline graphic or Inline graphic for short, is the collection of models Inline graphic consistent with the available data Y. Model-averaging with respect to the model ensemble Inline graphic allows predictions about the system’s behavior. Instead of identifying one model Inline graphic, a distribution of models Inline graphic is identified. With the number of parameters p being vastly greater than the number of data points n, predictions can still be made and tested with respect to averages computed from the ensemble Inline graphic.

Monte Carlo methods are used to identify the ensemble Inline graphic[15, 23] because the model specifications are complicated [18, 24]. A simple example will illustrate how this is done. Take the Hamiltonian viewed as a function of Inline graphic as having the following simple form:

graphic file with name DmEquation46.gif

The model parameter Inline graphic is the one we are truly interested in, and the remaining parameters Inline graphic and Inline graphic are ancillary. A graph of the ensemble Inline graphic is shown in Fig. 1d. There are two maxima in the ensemble or equivalently, two minima in the Hamiltonian. The goal is to reconstruct the ensemble Inline graphic by Monte Carlo for prediction.

Figure 1.

Figure 1

Convergence of an ensemble to the target distribution: (a) ensemble after 20 moves; (b) ensemble after 100 moves; (c) ensemble after 1000 moves; (d) true ensemble or “target distribution.” (d) shows the target distribution as the ensemble converges to the true target distribution under a Monte Carlo experiment (a–c). The starting guess at the parameter Inline graphic was 3. After each of 20, 100, and 1000 moves, 10 000 samples of Inline graphic from the resulting distribution were drawn to characterize the ensemble. The ancillary parameters were Inline graphic and Inline graphic The plots were created in MATLAB_R2018B (https://www.mathworks.com/products/matlab.html).

In this example, we are in the perfect world in which the ensemble, or equivalently the Hamiltonian, is observed from 10,000 values after each move in the Monte Carlo experiment. To reconstruct the ensemble Inline graphicby Monte Carlo at each move a new model parameter Inline graphic is drawn from the ensemble Inline graphicwhen the current proposal is the model parameter Inline graphic. The goal is to move into a region of the parameter space that is well supported by the ensemble Inline graphic in the equilibration phase. Once equilibrated many 1000 s or 10 000 s of models are accumulated that are well supported to reconstruct the ensemble from the sample histogram of these Inline graphic-values [18]. The question remains how to choose the well-supported Inline graphic-values.

One greedy approach to moving in the parameter space is to draw a model parameter Inline graphic and proceed uphill using some procedure like steepest ascent to climb the hill(s) in the ensemble. As shown in Fig. 1, this might lead to a local maximum. In fact, in Fig. 1, there are two such maxima. To avoid local maxima, a model parameter Inline graphic is drawn randomly from the ensemble Inline graphic, being greedy when there is an improvement in the ensemble probability, i.e. Inline graphic or equivalently Inline graphic, but occasionally when QInline graphic, move downhill anyway. The occasional downhill move may allow escape from a local maximum. In practice in systems biology, it may be more appropriate to think of the ensemble surface as gently rolling hills as on a golf course because the data are limited (Inline graphic

Metropolis and colleagues [25] developed a stochastic search procedure in statistical physics for this and now many other optimization problems [23]. The probability of a move is:

graphic file with name DmEquation3.gif

The probability p of a move from Inline graphic occurs with probability 1 if the proposed move takes us uphill, but if the proposed move takes us downhill, then the probability of a move downhill decreases with the amount of drop from Inline graphic. The sequence of moves is made 10 000 or more times to move into a region of the parameter space well supported by the data during the equilibration phase.

In the equilibration phase, the inferred ensemble converges to the true ensemble known as the target distribution (Fig. 1). The Monte Carlo search for this simple model is successful in the reconstruction in <1000 moves and is surprisingly quick, equilibrating in <20 moves. A video displays the reconstruction process (Video S1). Once equilibration is achieved, another sequence of Monte Carlo moves called the accumulation phase is used to build the target distribution. In this simple example, only 1000 moves are needed to carry the ensemble identification into the accumulation phase.

In practice, a sweep is introduced to describe the number of moves taken to visit each model parameter on average once. A standard equilibration run and an accumulation run is 40 000 sweeps, which will vary in practice with the complexity of the model [18, 19].

MINE

Once an ensemble method produces a collection of models supported by the data, then it is possible to make predictions from the ensemble distribution about the next experiment. By averaging some variable of interest over the models in the ensemble distribution Inline graphic, a prediction can be made, given the current data Y and the experimental conditions U. For example, Y might be plant biomasses measured in year 1 of a 5-year GWAS experiment to identify SNPs to predict biomass in sorghum with a certain collection of SNPs from the Bioenergy Accession Panel (BAP) [26]. The question is what accessions are to be used in Year 2 to specify design U. In Year 1, 79 accessions are measured in the GWAS and can be used to inform the SNP choice in Year 2.

One way to make this choice is to select experimental conditions permitting us to distinguish the models in the ensemble Inline graphic identified from Year 1. The best way to distinguish experimentally two models randomly chosen from the model ensemble is if the predictions Inline graphic of each model (Inline graphic) are orthogonal as shown in Fig. 2. For experiment 1 on the left, the predictions of the two selected models Inline graphic and Inline graphic are correlated and are harder to distinguish under experimental conditions U1. The same two models under experimental conditions U2 are easier to distinguish—model Inline graphic is easily tested against model Inline graphic. The goal of a MINE criterion is then to support making “the angle” between the two predictions of a random pair in the ensemble as large as possible on average in year 2 as a function of the experimental conditions U and current ensemble Inline graphic identified from the data in Year 1.

Figure 2.

Figure 2

Two models can be better distinguished by their predictions in the next experiment if their predictions are less correlated. The predictions of model Inline graphic and Inline graphic under experimental condition U1 are the expectations Inline graphic and Inline graphic, respectively. If the two models Inline graphic and Inline graphicare chosen independently from the model ensemble Inline graphic, the expectations are calculated with respect to the product density Inline graphic, where U = U1 or U2.

There are two standard ways to measure the associations between the predictions [6]. One is by the covariances between the predictions of the data Y (MINE by Covariance Ellipsoid Volume); the other is by the correlations between the predictions of the data Y (MINE by Correlation Ellipsoid Volume). There are a variety of reasons for advocating the use of MINE by Correlation Ellipsoid Volume [6]. One of the main reasons is that when there are a large number (p > > n) of almost linearly dependent observations as found in practice, it would be highly desirable to emphasize the new directions in the data Y as done by Correlation Ellipsoid Volume. The new directions in Y in the next year depend on the choice of design matrix X. Denote by E the correlation matrix between the components of Y. The MINE Correlation Ellipsoid Volume is then a determinant (det):

graphic file with name DmEquation4.gif

When the predictions are on average highly correlated (Fig. 2a), the determinant is nearly zero. When the predictions are nearly orthogonal (going in new directions) (Fig. 2b), the determinant is nearly 1.

A microscope analogy [11] provides insights into how MINE works (Fig. 3). MINE is highly analogous to a microscope and its optics. The object in the microscope field described by the data Y is the observed system. MINE, like the optics of the microscope, picks up each component of Y through the prediction Inline graphic about the system. For example, Inline graphic could be the list of predictions of plant biomasses in a GWAS study. The optics (Inline graphic) and, likewise, MINE then magnify the predictions to create the image or model of the system (Fig. 3).

Figure 3.

Figure 3

MINE is analogous in function to the optics on a microscope. The data Y are the objects in the field of view. The models Inline graphic are in the image. The MINE criterion with the predictions Inline graphic is the optics. The uncertainty volume in the image is the magnification measured by the MINE criterion, V(u) = det(E(U)). From [44].

The microscope has a field of view of the object, which we refer to as the Uncertainty Volume of the new experiment Y. The uncertainty in the observations on the field of view comes from our uncertainty about the optics controlled by Inline graphic and in the measurements Y on the object. The optics (predictions) then translate the Uncertainty Volume V(U) in the sample space into an image, the Uncertainty Volume in the parameter space. The result is that an Uncertainty Volume in the sample space (object) is mapped by the optics Inline graphic to the Uncertainty Volume in the parameter space (image).

The magnification applied to the object is adjusted to reduce the Uncertainty Volume in the parameter space (image). Another interpretation of the image quality is given by the determinant det(E(U)). The determinant is a measure of the volume of a parallelepiped defined by the Uncertainty Volume in the Sample Space [27]. The determinant is also a measure of the Uncertainty volume in the parameter space (inside the ensemble). As the magnification knob is twiddled, the clarity of the image (model parameters) is increased and uncertainty is reduced (Fig. 3). If the parallelepiped is squashed in the parameter space, less details from the observations Y in the sample space are being retained in imaging (i.e. model fitting). MINE is doing the focusing and representing the object in higher clarity in the image constructed by the observer using MINE.

A simple model for predicting hyphal extension colonization by arbuscular mycorrhizal fungi in plant roots to illustrate MINE

Mixture experiments are used in population genetics [28] and science and engineering in general [29]. Mixture experiments are examples of linear models that are the focus of experimental design [2]. In these designs, there is a mixture of treatments in different proportions affecting some dependent variables of interest. Mixture experiments can be used to study how different arbuscular mycorrhizal fungi (AMF) affect the health of the plant through colonization of the root system. The assembly of the AMF biome in plant roots is a product of choices imposed by the plant genotype [30, 31], competition between AMF, ecological drift [9], historical contingency [32], abiotic factors such as phosphorous (P) and nitrogen (N) in the soil [33, 34], and other factors. Consider three AMF species, S1, S2, and S3, competing for colonization area in the plant roots of sorghum [9] of ‘one’ plant genotype. These AMF are potential partners with the plant in one of the oldest symbioses on the planet [35]. Potentially the plant provides carbon, and, in return, potentially, the AMF hyphal network provides P and N, like an extended root system. The success of this partnership is measured in part by AMF hyphal extension in the roots and the resulting biomass of the plant host [36]. To study this symbiosis, the experimenter inoculates sorghum with a mixed population of at least 10% of the coenocyte cells being S1, at least 15% being S2, and at least 5% of the coenocyte cells being S3. The mixed coenocyte inoculum is a coculture in the plant root cells. Denoting respective spore percentages by u1, u2, and u3, respectively, u1, u2, and u3 are thus constrained by lower limits,

graphic file with name DmEquation5.gif (2)

and by the normalization condition

graphic file with name DmEquation6.gif (3)

Given eq. (3), only two of the three species fraction values can be freely chosen. In the following, we will use proportions u1 and u2 as those two free variables, with u3 then being determined via eq. (3). Furthermore, the proportions u1 and u2 are then subject to upper and lower bounds, resulting from Eqs. (2) and (3). When referring, below, to experimenters freely choosing (u1,  u2,  u3), it should be understood that these choices must be within the constraints imposed by conditions (2) and (3).

Assume that, by setting appropriate experimental conditions, the experimenter can construct an inoculum with a constant total spore population size, Nc, and constant species fractions, u1, u2, and u3. Assume also that, subject to the foregoing constraints (2) and (3), the experimenter can precisely set the values of u1, u2, u3, and Nc.

Each of the three AMF taxa can increase its rate of occupancy of the root space in the plant, denoted by Inline graphic, Inline graphic and Inline graphic, for species, S1, S2, and S3, respectively. The experimenter wishes to determine, or at least impose constraints on, the values of these rates in percent area increase, Inline graphic, Inline graphic and Inline graphic, by performing a sequence of time-series experiments wherein the linear filament extension in a root image, denoted by y(t), is measured as a function of time, t, at certain time points,

graphic file with name DmEquation7.gif

Here, K is the total number of experimental observation time points. Each experiment thus produces a series of observed filament extension amounts, y(tk) for k = 1,2,…, K, denoted by

graphic file with name DmEquation8.gif

That is, yk is the value of y(t) observed at time tk, with k = 1,2, …, K labeling the different observation time points. Each of these experiments is to be performed on a spore population begun with a different combination, (u1,  u2,  u3), of AMF inoculation fractions. For simplicity, assume, however, that the values of the rates of hyphal extension, Inline graphic, Inline graphic, and Inline graphic, remain the same throughout all these experiments, i.e. assume that the hyphal extension rates, Inline graphic, Inline graphic, and Inline graphic, do not change when the experimenter changes the population composition (u1,  u2,  u3) from one experiment to the next as in a race tube experiment [6, 37]. For simplicity, we will refer to Inline graphic, Inline graphic, and Inline graphic as the rates of colonization success.

The extraction of any information about the success rate in root colonization, Inline graphic, Inline graphic, and Inline graphic, from the experimental time series data, yk, requires, a ‘mathematical model’ that treats the rates Inline graphic, Inline graphic, and Inline graphic, as well as the known ‘experimental control parameters’, u1,  u2, and u3, as input parameters. The model must then use these input parameters to provide a ‘predicted value’ for each experimental observation, yk, the hyphal extension colonized in a plant root. For a given experimental data point, yk, we denote the corresponding value predicted by the model by fk. Obviously, whatever the model predicts depends on the model input parameters, Inline graphic, Inline graphic, Inline graphic, u1,  u2, and u3, that were used to make the prediction. We will therefore often write fk as a ‘function’ of these input parameters, i.e. as

graphic file with name DmEquation9.gif

to make it explicit that fk is dependent on the assumed values of the rate parameters Inline graphic, Inline graphic, and Inline graphic, and on the given values of the control parameters, u1,  u2, and u3, set by the experimenter.

For the scenario assumed here, i.e. for a mixed population of spores from three AMF species jointly producing percent root colonization, X, at constant rates per spore cell type, a simple mathematical model for fk is easy to construct. Assume that the mixed cell population is established, and starts producing colonization, at time Inline graphic, with no initial colonization length X being present at that time. Then, the total percent colonization X produced by the entire AMF population in the roots, by observation time Inline graphic, is given by:

graphic file with name DmEquation46a.gif (4)

The percent colonization X can be measured in roots by bright field microscopy [38–41].

To understand this linear model, which is linear in the model parameters Inline graphic recall here thatInline graphic is the total number of AMF in the inoculum, and hence Inline graphic is the number of S1-cells in the inoculum. Hence, Inline graphic is the rate of increase by all S1-cells combined producing percentage root area X-contribution. Each spore produces a hyphopodium by which to colonize the root cortex. Recall now that

graphic file with name DmEquation14.gif (5)

Hence, the length colonized, by all AMF S1-cells combined, by time Inline graphic, is Inline graphic. Likewise, the length colonized of the roots produced by all AMF S2-cells and by all S3-cells, by time Inline graphic, are Inline graphic and Inline graphic, respectively. We then obtain fk, i.e. the predicted total amount of hyphal extension colonized X produced by all cells until time Inline graphic, by simply adding up the foregoing three X-contributions from all three AMF species. The result is eq. (4).

Suppose we have performed multiple experiments, to be labeled by an “experiment index” Inline graphic, where L is the total number of experiments. In each experiment, a different AMF species composition (u1,  u2, u3) was used. To distinguish these u1,  u2, and u3, used in the different experiments, we therefore have to label ‘them’ with the additional index Inline graphic, as Inline graphic, for Inline graphic. Consequently, a different time series of X-data, y1, y2, …, yK, was observed in each experiment, and we therefore also have to label the observed data, y1, y2, …, yK with the additional index Inline graphic, as Inline graphic, for Inline graphic. Also assume that each data point, Inline graphic, has been measured with some experimental uncertainty, quantified by an experimental standard deviation Inline graphic. The Inline graphic-function (or by another name, the Hamiltonian) is then given by:

graphic file with name DmEquation15.gif (6)

To simplify and compactify the notation, we have introduced here the following abbreviations:

graphic file with name DmEquation18.gif (7)
graphic file with name DmEquation19.gif (8)
graphic file with name DmEquation21.gif (9)

That is, Inline graphic (without subscript) is shorthand for a vector that comprises the rates of colonization Inline graphic, and Inline graphic. The Inline graphic (without subscript) denotes the vector of the three AMF species inoculation fractions used in experiment number Inline graphic, and U is the vector comprising the species fractions from ‘all’ experiments combined. Note that Inline graphic does not have an Inline graphic-superscript here because Inline graphic, and Inline graphic are assumed to have the same values in all experiments.

Note that Inline graphic is the model prediction of hyphal extension, from eq. (4), for Inline graphic, i.e. for the kth time series data point for percent root area colonized observed in the Inline graphicth experiment. The square of the so-called ‘residual’, on the right-hand side of eq. (6),

graphic file with name DmEquation22.gif (10)

thus measures the deviation of the model prediction Inline graphic from the experimental observation of hyphal extension Inline graphic: The larger Inline graphic, the worse, i.e. greater, is the deviation of the model prediction, Inline graphic, from the observed data point, Inline graphic. By taking the sum of all squared residuals, the Inline graphic-function in eq. (6) thus provides a composite measure of the overall deviation of the model predictions from the data, for ‘all’ data points on hyphal length colonized combined. In the least-squares fitting approach, the “best possible” choice of model parameters is then obtained by finding a parameter combination, Inline graphic, which minimizes this deviation, i.e. by minimizing Inline graphic(Inline graphic, U) with respect to Inline graphic, and Inline graphic In the following, let Inline graphic denote the best possible parameter combination that minimizes Inline graphic(Inline graphic, U).

Note, in passing, that the squared residuals entering into Inline graphic(Inline graphic, U) in eq. (6) are weighted by the reciprocals of the variances, Inline graphic. This means that experimental data points with larger experimental uncertainties carry less weight and have less of an effect on the choice of the optimal, “best match” parameter combination, Inline graphic, than data points with smaller experimental uncertainties. In that sense, Inline graphic can be regarded as a ‘weighted compromise” between all data points, Inline graphic.

While there are, in principle, many different ways to define an ensemble probability distribution function having these general characteristics, an obvious, simple choice for eq. (1), supported by statistical theory [11], is given by:

graphic file with name DmEquation23.gif (11)

The Inline graphic-factor in eq. (11) is a normalization factor, chosen to ensure that the ensemble probability density function or PDF integrates to a probability of 1. That is, for our model for a mixture experiment with Inline graphic, the Inline graphic is chosen such that

graphic file with name DmEquation24.gif (12)

Here, Inline graphic and Inline graphic denote, respectively, a reasonable lower and upper limit imposed on Inline graphic, and Inline graphic. Eq. (11) is then to be understood to hold only when Inline graphic, and Inline graphic each falls within the interval between Inline graphic and Inline graphic; if Inline graphic, Inline graphic, or Inline graphic lies outside of this interval, we set Inline graphic

Notice that Inline graphic in eq. (11) has the desired general characteristics: For very large values of Inline graphic, the exponential function Inline graphic, and hence Inline graphic becomes very small; for smaller values of Inline graphic, Inline graphic becomes larger. Hence, Inline graphic-choices whose model predictions agree poorly with the experimental data will have a low probability of being drawn from Inline graphic; Inline graphic-choices whose model predictions agree well with the experimental data will have a higher probability of being drawn from Inline graphic.

Given Inline graphic, we can now calculate, for example, expectation values, variances, and histograms of any observable quantity, Inline graphic, which the model allows us to predict as a function of Inline graphic. Specifically for the expectation value, Inline graphic, and variance, Inline graphic, of such an “observable” Inline graphic, we need to calculate:

graphic file with name DmEquation25.gif (13)

with Inline graphic and Inline graphic for short, and then

graphic file with name DmEquation26.gif (14)

Here, Inline graphic is obtained, analogous to Inline graphic, with Inline graphic in eq. (13) replaced by Inline graphic.

Within the ensemble approach, Inline graphic can serve as a prediction of a representative value of Inline graphic, given the experimental control parameters U and prior experimental data, Inline graphic for all Inline graphic and all k. However, the ensemble approach also allows us to evaluate the ‘uncertainty’ of that prediction, by way of Inline graphic. Furthermore, with similar expectation value calculations, we can also analyze in more detail the random distribution of Inline graphic by way of histograms of all possible A-values. This would tell us, for example, if the values of Inline graphic have a uni- or a multimodal distribution, for random Inline graphics drawn from the ensemble Inline graphic.

These are just a few examples of what kinds of data analyses and model predictions the ensemble approach itself allows us to implement. In the context of the MINE approach of experiment design, we will have to evaluate certain correlations between pairs of observables, AInline graphic and BInline graphic, say. This will require the calculation of expectation values of the general form Inline graphic, with Inline graphic in eq. (13) replaced by the product Inline graphic.

The evaluation of all the foregoing expectation values usually requires numerical techniques to carry out the Inline graphic-integrations as in eq. (13). In general, the Inline graphic-space is very high-dimensional, far greater than the Inline graphic-dimension of Inline graphic=3 in our simple model here. Markov chain Monte Carlo methods are then the only approach available to perform the required expectation value calculations efficiently for omics experiments and field studies [15] (see Ensemble Methods section). In Fig. 4a is a simulation of the mixture experiment with the application of the ensemble method to the simulated data. The ensemble method converges quite well to the true colonization rates Inline graphic.

Figure 4.

Figure 4

An ensemble method to identify the mixture experiment’s hyphal extension rates Inline graphic is carried out on simulated data from the mixture experiment, and MINE is used to choose the next mixture experiment with inoculation proportions u1 and u2. (a) illustrates the ensemble method on simulated data for a mixture experiment of AMF colonizers. The orange lines are the true colonization rates Inline graphic. In the Monte Carlo experiment, the estimated rates are plotted as a function of sweep, a visit on average once to each of the three rates Inline graphicIn the first 3000 sweeps, the Monte Carlo experiment is equilibrated to get in the neighborhood of parameters Inline graphic that fit the simulated data. In the accumulation phase (last 1000 sweeps) the estimates of Inline graphic are accumulated to form the ensemble estimate. (b) presents the next MINE mixture experiment recommended. The contour plot is of the MINE criterion det(E) as a function of the mixture of spore inoculum proportions u1 and u2.

Assume L prior experiments have already been performed, with experimental control parameter vectors Inline graphic, as defined in eq. (8), and observed values Inline graphic, for Inline graphic and Inline graphic The experimental data points, Inline graphic, combined with the corresponding model predictions, Inline graphic from eq. (4), define an ensemble PDF, Inline graphic, via eqs. (6) and (13). The Inline graphic, in turn, will determine Inline graphic the predicted uncertain volume of the observables, to be measured in the new experiment(s), as follows:

As the simplest case, assume that we want to design just ‘one’ new experiment, with a new experimental control parameter vector Inline graphic. The MINE objective is then to choose to input inoculation proportions Inline graphic so as to maximize the information content of the new experiment about the rates of production of colonization by hyphal extension (X), by maximizing the predicted uncertainty volume of the observables to be measured. In our simple mixture experiment example, there are K such observables in any experiment: the hyphal extension X-amounts to be measured at times tk, for Inline graphic. The predicted values for these observed hyphal extensions X are then Inline graphic, as defined by the model eq. (4), for given Inline graphic and Inline graphic. These predicted values for these K observations can be thought of as the components of a vector in a K-dimensional space, the so-called ‘observation space’ (Fig. 3). For a given Inline graphic and Inline graphic, this vector of ‘predicted observations’ is in the following denoted by Inline graphic, and given by

graphic file with name DmEquation27.gif (15)

We can now use the ensemble PDF, Inline graphic, to define, in some way, a volume of likely Inline graphics in Inline graphic-space. If we let Inline graphic sweep over that finite volume then, by eq. (15), Inline graphic will sweep over some corresponding finite volume (or hyper-surface) in the observation space: the ‘uncertainty volume’ of the predicted observation vector, to be denoted by Inline graphic, for given Inline graphic, and illustrated in Fig. 3.

There is, of course, no precise prescription of how to define a volume of likely-Inline graphic in Inline graphic-space, or a corresponding uncertainty volume, Inline graphic, in observation space. That definition is not unique: it requires some arbitrary but reasonable choices to be made. In the following, two specific possible choices for Inline graphic will be discussed.

MINE by Covariance Ellipsoid Volume

In the covariance matrix approach, we define Inline graphic in terms of the uncertainty ellipsoid, constructed from the covariances of the K observable predictions, Inline graphic, subject to the ensemble PDF Inline graphic. Let Inline graphic denote those covariance matrix elements, i.e. for Inline graphic, let

graphic file with name DmEquation33.gif (16)

with expectation values E[…] defined as in eq. (13). On general mathematical grounds, the corresponding Inline graphic covariance matrix, Inline graphic is symmetric and positive semidefinite. Therefore, D has K real, non-negative eigenvalues, Inline graphic; and it has an orthonormal basis of corresponding K-dimensional eigenvectors, Inline graphic with Inline graphic. That is, for Inline graphic, we have:

graphic file with name DmEquation34.gif (17)
graphic file with name DmEquation35.gif (18)
graphic file with name DmEquation36.gif (19)
graphic file with name DmEquation37.gif (20)

The eigenvalues and eigen vectors are, of course, dependent upon u, but, for notational simplicity, we have suppressed that functional dependence, i.e. Inline graphic and Inline graphic in eqs. (1720).

By eq. (19), the eigenvectors, Inline graphic, are orthogonal, i.e. pairwise perpendicular to each other. The eigenvalues, Inline graphic, are the variances of the predicted observation vector, Inline graphicalong the corresponding eigenvector directions. That is, if we take the projection of the vector Inline graphic onto Inline graphic i.e. let Inline graphic denote that projection, with

graphic file with name DmEquation38.gif (21)

then Inline graphic is the variance of that projected Inline graphic–vector:

graphic file with name DmEquation39.gif (22)

where Inline graphic is defined as in eq. (14).

The eigenvalues and eigenvectors of Inline graphicdefine the so-called “covariance ellipsoid” or “error ellipsoid” of the predicted observation vector, Inline graphic, in the K-dimensional observation space: The eigenvectors, Inline graphic, can be thought of as the orientations of the principal axes of the ellipsoid; the standard deviations of the projections Inline graphic i.e. Inline graphic are the lengths of the principal semi-axes along the Inline graphic- direction. This ellipsoid serves as our uncertainty volume, and Inline graphic is given by the product of the semi-axis lengths,

graphic file with name DmEquation40.gif (23)

where Inline graphic is an unimportant geometrical prefactor,

graphic file with name DmEquation41.gif (24)

with Inline graphic denoting Euler’s gamma function. Eq. (23) can also be written in terms of the determinant of the D-matrix:

graphic file with name DmEquation42.gif (25)

MINE by Correlation Ellipsoid Volume

In the correlation matrix approach, we define Inline graphic in terms of an uncertainty ellipsoid constructed from the Pearson correlations of the K observable predictions, Inline graphic, subject to the ensemble PDF Inline graphic. The Pearson correlation matrix elements, denoted by Inline graphic, are related to the covariance matrix elements, Inline graphic, from eq. (16), by

graphic file with name DmEquation43.gif (26)

Note that Inline graphic can also be written as the covariance matrix of the predicted observations, Inline graphic, normalized by their standard deviations:

graphic file with name DmEquation44.gif (27)

where

graphic file with name DmEquation45.gif (28)

Therefore, the correlation matrix E has the same mathematical properties of symmetry and semipositivity as the covariance matrix D. Analogous to the covariance ellipsoid constructed from D, we can therefore construct a correlation ellipsoid from the eigenvalues and orthonormal eigenvectors of the correlation matrix E. Using the volume of the correlation ellipsoid as the uncertainty volume of the predicted observables, we then have, analogous to eq. (23),

graphic file with name DmEquation46b.gif (29)

where Inline graphic are the eigenvalues of the correlation matrix E. Analogous to eq. (25), we can also write this as

graphic file with name DmEquation47.gif (30)

The advantages of the volume for the correlation ellipsoid are several. One, V(u) is a measure of linear dependence of the observables, and the greatest gain in information from an experiment is likely to come from structuring an experiment to increase this linear independence of the observables. In fact, V(u) is 1 when the observables are linearly independent and is 0 when there is some linear dependence in the observables. Two, V(u) has well-known statistical properties when the ensemble takes the form of eq. (11) [42].

The surface of the MINE criterion in a contour plot is shown for the next mixture experiment (Fig. 4b). The MINE experiment involves using ~0.25 of AMF1 in the inoculum and ~0.40 of AMF2 in the inoculum to characterize the rates of hyphal extension in the next experiment. As a final note, some theorems about the properties of MINE have been established for the class of linear models, such as the mixture experiments [11]. Code for the ensemble methods is available on GitHub [43, 44].

Application of MINE to RNA profiling experiments to discover the mechanism of biological clocks

MINE was developed specifically for this kind of transcriptomics problem and used to close the loop in the computing life cycle (Fig. 5) proposed by Hood and Abersold [45]. Here, the application was to transcriptomic experiments to discover the mechanism underlying circadian rhythms, one of the central problems of systems biology [46]. Transcriptomic experiments have a limited number of time points n but have many 1000 s of genes [and hence parameters (p)] to be identified in the process [6]. While both MINE criteria using the Covariance by Ellipsoid Volume and Correlation by Ellipsoid Volume were developed, only the Correlation by Ellipsoid Volume was utilized in the end in designing the experiments [6].

Figure 5.

Figure 5

The MINE experiment is a 90% knockdown of the wc-1 gene. The MINE criterion displayed is the correlation ellipsoid volume det(E(U)), which is graphed as a function of the remaining activity of the three clock mechanism genes. The predictions F are of the log base 10 concentrations over time of frq, wc-1, and wc-2 mRNAs over time from the RNA profiling experiments. The mRNA levels were measured at 14 time points over an 8-h window. The drawing is taken from [6].

Transcriptomics was to be used to explore the mechanism of the biological clock in one of the most well-studied model systems [47], the filamentous fungus, Neurospora crassa. Three major components of the clock mechanism were: (i) frequency (frq), the gene encoding the oscillator of the system, and a negative regulator [48]; (ii) white-collar-1 (wc-1), the gene encoding the light response element and a positive transcriptional activator for the system [49]; and (iii) and white-collar − 2(wc-2), a second positive transcriptional activator for the system. Together wc-1 and wc-2 encode WC-1 and WC-2 proteins that act as positive elements in the clock through the dimeric complex WCC=WC-1/WC-2, while frq encodes a protein FRQ, which acts as the negative regulator for the system. The FRQ protein provides negative feedback to wc-1 and wc-2. All three of these elements appear in a single copy in the N. crassa genome, but they have homologs in fly and mammalian systems [49, 50].

Now consider a series of RNA profiling experiments that were conducted, guided by MINE to choose an informative sequence of experiments. The last in a series of three adaptive experiments guided by MINE involved a choice of whether or not to do a knockdown or overexpression experiment on: (i) frq; (ii) wc-1; or (iii) wc-2. The conventional wisdom was to mutate the oscillator gene frq [51].

The first step in the MINE application is to make predictions for various mutations in the clock mechanism genes using an available ensemble. The RNA profiles of all 11 000 genes were measured at each of 14 time points. A subset of these RNA measurements included a total of 14 time points on each of the three clock genes so that the prediction vector F had 42 components. Unlike previous models so far considered, the clock model is a nonlinear model (in the parameters describing the model), which specifies a genetic network of nonlinear ordinary differential equations describing the time course of the genes, their cognate RNAs, and proteins [18]. The model ensemble for this clock network was used to predict RNA profiles of frq, wc-1, and wc-2 and their correlations under different possible experiments as shown in Fig. 2.

With the correlation matrix in hand for the predictions, the MINE criterion based on the correlation volume ellipsoid in eq. (30) was calculated as a function of the degree of knockdown of the three clock genes (Fig. 5). The result was surprising. A knockdown of wc-1 was selected as the MINE experiment and used to identify 2323 genes responding to the knockdown [6]. The second surprise from this model-guided discovery process was that ribosome biogenesis was under clock control. This was later confirmed in mammalian systems [52].

A better way to do this MINE calculation would have been to use the transcriptomic data on all 11 000 N. crassa genes instead of just the three clock genes driving the system. The ensemble methods now exist for the whole genome-scale network with all of its 1000 s of genes and ensembles now exist for the entire clock network [19, 20, 53]. Using graphical processing unit (GPU) ensemble methods, such as MINE, can now be implemented on a genomic scale with an unknown network structure [19, 53].

Application of MINE to genome-wide association study field studies for AMF/sorghum project

Consider a GWAS study underway to understand the genetic basis of biomass and AMF colonization in Sorghum bicolor using the BAP Panel [26] of 343 plant accessions of varying biomass. There are 232 303 SNPs to characterize each member of the panel after filtering for minor allele frequencies [54]. The focus is on dry weight as a measure of biomass, and AMF colonization is measured using a convolution neural network from imaging AMF in roots [41, 55]. At Time 0, dry weight data were available on each accession in the panel to construct an ensemble [26]. The GWAS study has been running for 5 years [43, 44], and, each year, MINE is being used to select 79 BAP plant accessions for study. The 79 plant seedlings will be assigned randomly to 79 rows with each row consisting of 9 seedlings of identical genotype. The 79 × 9 block was replicated three times in the field. The goal is to discover a relation between plant biomass as a function of SNPs.

In the field study, there are actually a number of plant features that are being measured during the sequence of MINE experiments [44]. These include plant genotype, plant expression Quantitative Trait Loci or eQTLs, the microbiome, tissue total phosphorus (P), nitrogen (N), time of harvest, and other variables relevant to plant health as measured by biomass (Fig. 6). These variables used to predict biomass (as well as AMF colonization and AMF community composition) are held together in a causal diagram representing a structural equation model [56]. The standard model for GWAS experiments is the mixed linear model, which is a special case of the structural equation model in which some of the independent variables are random with mean 0 [57]. For purposes of illustration, a mixed linear model is presented below for an adaptive GWAS experiment underway at Wellbrook Farm, Athens, GA.

Figure 6.

Figure 6

A sequence of MINE experiments is being used in a 5-year GWAS experiment to examine the relation between biomass and SNPs in S. bicolor using the BAP accessions [26]. MINE is used to select the BAP accessions to be used each year in order to map AMF colonization and biomass to the sorghum genetic map in a GWAS study. Multiscale structural equation model (SEM) for the project (center boxes and arrows). Lotka–Volterra community models are nested within the SEM and predict associations that affect biomass. The dependent variable is biomass, and the arrows in the diagram denote causal relationships between independent variables in the SEM. In this model, sorghum genotype is the primary independent variable that correlates with the remaining variables. This conceptual model will evolve continuously using the model-guided discovery process of maximally informative next experiment (MINE; outer ring) [6].

There are two sets of measured inputs to the GWAS experiment in Year 1 at Wellbrook Farm, Athens, GA: (i) fixed variables in the design matrix X, such as block number, harvest time of each plant, N level, P level; (ii) random intercepts and slope effects Z for the genotype of each BAP accession on the other hand. The additive genetic variation in plant genotype as it affects the dependent variable of log dry weight (e.g. biomass) was captured by binning the number of alleles in a given genomic region different from the reference genome using the sum method [58]. The number of SNPs in each chromosomal region was adjusted to ensure that at least each genomic region was 50 kb in size. Since linkage disequilibrium is reported between markers separated by 3.5–35 kb [59], the 50 kb size was chosen to reduce linkage disequilibrium between bins. The result was that the number of bins was 2748 in the 750 Mb sorghum genome with each bin typically having 10–12 genes in sorghum. The number of such alleles in a bin is treated as a continuous random variable with mean 0 and variance component Inline graphic for the ith accession, and summed over regions to obtain a random effect for an accession. The fact that each random effect is the sum of 2748 small random effects of chromosomal regions makes it plausible that the random effect of accession is normally distributed. The fixed effects of block number and BAP genotype, for example, are denoted by the vector Inline graphic and the random effects, by u.

These fixed and random effects are used to predict some measure of biomass, log dry weight. In the experiment, there are a total of n ~ 606 plants (2–3 plants × 3 blocks × 79 rows) in the field to infer the fixed and random effects, which is less than p = 2748 fixed effects +79 variance components. The measurements to be predicted are summarized in the n × 1 vector Y. The last component of the model is the error in biomass, denoted by Inline graphic for the ith plant in the experiment. The biomass measurements are summarized in an n × 1 column vector Y.

In Year 1 of the adaptive GWAS experiment, there was no evidence of block effects, and N and P applied were not varied. A total of 79 BAP accessions were planted in a randomized complete block design with 3 blocks, 79 plots in each block, 1 genotype randomly assigned per plot (i.e. a row), and 12 replicates per plot. The mixed linear model for this experiment is reduced to:

graphic file with name DmEquation48.gif (31)

where Y is a n × 1 vector of observations on biomass. The n × p matrix X is the values of the p independent variables (describing each bin) with n observations on each of the chromosomal bins. The p × 1 vector Inline graphic are the fixed effects of each chromosomal region. The n × r matrix Z is the r = 2748 normalized number of alleles in each accession on n plants in the field. The r × 1 vector Inline graphic is a vector of random effects of each accession on biomass. The errors in the dependent variable, biomass Y, are captured in the n × 1 vector Inline graphic. Three of the assumptions of the model are that: (i) the random effects u are independent of the biomass errors Inline graphic; (ii) the errors Inline graphic are normally distributed with mean 0 and variance Inline graphic; (iii) the random effects u are normally distributed with mean 0 and variance Inline graphic plant I with accession j(i). That is, the assumptions are that the random effects Inline graphic and errors Inline graphic are independent and normally distributed with mean 0 and variance–covariance matrix Inline graphicand IInline graphic, respectively. In the application of eq. (31) taking Z = X implies that both intercepts and slopes are random as in [41].

Under this model, the prediction of biomass is:

graphic file with name DmEquation49.gif

The variance components and heritability are used to calculate the variance–covariance matrix V of the biomass measurements Y:

graphic file with name DmEquation50.gif

where Inline graphic. That is, Inline graphic and Inline graphic are the ith row vectors of Z and X, respectively. Each observation Inline graphic has such a 1 × n row vector Inline graphic to describe the genetics of its accession. Each term Inline graphic is an n × n block. The variance–covariance matrix is diagonal with p blocks each with the same diagonal elements Inline graphic. The index j(i) is a lookup that returns the variance component of the ith observation as determined from accession j. Plant I has an assigned accession j.

With the assumptions above for the mixed linear model, the ensemble Q can be written down as multivariate normal with the Inline graphic-vector consisting of the fixed effects Inline graphic and the variance components:

graphic file with name DmEquation51.gif

In the first use of MINE in a field experiment, no fertilizers were applied to the field in 2021 at Wellbrook Farm, Athens, GA [44]. The model was reduced to a fixed effects model with the number of alleles in a bin as the set of independent variables using the sum method [58].

The ensemble method was used to fit the models to the published log dry weight data from 3 years from 2013 to 2015 averaged in Florence, SC [26], to make predictions in the use of MINE [44]. Typically, in an omics experiment, there are prior published data available, and this should be used when available [6] to initialize the MINE sequence. A total of 1000 equilibration sweeps were done, and then 1000 sets of model parameters were accumulated, each model parameter set separated by 100 decorrelation sweeps. The chi-square per data point was 6.12 with n = 606 dry weight measurements. As a control, the ensemble run was repeated with the only change being 1000 decorrelation steps. The fitted ensemble from all the data simultaneously provides a unified framework for feature selection to avoid overfitting [43, 44].

MINE was then applied to the fitted ensemble from Florence, SC, to select 80 accessions for use in 2022 for planting at Wellbrook Farm [44]. The MINE method used was the covariance ellipsoid in eq. (25). The MINE criterion V(u) conceptually involved evaluating det(D) over all possible Inline graphic samples from the BAP panel, which is computationally intractable. Instead as an approximation, the MINE criterion was optimized by evaluating det(D) on all Inline graphic triples drawn from 343 BAP accessions deposited at USDA GRIN in Griffin, GA. The result is shown graphically (Fig. 7). Details of calculating det(D) in the MINE section begin with calculating the covariance matrix of the predictions for next year’s experiment with eq. (16). The eigenvalues are calculated for the covariance matrix using eqs. (1720). The eigenvalues, in turn, determine det(D) in eq. (23). Code for the ensemble methods and the MINE calculations is available on GitHub [44].

Figure 7.

Figure 7

The MINE criterion log(det(E)) was used to select 80 accessions for use in a GWAS experiment at Wellbrook Farm, GA, in 2022. The top 200 selected triples of accessions are ranked by det(D). From these top 200 triples, 80 distinct accessions were selected.

A detailed comparative analysis of MINE in GWAS with classic design approaches is available [43, 44]. The use of MINE in a series of smaller three yearly adaptive experiments is compared directly with a large classic design on the BAP for log dry weight [26]. In this example, the MINE experiments made the GWAS feasible with ~8 project participants per year in the field experiment where a larger classic design using all BAP accessions in 1 year involved over 20 project participants.

As a final note, in the first pass to developing the ensemble methods, such as MINE, for GWAS, the random effects were assigned to accessions rather than individual genes to track the replicates across blocks in the field design each year and to keep the model simpler by limiting the number of variance components to the number of accessions rather than the number of bins. The hypothesis to be tested was simply to ascertain whether or not BAP accessions had an effect on AMF colonization. Future work will include more variance components for each chromosomal bin to fit these more complicated ensembles of models [43].

Conclusion

There are other potential domains of application of MINE. One example is computer vision models for large image data sets, such as the 15 Terabyte (Tb) data set of plant root images with AMF structures in the root cells recently reported [60]. As such data sets grow, there are choices of structural categories to annotate for ‘balance’ to build a good classification and segmentation model. Then, the ratios of training, validation, and testing sets need to be selected as design parameters, the usual ratios being 8:1:1. Once a criterion for information is decided upon, such as accuracy of classification or F1 or precision as well as for segmentation [i.e. Intersection over Union (IoU)] [41], then MINE could be applied. Classification problems are potentially approachable by MINE in the same way as MINE has been applied to phylogenetic trees in taxon sampling [61, 62]. The goal would be to select promising categories for further annotation to improve classification and segmentation adaptively as the data resource is expanded.

Other kinds of large-Tb data sets are encountered in the natural language processing of genomic sequences arising from protein and DNA sequence languages in microbial genomes and microbiomes [63]. The goal is to use natural language models to help predict the functional categories of sequences, most of which are unannotated. Again, the same kind of design questions arise. What are the most informative functional categories to explore with annotation? How do we go about training, validating, and testing natural language models for genomic sequences? MINE would presumably help to select functional categories for discovery potential.

MINE is a discovery tool designed specifically for very large genetic data sets and is illustrated in a variety of problems within genetics here. MINE is built upon other ensemble methods [18] that have been developed for fitting models with n < < p. These ensemble methods of model identification coupled with MINE complete a discovery cycle (Fig. 6) for exploring problems in genetics. This discovery cycle has been called computing life [45]. MINE as a discovery tool completes the cycle in both analyzing and designing future costly omics experiments arising in genetics and allows an adaptive approach to solving problems in genetics.

Key Points

  • New model-guided and adaptive experimental design for omics experiments called the Maximally Informative Next Experiment or MINE is reviewed for genetics.

  • Ensemble methods are reviewed to fit models when the number of parameters p greatly exceeds the number of data points n (p > > n).

  • MINE uses a fitted model ensemble for model-guided discovery when p > > n to select the next MINE, thereby better distinguishing models within a fitted model ensemble.

  • MINE is illustrated in nonlinear models of the clock in Neurospora crassa in multiyear transcriptomics experiments.

  • MINE is also illustrated in linear models for a genome-wide association study in Sorghum bicolor for multiyear experiments to find the genes underlying biomass.

Supplementary Material

video1_bbaf167
Download video file (5.8MB, mp4)

Contributor Information

Isaac Torres, Institute of Bioinformatics, University of Georgia, Athens, GA 30602  USA.

Shufan Zhang, Institute of Bioinformatics, University of Georgia, Athens, GA 30602  USA.

Amanda Bouffier, Institute of Bioinformatics, University of Georgia, Athens, GA 30602  USA.

Michael Skaro, Institute of Bioinformatics, University of Georgia, Athens, GA 30602  USA.

Yue Wu, Department of Genetics, Stanford University, Stanford, CA 94309  USA.

Lauren Stupp, Genetics Department, University of Georgia, Athens, GA 30602  USA.

Jonathan Arnold, Institute of Bioinformatics, University of Georgia, Athens, GA 30602  USA.

Y Anny Chung, Plant Biology and Plant Pathology, University of Georgia, Athens, GA 30602  USA.

H-Bernd Schuttler, Physics and Astronomy, University of Georgia, Athens, GA 30602  USA.

Funding

This work was supported by DOE DE-SC0021386 and NSF MCB-2041546.

Data availability

The data underlying this article will be shared on reasonable request to the corresponding author.

References

  • 1. Fisher  RA. The Design of Experiments. Edinburgh: Oliver and Boyd, 1935. [Google Scholar]
  • 2. Fisher  RA. The Design of Experiments. 6th edn. Edinburgh: Oliver and Boyd, 1951. [Google Scholar]
  • 3. John  PWM. Statistical Design and Analysis of Experiments. London: Macmillan, 1971. [Google Scholar]
  • 4. Lasky  JR, Upadhyaya  HD, Ramu  P. et al.  Genome-environment associations in sorghum landraces predict adaptive traits. Sci Adv  2015;1:e1400218. 10.1126/sciadv.1400218 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Hu  Z, Olatoye  MO, Marla  S. et al.  An integrated genotyping-by-sequencing polymorphism map for over 10,000 sorghum genotypes. Plant Genome  2019;12:180044. 10.3835/plantgenome2018.06.0044 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Dong  W, Tang  X, Yu  Y. et al.  Systems biology of the clock in Neurospora crassa. PLoS One  2008;3:e3105. 10.1371/journal.pone.0003105 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Johnson  NC, Wilson  GWT, Wilson  JA. et al.  Mycorrhizal phenotypes and the law of the minimum. New Phytol  2015;205:1473–84. 10.1111/nph.13172 [DOI] [PubMed] [Google Scholar]
  • 8. Johnson  NC, Wilson  GWT, Bowker  MA. et al.  Resource limitation is a driver of local adaptation in mycorrhizal symbioses. Proc Natl Acad Sci  2010;107:2093. 10.1073/pnas.0906710107 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Gao  C, Montoya  L, Xu  L. et al.  Strong succession in arbuscular mycorrhizal fungal communities. ISME J  2019;13:214–26. 10.1038/s41396-018-0264-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Johnson  NC, Hoeksema  JD, Bever  JD. et al.  From Lilliput to Brobdingnag: Extending models of Mycorrhizal function across scales. Bioscience  2006;56:889–900. 10.1641/0006-3568(2006)56[889:FLTBEM]2.0.CO;2 [DOI] [Google Scholar]
  • 11. Bouffier  AM, Arnold  J, Schüttler  HB. A MINE alternative to D-optimal designs for the linear model. PLoS One  2014;9:e110234. 10.1371/journal.pone.0110234 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Yengo  L, Vedantam  S, Marouli  E. et al.  A saturated map of common genetic variants associated with human height. Nature  2022;610:704–12. 10.1038/s41586-022-05275-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Gouveia  GJ. et al.  Long-term metabolomics reference material. Anal Chem  2021;93:9193–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Battogtokh  D, Asch  DK, Case  ME. et al.  An ensemble method for identifying regulatory circuits with special reference to the qa gene cluster of Neurospora crassa. Proc Natl Acad Sci USA  2002;99:16904–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Landau  DP, Binder  K, Landau  DP. et al. In: Landau DP and Binder K. (eds.) Monte Carlo Simulations at the Periphery of Physics and beyond. A Guide to Monte Carlo Simulations in Statistical Physics. New York, NY, USA: Cambridge University Press, 2014. 13–22. [Google Scholar]
  • 16. Guggenheim  EA. Boltzmann's Distribution Law. New York: North-Holland Pub. Co.; Interscience Publishers, 1955. [Google Scholar]
  • 17. McGee  RL, Buzzard  GT. Maximally informative next experiments for nonlinear models. Math Biosci  2018;302:1–8. 10.1016/j.mbs.2018.04.007 [DOI] [PubMed] [Google Scholar]
  • 18. Yu  Y, Dong  W, Altimus  C. et al.  A genetic network for the clock of Neurospora crassa. Proc Natl Acad Sci USA  2007;104:2809–14. 10.1073/pnas.0611005104 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Al-Omari  A, Griffith  J, Caranica  C. et al.  Discovering regulators in post-transcriptional control of the biological clock of Neurospora crassa using variable topology ensemble methods on GPUs. IEEE Access  2018;6:54582–94. 10.1109/ACCESS.2018.2871876 [DOI] [Google Scholar]
  • 20. Al-Omari  AM, Griffith  J, Scruse  A. et al.  Ensemble methods for identifying RNA operons and regulons in the clock network of Neurospora Crassa. IEEE Access  2022;10:32510–24. 10.1109/ACCESS.2022.3160481 [DOI] [Google Scholar]
  • 21. Fisher  RA, Russell  EJ. On the mathematical foundations of theoretical statistics. Philosophical transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character  1922;222:309–68. 10.1098/rsta.1922.0009 [DOI] [Google Scholar]
  • 22. Savage  LJ. The Foundations of Statistics  2d rev. edn. Garden City, NY: Dover Publications, 1972. [Google Scholar]
  • 23. Landau  DP, Binder  K. A Guide to Monte Carlo Simulations in Statistical Physics. Cambridge, England: Cambridge University Press, 2009. [Google Scholar]
  • 24. Antoninka  AJ, Ritchie  ME, Johnson  NC. The hidden Serengeti—Mycorrhizal fungi respond to environmental gradients. Pedobiologia  2015;58:165–76. 10.1016/j.pedobi.2015.08.001 [DOI] [Google Scholar]
  • 25. Metropolis  N, Rosenbluth  AW, Rosenbluth  MN. et al.  Equation of state calculations by fast computing machines. J Chem Phys  1953;21:1087–92. 10.1063/1.1699114 [DOI] [Google Scholar]
  • 26. Brenton  ZW, Cooper  EA, Myers  MT. et al.  A genomic resource for the development, improvement, and exploitation of sorghum for bioenergy. Genetics  2016;204:21–33. 10.1534/genetics.115.183947 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Mostow  GD, Sampson  JH. Linear Algebra. NY: McGraw-Hill, 1969. [Google Scholar]
  • 28. Anderson  WW, Arnold  J, Sammons  SA. et al.  Frequency-dependent viabilities of Drosophila pseudoobscura karyotypes. Heredity  1986;56:7–17. 10.1038/hdy.1986.2 [DOI] [Google Scholar]
  • 29. Cornell  JA. Experiments with mixtures: A review. Dent Tech  1973;15:437–55. 10.1080/00401706.1973.10489071 [DOI] [Google Scholar]
  • 30. Cobb  AB, Wilson  GWT, Goad  CL. et al.  The role of arbuscular mycorrhizal fungi in grain production and nutrition of sorghum genotypes: Enhancing sustainability through plant-microbial partnership. Agric Ecosyst Environ  2016;233:432–40. 10.1016/j.agee.2016.09.024 [DOI] [Google Scholar]
  • 31. Watts-Williams  SJ, Emmett  BD, Levesque-Tremblay  V. et al.  Diverse Sorghum bicolor accessions show marked variation in growth and transcriptional responses to arbuscular mycorrhizal fungi. Plant Cell Environ  2019;42:1758–74. 10.1111/pce.13509 [DOI] [PubMed] [Google Scholar]
  • 32. Chappell  CR, Fukami  T. Nectar yeasts: A natural microcosm for ecology. Yeast  2018;35:417–23. 10.1002/yea.3311 [DOI] [PubMed] [Google Scholar]
  • 33. Liu  Y, Johnson  NC, Mao  L. et al.  Phylogenetic structure of arbuscular mycorrhizal community shifts in response to increasing soil fertility. Soil Biol Biochem  2015;89:196–205. 10.1016/j.soilbio.2015.07.007 [DOI] [Google Scholar]
  • 34. Jiang  S, Liu  Y, Luo  J. et al.  Dynamics of arbuscular mycorrhizal fungal community structure and functioning along a nitrogen enrichment gradient in an alpine meadow ecosystem. New Phytol  2018;220:1222–35. 10.1111/nph.15112 [DOI] [PubMed] [Google Scholar]
  • 35. Revillini  D, Gehring  CA, Johnson  NC. The role of locally adapted mycorrhizas and rhizobacteria in plant–soil feedback systems. Funct Ecol  2016;30:1086–98. 10.1111/1365-2435.12668 [DOI] [Google Scholar]
  • 36. Oyarte, Galvez  L, Bisot  C, Bourrianne  P. et al.  A travelling-wave strategy for plant–fungal trade. Nature  2025;639:172–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Dharmananda  S. Studies of the Circadian Clock of Neurospora Crassa: Light-Induced Phase Shifting. Santa Cruz, CA: University of California, 1980. [Google Scholar]
  • 38. McGonigle  TP, Miller  MH, Evans  DG. et al.  A new method which gives an objective measure of colonization of roots by vesicular—Arbuscular mycorrhizal fungi. New Phytol  1990;115:495–501. 10.1111/j.1469-8137.1990.tb00476.x [DOI] [PubMed] [Google Scholar]
  • 39. Plouznikoff  K, Asins  MJ, de  Boulois  HD. et al.  Genetic analysis of tomato root colonization by arbuscular mycorrhizal fungi. Ann Bot  2019;124:933–46. 10.1093/aob/mcy240 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. De Vita  P, Avio  L, Sbrana  C. et al.  Genetic markers associated to arbuscular mycorrhizal colonization in durum wheat. Sci Rep  2018;8:10612. 10.1038/s41598-018-29020-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Zhang  S, Wu  Y, Skaro  M. et al.  Computer vision models enable mixed linear modeling to predict arbuscular mycorrhizal fungal colonization using fungal morphology. Sci Rep  2024;14:10866. 10.1038/s41598-024-61181-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Muirhead  RJ. Aspects of Multivariate Statistical Theory. NY: John Wiley & Sons, 2009. [Google Scholar]
  • 43. Torres  I. MINE: Maximally Informative Next Experiment - Genetics Application and Novel Computational Methodology  PhD Dissertation. in Bioinformatics University of Georgia. Athens, GA, 2024.
  • 44. Torres  I, Bouffier  A, Schuttler  H-B  et al.  MINE: Maximally Informative Next Experiment - towards a new GWAS experimental design and methodology. Genes|Genomes|Genetics (G3), in revision. 2025.
  • 45. Ideker  T, Thorsson  V, Ranish  JA. et al.  Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. Science  2001;292:929–34. 10.1126/science.292.5518.929 [DOI] [PubMed] [Google Scholar]
  • 46. Kitano  H. Systems biology: A brief overview. Science  2002;295:1662–4. [DOI] [PubMed] [Google Scholar]
  • 47. Dunlap  JC. Molecular bases for circadian clocks. Cell  1999;96:271–90. [DOI] [PubMed] [Google Scholar]
  • 48. Aronson  BD, Johnson  KA, Loros  JJ. et al.  Negative feedback defining a circadian clock: Autoregulation of the clock gene frequency. Science  1994;263:1578–84. [DOI] [PubMed] [Google Scholar]
  • 49. Crosthwaite  SK, Dunlap  JC, Loros  JJ. Neurospora wc-1 and wc-2: Transcription, photoresponses, and the origins of circadian rhythmicity. Science  1997;276:763–9. [DOI] [PubMed] [Google Scholar]
  • 50. McClung  CR, Fox  BA, Dunlap  JC. The Neurospora clock gene frequency shares a sequence element with the drosophila clock gene period. Nature  1989;339:558–62. [DOI] [PubMed] [Google Scholar]
  • 51. McDonald  MJ, Rosbash  M. Microarray analysis and organization of circadian gene expression in drosophila. Cell  2001;107:567–78. [DOI] [PubMed] [Google Scholar]
  • 52. Jouffe  C, Cretenet  G, Symul  L. et al.  The circadian clock coordinates ribosome biogenesis. PLoS Biol  2013;11:e1001455. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Al-Omari  A, Griffith  J, Judge  M. et al.  Discovering regulatory network topologies using ensemble methods on GPGPUs with special reference to the biological clock of Neurospora crassa. IEEE Access  2015;3:27–42. 10.1109/ACCESS.2015.2399854 [DOI] [Google Scholar]
  • 54. Brenton  ZW, Juengst  BT, Cooper  EA. et al.  Species-specific duplication event associated with elevated levels of nonstructural carbohydrates in Sorghum bicolor. G3 Genes|Genomes|Genetics  2020;10:1511–20. 10.1534/g3.119.400921 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Evangelisti  E, Turner  C, McDowell  A. et al.  Deep learning-based quantification of arbuscular mycorrhizal fungi in plant roots. New Phytol  2021;232:2207–19. 10.1111/nph.17697 [DOI] [PubMed] [Google Scholar]
  • 56. Kendall  MG, Stuart  A, Ord  JK. et al.  Kendall's Advanced Theory of Statistics  6th edn. Edward Arnold, London; Halsted Press, 1994. [Google Scholar]
  • 57. Zhang  Z, Ersoz  E, Lai  CQ. et al.  Mixed linear model approach adapted for genome-wide association studies. Nat Genet  2010;42:355–60. 10.1038/ng.546 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58. Romagnoni  A, Jégou  S, van Steen  K. et al.  Comparative performances of machine learning methods for classifying Crohn disease patients using genome-wide genotyping data. Sci Rep  2019;9:10351. 10.1038/s41598-019-46649-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59. Wang  Y-H, Upadhyaya  HD, Burrell  AM. et al.  Genetic structure and linkage disequilibrium in a diverse, representative collection of the C4 model plant, Sorghum bicolor. G3 Genes|Genomes|Genetics  2013;3:783–93. 10.1534/g3.112.004861 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60. Zhang  S, Bourlai  T, Arnold  J. MycorrhisEE: A high-resolution image dataset for deep learning based quantification of Arbuscular Mycorrhizal fungi. Proceedings of IEEE “Big Data”  2024;P285. 10.1109/BigData62323.2024.10825578 [DOI] [Google Scholar]
  • 61. Townsend  JP, Leuenberger  C. Taxon sampling and the optimal rates of evolution for phylogenetic inference. Syst Biol  2011;60:358–65. 10.1093/sysbio/syq097 [DOI] [PubMed] [Google Scholar]
  • 62. Townsend  JP, Lopez-Giraldez  F. Optimal selection of gene and Ingroup taxon sampling for resolving phylogenetic relationships. Syst Biol  2010;59:446–57. 10.1093/sysbio/syq025 [DOI] [PubMed] [Google Scholar]
  • 63. Miller  D, Stern  A, Burstein  D. Deciphering microbial gene function using natural language processing. Nat Commun  2022;13:5731. 10.1038/s41467-022-33397-4 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

video1_bbaf167
Download video file (5.8MB, mp4)

Data Availability Statement

The data underlying this article will be shared on reasonable request to the corresponding author.


Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES