Abstract
The Maximally Informative Next Experiment or MINE is a new experimental design approach for experiments, such as those in omics, in which the number of effects or parameters p greatly exceeds the number of samples n (p > n). Classical experimental design presumes n > p for inference about parameters and its application to p > n can lead to over-fitting. To overcome p > n, MINE is an ensemble method, which makes predictions about future experiments from an existing ensemble of models consistent with available data in order to select the most informative next experiment. Its advantages are in exploration of the data for new relationships with n < p and being able to integrate smaller and more tractable experiments to replace adaptively one large classic experiment as discoveries are made. Thus, using MINE is model-guided and adaptive over time in a large omics study. Here, MINE is illustrated in two distinct multiyear experiments, one involving genetic networks in Neurospora crassa and a second one involving a genome-wide association study in Sorghum bicolor as a comparison to classic experimental design in an agricultural setting.
Keywords: ensemble methods, MINE, mixed linear models, genetic networks, GWAS
Introduction
The classic approach to experimental design was developed by R. A. Fisher for linear models in 1935 and had a profound effect on all of science [1, 2]. Growing out of his work at the Rothamsted Experiment Station, he introduced widely the notion of precision of an experiment, randomization, ways of controlling heterogeneity through blocking and by the use of covariates, and the vast subject of experimental design in the context of linear models [3].
The focus of all of these efforts was not on discovery per se. Rather, the end goal was the precision of estimates and the power to test effects in a controlled experiment with the proper randomization and blocking practices in place. The number of replicates was such that the number of observations (n) was typically much greater than the number of effects (p) being estimated in the model. Unfortunately, this is no longer the typical situation of an omics experiment [4], such as a genome-wide association study (or GWAS) as an example. Instead, there may be only n = 1943 samples of sorghum accessions but over p = 400 000 potential effects of single nucleotide polymorphisms (or SNPs) on the complex trait of interest for an agronomic crop [5]. The methods of classic experimental design are limited to the situation of n > p for inference about model parameters and are not designed for data exploration for hypothesis generation.
While the original goal was the precision of estimates [2], the new goal is discovery. The reason that precision is less important is that the goal of such studies has shifted to the discovery of relationships in the data. The focus on precision of effects can only be addressed in follow-up studies when the relevant variables in the experiment have been identified and related. We desire to discover the appropriate nonlinear kinetic models that underly the biological clock at the molecular level [6] as we carry out a very expensive sequence of transcriptomic experiments. We wish to discover the relation of plant functional traits to SNPs in the nuclear genome or the assemblage of fungal symbionts in the microbiome most beneficial for plant growth [7–9] using models drawn from systems and population ecology [10]. How can the classic linear models [11] and newer models of systems biology [6] guide a discovery process, in which some GWAS studies have millions of SNPs, such as those on human height [12]?
A new approach to the design of large genomics experiments is introduced, one in which model-guided discovery is used adaptively [13] over time to find the variables that matter, considering a system in which the number of potential effects in the system (p) far exceeds the number of observations (n) on the system. The methodological approach utilizes ensemble methods [14] drawn from statistical physics [15] and, ultimately, Boltzmann’s 19th-century work [16]. The particular ensemble method explored here for model-guided discovery is called MINE, which stands for Maximally Informative Next Experiment [17].
Ensemble methods
Ensemble methods were developed in the 19th century by Boltzmann to describe the motion of an ideal gas [16]. In this situation, there is an Avogadro number (A) of particles in a 1 L box, but only three measurements are made: temperature, pressure, and volume. How is the motion of A particles described with 6A degrees of freedom each described with only three measurements?
With so little data, the data did not strongly support just one model. Boltzmann’s solution was not to give up on identifying one best model but rather to make predictions from an ensemble of models [15]. Omics experiments face exactly the same problem [14], but the paucity of data with respect to the complexity of the model is not as severe as in the problem Boltzmann faced. The interest may be in identifying the dynamics of genes and their products in carbon metabolism [14] or the biological clock [18], but genetics dictates that only a limited number of samples at different time points can be made to identify the system, while there are many parameters required to describe the system. For example, the number of measurements at different time points on the biological clock may be on the order of 60 000 measurements (n), but there are over 90 000 rate constants and initial conditions (p) in the model that must be estimated [19, 20]. Much as for an ideal gas, averaging over the ensemble allows for detailed predictions about complex biological systems, such as the clock.
A simple example is used to illustrate the approach of ensemble methods. The first step is writing down the model specification for the measurements. There are n measurements
) drawn from an unknown distribution parameterized by p parameters in
and some of these parameters are ancillary parameters, such as
below, in which there is less interest. In addition, the variables
) describe the experimental conditions. For example, the list U might specify the SNPs used in a GWAS field trial. Then the model specification would take the form:
![]() |
where C is a normalization constant chosen to make the integral over the data Y equal to 1. The quantity
is known as the Hamiltonian. Ideally, the distribution would be observed directly, but, in practice, what is available are sample moments of the data Y.
Since the goal is to identify a model
supported by the data Y, a change of viewpoint is needed. As in the method of maximum likelihood [21], the model specification
is viewed as a function of the model parameters
, and the data Y and experimental conditions U are taken as fixed:
![]() |
(1) |
where
is a normalization constant chosen to make the integral over all parameters
in the parameter space equal to 1. This normalization constant is only a function of the data Y. The magnitude of the ensemble
, or
for short, is larger when the model
is more supported by the data
It may be useful to think of the ensemble
as a posterior distribution to the model specification
with the two functions,
and
, connected by Bayes theorem [22].
The ensemble
or
for short, is the collection of models
consistent with the available data Y. Model-averaging with respect to the model ensemble
allows predictions about the system’s behavior. Instead of identifying one model
, a distribution of models
is identified. With the number of parameters p being vastly greater than the number of data points n, predictions can still be made and tested with respect to averages computed from the ensemble
.
Monte Carlo methods are used to identify the ensemble
[15, 23] because the model specifications are complicated [18, 24]. A simple example will illustrate how this is done. Take the Hamiltonian viewed as a function of
as having the following simple form:
![]() |
The model parameter
is the one we are truly interested in, and the remaining parameters
and
are ancillary. A graph of the ensemble
is shown in Fig. 1d. There are two maxima in the ensemble or equivalently, two minima in the Hamiltonian. The goal is to reconstruct the ensemble
by Monte Carlo for prediction.
Figure 1.
Convergence of an ensemble to the target distribution: (a) ensemble after 20 moves; (b) ensemble after 100 moves; (c) ensemble after 1000 moves; (d) true ensemble or “target distribution.” (d) shows the target distribution as the ensemble converges to the true target distribution under a Monte Carlo experiment (a–c). The starting guess at the parameter
was 3. After each of 20, 100, and 1000 moves, 10 000 samples of
from the resulting distribution were drawn to characterize the ensemble. The ancillary parameters were
and
The plots were created in MATLAB_R2018B (https://www.mathworks.com/products/matlab.html).
In this example, we are in the perfect world in which the ensemble, or equivalently the Hamiltonian, is observed from 10,000 values after each move in the Monte Carlo experiment. To reconstruct the ensemble
by Monte Carlo at each move a new model parameter
is drawn from the ensemble
when the current proposal is the model parameter
. The goal is to move into a region of the parameter space that is well supported by the ensemble
in the equilibration phase. Once equilibrated many 1000 s or 10 000 s of models are accumulated that are well supported to reconstruct the ensemble from the sample histogram of these
-values [18]. The question remains how to choose the well-supported
-values.
One greedy approach to moving in the parameter space is to draw a model parameter
and proceed uphill using some procedure like steepest ascent to climb the hill(s) in the ensemble. As shown in Fig. 1, this might lead to a local maximum. In fact, in Fig. 1, there are two such maxima. To avoid local maxima, a model parameter
is drawn randomly from the ensemble
, being greedy when there is an improvement in the ensemble probability, i.e.
or equivalently
, but occasionally when Q
, move downhill anyway. The occasional downhill move may allow escape from a local maximum. In practice in systems biology, it may be more appropriate to think of the ensemble surface as gently rolling hills as on a golf course because the data are limited (
Metropolis and colleagues [25] developed a stochastic search procedure in statistical physics for this and now many other optimization problems [23]. The probability of a move is:
![]() |
The probability p of a move from
occurs with probability 1 if the proposed move takes us uphill, but if the proposed move takes us downhill, then the probability of a move downhill decreases with the amount of drop from
. The sequence of moves is made 10 000 or more times to move into a region of the parameter space well supported by the data during the equilibration phase.
In the equilibration phase, the inferred ensemble converges to the true ensemble known as the target distribution (Fig. 1). The Monte Carlo search for this simple model is successful in the reconstruction in <1000 moves and is surprisingly quick, equilibrating in <20 moves. A video displays the reconstruction process (Video S1). Once equilibration is achieved, another sequence of Monte Carlo moves called the accumulation phase is used to build the target distribution. In this simple example, only 1000 moves are needed to carry the ensemble identification into the accumulation phase.
In practice, a sweep is introduced to describe the number of moves taken to visit each model parameter on average once. A standard equilibration run and an accumulation run is 40 000 sweeps, which will vary in practice with the complexity of the model [18, 19].
MINE
Once an ensemble method produces a collection of models supported by the data, then it is possible to make predictions from the ensemble distribution about the next experiment. By averaging some variable of interest over the models in the ensemble distribution
, a prediction can be made, given the current data Y and the experimental conditions U. For example, Y might be plant biomasses measured in year 1 of a 5-year GWAS experiment to identify SNPs to predict biomass in sorghum with a certain collection of SNPs from the Bioenergy Accession Panel (BAP) [26]. The question is what accessions are to be used in Year 2 to specify design U. In Year 1, 79 accessions are measured in the GWAS and can be used to inform the SNP choice in Year 2.
One way to make this choice is to select experimental conditions permitting us to distinguish the models in the ensemble
identified from Year 1. The best way to distinguish experimentally two models randomly chosen from the model ensemble is if the predictions
of each model (
) are orthogonal as shown in Fig. 2. For experiment 1 on the left, the predictions of the two selected models
and
are correlated and are harder to distinguish under experimental conditions U1. The same two models under experimental conditions U2 are easier to distinguish—model
is easily tested against model
. The goal of a MINE criterion is then to support making “the angle” between the two predictions of a random pair in the ensemble as large as possible on average in year 2 as a function of the experimental conditions U and current ensemble
identified from the data in Year 1.
Figure 2.

Two models can be better distinguished by their predictions in the next experiment if their predictions are less correlated. The predictions of model
and
under experimental condition U1 are the expectations
and
, respectively. If the two models
and
are chosen independently from the model ensemble
, the expectations are calculated with respect to the product density
, where U = U1 or U2.
There are two standard ways to measure the associations between the predictions [6]. One is by the covariances between the predictions of the data Y (MINE by Covariance Ellipsoid Volume); the other is by the correlations between the predictions of the data Y (MINE by Correlation Ellipsoid Volume). There are a variety of reasons for advocating the use of MINE by Correlation Ellipsoid Volume [6]. One of the main reasons is that when there are a large number (p > > n) of almost linearly dependent observations as found in practice, it would be highly desirable to emphasize the new directions in the data Y as done by Correlation Ellipsoid Volume. The new directions in Y in the next year depend on the choice of design matrix X. Denote by E the correlation matrix between the components of Y. The MINE Correlation Ellipsoid Volume is then a determinant (det):
![]() |
When the predictions are on average highly correlated (Fig. 2a), the determinant is nearly zero. When the predictions are nearly orthogonal (going in new directions) (Fig. 2b), the determinant is nearly 1.
A microscope analogy [11] provides insights into how MINE works (Fig. 3). MINE is highly analogous to a microscope and its optics. The object in the microscope field described by the data Y is the observed system. MINE, like the optics of the microscope, picks up each component of Y through the prediction
about the system. For example,
could be the list of predictions of plant biomasses in a GWAS study. The optics (
) and, likewise, MINE then magnify the predictions to create the image or model of the system (Fig. 3).
Figure 3.

MINE is analogous in function to the optics on a microscope. The data Y are the objects in the field of view. The models
are in the image. The MINE criterion with the predictions
is the optics. The uncertainty volume in the image is the magnification measured by the MINE criterion, V(u) = det(E(U)). From [44].
The microscope has a field of view of the object, which we refer to as the Uncertainty Volume of the new experiment Y. The uncertainty in the observations on the field of view comes from our uncertainty about the optics controlled by
and in the measurements Y on the object. The optics (predictions) then translate the Uncertainty Volume V(U) in the sample space into an image, the Uncertainty Volume in the parameter space. The result is that an Uncertainty Volume in the sample space (object) is mapped by the optics
to the Uncertainty Volume in the parameter space (image).
The magnification applied to the object is adjusted to reduce the Uncertainty Volume in the parameter space (image). Another interpretation of the image quality is given by the determinant det(E(U)). The determinant is a measure of the volume of a parallelepiped defined by the Uncertainty Volume in the Sample Space [27]. The determinant is also a measure of the Uncertainty volume in the parameter space (inside the ensemble). As the magnification knob is twiddled, the clarity of the image (model parameters) is increased and uncertainty is reduced (Fig. 3). If the parallelepiped is squashed in the parameter space, less details from the observations Y in the sample space are being retained in imaging (i.e. model fitting). MINE is doing the focusing and representing the object in higher clarity in the image constructed by the observer using MINE.
A simple model for predicting hyphal extension colonization by arbuscular mycorrhizal fungi in plant roots to illustrate MINE
Mixture experiments are used in population genetics [28] and science and engineering in general [29]. Mixture experiments are examples of linear models that are the focus of experimental design [2]. In these designs, there is a mixture of treatments in different proportions affecting some dependent variables of interest. Mixture experiments can be used to study how different arbuscular mycorrhizal fungi (AMF) affect the health of the plant through colonization of the root system. The assembly of the AMF biome in plant roots is a product of choices imposed by the plant genotype [30, 31], competition between AMF, ecological drift [9], historical contingency [32], abiotic factors such as phosphorous (P) and nitrogen (N) in the soil [33, 34], and other factors. Consider three AMF species, S1, S2, and S3, competing for colonization area in the plant roots of sorghum [9] of ‘one’ plant genotype. These AMF are potential partners with the plant in one of the oldest symbioses on the planet [35]. Potentially the plant provides carbon, and, in return, potentially, the AMF hyphal network provides P and N, like an extended root system. The success of this partnership is measured in part by AMF hyphal extension in the roots and the resulting biomass of the plant host [36]. To study this symbiosis, the experimenter inoculates sorghum with a mixed population of at least 10% of the coenocyte cells being S1, at least 15% being S2, and at least 5% of the coenocyte cells being S3. The mixed coenocyte inoculum is a coculture in the plant root cells. Denoting respective spore percentages by u1, u2, and u3, respectively, u1, u2, and u3 are thus constrained by lower limits,
![]() |
(2) |
and by the normalization condition
![]() |
(3) |
Given eq. (3), only two of the three species fraction values can be freely chosen. In the following, we will use proportions u1 and u2 as those two free variables, with u3 then being determined via eq. (3). Furthermore, the proportions u1 and u2 are then subject to upper and lower bounds, resulting from Eqs. (2) and (3). When referring, below, to experimenters freely choosing (u1, u2, u3), it should be understood that these choices must be within the constraints imposed by conditions (2) and (3).
Assume that, by setting appropriate experimental conditions, the experimenter can construct an inoculum with a constant total spore population size, Nc, and constant species fractions, u1, u2, and u3. Assume also that, subject to the foregoing constraints (2) and (3), the experimenter can precisely set the values of u1, u2, u3, and Nc.
Each of the three AMF taxa can increase its rate of occupancy of the root space in the plant, denoted by
,
and
, for species, S1, S2, and S3, respectively. The experimenter wishes to determine, or at least impose constraints on, the values of these rates in percent area increase,
,
and
, by performing a sequence of time-series experiments wherein the linear filament extension in a root image, denoted by y(t), is measured as a function of time, t, at certain time points,
![]() |
Here, K is the total number of experimental observation time points. Each experiment thus produces a series of observed filament extension amounts, y(tk) for k = 1,2,…, K, denoted by
![]() |
That is, yk is the value of y(t) observed at time tk, with k = 1,2, …, K labeling the different observation time points. Each of these experiments is to be performed on a spore population begun with a different combination, (u1, u2, u3), of AMF inoculation fractions. For simplicity, assume, however, that the values of the rates of hyphal extension,
,
, and
, remain the same throughout all these experiments, i.e. assume that the hyphal extension rates,
,
, and
, do not change when the experimenter changes the population composition (u1, u2, u3) from one experiment to the next as in a race tube experiment [6, 37]. For simplicity, we will refer to
,
, and
as the rates of colonization success.
The extraction of any information about the success rate in root colonization,
,
, and
, from the experimental time series data, yk, requires, a ‘mathematical model’ that treats the rates
,
, and
, as well as the known ‘experimental control parameters’, u1, u2, and u3, as input parameters. The model must then use these input parameters to provide a ‘predicted value’ for each experimental observation, yk, the hyphal extension colonized in a plant root. For a given experimental data point, yk, we denote the corresponding value predicted by the model by fk. Obviously, whatever the model predicts depends on the model input parameters,
,
,
, u1, u2, and u3, that were used to make the prediction. We will therefore often write fk as a ‘function’ of these input parameters, i.e. as
![]() |
to make it explicit that fk is dependent on the assumed values of the rate parameters
,
, and
, and on the given values of the control parameters, u1, u2, and u3, set by the experimenter.
For the scenario assumed here, i.e. for a mixed population of spores from three AMF species jointly producing percent root colonization, X, at constant rates per spore cell type, a simple mathematical model for fk is easy to construct. Assume that the mixed cell population is established, and starts producing colonization, at time
, with no initial colonization length X being present at that time. Then, the total percent colonization X produced by the entire AMF population in the roots, by observation time
, is given by:
![]() |
(4) |
The percent colonization X can be measured in roots by bright field microscopy [38–41].
To understand this linear model, which is linear in the model parameters
recall here that
is the total number of AMF in the inoculum, and hence
is the number of S1-cells in the inoculum. Hence,
is the rate of increase by all S1-cells combined producing percentage root area X-contribution. Each spore produces a hyphopodium by which to colonize the root cortex. Recall now that
![]() |
(5) |
Hence, the length colonized, by all AMF S1-cells combined, by time
, is
. Likewise, the length colonized of the roots produced by all AMF S2-cells and by all S3-cells, by time
, are
and
, respectively. We then obtain fk, i.e. the predicted total amount of hyphal extension colonized X produced by all cells until time
, by simply adding up the foregoing three X-contributions from all three AMF species. The result is eq. (4).
Suppose we have performed multiple experiments, to be labeled by an “experiment index”
, where L is the total number of experiments. In each experiment, a different AMF species composition (u1, u2, u3) was used. To distinguish these u1, u2, and u3, used in the different experiments, we therefore have to label ‘them’ with the additional index
, as
, for
. Consequently, a different time series of X-data, y1, y2, …, yK, was observed in each experiment, and we therefore also have to label the observed data, y1, y2, …, yK with the additional index
, as
, for
. Also assume that each data point,
, has been measured with some experimental uncertainty, quantified by an experimental standard deviation
. The
-function (or by another name, the Hamiltonian) is then given by:
![]() |
(6) |
To simplify and compactify the notation, we have introduced here the following abbreviations:
![]() |
(7) |
![]() |
(8) |
![]() |
(9) |
That is,
(without subscript) is shorthand for a vector that comprises the rates of colonization
, and
. The
(without subscript) denotes the vector of the three AMF species inoculation fractions used in experiment number
, and U is the vector comprising the species fractions from ‘all’ experiments combined. Note that
does not have an
-superscript here because
, and
are assumed to have the same values in all experiments.
Note that
is the model prediction of hyphal extension, from eq. (4), for
, i.e. for the kth time series data point for percent root area colonized observed in the
th experiment. The square of the so-called ‘residual’, on the right-hand side of eq. (6),
![]() |
(10) |
thus measures the deviation of the model prediction
from the experimental observation of hyphal extension
: The larger
, the worse, i.e. greater, is the deviation of the model prediction,
, from the observed data point,
. By taking the sum of all squared residuals, the
-function in eq. (6) thus provides a composite measure of the overall deviation of the model predictions from the data, for ‘all’ data points on hyphal length colonized combined. In the least-squares fitting approach, the “best possible” choice of model parameters is then obtained by finding a parameter combination,
, which minimizes this deviation, i.e. by minimizing
(
, U) with respect to
, and
In the following, let
denote the best possible parameter combination that minimizes
(
, U).
Note, in passing, that the squared residuals entering into
(
, U) in eq. (6) are weighted by the reciprocals of the variances,
. This means that experimental data points with larger experimental uncertainties carry less weight and have less of an effect on the choice of the optimal, “best match” parameter combination,
, than data points with smaller experimental uncertainties. In that sense,
can be regarded as a ‘weighted compromise” between all data points,
.
While there are, in principle, many different ways to define an ensemble probability distribution function having these general characteristics, an obvious, simple choice for eq. (1), supported by statistical theory [11], is given by:
![]() |
(11) |
The
-factor in eq. (11) is a normalization factor, chosen to ensure that the ensemble probability density function or PDF integrates to a probability of 1. That is, for our model for a mixture experiment with
, the
is chosen such that
![]() |
(12) |
Here,
and
denote, respectively, a reasonable lower and upper limit imposed on
, and
. Eq. (11) is then to be understood to hold only when
, and
each falls within the interval between
and
; if
,
, or
lies outside of this interval, we set 
Notice that
in eq. (11) has the desired general characteristics: For very large values of
, the exponential function
, and hence
becomes very small; for smaller values of
,
becomes larger. Hence,
-choices whose model predictions agree poorly with the experimental data will have a low probability of being drawn from
;
-choices whose model predictions agree well with the experimental data will have a higher probability of being drawn from
.
Given
, we can now calculate, for example, expectation values, variances, and histograms of any observable quantity,
, which the model allows us to predict as a function of
. Specifically for the expectation value,
, and variance,
, of such an “observable”
, we need to calculate:
![]() |
(13) |
with
and
for short, and then
![]() |
(14) |
Here,
is obtained, analogous to
, with
in eq. (13) replaced by
.
Within the ensemble approach,
can serve as a prediction of a representative value of
, given the experimental control parameters U and prior experimental data,
for all
and all k. However, the ensemble approach also allows us to evaluate the ‘uncertainty’ of that prediction, by way of
. Furthermore, with similar expectation value calculations, we can also analyze in more detail the random distribution of
by way of histograms of all possible A-values. This would tell us, for example, if the values of
have a uni- or a multimodal distribution, for random
s drawn from the ensemble
.
These are just a few examples of what kinds of data analyses and model predictions the ensemble approach itself allows us to implement. In the context of the MINE approach of experiment design, we will have to evaluate certain correlations between pairs of observables, A
and B
, say. This will require the calculation of expectation values of the general form
, with
in eq. (13) replaced by the product
.
The evaluation of all the foregoing expectation values usually requires numerical techniques to carry out the
-integrations as in eq. (13). In general, the
-space is very high-dimensional, far greater than the
-dimension of
=3 in our simple model here. Markov chain Monte Carlo methods are then the only approach available to perform the required expectation value calculations efficiently for omics experiments and field studies [15] (see Ensemble Methods section). In Fig. 4a is a simulation of the mixture experiment with the application of the ensemble method to the simulated data. The ensemble method converges quite well to the true colonization rates
.
Figure 4.

An ensemble method to identify the mixture experiment’s hyphal extension rates
is carried out on simulated data from the mixture experiment, and MINE is used to choose the next mixture experiment with inoculation proportions u1 and u2. (a) illustrates the ensemble method on simulated data for a mixture experiment of AMF colonizers. The orange lines are the true colonization rates
. In the Monte Carlo experiment, the estimated rates are plotted as a function of sweep, a visit on average once to each of the three rates
In the first 3000 sweeps, the Monte Carlo experiment is equilibrated to get in the neighborhood of parameters
that fit the simulated data. In the accumulation phase (last 1000 sweeps) the estimates of
are accumulated to form the ensemble estimate. (b) presents the next MINE mixture experiment recommended. The contour plot is of the MINE criterion det(E) as a function of the mixture of spore inoculum proportions u1 and u2.
Assume L prior experiments have already been performed, with experimental control parameter vectors
, as defined in eq. (8), and observed values
, for
and
The experimental data points,
, combined with the corresponding model predictions,
from eq. (4), define an ensemble PDF,
, via eqs. (6) and (13). The
, in turn, will determine
the predicted uncertain volume of the observables, to be measured in the new experiment(s), as follows:
As the simplest case, assume that we want to design just ‘one’ new experiment, with a new experimental control parameter vector
. The MINE objective is then to choose to input inoculation proportions
so as to maximize the information content of the new experiment about the rates of production of colonization by hyphal extension (X), by maximizing the predicted uncertainty volume of the observables to be measured. In our simple mixture experiment example, there are K such observables in any experiment: the hyphal extension X-amounts to be measured at times tk, for
. The predicted values for these observed hyphal extensions X are then
, as defined by the model eq. (4), for given
and
. These predicted values for these K observations can be thought of as the components of a vector in a K-dimensional space, the so-called ‘observation space’ (Fig. 3). For a given
and
, this vector of ‘predicted observations’ is in the following denoted by
, and given by
![]() |
(15) |
We can now use the ensemble PDF,
, to define, in some way, a volume of likely
s in
-space. If we let
sweep over that finite volume then, by eq. (15),
will sweep over some corresponding finite volume (or hyper-surface) in the observation space: the ‘uncertainty volume’ of the predicted observation vector, to be denoted by
, for given
, and illustrated in Fig. 3.
There is, of course, no precise prescription of how to define a volume of likely-
in
-space, or a corresponding uncertainty volume,
, in observation space. That definition is not unique: it requires some arbitrary but reasonable choices to be made. In the following, two specific possible choices for
will be discussed.
MINE by Covariance Ellipsoid Volume
In the covariance matrix approach, we define
in terms of the uncertainty ellipsoid, constructed from the covariances of the K observable predictions,
, subject to the ensemble PDF
. Let
denote those covariance matrix elements, i.e. for
, let
![]() |
(16) |
with expectation values E[…] defined as in eq. (13). On general mathematical grounds, the corresponding
covariance matrix,
is symmetric and positive semidefinite. Therefore, D has K real, non-negative eigenvalues,
; and it has an orthonormal basis of corresponding K-dimensional eigenvectors,
with
. That is, for
, we have:
![]() |
(17) |
![]() |
(18) |
![]() |
(19) |
![]() |
(20) |
The eigenvalues and eigen vectors are, of course, dependent upon u, but, for notational simplicity, we have suppressed that functional dependence, i.e.
and
in eqs. (17–20).
By eq. (19), the eigenvectors,
, are orthogonal, i.e. pairwise perpendicular to each other. The eigenvalues,
, are the variances of the predicted observation vector,
along the corresponding eigenvector directions. That is, if we take the projection of the vector
onto
i.e. let
denote that projection, with
![]() |
(21) |
then
is the variance of that projected
–vector:
![]() |
(22) |
where
is defined as in eq. (14).
The eigenvalues and eigenvectors of
define the so-called “covariance ellipsoid” or “error ellipsoid” of the predicted observation vector,
, in the K-dimensional observation space: The eigenvectors,
, can be thought of as the orientations of the principal axes of the ellipsoid; the standard deviations of the projections
i.e.
are the lengths of the principal semi-axes along the
- direction. This ellipsoid serves as our uncertainty volume, and
is given by the product of the semi-axis lengths,
![]() |
(23) |
where
is an unimportant geometrical prefactor,
![]() |
(24) |
with
denoting Euler’s gamma function. Eq. (23) can also be written in terms of the determinant of the D-matrix:
![]() |
(25) |
MINE by Correlation Ellipsoid Volume
In the correlation matrix approach, we define
in terms of an uncertainty ellipsoid constructed from the Pearson correlations of the K observable predictions,
, subject to the ensemble PDF
. The Pearson correlation matrix elements, denoted by
, are related to the covariance matrix elements,
, from eq. (16), by
![]() |
(26) |
Note that
can also be written as the covariance matrix of the predicted observations,
, normalized by their standard deviations:
![]() |
(27) |
where
![]() |
(28) |
Therefore, the correlation matrix E has the same mathematical properties of symmetry and semipositivity as the covariance matrix D. Analogous to the covariance ellipsoid constructed from D, we can therefore construct a correlation ellipsoid from the eigenvalues and orthonormal eigenvectors of the correlation matrix E. Using the volume of the correlation ellipsoid as the uncertainty volume of the predicted observables, we then have, analogous to eq. (23),
![]() |
(29) |
where
are the eigenvalues of the correlation matrix E. Analogous to eq. (25), we can also write this as
![]() |
(30) |
The advantages of the volume for the correlation ellipsoid are several. One, V(u) is a measure of linear dependence of the observables, and the greatest gain in information from an experiment is likely to come from structuring an experiment to increase this linear independence of the observables. In fact, V(u) is 1 when the observables are linearly independent and is 0 when there is some linear dependence in the observables. Two, V(u) has well-known statistical properties when the ensemble takes the form of eq. (11) [42].
The surface of the MINE criterion in a contour plot is shown for the next mixture experiment (Fig. 4b). The MINE experiment involves using ~0.25 of AMF1 in the inoculum and ~0.40 of AMF2 in the inoculum to characterize the rates of hyphal extension in the next experiment. As a final note, some theorems about the properties of MINE have been established for the class of linear models, such as the mixture experiments [11]. Code for the ensemble methods is available on GitHub [43, 44].
Application of MINE to RNA profiling experiments to discover the mechanism of biological clocks
MINE was developed specifically for this kind of transcriptomics problem and used to close the loop in the computing life cycle (Fig. 5) proposed by Hood and Abersold [45]. Here, the application was to transcriptomic experiments to discover the mechanism underlying circadian rhythms, one of the central problems of systems biology [46]. Transcriptomic experiments have a limited number of time points n but have many 1000 s of genes [and hence parameters (p)] to be identified in the process [6]. While both MINE criteria using the Covariance by Ellipsoid Volume and Correlation by Ellipsoid Volume were developed, only the Correlation by Ellipsoid Volume was utilized in the end in designing the experiments [6].
Figure 5.
The MINE experiment is a 90% knockdown of the wc-1 gene. The MINE criterion displayed is the correlation ellipsoid volume det(E(U)), which is graphed as a function of the remaining activity of the three clock mechanism genes. The predictions F are of the log base 10 concentrations over time of frq, wc-1, and wc-2 mRNAs over time from the RNA profiling experiments. The mRNA levels were measured at 14 time points over an 8-h window. The drawing is taken from [6].
Transcriptomics was to be used to explore the mechanism of the biological clock in one of the most well-studied model systems [47], the filamentous fungus, Neurospora crassa. Three major components of the clock mechanism were: (i) frequency (frq), the gene encoding the oscillator of the system, and a negative regulator [48]; (ii) white-collar-1 (wc-1), the gene encoding the light response element and a positive transcriptional activator for the system [49]; and (iii) and white-collar − 2(wc-2), a second positive transcriptional activator for the system. Together wc-1 and wc-2 encode WC-1 and WC-2 proteins that act as positive elements in the clock through the dimeric complex WCC=WC-1/WC-2, while frq encodes a protein FRQ, which acts as the negative regulator for the system. The FRQ protein provides negative feedback to wc-1 and wc-2. All three of these elements appear in a single copy in the N. crassa genome, but they have homologs in fly and mammalian systems [49, 50].
Now consider a series of RNA profiling experiments that were conducted, guided by MINE to choose an informative sequence of experiments. The last in a series of three adaptive experiments guided by MINE involved a choice of whether or not to do a knockdown or overexpression experiment on: (i) frq; (ii) wc-1; or (iii) wc-2. The conventional wisdom was to mutate the oscillator gene frq [51].
The first step in the MINE application is to make predictions for various mutations in the clock mechanism genes using an available ensemble. The RNA profiles of all 11 000 genes were measured at each of 14 time points. A subset of these RNA measurements included a total of 14 time points on each of the three clock genes so that the prediction vector F had 42 components. Unlike previous models so far considered, the clock model is a nonlinear model (in the parameters describing the model), which specifies a genetic network of nonlinear ordinary differential equations describing the time course of the genes, their cognate RNAs, and proteins [18]. The model ensemble for this clock network was used to predict RNA profiles of frq, wc-1, and wc-2 and their correlations under different possible experiments as shown in Fig. 2.
With the correlation matrix in hand for the predictions, the MINE criterion based on the correlation volume ellipsoid in eq. (30) was calculated as a function of the degree of knockdown of the three clock genes (Fig. 5). The result was surprising. A knockdown of wc-1 was selected as the MINE experiment and used to identify 2323 genes responding to the knockdown [6]. The second surprise from this model-guided discovery process was that ribosome biogenesis was under clock control. This was later confirmed in mammalian systems [52].
A better way to do this MINE calculation would have been to use the transcriptomic data on all 11 000 N. crassa genes instead of just the three clock genes driving the system. The ensemble methods now exist for the whole genome-scale network with all of its 1000 s of genes and ensembles now exist for the entire clock network [19, 20, 53]. Using graphical processing unit (GPU) ensemble methods, such as MINE, can now be implemented on a genomic scale with an unknown network structure [19, 53].
Application of MINE to genome-wide association study field studies for AMF/sorghum project
Consider a GWAS study underway to understand the genetic basis of biomass and AMF colonization in Sorghum bicolor using the BAP Panel [26] of 343 plant accessions of varying biomass. There are 232 303 SNPs to characterize each member of the panel after filtering for minor allele frequencies [54]. The focus is on dry weight as a measure of biomass, and AMF colonization is measured using a convolution neural network from imaging AMF in roots [41, 55]. At Time 0, dry weight data were available on each accession in the panel to construct an ensemble [26]. The GWAS study has been running for 5 years [43, 44], and, each year, MINE is being used to select 79 BAP plant accessions for study. The 79 plant seedlings will be assigned randomly to 79 rows with each row consisting of 9 seedlings of identical genotype. The 79 × 9 block was replicated three times in the field. The goal is to discover a relation between plant biomass as a function of SNPs.
In the field study, there are actually a number of plant features that are being measured during the sequence of MINE experiments [44]. These include plant genotype, plant expression Quantitative Trait Loci or eQTLs, the microbiome, tissue total phosphorus (P), nitrogen (N), time of harvest, and other variables relevant to plant health as measured by biomass (Fig. 6). These variables used to predict biomass (as well as AMF colonization and AMF community composition) are held together in a causal diagram representing a structural equation model [56]. The standard model for GWAS experiments is the mixed linear model, which is a special case of the structural equation model in which some of the independent variables are random with mean 0 [57]. For purposes of illustration, a mixed linear model is presented below for an adaptive GWAS experiment underway at Wellbrook Farm, Athens, GA.
Figure 6.
A sequence of MINE experiments is being used in a 5-year GWAS experiment to examine the relation between biomass and SNPs in S. bicolor using the BAP accessions [26]. MINE is used to select the BAP accessions to be used each year in order to map AMF colonization and biomass to the sorghum genetic map in a GWAS study. Multiscale structural equation model (SEM) for the project (center boxes and arrows). Lotka–Volterra community models are nested within the SEM and predict associations that affect biomass. The dependent variable is biomass, and the arrows in the diagram denote causal relationships between independent variables in the SEM. In this model, sorghum genotype is the primary independent variable that correlates with the remaining variables. This conceptual model will evolve continuously using the model-guided discovery process of maximally informative next experiment (MINE; outer ring) [6].
There are two sets of measured inputs to the GWAS experiment in Year 1 at Wellbrook Farm, Athens, GA: (i) fixed variables in the design matrix X, such as block number, harvest time of each plant, N level, P level; (ii) random intercepts and slope effects Z for the genotype of each BAP accession on the other hand. The additive genetic variation in plant genotype as it affects the dependent variable of log dry weight (e.g. biomass) was captured by binning the number of alleles in a given genomic region different from the reference genome using the sum method [58]. The number of SNPs in each chromosomal region was adjusted to ensure that at least each genomic region was 50 kb in size. Since linkage disequilibrium is reported between markers separated by 3.5–35 kb [59], the 50 kb size was chosen to reduce linkage disequilibrium between bins. The result was that the number of bins was 2748 in the 750 Mb sorghum genome with each bin typically having 10–12 genes in sorghum. The number of such alleles in a bin is treated as a continuous random variable with mean 0 and variance component
for the ith accession, and summed over regions to obtain a random effect for an accession. The fact that each random effect is the sum of 2748 small random effects of chromosomal regions makes it plausible that the random effect of accession is normally distributed. The fixed effects of block number and BAP genotype, for example, are denoted by the vector
and the random effects, by u.
These fixed and random effects are used to predict some measure of biomass, log dry weight. In the experiment, there are a total of n ~ 606 plants (2–3 plants × 3 blocks × 79 rows) in the field to infer the fixed and random effects, which is less than p = 2748 fixed effects +79 variance components. The measurements to be predicted are summarized in the n × 1 vector Y. The last component of the model is the error in biomass, denoted by
for the ith plant in the experiment. The biomass measurements are summarized in an n × 1 column vector Y.
In Year 1 of the adaptive GWAS experiment, there was no evidence of block effects, and N and P applied were not varied. A total of 79 BAP accessions were planted in a randomized complete block design with 3 blocks, 79 plots in each block, 1 genotype randomly assigned per plot (i.e. a row), and 12 replicates per plot. The mixed linear model for this experiment is reduced to:
![]() |
(31) |
where Y is a n × 1 vector of observations on biomass. The n × p matrix X is the values of the p independent variables (describing each bin) with n observations on each of the chromosomal bins. The p × 1 vector
are the fixed effects of each chromosomal region. The n × r matrix Z is the r = 2748 normalized number of alleles in each accession on n plants in the field. The r × 1 vector
is a vector of random effects of each accession on biomass. The errors in the dependent variable, biomass Y, are captured in the n × 1 vector
. Three of the assumptions of the model are that: (i) the random effects u are independent of the biomass errors
; (ii) the errors
are normally distributed with mean 0 and variance
; (iii) the random effects u are normally distributed with mean 0 and variance
plant I with accession j(i). That is, the assumptions are that the random effects
and errors
are independent and normally distributed with mean 0 and variance–covariance matrix
and I
, respectively. In the application of eq. (31) taking Z = X implies that both intercepts and slopes are random as in [41].
Under this model, the prediction of biomass is:
![]() |
The variance components and heritability are used to calculate the variance–covariance matrix V of the biomass measurements Y:
![]() |
where
. That is,
and
are the ith row vectors of Z and X, respectively. Each observation
has such a 1 × n row vector
to describe the genetics of its accession. Each term
is an n × n block. The variance–covariance matrix is diagonal with p blocks each with the same diagonal elements
. The index j(i) is a lookup that returns the variance component of the ith observation as determined from accession j. Plant I has an assigned accession j.
With the assumptions above for the mixed linear model, the ensemble Q can be written down as multivariate normal with the
-vector consisting of the fixed effects
and the variance components:
![]() |
In the first use of MINE in a field experiment, no fertilizers were applied to the field in 2021 at Wellbrook Farm, Athens, GA [44]. The model was reduced to a fixed effects model with the number of alleles in a bin as the set of independent variables using the sum method [58].
The ensemble method was used to fit the models to the published log dry weight data from 3 years from 2013 to 2015 averaged in Florence, SC [26], to make predictions in the use of MINE [44]. Typically, in an omics experiment, there are prior published data available, and this should be used when available [6] to initialize the MINE sequence. A total of 1000 equilibration sweeps were done, and then 1000 sets of model parameters were accumulated, each model parameter set separated by 100 decorrelation sweeps. The chi-square per data point was 6.12 with n = 606 dry weight measurements. As a control, the ensemble run was repeated with the only change being 1000 decorrelation steps. The fitted ensemble from all the data simultaneously provides a unified framework for feature selection to avoid overfitting [43, 44].
MINE was then applied to the fitted ensemble from Florence, SC, to select 80 accessions for use in 2022 for planting at Wellbrook Farm [44]. The MINE method used was the covariance ellipsoid in eq. (25). The MINE criterion V(u) conceptually involved evaluating det(D) over all possible
samples from the BAP panel, which is computationally intractable. Instead as an approximation, the MINE criterion was optimized by evaluating det(D) on all
triples drawn from 343 BAP accessions deposited at USDA GRIN in Griffin, GA. The result is shown graphically (Fig. 7). Details of calculating det(D) in the MINE section begin with calculating the covariance matrix of the predictions for next year’s experiment with eq. (16). The eigenvalues are calculated for the covariance matrix using eqs. (17–20). The eigenvalues, in turn, determine det(D) in eq. (23). Code for the ensemble methods and the MINE calculations is available on GitHub [44].
Figure 7.

The MINE criterion log(det(E)) was used to select 80 accessions for use in a GWAS experiment at Wellbrook Farm, GA, in 2022. The top 200 selected triples of accessions are ranked by det(D). From these top 200 triples, 80 distinct accessions were selected.
A detailed comparative analysis of MINE in GWAS with classic design approaches is available [43, 44]. The use of MINE in a series of smaller three yearly adaptive experiments is compared directly with a large classic design on the BAP for log dry weight [26]. In this example, the MINE experiments made the GWAS feasible with ~8 project participants per year in the field experiment where a larger classic design using all BAP accessions in 1 year involved over 20 project participants.
As a final note, in the first pass to developing the ensemble methods, such as MINE, for GWAS, the random effects were assigned to accessions rather than individual genes to track the replicates across blocks in the field design each year and to keep the model simpler by limiting the number of variance components to the number of accessions rather than the number of bins. The hypothesis to be tested was simply to ascertain whether or not BAP accessions had an effect on AMF colonization. Future work will include more variance components for each chromosomal bin to fit these more complicated ensembles of models [43].
Conclusion
There are other potential domains of application of MINE. One example is computer vision models for large image data sets, such as the 15 Terabyte (Tb) data set of plant root images with AMF structures in the root cells recently reported [60]. As such data sets grow, there are choices of structural categories to annotate for ‘balance’ to build a good classification and segmentation model. Then, the ratios of training, validation, and testing sets need to be selected as design parameters, the usual ratios being 8:1:1. Once a criterion for information is decided upon, such as accuracy of classification or F1 or precision as well as for segmentation [i.e. Intersection over Union (IoU)] [41], then MINE could be applied. Classification problems are potentially approachable by MINE in the same way as MINE has been applied to phylogenetic trees in taxon sampling [61, 62]. The goal would be to select promising categories for further annotation to improve classification and segmentation adaptively as the data resource is expanded.
Other kinds of large-Tb data sets are encountered in the natural language processing of genomic sequences arising from protein and DNA sequence languages in microbial genomes and microbiomes [63]. The goal is to use natural language models to help predict the functional categories of sequences, most of which are unannotated. Again, the same kind of design questions arise. What are the most informative functional categories to explore with annotation? How do we go about training, validating, and testing natural language models for genomic sequences? MINE would presumably help to select functional categories for discovery potential.
MINE is a discovery tool designed specifically for very large genetic data sets and is illustrated in a variety of problems within genetics here. MINE is built upon other ensemble methods [18] that have been developed for fitting models with n < < p. These ensemble methods of model identification coupled with MINE complete a discovery cycle (Fig. 6) for exploring problems in genetics. This discovery cycle has been called computing life [45]. MINE as a discovery tool completes the cycle in both analyzing and designing future costly omics experiments arising in genetics and allows an adaptive approach to solving problems in genetics.
Key Points
New model-guided and adaptive experimental design for omics experiments called the Maximally Informative Next Experiment or MINE is reviewed for genetics.
Ensemble methods are reviewed to fit models when the number of parameters p greatly exceeds the number of data points n (p > > n).
MINE uses a fitted model ensemble for model-guided discovery when p > > n to select the next MINE, thereby better distinguishing models within a fitted model ensemble.
MINE is illustrated in nonlinear models of the clock in Neurospora crassa in multiyear transcriptomics experiments.
MINE is also illustrated in linear models for a genome-wide association study in Sorghum bicolor for multiyear experiments to find the genes underlying biomass.
Supplementary Material
Contributor Information
Isaac Torres, Institute of Bioinformatics, University of Georgia, Athens, GA 30602 USA.
Shufan Zhang, Institute of Bioinformatics, University of Georgia, Athens, GA 30602 USA.
Amanda Bouffier, Institute of Bioinformatics, University of Georgia, Athens, GA 30602 USA.
Michael Skaro, Institute of Bioinformatics, University of Georgia, Athens, GA 30602 USA.
Yue Wu, Department of Genetics, Stanford University, Stanford, CA 94309 USA.
Lauren Stupp, Genetics Department, University of Georgia, Athens, GA 30602 USA.
Jonathan Arnold, Institute of Bioinformatics, University of Georgia, Athens, GA 30602 USA.
Y Anny Chung, Plant Biology and Plant Pathology, University of Georgia, Athens, GA 30602 USA.
H-Bernd Schuttler, Physics and Astronomy, University of Georgia, Athens, GA 30602 USA.
Funding
This work was supported by DOE DE-SC0021386 and NSF MCB-2041546.
Data availability
The data underlying this article will be shared on reasonable request to the corresponding author.
References
- 1. Fisher RA. The Design of Experiments. Edinburgh: Oliver and Boyd, 1935. [Google Scholar]
- 2. Fisher RA. The Design of Experiments. 6th edn. Edinburgh: Oliver and Boyd, 1951. [Google Scholar]
- 3. John PWM. Statistical Design and Analysis of Experiments. London: Macmillan, 1971. [Google Scholar]
- 4. Lasky JR, Upadhyaya HD, Ramu P. et al. Genome-environment associations in sorghum landraces predict adaptive traits. Sci Adv 2015;1:e1400218. 10.1126/sciadv.1400218 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Hu Z, Olatoye MO, Marla S. et al. An integrated genotyping-by-sequencing polymorphism map for over 10,000 sorghum genotypes. Plant Genome 2019;12:180044. 10.3835/plantgenome2018.06.0044 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Dong W, Tang X, Yu Y. et al. Systems biology of the clock in Neurospora crassa. PLoS One 2008;3:e3105. 10.1371/journal.pone.0003105 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Johnson NC, Wilson GWT, Wilson JA. et al. Mycorrhizal phenotypes and the law of the minimum. New Phytol 2015;205:1473–84. 10.1111/nph.13172 [DOI] [PubMed] [Google Scholar]
- 8. Johnson NC, Wilson GWT, Bowker MA. et al. Resource limitation is a driver of local adaptation in mycorrhizal symbioses. Proc Natl Acad Sci 2010;107:2093. 10.1073/pnas.0906710107 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Gao C, Montoya L, Xu L. et al. Strong succession in arbuscular mycorrhizal fungal communities. ISME J 2019;13:214–26. 10.1038/s41396-018-0264-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Johnson NC, Hoeksema JD, Bever JD. et al. From Lilliput to Brobdingnag: Extending models of Mycorrhizal function across scales. Bioscience 2006;56:889–900. 10.1641/0006-3568(2006)56[889:FLTBEM]2.0.CO;2 [DOI] [Google Scholar]
- 11. Bouffier AM, Arnold J, Schüttler HB. A MINE alternative to D-optimal designs for the linear model. PLoS One 2014;9:e110234. 10.1371/journal.pone.0110234 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Yengo L, Vedantam S, Marouli E. et al. A saturated map of common genetic variants associated with human height. Nature 2022;610:704–12. 10.1038/s41586-022-05275-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Gouveia GJ. et al. Long-term metabolomics reference material. Anal Chem 2021;93:9193–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Battogtokh D, Asch DK, Case ME. et al. An ensemble method for identifying regulatory circuits with special reference to the qa gene cluster of Neurospora crassa. Proc Natl Acad Sci USA 2002;99:16904–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Landau DP, Binder K, Landau DP. et al. In: Landau DP and Binder K. (eds.) Monte Carlo Simulations at the Periphery of Physics and beyond. A Guide to Monte Carlo Simulations in Statistical Physics. New York, NY, USA: Cambridge University Press, 2014. 13–22. [Google Scholar]
- 16. Guggenheim EA. Boltzmann's Distribution Law. New York: North-Holland Pub. Co.; Interscience Publishers, 1955. [Google Scholar]
- 17. McGee RL, Buzzard GT. Maximally informative next experiments for nonlinear models. Math Biosci 2018;302:1–8. 10.1016/j.mbs.2018.04.007 [DOI] [PubMed] [Google Scholar]
- 18. Yu Y, Dong W, Altimus C. et al. A genetic network for the clock of Neurospora crassa. Proc Natl Acad Sci USA 2007;104:2809–14. 10.1073/pnas.0611005104 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Al-Omari A, Griffith J, Caranica C. et al. Discovering regulators in post-transcriptional control of the biological clock of Neurospora crassa using variable topology ensemble methods on GPUs. IEEE Access 2018;6:54582–94. 10.1109/ACCESS.2018.2871876 [DOI] [Google Scholar]
- 20. Al-Omari AM, Griffith J, Scruse A. et al. Ensemble methods for identifying RNA operons and regulons in the clock network of Neurospora Crassa. IEEE Access 2022;10:32510–24. 10.1109/ACCESS.2022.3160481 [DOI] [Google Scholar]
- 21. Fisher RA, Russell EJ. On the mathematical foundations of theoretical statistics. Philosophical transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character 1922;222:309–68. 10.1098/rsta.1922.0009 [DOI] [Google Scholar]
- 22. Savage LJ. The Foundations of Statistics 2d rev. edn. Garden City, NY: Dover Publications, 1972. [Google Scholar]
- 23. Landau DP, Binder K. A Guide to Monte Carlo Simulations in Statistical Physics. Cambridge, England: Cambridge University Press, 2009. [Google Scholar]
- 24. Antoninka AJ, Ritchie ME, Johnson NC. The hidden Serengeti—Mycorrhizal fungi respond to environmental gradients. Pedobiologia 2015;58:165–76. 10.1016/j.pedobi.2015.08.001 [DOI] [Google Scholar]
- 25. Metropolis N, Rosenbluth AW, Rosenbluth MN. et al. Equation of state calculations by fast computing machines. J Chem Phys 1953;21:1087–92. 10.1063/1.1699114 [DOI] [Google Scholar]
- 26. Brenton ZW, Cooper EA, Myers MT. et al. A genomic resource for the development, improvement, and exploitation of sorghum for bioenergy. Genetics 2016;204:21–33. 10.1534/genetics.115.183947 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Mostow GD, Sampson JH. Linear Algebra. NY: McGraw-Hill, 1969. [Google Scholar]
- 28. Anderson WW, Arnold J, Sammons SA. et al. Frequency-dependent viabilities of Drosophila pseudoobscura karyotypes. Heredity 1986;56:7–17. 10.1038/hdy.1986.2 [DOI] [Google Scholar]
- 29. Cornell JA. Experiments with mixtures: A review. Dent Tech 1973;15:437–55. 10.1080/00401706.1973.10489071 [DOI] [Google Scholar]
- 30. Cobb AB, Wilson GWT, Goad CL. et al. The role of arbuscular mycorrhizal fungi in grain production and nutrition of sorghum genotypes: Enhancing sustainability through plant-microbial partnership. Agric Ecosyst Environ 2016;233:432–40. 10.1016/j.agee.2016.09.024 [DOI] [Google Scholar]
- 31. Watts-Williams SJ, Emmett BD, Levesque-Tremblay V. et al. Diverse Sorghum bicolor accessions show marked variation in growth and transcriptional responses to arbuscular mycorrhizal fungi. Plant Cell Environ 2019;42:1758–74. 10.1111/pce.13509 [DOI] [PubMed] [Google Scholar]
- 32. Chappell CR, Fukami T. Nectar yeasts: A natural microcosm for ecology. Yeast 2018;35:417–23. 10.1002/yea.3311 [DOI] [PubMed] [Google Scholar]
- 33. Liu Y, Johnson NC, Mao L. et al. Phylogenetic structure of arbuscular mycorrhizal community shifts in response to increasing soil fertility. Soil Biol Biochem 2015;89:196–205. 10.1016/j.soilbio.2015.07.007 [DOI] [Google Scholar]
- 34. Jiang S, Liu Y, Luo J. et al. Dynamics of arbuscular mycorrhizal fungal community structure and functioning along a nitrogen enrichment gradient in an alpine meadow ecosystem. New Phytol 2018;220:1222–35. 10.1111/nph.15112 [DOI] [PubMed] [Google Scholar]
- 35. Revillini D, Gehring CA, Johnson NC. The role of locally adapted mycorrhizas and rhizobacteria in plant–soil feedback systems. Funct Ecol 2016;30:1086–98. 10.1111/1365-2435.12668 [DOI] [Google Scholar]
- 36. Oyarte, Galvez L, Bisot C, Bourrianne P. et al. A travelling-wave strategy for plant–fungal trade. Nature 2025;639:172–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Dharmananda S. Studies of the Circadian Clock of Neurospora Crassa: Light-Induced Phase Shifting. Santa Cruz, CA: University of California, 1980. [Google Scholar]
- 38. McGonigle TP, Miller MH, Evans DG. et al. A new method which gives an objective measure of colonization of roots by vesicular—Arbuscular mycorrhizal fungi. New Phytol 1990;115:495–501. 10.1111/j.1469-8137.1990.tb00476.x [DOI] [PubMed] [Google Scholar]
- 39. Plouznikoff K, Asins MJ, de Boulois HD. et al. Genetic analysis of tomato root colonization by arbuscular mycorrhizal fungi. Ann Bot 2019;124:933–46. 10.1093/aob/mcy240 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. De Vita P, Avio L, Sbrana C. et al. Genetic markers associated to arbuscular mycorrhizal colonization in durum wheat. Sci Rep 2018;8:10612. 10.1038/s41598-018-29020-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41. Zhang S, Wu Y, Skaro M. et al. Computer vision models enable mixed linear modeling to predict arbuscular mycorrhizal fungal colonization using fungal morphology. Sci Rep 2024;14:10866. 10.1038/s41598-024-61181-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Muirhead RJ. Aspects of Multivariate Statistical Theory. NY: John Wiley & Sons, 2009. [Google Scholar]
- 43. Torres I. MINE: Maximally Informative Next Experiment - Genetics Application and Novel Computational Methodology PhD Dissertation. in Bioinformatics University of Georgia. Athens, GA, 2024.
- 44. Torres I, Bouffier A, Schuttler H-B et al. MINE: Maximally Informative Next Experiment - towards a new GWAS experimental design and methodology. Genes|Genomes|Genetics (G3), in revision. 2025.
- 45. Ideker T, Thorsson V, Ranish JA. et al. Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. Science 2001;292:929–34. 10.1126/science.292.5518.929 [DOI] [PubMed] [Google Scholar]
- 46. Kitano H. Systems biology: A brief overview. Science 2002;295:1662–4. [DOI] [PubMed] [Google Scholar]
- 47. Dunlap JC. Molecular bases for circadian clocks. Cell 1999;96:271–90. [DOI] [PubMed] [Google Scholar]
- 48. Aronson BD, Johnson KA, Loros JJ. et al. Negative feedback defining a circadian clock: Autoregulation of the clock gene frequency. Science 1994;263:1578–84. [DOI] [PubMed] [Google Scholar]
- 49. Crosthwaite SK, Dunlap JC, Loros JJ. Neurospora wc-1 and wc-2: Transcription, photoresponses, and the origins of circadian rhythmicity. Science 1997;276:763–9. [DOI] [PubMed] [Google Scholar]
- 50. McClung CR, Fox BA, Dunlap JC. The Neurospora clock gene frequency shares a sequence element with the drosophila clock gene period. Nature 1989;339:558–62. [DOI] [PubMed] [Google Scholar]
- 51. McDonald MJ, Rosbash M. Microarray analysis and organization of circadian gene expression in drosophila. Cell 2001;107:567–78. [DOI] [PubMed] [Google Scholar]
- 52. Jouffe C, Cretenet G, Symul L. et al. The circadian clock coordinates ribosome biogenesis. PLoS Biol 2013;11:e1001455. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53. Al-Omari A, Griffith J, Judge M. et al. Discovering regulatory network topologies using ensemble methods on GPGPUs with special reference to the biological clock of Neurospora crassa. IEEE Access 2015;3:27–42. 10.1109/ACCESS.2015.2399854 [DOI] [Google Scholar]
- 54. Brenton ZW, Juengst BT, Cooper EA. et al. Species-specific duplication event associated with elevated levels of nonstructural carbohydrates in Sorghum bicolor. G3 Genes|Genomes|Genetics 2020;10:1511–20. 10.1534/g3.119.400921 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55. Evangelisti E, Turner C, McDowell A. et al. Deep learning-based quantification of arbuscular mycorrhizal fungi in plant roots. New Phytol 2021;232:2207–19. 10.1111/nph.17697 [DOI] [PubMed] [Google Scholar]
- 56. Kendall MG, Stuart A, Ord JK. et al. Kendall's Advanced Theory of Statistics 6th edn. Edward Arnold, London; Halsted Press, 1994. [Google Scholar]
- 57. Zhang Z, Ersoz E, Lai CQ. et al. Mixed linear model approach adapted for genome-wide association studies. Nat Genet 2010;42:355–60. 10.1038/ng.546 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58. Romagnoni A, Jégou S, van Steen K. et al. Comparative performances of machine learning methods for classifying Crohn disease patients using genome-wide genotyping data. Sci Rep 2019;9:10351. 10.1038/s41598-019-46649-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59. Wang Y-H, Upadhyaya HD, Burrell AM. et al. Genetic structure and linkage disequilibrium in a diverse, representative collection of the C4 model plant, Sorghum bicolor. G3 Genes|Genomes|Genetics 2013;3:783–93. 10.1534/g3.112.004861 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60. Zhang S, Bourlai T, Arnold J. MycorrhisEE: A high-resolution image dataset for deep learning based quantification of Arbuscular Mycorrhizal fungi. Proceedings of IEEE “Big Data” 2024;P285. 10.1109/BigData62323.2024.10825578 [DOI] [Google Scholar]
- 61. Townsend JP, Leuenberger C. Taxon sampling and the optimal rates of evolution for phylogenetic inference. Syst Biol 2011;60:358–65. 10.1093/sysbio/syq097 [DOI] [PubMed] [Google Scholar]
- 62. Townsend JP, Lopez-Giraldez F. Optimal selection of gene and Ingroup taxon sampling for resolving phylogenetic relationships. Syst Biol 2010;59:446–57. 10.1093/sysbio/syq025 [DOI] [PubMed] [Google Scholar]
- 63. Miller D, Stern A, Burstein D. Deciphering microbial gene function using natural language processing. Nat Commun 2022;13:5731. 10.1038/s41467-022-33397-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data underlying this article will be shared on reasonable request to the corresponding author.












































