Abstract
A first and second order progress rate analysis was conducted for the intermediate multi-recombinative Evolution Strategy (μ/μI, λ)-ES with isotropic scale-invariant mutations on the highly multimodal Rastrigin test function. Closed-form analytic solutions for the progress rates are obtained in the limit of large dimensionality and large populations. The first order results are able to model the one-generation progress including local attraction phenomena. Furthermore, a second order progress rate is derived yielding additional correction terms and further improving the progress model. The obtained results are compared to simulations and show good agreement, even for moderately large populations and dimensionality. The progress rates are applied within a dynamical systems approach, which models the evolution using difference equations. The obtained dynamics are compared to real averaged optimization runs and yield good agreement. The results improve further when dimensionality and population size are increased. Local and global convergence is investigated within given model showing that large mutations are needed to maximize the probability of global convergence, which comes at the expense of efficiency. An outlook regarding future research goals is provided.
Keywords: Evolution strategy, Progress rate, Global optimization, Rastrigin function
1. Introduction
The theoretical analysis of the performance of Evolution Strategies (ES) [8] optimizing functions f (y) in real-valued N-dimensional search spaces y ∈ ℝN is a challenge. This is due to the probabilistic nature of these algorithms allowing up to now the dynamic progress analysis only on simple test functions such as the sphere model [2,5], the ridge function class [3,14], and the ellipsoid model [7]. These test functions are simple w.r.t. their optimization landscape (also referred to as fitness landscape) in that they have at most one optimizer (i.e., the location y of the optimum). Analyzing the dynamical behavior of ES on more complex and multimodal test functions appears to be even more demanding. However, ES and other evolutionary algorithms are especially designated to optimize such problems. There is empirical evidence that ES are able to globally optimize highly multimodal optimization problems [11] with in N exponential number of local optima. The question arises how and when these ES are able to locate the global optimizer. It is the long term goal to find conditions the ES must fulfill to not get trapped in the vast amount of local optimizers. Ideally, a theoretical analysis should provide the answers regarding the success probability PS (of locating the global optimum) depending on the ES parameters such as the population size λ and the test function to be optimized. Furthermore, one is interested in the computational complexity of the optimization process.
One approach successfully applied to the analysis of the ES-performance on simple unimodal test functions mentioned above is the dynamical systems approach [5] which is based on progress rate analysis. The progress rate is a measure of expected positional change in search space between two generations depending on location, strategy and test function parameters. The idea of investigating global search behavior from expected local progress was successfully applied, among others, in [3,7]. It will be shown in this paper that this approach can be extended to the highly multimodal Rastrigin test function
| (1) |
where y ∈ ℝN, with oscillation amplitude A and frequency parameter α. The i-th fitness component in Eq. (1) is defined as
| (2) |
Depending on A and α a finite number of local minima M can be observed for each component i. Therefore, the overall number of local minima is scaling as MN posing a highly multimodal minimization problem with the global optimizer located at ŷ = 0. An exemplary optimization landscape of the Rastrigin function is shown in Fig. 1.
Fig. 1.
The heat map shows the optimization landscape for A = 1, α = 2π, and N = 2. The global minimizer located at the origin (dark blue) is surrounded by multiple local minima. On the right side the same parameter set is shown for N = 1. For increasing y the oscillation contribution is decreasing. (For interpretation of the colors in the figure(s), the reader is referred to the web version of this article.)
The remarkable observation is that ES – unlike classical nonlinear optimization algorithms (e.g. BFGS) – do not follow the local gradient or Hessian ending in one of the MN − 1 local optimizers. That is, ES perform a rather global search. A deeper understanding of this behavior is still missing. Recently, attempts have been made to analyze the problem from the viewpoint of relaxation using kernel smoothing [15]. However, the sampling process needed to transform the original problem into a convex optimization problem is still lacking a link to the ES.
In this paper a simplified and scale-invariant (μ/μI, λ)-ES, see Algorithm 1, is analyzed with step-size control defined in Eq. (4). Starting from the so-called parental centroid vector y(g) a population of λ offspring are generated by adding isotropic Gaussian mutations x ~ σ𝒩(0, 1) with mutation strength σ in Lines 6 and 7. Thereafter, the fitness is evaluated in Line 8. Selection of the μ best individuals is done in Line 10. It is performed for a given selection (truncation) ratio defined as
| (3) |
with ϑ ∈ (0, 1). It will be an essential quantity for the progress rate results in the limit of large population sizes. Using intermediate recombination with equal weights the best m = 1, …, μ individuals are recombined in Line 11 and the new parental centroid y(g+1) is obtained. In the following, the subscript “m; λ” can be read as the m-th best solution out of λ candidate solutions. In Line 12 the simplified step-size adaptation is performed. To this end, a constant normalized mutation σ* using the spherical normalization with ‖y(g)‖ = R(g) is defined as
| (4) |
This property ensures scale invariance and therefore global convergence of the algorithm, as the mutation strength σ(g) decreases if and only if the residual distance R(g) decreases. The quantity σ* is unknown during black-box optimizations, but it is very useful for theoretical investigations to obtain scale-invariant mutations strengths.
Algorithm 1. (μ/μI, λ)-ES with constant σ*.
1: g ← 0
2: y(0) ← y(init)
3: σ(0) ← σ*‖y(0)‖/N
4: repeat
5: for l ← 1, …, λ do
6:
7:
8:
9: end for
10: (ỹ1;λ,…, ỹμ;λ) ← sort (ỹ w.r.t. ascending )
11:
12: σ(g+1) ← σ*‖y(g+1)‖/N
13: g ← g + 1
14: until termination criterion
The remainder of this paper is organized as follows. In the next section the local performance measures will be introduced being the basis for both the progress rate analysis and the dynamical systems approach. Section 3 is devoted to the determination and evaluation of the first order progress rate. Section 4 describes the derivation of the second order progress rate, which will rely on first order progress rate results. Section 5 uses the local performance measures to establish the evolution equations that govern the dynamical behavior of the ES. Experiments will be presented to show the usefulness of the approach. In the final Section 6 conclusions will be drawn and being based on open problems the further research direction will be outlined.
2. Local performance measures and quality gain distribution
The performance of an ES between two generations can be evaluated in both fitness and search space. The quality gain Qy(x) of fitness f at a position y(g) due to an isotropic mutation x ~ σ𝒩(0, 1) is defined as
| (5) |
and yields in the case of fitness improvement (minimization considered) a negative value Qy < 0. The definition (5) measures the fitness change before selection and will be needed for the evaluation of the two progress rates (7) and (8). The quality gain components are decomposed using fi from Eq. (2) as Qi := fi(yi + xi) − fi(yi), such that
| (6) |
That is, the quality gain corresponds to the difference between fitness values before and after the mutation application. A probabilistic model for the distribution of quality values will be presented below. It will be important for the subsequent progress rate derivations, as selection is based on fitness values.
Analyzing the progress towards the optimizer in search space, the first order progress rate on the Rastrigin function has already been investigated in [17] as a first approach. In this paper, a new approach is presented which significantly improves the prediction quality.
The first order progress rate between two generations for the parental component yi is defined as
| (7) |
given parental position y(g) and mutation strength σ(g) at generation g. It is a measure of expected positional difference in search space. Positive expected progress φi > 0 is defined in the case for and . In this case the distance to the optimizer ŷi = 0 is reduced in expectation. This assumption is only valid as long as the sign of does not change, i.e., for small mutations compared to the residual distance. Therefore φi has limited applicability when studying the convergence behavior in the vicinity of the optimizer. As has been shown in [7] regarding the performance analysis on the ellipsoid model, a second order progress rate is needed. It is defined as
| (8) |
Squaring the positions yields independent of the sign, if the distance to ŷi = 0 decreases in expectation. Additionally, the derivation will yield expressions containing a progress gain and loss part, which is necessary for a more accurate model of convergence. Both progress rates will be expressed using integral equations for the expected values and approximations will be necessary to find closed-form solutions. In a second step the progress rates can be applied within difference equations to model the expected dynamics over many generations in order to investigate the global convergence behavior.
The selection of individuals is based on the attained fitness values. The quality gain measures the fitness change before selection according to (5). When the progress rate of an ES is modeled, the cumulative distribution function (CDF) PQ (q) of the quality gain and its probability density function (PDF) pQ (q) are needed as a function of y and σ. Obtaining an exact CDF for Qy(x) is not feasible at this point. Since with independent random variables Qi, the application of the Central Limit Theorem seems appropriate to show that the distribution is asymptotically normal.1 However, proving its validity rigorously seems hard or even impossible for arbitrary y. Therefore, we resort to normality as an approximation for the quality gain distribution. This is backed up by experimental results in Fig. 2, where sampled Qy(x)-values are compared to the normal approximation. A standard Anderson-Darling test was performed to check whether the sampled data was drawn from a normal distribution with known mean and variance according to (9). The hypothesis test fails to reject the normality assumption at p-values p = 0.48 (left) and p = 0.53 (right), where rejection is usually defined for p < 0.05. Even at relatively small N = 10 the results agree well. Good experimental agreement is also observed for the variation of the location y and mutation strength σ (not shown). Therefore, the normality assumption does not pose a strong restriction on the overall prediction quality of the progress rates in the subsequent sections, such that we approximate
| (9) |
Fig. 2.
The histograms show sampled values of Qy(x) from (5) with fixed y by applying random mutations xk ~ σ𝒩(0, 1) (σ = 1 with k = 1, …, 104 samples) at N = 10 (left) and N = 100 (right) with A = 10. The y-values were initialized randomly at ‖y‖ = 10 where local attraction is significant. The red envelope curves show the respective normal approximation (9) using mean value (30) and variance (31). The p-values of the Anderson-Darling-test for normality are p = 0.48 (left) and p = 0.53 (right).
Furthermore, the following abbreviations are introduced
| (10) |
| (11) |
At this point an additional assumption for the coordinates y = (y1, …, yN) has to be made to justify subsequent variance approximations (13) and (14). Given the search vector y = (y1, …, yN) and residual distance R2 = ‖y‖2 it is assumed that the components contribute approximately equally (in expectation) to the residual distance, i.e., there is no dominating component, such that
| (12) |
Property (12) will also be referred to as component equipartition. The concept was introduced in [6] and proven for the noisy ellipsoid in [12]. Its applicability to the Rastrigin function was shown in [19]. The equipartition assumption is necessary in order to justify certain approximation steps and to provide a closed-form solution for the progress rate. Furthermore, it will be a reasonable assumption to obtain a model of the algorithm’s progress and dynamics in expectation. This assumption also justifies a linear scaling of the variance with dimensionality N provided that the components are contributing equally to the overall variance, such that
| (13) |
Additionally, for large N an important approximation will be used for the variance to significantly simplify the obtained lengthy results. If no single i-th component is dominating the sum, i.e., Var [Qi] / Σj≠i Var [Qj] → 0 (for any i in the limit N → ∞), the contribution of a single term is negligible for N → ∞. Therefore, the two sums over N and N − 1 terms, respectively, are asymptotically equal with
| (14) |
Note that quantity is formally introduced in (20). Returning to Eq. (9), the expression is rewritten using a standardized random variate Z as
| (15) |
Approximation 1 (Quality gain distribution)
The local quality gain at position y due to random mutation vector x ~ 𝒩(0, σ21) is approximately normally distributed. Therefore, PQ (q) and pQ (q) can be approximated as
| (16) |
| (17) |
Within the normal approximation (16) the inverse given some probability p can be easily obtained by using the quantile function Φ−1(p) of the normal distribution. This relation will be used later to obtain a quality gain for some given probability p using
| (18) |
For the derivation of the i-th component progress rate the conditional distribution function PQ (q|xi) of the quality gain is needed for a given component xi. In this case expected value and variance are given by
| (19) |
| (20) |
where the sum j ≠ i is taken for fixed i over the remaining N − 1 components. Therefore, a normal approximation for the conditional CDF is introduced using (19) and (20).
Approximation 2 (Quality gain distribution given xi)
The quality gain distribution at position y given fixed mutation component xi and random mutation vector (x)j≠i ~ (𝒩(0, σ21))j≠i is approximately normally distributed. Therefore, PQ (q|xi) and pQ (q|xi) can be approximated as
| (21) |
| (22) |
Having derived approximations of the quality gain distribution functions, the quantities E [Qi] and Var [Qi] remain to be determined. As the components are independent, it is sufficient to consider a single component and then perform the summation. Starting from definition (6), one can evaluate the quality gain of a single component Qi(xi). After applying trigonometric identity cos (α(yi + xi)) = cos (αyi) cos (αxi) − sin (αyi) sin (αxi), one gets
| (23) |
| (24) |
of which E [Qi] and need to be evaluated. The results will be expressed as expected values containing trigonometric functions. As a remark, terms containing moments of xi ~ 𝒩(0, σ2), i.e., with k ≥ 1, are silently evaluated as they are assumed to be widely known. Starting with E[Qi] one has
| (25) |
where odd powers of , which also yields E[sin (αxi)] = 0. Evaluating Var [Qi] yields
| (26) |
Expectations of the form and for k ≥ 0 can be obtained by using the definition of the characteristic function χ of a random variate x ~ 𝒩(μ, σ2) and its known result [1]
| (27) |
with the imaginary unit denoted by in (27) and (28). Now the k-th derivatives with respect to α can be applied to both sides
| (28) |
such that corresponding real and imaginary parts can be identified by comparing both sides (denoted by ) of Eq. (28). Given μ = 0 for k = {0, 1, 2} the required expectations of trigonometric terms can be derived. Additionally, trigonometric identities cos2(x) = 1/2 + cos(2x)/2 and sin2(x) = 1/2 − cos(2x)/2 are used. The results are
| (29) |
Inserting relations (29) into (25) and (26), summing over all N components and collecting the resulting terms one obtains the expected value
| (30) |
Analogously, the variance of the Rastrigin quality gain yields
| (31) |
The quantities EQ|xi from (19) and from (20) are given analogously by summing over N − 1 components. Expressions EQ and DQ could be inserted into (16), and EQ|xi with Qi(xi) and Di into (21). However, it is omitted at this point for better readability.
As an important remark, expression (23) can be linearized w.r.t. mutation xi to obtain analytically solvable progress rate integrals, see also discussion after Eq. (51). Taylor-expanding fi around yi for small xi gives , such that after setting and evaluating the derivative one has
| (32) |
with following definitions applied to (32)
| (33) |
Component ki is the derivative of the quadratic term , cf. Eq. (2), which follows the global quadratic structure of the function. Conversely, derivative di follows the local oscillation, such that it will be very important for the model of local attraction during the progress rate derivations in Secs. 3 and 4.
3. First order progress rate
While the first order progress rate (7) does not suffice to completely describe the convergence behavior of the ES on Rastrigin, see Sec. 5, it is a necessary step in the calculation of the second order progress rate in Sec. 4. Given definition (7) and the parental location y(g), one has to find the expected value over the i-component location . The positional update y(g) → y(g+1) performed by the ES is realized by consecutively applying mutation, selection, and recombination (see Algorithm 1), such that one can write
| (34) |
where xm;λ denotes the mutation vector of the m-th best offspring after selection. Considering the i-th component of Eq. (34), abbreviating the mutation component as xm;λ := (xm;λ)i, and taking the expected value thereof yields
| (35) |
The progress rate can therefore be evaluated by inserting (35) into (7) giving
| (36) |
Before starting the derivation of (36), the important large population theorem is stated which will be used during the derivation of both first and second order progress rate. Its application also yields the so-called asymptotic generalized progress coefficients presented in Eq. (45).
Theorem 1
Let λ > μ + 1 and μ > a with a ≥ 1 and ϑ = μ/λ with 0 < ϑ < 1, such that tλ−μ−1(1 − t)μ−a exhibits its maximum on (0, 1) and vanishes at t ∈ {0, 1}. Let fx(t) be a function defined for constant x ∈ ℝ, such that fx : [0, 1] → [0, 1] with bounded derivatives on [0, 1] and let B denote the beta function. Furthermore, let px denote the PDF of a normally distributed variate and let pn(x) denote a polynomial of degree n in x. For infinitely large μ, λ → ∞ and constant ϑ = μ/λ the following limit holds
| (37) |
Proof
The dominated convergence theorem is applied. First, the following sequence is defined for μ = 1, 2, …, with λ(μ) = μ/ϑ and constant ϑ
| (38) |
Note that gμ is measured over the density of the normal distribution. In [18] it was shown that gμ(x) converges for any x according to
| (39) |
An upper bound of gμ can be estimated using 0 ≤ fx ≤ 1 and the definition of the beta function as
| (40) |
A lower bound for the denominator of (40) can be given as
| (41) |
Inequality (41) can be shown easily by setting μ = a + k with integers a ≥ 1 and k ≥ 1 (ensuring μ > a). This yields
| (42) |
which is fulfilled for any a ≥ 1 and k ≥ 1. Using (41) in (40) one gets
| (43) |
As there is a constant upper bound of |gμ(x)|, it remains to show that
| (44) |
which is finite due to normal density px(x). Hence, the limit in Eq. (37) can be exchanged with the integral over x. Using the limit of (39) the desired result is obtained.
The limit (39) is readily used in [16] to define the so-called asymptotic generalized progress coefficients for integers a ≥ 1, b ≥ 0, and truncation ratio 0 < ϑ < 1 as
| (45) |
These are characteristic coefficients describing the progress in the limit μ, λ → ∞ with constant ϑ = μ/λ, and are related to the generalized progress coefficients [5, Eq. (5.112)]. They will reappear during the derivation of both φi and . The derivation of φi is presented now.
Proposition 1
Let μ,λ ∈ ℕ with μ ≥ 1 and μ < λ and let px denote the PDF of the random mutation x ~ 𝒩 (0,σ2). Let xm;λ denote the m-th best value (out of λ) of the i-th mutation component (xm;λ)i. Furthermore, let PQ and denote the quality gain CDF (and its inverse), respectively, with B denoting the beta function. Then, the first order component-wise progress rate is given by
| (46) |
Proof
From now on the conditional dependency on y(g) and σ(g) will be implicitly assumed as given for better readability of the equations. The expected value of the i-th mutation component xm;λ after selection can be expressed as an integral over the order statistic density pm;λ(xi) of the m-th best individual, such that (36) is rewritten as
| (47) |
The subsequent task will be to derive the density pm;λ as a function of mutation and quality gain distributions. Mutations are distributed normally with zero mean and variance σ2 according to the normal density
| (48) |
Given mutation xi (and implicitly position y), a random quality gain value Q is distributed according to a conditional probability density pQ (q|xi). Given that the m-th best individual attains a quality gain within [q, q + dq], there must be m − 1 better individuals having a smaller quality value with probability [Pr{Q ≤ q}]m−1 = [PQ (q)]m−1, and λ − m individuals having a larger value with [Pr{Q > q}]λ−m = [1 − PQ (q)]λ−m. To account for all relevant combinations one has , where 1/(m − 1)! and 1/(λ − m)! exclude the irrelevant combinations among the two groups of better and worse individuals, respectively. The conditional density for the m-th individual as a function of the quality gain q yields
| (49) |
By integrating (49) over all attainable quality gain values q ∈ [ql, qu], one arrives at the density
| (50) |
Inserting the order statistic density from (50) into the progress rate (47), one obtains the intermediate result
| (51) |
A few important remarks can be made regarding Eq. (51). A closed-form analytic solution cannot be obtained without applying further approximations. It can be approached in an analogous way to the φi-derivation of the Ellipsoid in [13] to obtain a solution in terms of the well-known progress coefficient cμ/μ,λ [5, p. 216]. However, a closed-form solution with this approach requires a linear relation of Qi w.r.t. xi, see relation (32). The effect of a linearized quality gain on the progress rate of the Rastrigin function was already studied in [17] and showed that the progress due to local attraction is not modeled correctly, as the oscillation terms have to be either dropped or linearized for small xi.
Therefore a different approach is followed here assuming the infinite population limit, an approach which was applied within the analysis of functions with noise-induced multi-modality [9]. The approach will yield correction terms including the effects of the trigonometric terms from (24), in contrast to only taking linearized terms from (32). Starting from Eq. (51) and moving the sum including the m-dependent prefactors into the innermost integral yields
| (52) |
Now a transformation can be applied for the sum Σm(·) yielding an expression as a function of the regularized incomplete beta function [5, p. 147]. One has
| (53) |
Furthermore, one can rewrite the resulting population-dependent factor as follows
| (54) |
where we have used the property of the gamma function Γ(n) = (n − 1)! (for any integer n > 0) and the known relation between gamma and beta functions . These replacements will be useful later. After replacing the sum and refactoring we arrive at the following progress rate integral
| (55) |
Now the integration order of t and q is exchanged. In Eq. (55) one has the bounds
| (56) |
Defining the inverse transformation and integrating over t first, one obtains the new ranges
| (57) |
The progress rate yields
| (58) |
Now the innermost integral can be solved using pQ (q|xi) = dPQ (q|xi)/dq
| (59) |
where the probability PQ (ql|xi) = Pr(Q ≤ ql|xi) = 0 for any lower bound value ql. Inserting (59) into (58), we arrive at the progress rate integral (46).
Unfortunately a closed-form solution of (46) after inserting Approximation 1 and Approximation 2 for the quality gain CDF is not possible due to the underlying structure of the integrand. Hence, asymptotic approximations will be introduced assuming large populations and large dimensionality to successively simplify the integral in a way that closed-form solutions can be provided. First, the large population theorem will be applied and then the quality gain CDF is inserted. Thereafter, the normal CDF is Taylor-expanded with the first two terms yielding analytically solvable results and higher order terms vanishing as O(1/N). The results are further simplified in the end assuming component equipartition (12), which finally gives the progress rate result in (96).
Theorem 2
Let px denote the PDF of the random mutation x ~ 𝒩 (0, σ2). Let PQ denote the quality gain CDF with its quantile function given by . For a truncation ratio ϑ = μ/λ with 0 < ϑ < 1 the component-wise progress rate for large populations yields
| (60) |
Proof
Starting from Eq. (46) and applying the infinite population size limit, the result of Theorem 1 can be applied with a = 1, pn(xi) = xi, and . Evaluating fx(t) at t = 1 − ϑ gives
| (61) |
which yields the result (60).
The next step requires the use of Approximation 1 and Approximation 2 for the quality gain distributions in Eq. (60). To this end, one uses the conditional normal distribution function , see (21), and the inverse transformation q = EQ + DQ Φ−1(p) evaluated at p = ϑ, see (18). One obtains
| (62) |
Given the normal approximation (62), an expression for EQ|xi is needed. Using definition (19) with Qi-result (24) the (conditional) expected value is written as
| (63) |
In (63) the following definitions are introduced as abbreviations
| (64) |
Given Eq. (63), quantity δ(xi) includes all non-linear terms in xi. This will be important when the normal CDF is expanded and analytically solved. Inserting relation (63) into (62) and the result into (60) yields
| (65) |
A closed-form solution of (65) cannot be obtained with Φ(δi(xi)) containing non-linear terms in xi. However, a solution in terms of a Taylor expansion can be provided by introducing the decomposition Φ(g(xi) + h(xi)) with g(xi) being a linear function, and h(xi) being a small non-linear perturbation according to
| (66) |
| (67) |
In (66), the abbreviation EQi = EQ − Ei = E[Qi], cf. Eq. (10), is used to denote the expected value of the i-th summand of the quality gain (6). Using functions g(xi) and h(xi) Eq. (65) becomes
| (68) |
Approximation 3 (Truncated cumulative distribution function series)
Under the assumption of a normally distributed quality gain, see Approximation 1 and Approximation 2, and a quality gain variance scaling with N according to Eq. (13), the CDF of the normal distribution is expanded at g(xi) in the limit of N → ∞ as
| (69) |
Relation (69) is derived now. Starting from (68), the Taylor-expansion of Φ(·) up to first order with the remainder denoted by r yields
| (70) |
Note that all derivatives of the normal distribution exist as with Hen (x) denoting the n-th order probabilist’s Hermite polynomials. In the following the scaling properties of the remainder as a function of N are investigated. It will be shown that r = O(1/N). To this end, (70) is rewritten as
| (71) |
For the further analysis of r(N) the equipartition of components is assumed as introduced in Eqs. (12), (13), and (14). Hence, the variance Di can be written as a function of N as
| (72) |
where the prefactor s ≠ s(N) depends on A, α, y, and σ. With these assumptions the functions g and h are written as (using E := EQi, , dropping the subscript i for brevity and using Di ≃ DQ)
| (73) |
As h → 0 for N → ∞, the remainder (71) vanishes accordingly. Therefore, in order to show r(N) = O(1/N), limN→∞ r(N)N is investigated applying l’Hôpital’s rule
| (74) |
To evaluate (74) the derivative of r from (71) w.r.t. N is evaluated as
| (75) |
The term of (75) is expanded up to first order discarding higher orders
| (76) |
The derivatives of g and h from Eq. (73) are
| (77) |
Inserting (73) and (77) into (76) yields after refactoring
| (78) |
Taking the limit (74) of (78) therefore yields
| (79) |
such that the remainder r(N) can be given as
| (80) |
which concludes the derivation of (69).
Both integrals of (69) are analytically solvable.2 The zeroth order term yields a closed form solution due to g(xi) being linear w.r.t. xi and gives progress contributions due to the sphere function, i.e., the linear part of the quality gain (63). The first order term can be solved by applying quadratic completion to the Gaussian product px(xi)ϕ(g(xi)) yielding an expected value over a normal density. The expected value over h(xi) can be regarded as a perturbation of the sphere containing A and α dependencies.
The determination of φi via (69) was done in [18] by evaluating both integrals. As the derivation and the final result for φi are very lengthy and therefore not practical for further analytic treatment, the obtained expression for φi was simplified as a last step assuming large dimensionality N. However, the same result as in [18] can be obtained in a quicker way by simplifying the integrands of (69) under the same assumptions before the integration, instead of simplifying the result afterwards. This will enable a more concise derivation of the final progress rate result.
First the functions g and h from (66) and (67), respectively, are simplified. For large N, the quality gain variance Di ≃ DQ using (14). As EQi is just the quality gain expectation of a single component, it can be neglected compared to DQ scaling as using (13). Hence, one has
| (81) |
| (82) |
Another approximation is introduced regarding the density px(xi)ϕ(g(xi)) for the second term of (69). By completing the square one can derive a resulting normal density with mean m and variance ς2 by demanding
| (83) |
Simple calculations yield
| (84) |
Noting that and neglecting contributions of single components for N → ∞, i.e., , , the quantities m and ς2 from (84) yield the asymptotic results
| (85) |
such that the density of the first order term yields
| (86) |
Using the results from Eqs. (81), (82), and (86), the progress rate integral (69) is further simplified. The prefactors of the resulting integral yield the asymptotic progress coefficient (45)
| (87) |
Approximation 4 (Progress rate integral for large dimensionality)
Based on the result of Approximation 3 only the first two terms are considered. Furthermore, the integrands of (69) are approximated and simplified assuming large dimensionality using Eqs. (81), (82), (86), and (87). Hence, one obtains
| (88) |
| (89) |
| (90) |
Calculating from (89) by inserting mutation density px(xi) from (48) and applying the substitution z = xi/σ, one gets
| (91) |
The following integral identity [5, Eq. (A.12)] can be applied
| (92) |
Evaluating (92) with a = −kiσ/DQ and b = Φ−1(ϑ) yields for the right-hand side of (92)
| (93) |
Again assuming , expression (93) simplifies and the result for (89) is obtained with (87) as
| (94) |
Now is solved. One notices that see (64), is integrated over density px with zero mean. Therefore, all odd functions of xi yield no contribution and only the term xi sin (αxi) needs to be evaluated. One gets
| (95) |
In the second line of (95) the expected value definition is used. From second to third line the expected value of xi sin (αxi) is evaluated using (29). In the last line the derivative di = αA sin (αyi) from (33) is recovered. Using the results from (94) and (95) the first order progress rate approximation for large N and μ can finally be given.
First order progress rate
The first order component-wise progress rate on the Rastrigin function in the asymptotic limits of infinitely large population size μ (constant ϑ = μ/λ) and infinitely large dimensionality N yields
| (96) |
The expressions for from (45) and DQ from (31) were not inserted to improve readability. Result (96) shows very interesting properties compared to [17, Eq. (26)], where a linearized quality gain approximation resulted in
| (97) |
First note that the progress coefficient was replaced by its asymptotic form cμ/μ,λ ≃ cϑ. The difference for the variance terms in the denominators of (96) and (97) is negligible for large N with , see also (14). However, the most notable difference lies between the derivative term , see definition (33), and the newly obtained term . It contains an unchanged sphere-dependent term ki and an exponentially decaying Rastrigin-specific term di. This characteristic form will be discussed in the subsequent part. The result (96) will be essential for the determination of the second order .
At this point one-generation experiments can be performed and compared to the progress rate (96) to investigate its accuracy. To this end, a random position vector y is initialized isotropically with ‖y‖ = R given some residual distance R. Then, repeated simulations are performed and quantity (7) is averaged over 106 trials. The issue with the choice of R is that the “interesting” region with high density of local minima scales with N, such that a relation R(N) is needed. The following argumentation can be given. Assuming w.l.o.g. y > 0 and that all components of the parental position are at some given local minimum denoted by ŷ(j). Index j identifies the local attractor along the half-axis, e.g. j ∈ {1, 2, 3} in Fig. 1 on the right side. For N = 1 one has y = [ŷ(j)] and therefore R2 = (ŷ(j))2. Having N components at the same j-th local minimum yields y = [ŷ(j), ŷ(j), …, ŷ(j)], such that R2 = N(ŷ(j))2. A scaling is therefore needed to stay within a certain region of local attractors when N is increased.
The progress rates of two exemplary components for a single experiment are shown in Fig. 3. For both plots σ ∈ [0, 1] was chosen in order to investigate the effects of the oscillation as α = 2π. On the left, one observes enhanced progress for moderate σ-values due to local attraction, as both local and global attractor are aligned along the same direction. On the right, there is negative progress for moderate σ, as the local attractor is driving the ES away from the global attractor. For larger σ, the overall spherical shape is dominating and both exhibit positive progress. A decomposition of the progress rate in terms of φi = φi(di, ki)|ki=0 + φi(di, ki)|di=0 is displayed in Fig. 3. It shows the large-scale behavior of the ki-term, dashed cyan, and limited range of the di-term, dotted green. As , its progress term models the global quadratic structure of Rastrigin, see derivative definitions (33). The second term models the Rastrigin-specific local oscillation having limited range depending on the mutation strength σ (or α). By defining scale-invariant mutations using (4) with σ = σ*R/N, the oscillations vanish via for large residual distance R, where the sphere function is recovered. This model significantly improves the progress rate formula (97) from [17].
Fig. 3.
One-generation experiments with (10/10, 40)-ES for N = 20, A = 10, α = 2π at randomly chosen The results for φi of Eq. (96) are shown for the exemplary components i = 2 with yi = 1.16 (left) and i = 12 with yi = 0.78 (right) to illustrate the effect of local attraction on the progress rate. The plots show additionally Eq. (96) with φi(ki) = φi(di, ki)|di=0 [cyan, dashed] and φi(di) = φi(di, ki)|ki=0 [green, dotted], respectively.
As a note, changing one of the fitness parameters A or α directly affects Fig. 3. The change of amplitude A rescales both the (local) peak and dip heights accordingly, increasing the effects of local attraction for larger A. Increasing frequency α has mostly short-range effects as the overall range is reduced due to suppression via of (96). In the subsequent parts, the progress rate is investigated for A = 1 and α = 2π as an example.
In Figs. 4 and 5 the progress rate is evaluated over scale-invariant σ* for two different N-values and population sizes. One can see that the approximation quality improves for larger N and μ, as expected from the applied approximations. The overall agreement between simulation and approximation is good for larger and smaller residual distances R, see left and right plots, respectively. The σ*-range was chosen large enough, such that the progress rate of the corresponding sphere function [5, Eq. (6.54)] reaches negative values due to mutations being too large. This boundary directly translates to Rastrigin, as the global structure is the same. However, due to φi being first order, no negative progress occurs even for large σ*. Therefore the second order progress rate needs to be derived in Sec. 4, where loss terms will provide additional correction terms.
Fig. 4.
Progress rate φi as a function of the normalized mutation σ* for (10/10, 40)-ES with N = 20, A = 1, α = 2π, at two residual distances with yi = 11.6 (left) and with yi = 0.116 (right). As in Fig. 3, black dots depict the simulation, while the red dash-dotted line shows result (96). The error bars are very small and therefore not visible.
Fig. 5.
Progress rate φi as a function of the normalized mutation σ* for (100/100, 200)-ES with N = 100, A = 1, α = 2π, at two residual distances with yi = 11.9 (left) and with yi = 0.119 (right). The approximation quality improves compared to Fig. 4 and shows very good agreement.
4. Second order progress rate
The second order progress rate (8) requires the evaluation of Starting with intermediate result (34) and referring to the i-th component, the expression yields after squaring
| (98) |
Squaring the last term can be evaluated by separating the sum into equal and unequal indices
| (99) |
Inserting (99) into (98) and taking the expected value (conditional variables y(g) and σ(g) are implicitly assumed to be given) yields
| (100) |
Noting that see Eq. (36), and using (100) in -definition (8) yields the second order i-th component progress rate
| (101) |
for which the two following expected values need to be determined
| (102) |
| (103) |
In the subsequent parts the solutions to Eqs. (102) and (103) will be derived. Starting with (102), the solution requires order statistic density (50) for the m-th individual, large population identity (37), and the expansion of the normal CDF (69) up to first order. The resulting two integrals can then be solved analytically for large N and the results will simplify significantly.
Proposition 2
Let μ,λ ∈ ℕ with μ ≥ 1 and μ < λ and let px denote the PDF of the random mutation x ~ 𝒩 (0, σ2). Let xm;λ denote the m-th best value (out of λ) of the i-th mutation component (xm;λ)i. Furthermore, let PQ and denote the quality gain CDF (and its inverse), respectively, with B denoting the beta function. Then, the second order expected value reads
| (104) |
Proof
Starting from (102) and rewriting the expected value as an integral over order statistic density pm;λ(xi) yields
| (105) |
Both (47) and (105) have the same structure after inserting pm;λ(xi) from (50) and the integration over the squared mutation component is performed as the last step. The same steps as presented in the proof of Proposition 1 can therefore be applied with squared quantity , which directly gives the result (104).
Analogously to the derivation of the first order progress rate in Sec. 3, a closed-form solution for (104) can only be provided by first applying the limit of large populations and then introducing approximations assuming large dimensionality N.
Theorem 3
Let px denote the PDF of the random mutation x ~ 𝒩 (0, σ2) and let xm;λ denote the m-th best value (out of λ) of the i-th mutation component (xm;λ)i. Let PQ denote the quality gain CDF with its quantile function given by . For a truncation ratio ϑ the limit of the second order expected value reads
| (106) |
Proof
Starting from Eq. (104) and applying the infinite population size limit, the result of Theorem 1 can be applied with a = 1, which yields the result (106).
Given result (106), approximations are again applied to provide closed-form solutions. Inserting quality gain Approximation 1 and Approximation 2 via Eq. (62) into (106) leads (again) to an analytically not solvable integral due to non-linear terms in xi within Φ(·). Therefore, the CDF is expanded using Approximation 3 neglecting higher order terms O(1/N). Finally, the integrands are simplified assuming large dimensionality using Approximation 4. The result is therefore given after inserting g(xi) and h(xi) from (81) and (82) as
| (107) |
| (108) |
| (109) |
The two integrals abbreviated as and are evaluated now. For , the substitution z = xi/σ is introduced
| (110) |
The following integral identity [16] is applied for real parameters a and b
| (111) |
Evaluating (111) with a = −kiσ/DQ, b = Φ−1(ϑ) from (108) yields for the right-hand side of (111)
| (112) |
Assuming for large N further simplifies (112) and one obtains the result
| (113) |
For (113) the asymptotic generalized progress coefficient definition from (45) can be applied with parameters a = 1 and b = 1
| (114) |
This leads to following result for the first integral
| (115) |
Second integral from (109) is expressed using expected values over the normal density px of the terms given by With δ(xi) given in Eq. (64) one gets
| (116) |
One has and . Using results from (29) the remaining expected values read
| (117) |
Therefore, one gets
| (118) |
Collecting the results (115) and (118) with ki = 2yi and inserting them back into (107) the expected value finally reads
| (119) |
The solution of the second expected value from (103) is presented now. First an exact integral is derived. Then, approximations are applied to give closed-form solutions.
Proposition 3
Let μ, λ ∈ ℕ with μ ≥ 1 and μ < λ and let px denote the PDF of the random mutation x ~ 𝒩 (0, σ2). Let xk;λ denote the k-th best value (out of λ) of the i-th mutation component (xk;λ)i. Furthermore, let PQ and denote the quality gain CDF (and its inverse), respectively, with B denoting the beta function. Then, the second order expected value reads
| (120) |
Proof
First, a joint order statistic density has to be derived for the expected value. Then, the double sum is converted into a single integral using a known identity. The resulting five-fold integration is restructured by exchanging bounds and then successively solved.
Starting with (103), the double sum includes mixed contributions from the k-th and l-th best elements of the i-th mutation component. To avoid confusion with the summation indices k and l, the integration variables associated with k-th element will be denoted as x1 (mutation) and q1 (quality), while the l-th element is integrated over x2 and q2. The ordering 1 ≤ k < l ≤ λ is assumed with k yielding a smaller (better) quality value q1 < q2. Additionally, the joint probability density pk,l;λ(x1, x2) is needed, such that the expected value can be formulated as
| (121) |
The mutation densities are independent and denoted by px(x1) and px(x2), respectively. Given mutation components x1 and x2, the conditional density obtaining the quality values q1 and q2 is pQ (q1|x1) and pQ (q2|x2), respectively. Given q1 and q2, one has k − 1 values smaller than q1, l − k − 1 values between q1 and q2 and λ − l values larger than q2 with probabilities
| (122) |
and PQ (q) denoting the quality gain CDF. The joint probability density can therefore be written as
| (123) |
with integration ranges qmin ≤ q1 < ∞ and q1 < q2 < ∞ as k < l. Lower bound qmin denotes the smallest possible quality value, which is resolved later. The factorials exclude the irrelevant combinations among the three groups given in (122). Plugging (123) into (121) and moving the sum into the innermost integral gives
| (124) |
The double sum of (124) over the PQ -values will be expressed by an integral. This can be done using an identity from [4, p. 113]. Setting ν = 2 and identifying the indices as i1 = l and i2 = k, the identity yields
| (125) |
for real values Q1 and Q2, with integers ν ≤ μ < λ. Now the substitution Q1 = 1 − PQ (q2), Q2 = 1 − PQ (q1) can be performed and the double sum of (124) can be recognized by comparing with (125). Applying the identity therefore yields
| (126) |
Hence, Eq. (124) is expressed as
| (127) |
The prefactor of Eq. (127) can be evaluated as
| (128) |
Now the integration order will be exchanged twice in (127). First the order between t and q2 is exchanged. Then the order between t and q1 is exchanged, such that both q-integrations are performed before the t-integration enabling the application of the large population identity (37). Starting with integration bounds
| (129) |
and using the inverse function with the exchanged bounds between t and q2 are
| (130) |
Using factor (128) and exchanged bounds (130), the expression (127) is reformulated as
| (131) |
Now the integration order between t and q1 is exchanged starting from
| (132) |
yielding exchanged bounds
| (133) |
Therefore, one arrives at the following integral to be solved (beta function has been moved inside as it will be evaluated during the t-integration)
| (134) |
Now the integrals in (134) will be successively solved. Starting with integral {·} over q2 one has
| (135) |
The q1-integration within [·] using (135) yields
| (136) |
| (137) |
| (138) |
First integral (137) is easily evaluated, as the conditional density is integrated over its support giving
| (139) |
with PQ (qmin|x1) = Pr{Q ≤ qmin|x1} = 0. Note that the resulting factors are equal up to the conditional variables x1 and x2.
The second integral (138) will be simplified using integration by parts. Thereafter, one can exchange the x1 and x2 variables to find a simpler expression for the original integral. Integration by parts yields
| (140) |
Equation (140) inserted into (134) has to be integrated over x1 and x2, of which the order can be exchanged. For the following step the t-integration and the prefactors of (134) have no influence, such that they are dropped for better readability. Integrating both sides of (140) yields
| (141) |
where in the last line the integration order of x1 and x2 was exchanged, such that an expression equivalent to the left-hand side of (141) is obtained with given arguments for pQ and PQ. Collecting the terms, Eq. (141) can be formulated as
| (142) |
Noting that the right-hand side of result (142) is one half of the first integration result (139) after x-integration and noting the minus sign in (138), one gets for (136) the expression
| (143) |
Inserting the results of (143) back into [·] of (134) and including all prefactors, the five-fold integral simplifies providing the desired result of Eq. (120).
Theorem 4
Let px denote the density of the i-th component mutation x ~ 𝒩 (0, σ2) and let xk;λ denote the k-th best value (out of λ) of the i-th mutation component (xk;λ)i. Let PQ denote the quality gain CDF with its quantile function given by For a truncation ratio ϑ the limit of the second order expected value reads
| (144) |
Proof
Starting from Eq. (120) the μ-dependent prefactor was rearranged in a way that the factor (μ − 1)/μ in (120) is retained in the final result. Formally one could include (μ − 1)/μ in the sequence (38) and take the limit. However, it is desirable to keep the factor in the progress rate as a correction for finite μ-values. As a next step, one can define As 0 ≤ fx(t) ≤ 1 the same bound estimation as in (43) holds. Furthermore, both mutation integrals over density px are finite, see also (44). Therefore, the limit is evaluated with and a = 2 as
| (145) |
with xi re-introduced in the last line to denote the i-th mutation component, which gives Eq. (144).
In [·] of result (144), one can identify the first order progress rate −φi within the large population limit derived in Eq. (60). Refactoring (144) to obtain one can insert the φi-approximation from (96). Noting that via (45), one gets
| (146) |
Finally, inserting the results from (119) and (146) into (101), one obtains the second order progress rate
| (147) |
which serves as an approximation in the asymptotic limit of infinitely large dimensionality and population size. However, experimental investigations will also show good agreement for finite N, μ, and λ.
For future investigations of the convergence and step-size adaptation properties of the (μ/μI,λ)-ES, a simpler expression than (147) is needed. To this end, the N-dependency of the terms within {·} of (147) is investigated. It will be shown that for N → ∞ and μ = o (N) only the term −σ2/μ yields relevant contributions. The relevant terms in {·} of Eq. (147) are abbreviated according to their respective factors as , cϑ/DQ and . In order to maximize the absolute value of the individual terms a lower bound for is needed. Given the form of from Eq. (31), no useful lower bound for the variance could be established satisfying > 0 for any yi due to the trigonometric terms. Therefore, we will restrict the analysis to the sphere limit case A → 0. This assumption might seem crude. However, the most important characteristics are already contained in the first φi-dependent term of (147) referred to as the gain term in sphere model theory [5]. On the other hand, the loss terms in {·} are mostly dominated by the first term −σ2/μ. Experiments will affirm this assumption.
As the -approximation shall be valid for a constant σ* given any R-value, the mutation strength is re-normalized using (4)
| (148) |
Setting A = 0, σ = σ*R/N, and in (31), one obtains the sphere variance for constant normalized mutation strength as
| (149) |
In the limit N → ∞ the second term of (149) is negligible for constant σ* giving
| (150) |
Having obtained the sphere variance asymptotic in (150), the terms within {·} of (147) are evaluated. The term with prefactor yields with σ = σ*R/N and using (150)
| (151) |
It was used in (151) that a single component contributes in expectation 1/N to the residual distance see also (12). The second term with prefactor cϑ/DQ using DQ ≃ 2R2σ*/N with A = 0 as
| (152) |
The last term with prefactor yields with A = 0 and using (150)
| (153) |
In (153) the notation μ(N) was introduced to emphasize that the population size is usually chosen depending on the dimensionality of the search space. Finally, inserting the results of the loss term investigation for the three terms (151), (152), and (153) back into progress rate (147), one gets for the loss term in {·} of (147)
| (154) |
Provided that the population size μ = o (N), i.e., increasing sub-linearly with N, all terms except “1” in {·} can be neglected for N → ∞. Theoretical results concerning population sizing, i.e., choosing the necessary μ(N) to achieve high global convergence probability (success probability), are not available at this point. It is one of the main future goals of the current research project. Note that treating μ as a constant is also not satisfactory, since for large N an increase of μ is necessary to maintain a high success rate on a highly multimodal problem. However, experimental investigations on the Rastrigin function including step-size adaptation suggest a sub-linear relation, which validates the approximation. Finally, the lengthy result (147) is simplified using the loss term asymptotic of (154) and the second order progress rate approximation is obtained.
Second order progress rate
The second order component-wise progress rate on the Rastrigin function in the asymptotic limits of infinitely large population size μ (constant ϑ = μ/λ) and infinitely large dimensionality N with μ = o (N) yields
| (155) |
| (156) |
The expressions for from (45) and DQ from (31) were not inserted to improve readability. The first line (155) emphasizes the dependence of and can be thought of as a more general formula provided that φi is known and the loss term behaves similarly to the sphere function loss term −σ2/μ. The second line (156) shows the explicit results for the Rastrigin function. The results (155) and (156) can be mapped to the Evolutionary Progress Principle [5] as the expressions contain a progress gain and loss term, respectively. Here, the gain part scales with cϑ and it is a yi-dependent expression. Hence, depending on the sign of yi sin (α yi) it may also yield negative contributions due to local attraction moving the ES away from the global optimizer, cf. Fig. 3. The loss term −σ2/μ is characteristic for intermediate recombination. It introduces significant loss for large σ, but can be decreased using a larger μ due to recombination effects.
Results of one-generation experiments are presented in Figs. 6 and 7 by evaluating (8) over 106 trials (black dots with vanishing error bars) and comparing with the obtained approximations. The red dash-dotted line is showing simplified result (156), while the blue dashed line is showing (147). The positions y were initialized randomly (given R) and kept constant over all repetitions. Fig. 6 shows a smaller dimensionality N = 20 and truncation ratio ϑ = 1/4, while Fig. 7 shows larger values N = 100 with ϑ = 1/2. This was done to exemplarily investigate the results at different parameter sets.
Fig. 6.
Second order progress rate as a function of σ* for (10/10,40)-ES with N = 20, A = 1, α = 2π, at two residual distances with yi = 11.6 (left) and with yi = 0.116 (right). The dashed blue curves show Eq. (147) and the dash-dotted red curves Eq. (156).
Fig. 7.
Second order progress rate as a function of σ* for (100/100, 200)-ES with N = 100, A = 1, α = 2π, at two residual distances with yi = 11.9 (left) and with yi = 0.119 (right). The dashed blue curves show Eq. (147) and the dash-dotted red curves Eq. (156).
First thing to note is that the loss term allows negative progress for large σ*, which was not the case for φi. The approximation quality is good for different R-values (see left and right plots, respectively) and improves for larger N and μ in Fig. 7, which was expected. Simplified expression from (156) [red, dash-dotted] yields good results compared to (147) [blue, dashed], with (147) giving slightly better results for smaller σ* and (156) better results at larger σ*. This indicates that additional terms of the Taylor expansion (70) would be needed to further improve the results of (147). However, this would make the expression more involved, which is not desired. Furthermore, the results of Fig. 6 are relatively good considering that a rather small population (10/10, 40)-ES was used at low dimensionality N = 20. One can conclude that (156) yields very good results considering its “simplicity”. It will therefore be used in Sec. 5 to investigate the dynamical behavior of the ES. It should be noted that at this point there is no aggregated progress measure over all N components, such as the R-dependent sphere progress rate. Given some y(g) one can evaluate all i = 1, …, N values for and obtain a progress vector, but the overall effect on R(g) → R(g+1) is not known. This will be part of future research. However, the cumulative effect of all N progress rates can be evaluated within a dynamical systems model to be shown in the next chapter.
5. Evolution equations
In the previous sections one-generation experiments were conducted and compared against progress rate results (96), (147), and (156). In order to have an aggregated measure over all components and many generations, φi and will be used within the evolution equations and compared to real optimization runs of Algorithm 1. Using this method the (mean) global convergence behavior can be investigated.
Given definitions for first and second order progress (7) and (8), the expressions can be reformulated as stochastic iterative mappings between two generations g → g + 1 according to
| (157) |
| (158) |
The two terms ϵ(1) and ϵ(2) can be interpreted as fluctuations w.r.t. the expected values (provided by φi and ). Thus, it holds E[ϵ(1)] = 0 = E[ϵ(2)]. However, the exact transition densities for g → g + 1 are not known at this point. In principle, they could be approximated using a finite number of higher order moments (or cumulants) to model the fluctuations [5, Ch. 7]. However, for a first study of the progress rate results on the dynamics, the fluctuations are neglected by setting ϵ(1) = 0 = ϵ(2). Therefore, one arrives at the (deterministic) equations describing the mean-value dynamics of the parental position coordinates
| (159) |
| (160) |
with constant normalized mutation strength σ* from Eq. (4) giving
| (161) |
Two important issues need to be discussed. Firstly, the positional iterations are defined for a single component i. For large N however, it is not feasible to display each component individually. While the components will be iterated separately, the dynamics will be presented as a function of the residual distance R = ‖y(g)‖. Secondly, for the evaluation of being a function of y(g), the square root of the components has to be taken after iteration giving two solutions . As the corresponding terms of and are even in , both solutions are equivalent.
In the following, the deterministic iterations (159) and (160) using mutation strength rescaling (161) are compared to real optimization runs. For the initialization, y(0) is chosen randomly such that ‖y(0)‖ = R(0) for a given R(0). The starting position is kept constant for consecutive runs of the same experiment. For the magnitude of R(0) it is ensured that the strategy starts far enough away from the local minima landscape. Given Fig. 1 with A = 1, the farthermost local minimizer is at yi ≈ 3 with resulting for N-components, such that is chosen.
Considering the choice of σ* one observes in experiments that larger mutation strengths (compared to a sphere-optimal σ*) increase the success probability PS of individual trials to converge to the global optimizer. This is due to the fact that large steps tend to overcome local attraction more easily. However, this comes at the expense of efficiency, since large steps are often overshooting the global optimizer. Therefore in Fig. 8, σ* is chosen larger than the sphere-optimal value which can be obtained numerically from [5, Eq. (6.54)], but small enough to prevent negative progress. The aim was to obtain PS ≈ 1.
Fig. 8.
Comparison of real optimization runs with mean value dynamics using progress rates φi via (157) [dashed blue] and via (158) [dash-dotted red]. Gray lines show all 100 successful runs of Algorithm 1 and the black line shows the median thereof. The left plot shows (10/10, 40)-ES for N = 20 with and the right one (100/100, 200)-ES for N = 100 with . For both experiments A = 1, and α = 2π are chosen. The resulting success probability PS = 1.
In order to aggregate the R(g)-data of multiple dynamic experiments, the median has shown to be a suitable measure of central tendency. The main issue is that due to fluctuations the R(g)-values of distinct ES-runs may differ by orders of magnitude, such that the mean yields biased results due to a skewed distribution. The median is more suitable in this case and a more stable measure.
In Fig. 8 one can observe three phases within the dynamics. First, linear convergence is observed for large R(g)-values, where the sphere function dominates. Then, a slow down is observed due to increasing effects of local attraction. For small R(g)-values, the ES descends into the global attractor basin and linear convergence can be observed again. One can see that the φi-iteration (blue) shows by far too much progress compared to -iteration. This is due to the first order model, which does not include loss terms and overestimates the progress significantly, see also discussion of result (96). Iteration via (red) shows good results compared to the median curve, especially for larger μ and N (right plot). Better agreement for large populations is also due to reduced fluctuation effects, which were neglected at the beginning of Sec. 5.
In Fig. 9 the effect of reduced σ* is investigated, which increases the probability of local convergence. The left plot shows σ* = 5 with no globally converging runs, as the mutation strength is too low. Technically, for constant σ* there is no local convergence as the algorithm never stops if R is not decreasing. Still, the experiments are stopped after some g-threshold is reached. The stagnating behavior of the ES around some R(g) can be illustrated using Fig. 3. For σ = 0.2 one has σ* = σ N/R ≈ 0.9, which is small compared to . Both left and right progress components of Fig. 3 are significantly influenced by the local attraction region at σ = 0.2. While some components may be improved (positive value left), others are worsened (negative value right) resulting in a cumulative effect of R(g)-stagnation. One way out can be increasing σ (or equivalently σ*). However, the local minima landscape changes with changing R and arbitrary σ*-increase is not possible. Stagnation may appear at different σ* and R(g)-values depending on fitness and strategy parameters. For an active step-size adaptation, changing σ appropriately – without converging locally – poses a major challenge.
Fig. 9.
Variation of σ* for (100/100, 200)-ES for N = 100, A = 1, and α = 2π. From left to right σ* = {5, 18.3, 25}, with , and success rate PS = {0, 0.45, 0.97}. The experiment with σ* = 30 (PS = 1) was already shown in Fig. 8. Globally converging trials are shown in gray, and non-converging runs in light-orange. The median is taken over the globally converging runs, except for the left plot where none exist, in which the median over all unsuccessful runs is taken.
In the central plot of Fig. 9 roughly half of the runs are globally converging at increased . In this case the deterministic iteration follows a single converging path, as no fluctuations are modeled. The residual distance of the locally converging runs is reduced compared to ES-runs with σ* = 5. Note that the convergence speed is faster (steeper negative slope) for the globally converging runs compared to σ* = 30 of Fig. 8 due to sphere-optimal . However, this comes with the disadvantage of a lower PS, as more trials are converging locally. The right plot with σ* = 25 is similar to σ* = 30 of Fig. 8, but with several non-converging runs. Again, the ES convergence speed is faster, if σ* is chosen closer to , but shows a slightly reduced PS -value. The overall prediction quality of the iterative mapping (160) is good and the results affirm the expectation, that relatively large mutations are favorable to maximize PS on the Rastrigin function.
To confirm the expectation that the approximation quality increases further for larger μ and N, experiments are shown in Fig. 10. First thing to notice is that positional fluctuations of the ES trials decrease further, such that nearly all runs show a similar R-dynamics. This is related to the intermediate recombination, see Eq. (34), as position y(g+1) is obtained by averaging over a large number of individuals. One can see good agreement, but for the left plot there is still some room for improvement. This is related to truncation ratio ϑ = 1/4, such that the Taylor expansion point in Eq. (70) via function g(xi) is shifted by Φ−1(ϑ). For ϑ = 1/2 and even larger N and μ (right plot), very good agreement is observed.
Fig. 10.
The left plot shows (1000/1000, 4000)-ES with σ* = 110 for N = 1000, A = 1, and α = 2π. The right plot shows (10000/10000, 20000)-ES with σ* = 400 for N = 10000 (same α and A), evaluated for 50 trials due to CPU resource restrictions.
6. Conclusion and outlook
In this paper the full first and second order progress rate analysis of the (μ/μI, λ)-ES has been presented. In order to obtain closed-form expressions for φi and it was necessary to consider the large dimensionality and large population assumption. While the latter does not present a serious issue because large populations are needed to ensure global convergence, it was the key prerequisite to solve and simplify the expected value integrals. As the experiments have shown, the approximation quality of the progress rate expressions is rather good even for N as small as 20 and comparably small populations of μ = 10. For larger N and μ the approximation quality improves further, as expected. The first order progress rate result is able to model the local attraction effects on the Rastrigin function. This is a very important step, as all subsequent investigations in this paper are based on φi-results. The second order progress rate derivation was needed to obtain additional loss terms completing the progress model, which was especially needed for larger mutation strengths and close to the global optimizer.
Using the progress rate expressions, the dynamics of the evolution process have been investigated. There is a good agreement between the iterations and real ES-runs using median aggregation of the residual distance R to the global optimizer. As has been shown, depending on the choice of the normalized mutation strength, one can model global as well as local convergence behavior. Additionally, one observes a trade-off between efficiency and success rate, as relatively large mutations have to be chosen to maximize the success probability.
The conducted experiments assume scale-invariance, i.e., the mutation strength is controlled by the residual distance R. This is in contrast to the full self-adaptive ES where σ evolves during the ES run either by mutative self-adaptation (SA), cumulative step-size adaptation (CSA), or Meta-ES. The incorporation of the self-adaptation process will be the next step completing the analysis of the (μ/μI, λ)-ES on Rastrigin. To this end, the self-adaptation response (SAR) function must be derived. Combining N progress rates with the SAR function yields N + 1 evolution equations. In order to get manageable expressions that allow for analytic population sizing and expected runtime investigations, additional aggregation is needed. One possible approach would be the aggregation of individual parental yi components into the parental distance R modeling the expected progress as a function of the residual distance. This would reduce the number of evolution equations to two and making further analytic treatment more accessible. A first step in this direction has been done in [19].
Finally, the presented approach to model the ES-dynamics is based on mean value considerations. That is, fluctuations are not considered so far. Whether the approach presented can be extended to allow for the calculation of the global attractor convergence probability as a function of strategy and fitness parameters remains an open question.
Acknowledgements
This work was supported by the Austrian Science Fund (FWF) under grant P33702-N.
Footnotes
Actually, using the result from (80) one could even calculate a closed-form second-order approximation for (69). However, the resulting formula would be rather complex.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Data availability
No data was used for the research described in the article.
References
- [1].Abramowitz M, Stegun IA. Pocketbook of Mathematical Functions. Verlag Harri Deutsch; Thun: 1984. [Google Scholar]
- [2].Agapie A, Solomon O, Bădin L. Theory of (1+1) ES on SPHERE revisited. IEEE Trans Evol Comput. 2022:938–948. [Google Scholar]
- [3].Agapie A, Solomon O, Giuclea M. Theory of (1 + 1) ES on the RIDGE. IEEE Trans Evol Comput. 2022;26(3):501–511. [Google Scholar]
- [4].Arnold DV. Noisy Optimization with Evolution Strategies. Kluwer Academic Publishers; Dordrecht: 2002. [Google Scholar]
- [5].Beyer H-G. The Theory of Evolution Strategies. Springer; Heidelberg: 2001. (Natural Computing Series). [Google Scholar]
- [6].Beyer H-G, Arnold DV, Meyer-Nieberg S. A new approach for predicting the final outcome of evolution strategy optimization under noise. Genet Program Evol Mach. 2005;6(1):7–24. [Google Scholar]
- [7].Beyer H-G, Melkozerov A. The dynamics of self-adaptive multi-recombinant evolution strategies on the general ellipsoid model. IEEE Trans Evol Comput. 2014;18(5):764–778. doi: 10.1109/TEVC.2013.2283968. [DOI] [Google Scholar]
- [8].Beyer H-G, Schwefel H-P. Evolution strategies: a comprehensive introduction. Nat Comput. 2002;1(1):3–52. [Google Scholar]
- [9].Beyer H-G, Sendhoff B. Toward a steady-state analysis of an evolution strategy on a robust optimization problem with noise-induced multi-modality. IEEE Trans Evol Comput. 2017;21(4):629–643. doi: 10.1109/TEVC.2017.2668068. [DOI] [Google Scholar]
- [10].Billingsley P. Probability and Measure. Wiley; 1995. (Wiley Series in Probability and Statistics). [Google Scholar]
- [11].Hansen N, Kern S. In: Yao X, et al., editors. Evaluating the CMA evolution strategy on multimodal test functions; Parallel Problem Solving from Nature 8; Berlin. 2004. pp. 282–291. [Google Scholar]
- [12].Hellwig M, Beyer H-G. On the steady state analysis of covariance matrix self-adaptation evolution strategies on the noisy ellipsoid model. Theor Comput Sci. 2018 doi: 10.1016/j.tcs.2018.05.016. [DOI] [Google Scholar]
- [13].Melkozerov A, Beyer H-G. In: Branke J, et al., editors. On the analysis of self-adaptive evolution strategies on elliptic model: first results; GECCO’10: Proceedings of the Genetic and Evolutionary Computation Conference; New York. 2010. pp. 369–376. [Google Scholar]
- [14].Meyer-Nieberg S. Self-Adaptation in Evolution Strategies. PhD thesis, University of Dortmund, CS Department; Dortmund, Germany: 2007. [Google Scholar]
- [15].Müller N, Glasmachers T. Non-local optimization: imposing structure on optimization problems by relaxation; Foundations of Genetic Algorithms; 2021. pp. 1–10. [Google Scholar]
- [16].Omeradzic A, Beyer H-G. Progress Analysis of a Multi-Recombinative Evolution Strategy on the Highly Multimodal Rastrigin Function, Report. Vorarlberg University of Applied Sciences; 2022. https://opus.fhv.at/frontdoor/index/index/docId/4722 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Omeradzic A, Beyer H-G. In: Rudolph G, Kononova AV, Aguirre H, Kerschke P, Ochoa G, Tušar T, editors. Progress rate analysis of evolution strategies on the Rastrigin function: first results; Parallel Problem Solving from Nature – PPSN XVII; 2022. pp. 499–511. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Omeradzic A, Beyer H-G. Rastrigin Function: Quality Gain and Progress Rate for (µ/µI, λ)-ES, Report. Vorarlberg University of Applied Sciences; 2023. https://opus.fhv.at/frontdoor/index/index/docId/5151 . [Google Scholar]
- [19].Omeradzic A, Beyer H-G. Convergence properties of the (µ/µI, λ)-ES on the Rastrigin function; Proceedings of the 17th ACM/SIGEVO Conference on Foundations of Genetic Algorithms, FOGA ‘23; New York, NY, USA. 2023. pp. 117–128. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
No data was used for the research described in the article.










