Progress Rate Analysis of Evolution Strategies on the Rastrigin Function: First Results

Amir Omeradzic; Hans-Georg Beyer

doi:10.1007/978-3-031-14721-0_35

. Author manuscript; available in PMC: 2024 Mar 26.

Published in final edited form as: Parallel Probl Solving Nat. 2022 Aug 15;13399:499–511. doi: 10.1007/978-3-031-14721-0_35

Progress Rate Analysis of Evolution Strategies on the Rastrigin Function: First Results

Amir Omeradzic ^1,^✉, Hans-Georg Beyer ¹

PMCID: PMC7615766 EMSID: EMS189095 PMID: 38532780

Abstract

A first order progress rate is derived for the intermediate multi-recombinative Evolution Strategy (μ/μ_I, λ)-ES on the highly multimodal Rastrigin test function. The progress is derived within a linearized model applying the method of so-called noisy order statistics. To this end, the mutation-induced variance of the Rastrigin function is determined. The obtained progress approximation is compared to simulations and yields strengths and limitations depending on mutation strength and distance to the optimizer. Furthermore, the progress is iterated using the dynamical systems approach and compared to averaged optimization runs. The property of global convergence within given approximation is discussed. As an outlook, the need of an improved first order progress rate as well as the extension to higher order progress including positional fluctuations is explained.

Keywords: Evolution Strategies, Rastrigin function, Progress rate analysis, Global optimization

1. Introduction

Evolution Strategies (ES) [12,13] are well-recognized Evolutionary Algorithms suited for real-valued non-linear optimization. State-of-the-art ES such as the CMA-ES [8] or its simplification [5] are also well-suited for locating global optimizers in highly multimodal fitness landscapes. While the CMA-ES was originally mainly intended for non-differentiable optimization problems, but yet regarded as a locally acting strategy, it was already in [7] observed that using a large population size can make the ES a strategy that is able to locate the global optimizer among a huge number of local optima. This is a surprising observation when considering the ES as a strategy that acts mainly local in the search space following some kind of gradient or natural gradient [3,6,11]. As one can easily check using standard (highly) multimodal test functions such as Rastrigin, Ackley, and Griewank to name a few, this ES property is not intimately related to the covariance matrix adaptation (CMA) ES which generates non-isotropic correlated mutations, but can also be found in (μ/μ_I, λ)-ES with isotropic mutations. Therefore, if one wants to understand the underlying working principles how the ES locates the global optimizer, the analysis of the (μ/μ_I, λ)-ES should be the starting point.

The question regarding why and when optimization algorithms − originally designed for local search − are able to locate global optima has gained attention in the last few years. A recurring idea comes from relaxation procedures that transform the original multimodal optimization problem into a convex optimization problem called Gaussian continuation [9]. Gaussian continuation is nothing else but a convolution of the original optimization problem with a Gaussian kernel. As has been shown in [10], using the right Gaussian, Rastrigin-like functions can be transformed into a convex optimization problem, thus making it accessible to gradient following strategies. However, this raises the question how to perform the convolution efficiently. One road followed in [14] uses high-order Gauss-Hermite integration in conjunction with a gradient descent strategy yielding surprisingly good results. The other road coming to mind is approximating the convolution by Gaussian sampling. This resembles the procedure ES do: starting from a parental state, offspring are generated by Gaussian mutations. The problem is, however, that in order to get a reliable gradient, a huge number of samples, i.e. offspring in ES must be generated in order to get reliable convolution results. The number of offspring needed to get reliable estimates seems much larger than the offspring population size needed in ES experiments conducted in [7] showing approximately a linear relation between problem dimension N and population size for the Rastrigin function. Therefore, understanding the ES performance from viewpoint of Gaussian relaxation does not seem to help much.

The approach followed in this paper will incorporate two main concepts, namely a progress rate analysis as well as its application within the so-called evolution equations modeling the transition dynamics of the ES [2]. The progress rate measure yields the expected positional change in search space between two generations depending on location, strategy and test function parameters. Aiming to investigate and understand the dynamics of globally converging ES runs, the progress rate is an essential quantity to model the expected evolution dynamics over many generations.

This paper provides first results of a scientific program that aims at an analysis of the performance of the (μ/μ_I, λ)-ES on Rastrigin’s test function based on a first order progress rate. After a short introduction of the (μ/μ_I, λ)-ES, the N-dimensional first order progress will be defined and an approximation will be derived resulting in a closed form expression. The predictive power and its limitations will be checked by one-generation experiments. The progress rate will then be used to simulate the ES dynamics on Rastrigin using difference equations. This simulation will be compared with real runs of the (μ/μ_I, λ)-ES. In a concluding section a summary of the results and outlook of the future research will be given.

2. Rastrigin Function and Local Quality Change

The real-valued minimization problem defined for an N-dimensional search vector y = (y₁, …, y_N) is performed on the Rastrigin test function f given by

f (y) = \sum_{i = 1}^{N} f_{i} (y_{i}) = \sum_{i = 1}^{N} y_{i}^{2} + A - A \cos (α y_{i}),

(1)

with A denoting the oscillation amplitude and α = 2π the corresponding frequency. The quadratic term with superimposed oscillations yields a finite number of local minima M for each dimension i, such that the overall number of minima scales exponentially as M^N posing a highly multimodal minimization problem. The global optimizer is at ŷ = 0.

For the progress rate analysis in Sect. 4 the local quality function Q_y(x) at y due to mutation vector x = (x₁, …, x_N) is needed. In order to reuse results from noisy progress rate theory it will be formulated for the maximization case of F(y) = −f(y) with F_i(y_i) = −f_i(y_i), such that local quality change yields

Q_{y} (x) = F (y + x) - F (y) = f (y) - f (y + x) .

(2)

Q_y(x) can be evaluated for each component i independently giving

Q_{y} (x) = \sum_{i = 1}^{N} Q_{i} (x_{i}) = \sum_{i = 1}^{N} f_{i} (y_{i}) - f_{i} (y_{i} + x_{i})

(3)

= \sum_{i = 1}^{N} - (x_{i}^{2} + 2 y_{i} x_{i} + A \cos (α y_{i}) (1 - \cos (α x_{i})) + A \sin (α y_{i}) \sin (α x_{i})) .

(4)

A closed form solution of the progress rate appears to be obtainable only for a linearized expression of Q_i(x_i). A first approach taken in this paper is based on a Taylor expansion for the mutation x_i and discarding higher order terms

Q_{i} (x_{i}) = F_{i} (y_{i} + x_{i}) - F_{i} (y_{i}) = \frac{\partial F_{i}}{\partial y_{i}} x_{i} + O (x_{i}^{2})

(5)

\approx (- 2 y_{i} - α A \sin (α y_{i})) x_{i} = : - f_{i}^{'} x_{i},

(6)

using the following derivative terms

k_{i} = 2 y_{i} and d_{i} = α A \sin (α y_{i}), such that \frac{\partial f_{i}}{\partial y_{i}} = f_{i}^{'} = k_{i} + d_{i} .

(7)

A second approach is to consider only the linear term of Eq. (4) and neglect all non-linear terms denoted by δ(x_i) according to

Q_{i} (x_{i}) = - 2 y_{i} x_{i} - x_{i}^{2} - A \cos (α y_{i}) (1 - \cos (α x_{i})) - A \sin (α y_{i}) \sin (α x_{i})

(8)

= - 2 y_{i} x_{i} + δ (x_{i}) \approx - 2 y_{i} x_{i} = - k_{i} x_{i} .

(9)

The linearization using $f_{i}^{'}$ is a local approximation of the function incorporating oscillation parameters A and α. Using only k_i (setting d_i = 0) discards oscillations by approximating the quadratic term via $k_{i} = \partial (y_{i}^{2}) / \partial y_{i} = 2 y_{i}$ with negative sign due to maximization. Both approximations will be evaluated later.

3. The (μ/μ_I, λ)-ES with Normalized Mutations

The Evolution Strategy under investigation consists of a population of μ parents and λ offspring (μ < λ) per generation g. Algorithm 1 is presented below and offspring variables are denoted with overset “∼”.

Population variation is achieved by applying an isotropic normally distributed mutation x ∼ σ𝒩(0, 1) with strength σ to the parent recombinant in Lines 6 and 7. The recombinant is obtained using intermediate recombination of all μ parents equally weighted in Line 11. Selection of the m = 1, …, μ best search vectors y_m;λ (out of λ) according to their fitness is performed in Line 10.

Note that the ES in Algorithm 1 operates under constant normalized mutation σ* in Lines 3 and 12 using the spherical normalization

σ^{*} = \frac{σ^{(g)} N}{| | y^{(g)} | |} = \frac{σ^{(g)} N}{R^{(g)}} .

(10)

This property ensures global convergence of the algorithm as the mutation strength σ^(g) decreases if and only if the residual distance ║y^(g)║ = R^(g) decreases. While σ* is not known during black-box optimizations, it is used here to investigate the dynamical behavior of the ES using the first order progress rate approach to be developed in this paper. Incorporating self-adaptation of σ or cumulative step-size adaptation remains for future research.

Algorithm 1. (μ/μ_I, λ)-ES with constant σ*.

1: g ← 0

2: y⁽⁰⁾ ← y^(init)

3: σ⁽⁰⁾ ← σ* ║y⁽⁰⁾║ /N

4: repeat

5: for l = 1, …, λ do

6: ${\tilde{x}}_{l} \leftarrow σ^{(g)} 𝒩_{l} (0, 1)$

7: ${\tilde{y}}_{l} \leftarrow y^{(g)} + {\tilde{x}}_{l}$

8: ${\tilde{f}}_{l} \leftarrow f ({\tilde{y}}_{l})$

9: end for

10: $({\tilde{y}}_{1; λ}, \dots, {\tilde{y}}_{μ; λ}) \leftarrow sort (\tilde{y} w . r . t . ascending \tilde{f})$

11: $y^{(g + 1)} \leftarrow \frac{1}{μ} \sum_{m = 1}^{μ} {\tilde{y}}_{m; λ}$

12: σ^(g+1) ← σ* ║y^(g+1)║/N

13: g ← g + 1

14: until termination criterion

4. Progress Rate

4.1. Definition

Having introduced the Evolution Strategy, we are interested in the expected one-generation progress of the optimization on the Rastrigin function (1) before investigating the dynamics over multiple generations.

A first order progress rate φ_i for the i-th component between two generations g → g + 1 can be defined as the expectation value over the positional difference of the parental components

φ_{i} = E [y_{i}^{(g)} - y_{i}^{(g + 1)} | σ^{(g)}, y^{(g)}] = y_{i}^{(g)} - E [y_{i}^{(g + 1)} | σ^{(g)}, y^{(g)}],

(11)

given mutation strength σ^(g) and the position y^(g). First, an expression for y^(g+1) is needed, see Algorithm 1, Line 11. It is the result of mutation, selection and recombination of the m = 1, …, μ offspring vectors yielding the highest fitness, such that $y^{(g + 1)} = \frac{1}{μ} \sum_{m = 1}^{μ} {\tilde{y}}_{m; λ} = \frac{1}{μ} \sum_{m = 1}^{μ} {(y^{(g)} + x)}_{m; λ}$ . Considering the i-th component, noting that y^(g) is the same for all offspring and setting (x_m;λ)_i = x_m;λ one has

y_{i}^{(g + 1)} = \frac{1}{μ} \sum_{m = 1}^{μ} (y_{i}^{(g)} + x_{m; λ}) = y_{i}^{(g)} + \frac{1}{μ} \sum_{m = 1}^{μ} x_{m; λ} .

(12)

Taking the expectation $E [y_{i}^{(g + 1)}]$ , setting x = σz = σ𝒩(0, 1) and inserting the expression back into (11) yields

φ_{i} = - \frac{1}{μ} E [\sum_{m = 1}^{μ} x_{m; λ} | σ^{(g)}, y^{(g)}] = - \frac{σ}{μ} E [\sum_{m = 1}^{μ} z_{m; λ} | σ^{(g)}, y^{(g)}] .

(13)

Therefore progress can be evaluated by averaging over the expectations of μ selected mutation contributions. In principle this task can be solved by deriving the induced order statistic density p_m;λ for the m-th best individual and subsequently solving the integration over the i-th component

φ_{i} = - \frac{1}{μ} \sum_{m = 1}^{μ} \int_{- \infty}^{\infty} x_{i} p_{m; λ} (x_{i} | σ^{(g)}, y^{(g)}) d x_{i} .

(14)

However, the task of computing expectations of sums of order statistics under noise disturbance has already been discussed and solved by Arnold in [1]. Therefore the problem of Eq. (13) will be reformulated in order to apply the solutions provided by Arnold.

4.2. Expectations of Sums of Noisy Order Statistics

Let z be a random variate with density p_z(z) and zero mean. The density is expanded into a Gram-Charlier series by means of its cumulants κ_i (i ≥ 1) according to [1, p. 138, D.15]

p_{z} (z) = \frac{1}{\sqrt{2 π κ_{2}}} e^{- \frac{z^{2}}{2 κ_{2}}} (1 + \frac{γ_{1}}{6} {He}_{3} (\frac{z}{\sqrt{κ_{2}}}) + \frac{γ_{2}}{24} {He}_{4} (\frac{z}{\sqrt{κ_{2}}}) + \dots),

(15)

with expectation κ₁ = 0, variance κ₂, skewness $γ_{1} = κ_{3} / κ_{2}^{3 / 2}$ , excess $γ_{2} = κ_{4} / κ_{2}^{2}$ (higher order terms not shown) and He_k denoting the k-th order probabilist’s Hermite polynomials. For the problem at hand, see Eq. (13), the mutation variate z ∼ 𝒩(0, 1) with κ₂ = 1 and κ_i = 0 for i ≠ 2 yielding a standard normal density.

Furthermore, let $ϵ \sim 𝒩 (0, σ_{ϵ}^{2})$ model additive noise disturbance, such that resulting observed values are v = z + ϵ. Selection of the m-th largest out of λ values yields

v_{m; λ} = (z + 𝒩 (0, σ_{ϵ}^{2}))_{m; λ},

(16)

and the distribution of selected source terms z_m;λ follows a noisy order statistic with density p_m;λ. Given this definition and a linear relation between z_m;λ and v_m;λ the method of Arnold is applicable.

In our case the i-th mutation component x_m;λ of Eq. (13) is related to selection via the quality change defined in Eq. (3). Maximizing the fitness F_i(y_i + x_i) conforms to maximizing quality Q_i(x_i) with F_i(y_i) being a constant offset.

Aiming at an expression of form (16) and starting with (3), we first isolate component Q_i from the remaining N − 1 components denoted by Σ_j≠i Q_j. Then, approximations are applied to both terms yielding

Q_{y} (x) = Q_{i} (x_{i}) + \sum_{j \neq i} Q_{j} (x_{j})

(17)

\approx - f_{i}^{'} x_{i} + 𝒩 (E_{i}, D_{i}^{2}),

(18)

with linearization (6) applied to Q_i(x_i). Additionally, $\sum_{j \neq i} Q_{j} ≃ 𝒩 (E_{i}, D_{i}^{2})$ , as the sum of independent random variables asymptotically approaches a normal distribution in the limit N → ∞ due to the Central Limit Theorem. This is ensured by Lyapunov’s condition provided that there are no dominating components within the sum due to largely different values of y_j. The corresponding Rastrigin quality variance $D_{i}^{2} = Var [\sum_{j \neq i} Q_{j} (x_{j})]$ is calculated in the supplementary material (https://github.com/omam-evo/paper/blob/main/ppsn22/PPSN22_OB22.pdf). As the expectation E_i = E[Σ_j≠i Q_j(x_j)] is only an offset to Q_y(x) it has no influence on the selection and its calculation can be dropped.

Using x_i = σz_i and $f_{i}^{'} = sgn (f_{i}^{'}) | f_{i}^{'} |$ , expression (18) is reformulated as

Q_{y} (x) = - sgn (f_{i}^{'}) | f_{i}^{'} | σ z_{i} + E_{i} + 𝒩 (0, D_{i}^{2})

(19)

\frac{Q_{y} (x) - E_{i}}{| f_{i}^{'} | σ} = sgn (- f_{i}^{'}) z_{i} + 𝒩 (0, \frac{D_{i}^{2}}{{(f_{i}^{'} σ)}^{2}}) .

(20)

The decomposition using sign function and absolute value is needed for correct ordering of selected values w.r.t. z_i in (20).

Given result (20), one can define the linearly transformed quality measure $v_{i} : = (Q_{y} (x) - E_{i}) / | f_{i}^{'} | σ$ and noise variance $σ_{ϵ}^{2} : = {(D_{i} / f_{i}^{'} σ)}^{2}$ , such that the selection of mutation component sgn $(- f_{i}^{'}) z_{i}$ is disturbed by a noise term due to the remaining N − 1 components. A relation of the form (16) is obtained up to the sign function.

In [1] Arnold calculated the expected value of arbitrary sums S_P of products of noisy ordered variates containing ν factors per summand

S_{P} = \sum_{{n_{1}, \dots, n_{ν}}} z_{n_{1}; λ}^{p_{1}} \dots z_{n_{ν}; λ}^{p_{ν}},

(21)

with random variate z introduced in Eqs. (15) and (16). The vector P = (p₁, …, p_ν) denotes the positive exponents and distinct summation indices are denoted by the set {n₁, …, n_ν}. The generic result for the expectation of (21) is provided in [1, p. 142, D.28] and was adapted to account for the sign difference between (16) and (20) resulting in possible exchanged ordering. Performing simple substitutions in Arnold’s calculations in [1] and recalling that in our case γ₁ = γ₂ = 0, the expected value yields

E [S_{P}] = sgn {(- f_{i}^{'})}^{| | P_{1} | |} {\sqrt{κ_{2}}}^{| | P_{1} | |} \frac{μ!}{(μ - ν)!} \sum_{n = 0}^{ν} \sum_{k \geq 0} ζ_{n, 0}^{(P)} (k) h_{μ, λ}^{ν - n, k} .

(22)

Note that expression (22) deviates from Arnold’s formula only in the sign in front of $\sqrt{κ_{2}}$ . The coefficients $ζ_{n, 0}^{(P)} (k)$ are defined in terms of a noise coefficient a according to

a = \sqrt{\frac{κ_{2}}{κ_{2} + σ_{ϵ}^{2}}} with ζ_{n, 0}^{(P)} (k) = Polynomial (a),

(23)

for which tabulated results are presented in [1, p. 141]. The coefficients $h_{μ, λ}^{i, k}$ are numerically obtainable solving

h_{μ, λ}^{i, k} = \frac{λ - μ}{\sqrt{2 π}} (\begin{array}{l} λ \\ μ \end{array}) \int_{- \infty}^{\infty} H e_{k} (x) e^{- \frac{1}{2} x^{2}} [ϕ (x)]^{i} [Φ (x)]^{λ - μ - 1} [1 - Φ (x)]^{μ - i} d x .

(24)

Now we are in the position to calculate expectation (13) using (22). Since z ~ 𝒩(0,1), it holds κ₂ = 1. Identifying P = (1), ║P║₁ = 1 and ν = 1 yields

\begin{array}{l} E [\sum_{m = 1}^{μ} z_{m; λ}] = sgn (- f_{i}^{'}) \frac{μ!}{(μ - 1)!} \sum_{n = 0}^{1} \sum_{k \geq 0} ζ_{n, 0}^{(1)} (k) h_{μ, λ}^{1 - n, k} \\ = sgn (- f_{i}^{'}) μ ζ_{0, 0}^{(1)} (0) h_{μ, λ}^{1, 0} = - sgn (f_{i}^{'}) μ a c_{μ / μ, λ}, \end{array}

(25)

with $ζ_{1, 0}^{(1)} (k) = 0$ for any k, and $ζ_{0, 0}^{(1)} (k) \neq 0$ only for k = 0 yielding a. The expression $h_{μ, λ}^{1, 0}$ is equivalent to the progress coefficient definition c_μ/μ,λ [2, p. 216]. Inserting (25) back into (13), using $a = \sqrt{1 / (1 + {(D_{i} / f_{i}^{'} σ)}^{2})} = | f_{i}^{'} | σ / \sqrt{{(f_{i}^{'} σ)}^{2} + D_{i}^{2}}$ with the requirement a > 0, and noting that $f_{i}^{'} = sgn (f_{i}^{'}) | f_{i}^{'} |$ one finally obtains for the i-th component first order progress rate

φ_{i} (σ, y) = c_{μ / μ, λ} \frac{f_{i}^{'} (y_{i}) σ^{2}}{\sqrt{{(f_{i}^{'} (y_{i}) σ)}^{2} + D_{i}^{2} (σ, {(y)}_{j \neq i})}} .

(26)

The population dependency is given by progress coefficient c_μ/μ,λ. The fitness dependent parameters are contained in $f_{i}^{'}$ , see (7), and in $D_{i}^{2}$ calculated in the supplementary material (https://github.com/omam-evo/paper/blob/main/ppsn22/PPSN22_OB22.pdf). For better readability the derivative $f_{i}^{'}$ and variance $D_{i}^{2}$ are not inserted into (26). An exemplary evaluation of $D_{i}^{2}$ as a function of the residual distance R using normalization (10) is also shown in the supplementary material.

4.3. Comparison of Simulation and Approximation

Figure 1 shows an experimentally obtained progress rate compared to the result of (26). Due to large N one exemplary φ_i-graph is shown on the left, and corresponding i = 1, …, N errors are shown on the right.

Fig. 1 — One-generation experiments with (150/150, 300)-ES, N = 100, A = 10 are performed and quantity (11) is measured averaging over 10⁵ runs. Left: *φ_i* over σ for i = 2 at position y₂ ≈ 1.19, where y was chosen randomly such that ║y║ = R = 10. Right: error measure *φ_i* − φ_i,sim between (26) and simulation for i = 1, …, N evaluated at σ = {0.1, 1}. The colors are set according to the legend. (Color figure online)

The left plot shows the progress rate over a σ-range of [0, 1]. This magnitude was chosen in order to study the oscillation, as the frequency α = 2π. The initial position was chosen randomly to be on the sphere surface R = 10.

The red dashed curve uses $f_{i}^{'}$ as linearization, while the blue dash-dotted curve assumes $f_{i}^{'} = k_{i}$ (with d_i = 0), see also (7). As $f_{i}^{'}$ approximates the quality change locally, agreement for the progress is given only for very small mutations σ. For larger σ very large deviation may occur, depending on the local derivative.

The blue curve φ_i(k_i) neglects the oscillation (d_i = 0) and therefore follows the progress of the quadratic function $f (y) = \sum_{i} y_{i}^{2}$ for large σ with very good agreement. Due to a linearized form of Q_i(x_i) in (6) neither approximation can reproduce the oscillation for moderately large σ.

To verify the approximation quality, the error between (26) and simulation is displayed on the right side of Fig. 1 for all i = 1, …, N. It was done for small σ = 0.1 and large σ = 1. The deviations are very similar in magnitude for all i, given randomly chosen y_i. Note that for σ = 1 the red points show very large errors compared to blue, which was expected.

Figure 2 shows the progress rate φ_i over σ*, for i = 2 as in Fig. 1, with y randomly on the surface radii R = {100, 10, 1, 0.1}. Using σ* the mutation σ is normalized by the residual distance R with spherical normalization (10). Far from the origin with R = {100, 10} the quadratic terms are dominating giving better results using φ_i(k_i). Reaching R = 1 local minima are more relevant and mixed results are obtained with $φ_{i} (f_{i}^{'})$ better for smaller σ* and φ_i(k_i) for larger σ*. Within the global attractor R = 0.1 the local structure dominates and $φ_{i} (f_{i}^{'})$ yields better results. These observations will be relevant analyzing the dynamics in Fig. 3 where both approximations show strengths and weaknesses.

Fig. 2 — One-generation progress *φ_i* (i = 2) over normalized mutation σ* for (150/150, 300)-ES, N = 100, A = 1 and R = {100, 10, 1, 0.1}. Simulations are averaged over 10⁵ runs. These experiments are preliminary investigations related to the dynamics shown in Fig. 3 with σ* = 30. Given a constant σ* the approximation quality varies over different magnitudes of R.

Fig. 3 — Comparing average of 100 optimization runs of Algorithm 1 (black, solid) with iterated dynamics from Eq. (27) under constant σ* = 30 for A = 1 and N = 100. Large populations sizes are chosen to ensure global convergence (left: μ = 150; right: μ = 1500; constant *μ/λ* = 0.5). Iteration using progress (26) is performed for both $f_{i}^{'} = k_{i} + d_{i}$ (red/orange dashed) and $f_{i}^{'} (d_{i} = 0) = k_{i}$ (blue dash-dotted) using Equations (27) and (28). The orange dashed iteration was initialized with R⁽⁰⁾ = 0.1 and translated to the corresponding position of the simulation for easier comparison. The evaluation of quality variance $D_{i}^{2} (R)$ is shown in the supplementary material (https://github.com/omam-evo/paper/blob/main/ppsn22/PPSN22_OB22.pdf). (Color figure online)

5. Evolution Dynamics

As we are interested in the dynamical behavior of the ES, averaged real optimization runs from Algorithm 1 will be compared to the iterated dynamics using progress result (26) by applying the dynamical systems approach [2]. Neglecting fluctuations, i.e., $y_{i}^{(g + 1)} = E [y_{i}^{(g + 1)} ∣ σ^{(g)}, y^{(g)}]$ the mean value dynamics for the mapping $y_{i}^{(g)} \to y_{i}^{(g + 1)}$ immediately follows from (11) giving

y_{i}^{(g + 1)} = y_{i}^{(g)} - φ_{i} (σ^{(g)}, y^{(g)}) .

(27)

The control scheme of σ^(g) was introduced in Eq. (10) and yields simply

σ^{(g)} = σ^{*} | | y^{(g)} | | / N .

(28)

Equations (27) and (28) describe a deterministic iteration in search space and rescaling of mutations according to the residual distance. For a convergence analysis, we are interested in the dynamics of R^(g) = ║y^(g)║ rather than the actual position values y^(g). Hence in Fig. 3 the R^(g)-dynamics of the conducted experiments is shown.

In Fig. 3, all runs of Algorithm 1 exhibit global convergence with the black line showing the average. The left and right plots differ by population size. Iteration φ_i(k_i), blue dash-dotted curve, also converges globally, though very slowly and therefore not shown entirely. The convergence behavior of iteration $φ_{i} (f_{i}^{'})$ , red and orange dashed curves, strongly depends on the initialization and is discussed below.

Three phases can be observed for the simulation. It shows linear convergence at first being followed by a slow-down due to local attractors. Reaching the global attractor the convergence speed increases again. Iteration φ_i(k_i) is able to model the first two phases to some degree. Within the global attractor the slope information d_i is missing such that the progress is largely underestimated.

Iteration $φ_{i} (f_{i}^{'})$ converges first, but yields a stationary state with R^st ≈ 20 when the progress φ_i becomes dominated by derivative term d_i. Starting from R⁽⁰⁾ = 10² the stationary $y_{i}^{s t}$ are either fixed or alternating between coordinates depending on σ, D_i, k_i, and d_i. This effect is due to attraction of local minima and due to the deterministic iteration disregarding fluctuations. It occurs also with varying initial positions. Initialized at R⁽⁰⁾ = 10⁻¹ orange iteration $φ_{i} (f_{i}^{'})$ is globally converging.

It turns out that the splitting point of the two approximations in Fig. 3 occurs at a distance R to the global optimizer where the ES approaches the attractor region of the “first” local minima. For the model parameters considered in the experiment this is at about R ≈ 28.2 − the distance of the farest local minimizer to the global optimizer (obtained by numerical analysis).

Plots in Fig. 3 differ by population size. The convergence speed, i.e. the slopes, show better agreement for large populations, which can be attributed to the fluctuations neglected in (27). Investigations on unimodal funtions Sphere [2] and Ellipsoid [4] have shown that progress is decreased by fluctuations due to a loss-term scaling with 1/μ, which agrees with Fig. 3. On the left the iterated progress is faster due to neglected but present fluctuations, while on the right better agreement is observed due to insignificant fluctuations. These observations will be investigated in future research.

6. Summary and Outlook

A first order progress rate φ_i was derived for the (μ/μ_I, λ)-ES by means of noisy order statistics in (26) on the Rastrigin function (1). To this end, the mutation induced variance of the quality change $D_{i}^{2}$ is needed. Starting from (4) a derivation yielding $D_{i}^{2}$ has been presented in the supplementary material. Furthermore, the approximation quality of φ_i was investigated using Rastrigin and quadratic derivatives $f_{i}^{'}$ and k_i, respectively, by comparing with one-generation experiments.

Linearization $f_{i}^{'}$ shows good agreement for small-scale mutations, but very large deviations for large mutations. Conversely, linearization k_i yields significantly better results for large mutations as the quadratic fitness term dominates. A progress rate modeling the transition between the regimes is yet to be determined. First numerical investigations of (14) including all terms of (4) indicate that nonlinear terms are needed for a better progress rate model, which is an open challenge and part of future research.

The obtained progress rate was used to investigate the dynamics by iterating (27) using (28) and comparing with ES runs. Iteration via $f_{i}^{'}$ only converges globally if initialized close to the optimizer, since local attraction is strongly dominating. Dynamics via k_i converges globally independent of initialization, but the observed rate matches only for the initial phase and for very large populations. This confirms the need for a higher order progress rate modeling the effect of fluctuations, especially when function evaluations are expensive and small populations must be used. Additionally, an advanced progress rate formula is needed combining effects of global and local attraction to model all three phases of the dynamics correctly.

The investigations done so far are a first step towards a full dynamical analysis of the ES on the multimodal Rastrigin function. Future investigations must also include the complete dynamical modeling of the mutation strength control. One aim is the tuning of mutation control parameters such that the global convergence probability is increased while still maintaining search efficiency. Our final goal will be the theoretical analysis of the full evolutionary process yielding also recommendations regarding the choice of the minimal population size needed to converge to the global optimizer with high probability.

Acknowledgments

This work was supported by the Austrian Science Fund (FWF) under grant P33702-N. Special thanks goes to Lisa Schönenberger for providing valuable feedback and helpful discussions.

References

1.Arnold D. Noisy Optimization with Evolution Strategies. Kluwer Academic Publishers; Dordrecht: 2002. [Google Scholar]
2.Beyer HG. Natural Computing Series. Springer; Heidelberg: 2001. The Theory of Evolution Strategies. [DOI] [Google Scholar]
3.Beyer HG. Convergence analysis of evolutionary algorithms that are based on the paradigm of information geometry. Evol Comput. 2014;22(4):679–709. doi: 10.1162/EVCO_a_00132. [DOI] [PubMed] [Google Scholar]
4.Beyer HG, Melkozerov A. The dynamics of self-adaptive multi-recombinant evolution strategies on the general ellipsoid model. IEEE Trans Evol Comput. 2014;18(5):764–778. doi: 10.1109/TEVC.2013.2283968. [DOI] [Google Scholar]
5.Beyer HG, Sendhoff B. Simplify your covariance matrix adaptation evolution strategy. IEEE Trans Evol Comput. 2017;21(5):746–759. doi: 10.1109/TEVC.2017.2680320. [DOI] [Google Scholar]
6.Glasmachers T, Schaul T, Sun Y, Wierstra D, Schmidhuber J. Exponential natural evolution strategies. In: Branke J, et al., editors. GECCO 2010: Proceedings of the Genetic and Evolutionary Computation Conference; New York. 2010. pp. 393–400. [Google Scholar]
7.Hansen N, Kern S. In: PPSN 2004 LNCS. Yao X, et al., editors. Vol. 3242. Springer; Heidelberg: 2004. Evaluating the CMA evolution strategy on multimodal test functions; pp. 282–291. [DOI] [Google Scholar]
8.Hansen N, Müller S, Koumoutsakos P. Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (CMA-ES) Evol Comput. 2003;11(1):1–18. doi: 10.1162/106365603321828970. [DOI] [PubMed] [Google Scholar]
9.Mobahi H, Fisher J. A theoretical analysis of optimization by Gaussian continuation. Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence; AAAI Press; 2015. pp. 1205–1211. [Google Scholar]
10.Müller N, Glasmachers T. Foundations of Genetic Algorithms. Vol. 16. ACM; 2021. Non-local optimization: imposing structure on optimization problems by relaxation; pp. 1–10. [DOI] [Google Scholar]
11.Ollivier Y, Arnold L, Auger A, Hansen N. Information-geometric optimization algorithms: a unifying picture via invariance principles. J Mach Learn Res. 2017;18(18):1–65. [Google Scholar]
12.Rechenberg I. Evolutionsstrategie: Optimierung technischer Systeme nach Prinzipien der biologischen Evolution. Frommann-Holzboog Verlag; Stuttgart: 1973. [Google Scholar]
13.Schwefel HP. Numerical Optimization of Computer Models. Wiley; Chichester: 1981. [Google Scholar]
14.Zhang J, Bi S, Zhang G. A directional Gaussian smoothing optimization method for computational inverse design in nanophotonics. Mater Des. 2021;197:109213. doi: 10.1016/j.matdes.2020.109213. [DOI] [Google Scholar]

[R1] 1.Arnold D. Noisy Optimization with Evolution Strategies. Kluwer Academic Publishers; Dordrecht: 2002. [Google Scholar]

[R2] 2.Beyer HG. Natural Computing Series. Springer; Heidelberg: 2001. The Theory of Evolution Strategies. [DOI] [Google Scholar]

[R3] 3.Beyer HG. Convergence analysis of evolutionary algorithms that are based on the paradigm of information geometry. Evol Comput. 2014;22(4):679–709. doi: 10.1162/EVCO_a_00132. [DOI] [PubMed] [Google Scholar]

[R4] 4.Beyer HG, Melkozerov A. The dynamics of self-adaptive multi-recombinant evolution strategies on the general ellipsoid model. IEEE Trans Evol Comput. 2014;18(5):764–778. doi: 10.1109/TEVC.2013.2283968. [DOI] [Google Scholar]

[R5] 5.Beyer HG, Sendhoff B. Simplify your covariance matrix adaptation evolution strategy. IEEE Trans Evol Comput. 2017;21(5):746–759. doi: 10.1109/TEVC.2017.2680320. [DOI] [Google Scholar]

[R6] 6.Glasmachers T, Schaul T, Sun Y, Wierstra D, Schmidhuber J. Exponential natural evolution strategies. In: Branke J, et al., editors. GECCO 2010: Proceedings of the Genetic and Evolutionary Computation Conference; New York. 2010. pp. 393–400. [Google Scholar]

[R7] 7.Hansen N, Kern S. In: PPSN 2004 LNCS. Yao X, et al., editors. Vol. 3242. Springer; Heidelberg: 2004. Evaluating the CMA evolution strategy on multimodal test functions; pp. 282–291. [DOI] [Google Scholar]

[R8] 8.Hansen N, Müller S, Koumoutsakos P. Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (CMA-ES) Evol Comput. 2003;11(1):1–18. doi: 10.1162/106365603321828970. [DOI] [PubMed] [Google Scholar]

[R9] 9.Mobahi H, Fisher J. A theoretical analysis of optimization by Gaussian continuation. Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence; AAAI Press; 2015. pp. 1205–1211. [Google Scholar]

[R10] 10.Müller N, Glasmachers T. Foundations of Genetic Algorithms. Vol. 16. ACM; 2021. Non-local optimization: imposing structure on optimization problems by relaxation; pp. 1–10. [DOI] [Google Scholar]

[R11] 11.Ollivier Y, Arnold L, Auger A, Hansen N. Information-geometric optimization algorithms: a unifying picture via invariance principles. J Mach Learn Res. 2017;18(18):1–65. [Google Scholar]

[R12] 12.Rechenberg I. Evolutionsstrategie: Optimierung technischer Systeme nach Prinzipien der biologischen Evolution. Frommann-Holzboog Verlag; Stuttgart: 1973. [Google Scholar]

[R13] 13.Schwefel HP. Numerical Optimization of Computer Models. Wiley; Chichester: 1981. [Google Scholar]

[R14] 14.Zhang J, Bi S, Zhang G. A directional Gaussian smoothing optimization method for computational inverse design in nanophotonics. Mater Des. 2021;197:109213. doi: 10.1016/j.matdes.2020.109213. [DOI] [Google Scholar]

PERMALINK

Progress Rate Analysis of Evolution Strategies on the Rastrigin Function: First Results

Amir Omeradzic

Hans-Georg Beyer

Abstract

1. Introduction

2. Rastrigin Function and Local Quality Change

3. The (μ/μ_I, λ)-ES with Normalized Mutations

Algorithm 1. (μ/μ_I, λ)-ES with constant σ*.

4. Progress Rate

4.1. Definition

4.2. Expectations of Sums of Noisy Order Statistics

4.3. Comparison of Simulation and Approximation

Fig. 1.

Fig. 2.

Fig. 3.

5. Evolution Dynamics

6. Summary and Outlook

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Progress Rate Analysis of Evolution Strategies on the Rastrigin Function: First Results

Amir Omeradzic

Hans-Georg Beyer

Abstract

1. Introduction

2. Rastrigin Function and Local Quality Change

3. The (μ/μI, λ)-ES with Normalized Mutations

Algorithm 1. (μ/μI, λ)-ES with constant σ*.

4. Progress Rate

4.1. Definition

4.2. Expectations of Sums of Noisy Order Statistics

4.3. Comparison of Simulation and Approximation

Fig. 1.

Fig. 2.

Fig. 3.

5. Evolution Dynamics

6. Summary and Outlook

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

3. The (μ/μ_I, λ)-ES with Normalized Mutations

Algorithm 1. (μ/μ_I, λ)-ES with constant σ*.