On a Population Sizing Model for Evolution Strategies Optimizing the Highly Multimodal Rastrigin Function

Lisa Schönenberger; Hans-Georg Beyer

doi:10.1145/3583131.3590451

. Author manuscript; available in PMC: 2024 Feb 16.

Published in final edited form as: Genet Evol Comput Conf. 2023 Jul 12;2023:848–855. doi: 10.1145/3583131.3590451

On a Population Sizing Model for Evolution Strategies Optimizing the Highly Multimodal Rastrigin Function

Lisa Schönenberger ¹, Hans-Georg Beyer ¹

PMCID: PMC7615652 EMSID: EMS189096 PMID: 38370395

Abstract

A model is presented that allows for the calculation of the success probability by which a vanilla Evolution Strategy converges to the global optimizer of the Rastrigin test function. As a result a population size scaling formula will be derived that allows for an estimation of the population size needed to ensure a high convergence security depending on the search space dimensionality.

CCS Concepts: Theory of computation→Random search heuristics, Mathematics of computing→Bio-inspired optimization

Keywords: Evolution Strategies, global optimization, multi-modal objective function, global convergence, population sizing

1. Introduction

Finding global optimal solutions in highly multimodal real-valued fitness landscapes by means of Evolution Strategies (ES) [8] depends on the choice of algorithm-specific parameters. Considering highly multimodal test functions such as Rastrigin, Ackley, Fletcher-Powell, and Bohachevsky to name a few, the probability of success of the ES locating the global optimizer is strongly influenced by the choice of the population size. This observation has been made already in [10] regarding the CMA-ES [11], however, this also holds for simple (μ/μ_I, λ)-ES using isotropic mutations in conjunction with σ self-adaptation (σSA) or cumulative stepsize adaptation (CSA) for mutation strength control. Consider the minimization problem ŷ := argmin_yF(y), y ∈ ℝ^N, where ŷ is the global minimizer of F and N is the search space dimensionality. In order to reach ŷ with high probability it seems intuitively clear that sufficiently large population sizes for both the number of parents μ and offspring λ are needed. Furthermore, if the number of local minima increases with the search space dimensionality N it seems plausible that also the population sizes, i.e., μ and λ should increase as well. Since the number of local minima increases exponentially with dimensionality N one could expect that the population size should increase in a similar manner as it would be the case of the number of multi-starts in classical non-linear numerical optimization strategies. Therefore, the empirical findings in [10] came as a big surprise: In most of the cases considered the population sizes did not scale exponentially with N, but seemingly in-between 𝒪(N) and 𝒪(N²) (with the Griewank test function as an exception where the population sizes even decreased with N).

What can be learned from these experimental observations? First of all, the ES does not perform some kind of gradient following strategy to locate the global optimizer as sometimes claimed [16, p. 75f]. This raises the question how the ES does locate a global optimizer under a huge number of local optima. Furthermore, from the viewpoint of algorithmic efficiency the question of computational complexity would be of interest here. However, this question is intimately connected to the question how to choose the population size since using a population size too small the probability to reach the global optimizer will be very small while choosing the population size too large would be a waste of computing resources. Therefore, this paper is devoted to the derivation of a population sizing equation.

The theoretical analysis of the behavior of ES on highly multi-modal test functions is still in its infancy. There are first attempts to extend the progress rate analysis [6] to the Rastrigin function [15]. While this approach is able to take into account many specific details of the ES algorithms and also the influence of the population size parameters μ and λ it is still restricted to the derivation of mean value dynamics. Another approach models the ES mutation process as some kind of convolution. As has been shown in [14] the convolution of Rastrigin like functions with a Gaussian kernel can transform the original non-convex minimization problem into a convex one depending on the kernel parameter (being the mutation strength σ). However, the convolution is an N-fold integration performed only approximately by the ES mutation sampling process. That is, the question of how many samples are needed to get a reliable convex result cannot be easily answered. Therefore, the question still remains how to choose the population size.

It is the goal of this paper to develop a model that describes the convergence behavior of the (μ/μ_I, λ)-ES to the global optimizer of the Rastrigin function. As a result, a population sizing equation will be obtained that scales like $𝒪 (\sqrt{N} \ln N)$ . That is, for the Rastrigin test function the population size scales even sublinearly with the search space dimensionality N. The remainder of this paper is organized as follows. First, the ES algorithms to be considered are briefly reviewed. In Section 3 the Rastrigin test function will be introduced. In Section 4 the convergence model will be developed and the success probability will be derived. Section 5 is devoted to the derivation of the population sizing equation. In the concluding Section 6 a summary will be given and an outlook regarding future research will be presented.

2. ES-Algorithms

It is assumed that the reader is acquainted with the basic (μ/μ_I, λ)-ES algorithms and the order statistics notation “m; λ” used. The control of the strength σ of the isotropic Gaussian mutations used is done by either σ self-adaptation (σSA), see Alg. 1, or cumulative stepsize adaptation (CSA), see Alg. 2. The performance of the algorithms depends on the choice of the learning parameter τ and the cumulation time constant 1/c and D, respectively, where D = 1/c has been chosen. The standard choice of the learning parameter $τ = 1 / \sqrt{2 N}$ [12] guarantees optimal performance on the sphere model. As for the choice of c in the CSA-ES, 1/N to $1 / \sqrt{N}$ defines an admissible range [2, 9] where the latter results in faster convergence rate at the price of lower global success probability P_s on the Rastrigin function (1).

Algorithm 1. The (μ/μ_I, λ)-σSA Evolution Strategy.

1: Initialize (y⁽⁰⁾, σ⁽⁰⁾, σ_stop, g = 0)

2: repeat

3: for l = 1 to λ do

4: ${\tilde{σ}}_{l} = σ^{(g)} e^{τ 𝒩 (0, 1)}$ ▹ mutate parental σ

5: ${\tilde{y}}_{l} = y^{(g)} + {\tilde{σ}}_{l} (𝒩 (0, 1), \dots, 𝒩 (0, 1))$ ▹ mutate y

6: ${\tilde{F}}_{l} = F ({\tilde{y}}_{l})$ ▹ evaluate offspring

7: end for

8: Sort Individuals Ascendingly w.r.t. Fitness $\tilde{F}$

9: g = g +1

10: $y^{(g)} = \frac{1}{μ} \sum_{m = 1}^{μ} {\tilde{y}}_{m; λ}$ ▹ recombine the μ best ỹ

11: $σ^{(g)} = \frac{1}{μ} \sum_{m = 1}^{μ} {\tilde{σ}}_{m; λ}$ ▹ recombine the μ best $\tilde{σ}$

12: until σ^(g) < σ_stop

Algorithm 2. The (μ/μ_I, λ)-CSA Evolution Strategy.

1: Initialize (y⁽⁰⁾, σ⁽⁰⁾, σ_stop, s = 1, g = 0)

2: repeat

3: for l = 1 to λ do

4: ${\tilde{z}}_{l} = (𝒩 (0, 1), \dots, 𝒩 (0, 1))$ ▹ generate search direction

5: ${\tilde{y}}_{l} = y^{(g)} + σ^{(g)} {\tilde{z}}_{l}$ ▹ mutate y

6: ${\tilde{F}}_{l} = F ({\tilde{y}}_{l})$ ▹ evaluate offspring

7: end for

8: Sort Individuals Ascendingly w.r.t. Fitness $\tilde{F}$

9: g = g + 1

10: $y^{(g)} = \frac{1}{μ} \sum_{m = 1}^{μ} {\tilde{y}}_{m; λ}$ ▹ recombine the μ best ỹ

11: $s = (1 - c) s + \sqrt{μ c (2 - c)} \frac{1}{μ} \sum_{m = 1}^{μ} {\tilde{z}}_{m; λ}$ ▹ update s-path

12: $σ^{(g)} = σ^{(g - 1)} \exp (\frac{{‖ s ‖}^{2} - N}{2 D N})$ ▹ update σ, see [2, p.13]

13: until σ^(g) < σ_stop

3. The Rastrigin Function

The Rastrigin test function F for an N-dimensional search vector y = (y₁,...,y_N) is given by

F (y) = \sum_{i = 1}^{N} [y_{i}^{2} + A (1 - \cos (α y_{i}))]

(1)

where the parameter A > 0 denotes the oscillation amplitude and α denotes the frequency. Unless otherwise stated, the parameters A = 1 and α = 2π are used in all experiments. The global optimizer located at ŷ = 0 is surrounded by κ^N – 1 local minima (e.g., for α = 2π, A = 1: κ = 7 and for α = 2π, A = 10: κ = 63). Figure 1 shows an example 3D-plot of the Rastrigin function. Looking at the contour map in Fig. 2 that includes the global optimizer at ŷ = 0 one sees a squared domain (bounded by green lines) in which the negative gradient flow (expressed by small arrows) is directed towards the global minimizer. That is, a gradient strategy initialized in this global attractor region would converge to the global attractor. The global attractor region is defined by the hypercube

𝒜_{0} : = {[- Δ_{0}, Δ_{0}]}^{N},

(2)

where Δ₀ is the distance from the global optimizer 0 (the star) to the nearest stationary point(s) (the small filled squares in Fig. 2). The value of Δ₀ is determined by a non-linear equation. One finds for Aα ≫ 2 asymptotically (see Appendix A)

Δ_{0} ≃ \frac{A α π}{A α^{2} - 2} .

(3)

3D-plot of an N = 2 dimensional Rastrigin function (1) with α = 2π and A = 3.

Global attractor region of the Rastrigin function for N = 2 (green dashed square), α = 2π and A = 1. The star shows the global optimizer, squares the nearest stationary points, and circles the farthest stationary points. Arrows show the negative gradient flow.

Unlike gradient strategies, it cannot be guaranteed that the (μ/μ_I, λ)-ES converges globally if the parental centroid y is in $𝒜_{0}$ . Especially, parents y located in the vicinity of the corners of $𝒜_{0}$ will produce better offspring only with a probability of about 2^−N, thus, requiring an exponentially large population size for improvements. On the other hand, parents in the vicinity of the stationary points can even be located outside $𝒜_{0}$ and still produce better offspring allowing for convergence to the global optimizer. That is, for fixed values of A and α, the global attractor domain of an ES denoted by $𝒜_{ES}$ , depends on the strategy-specific parameters such as the truncation ratio 𝜗 := μ/λ, the actual mutation strength σ, the learning parameter τ and the time constant 1/c, respectively. Simplifications are needed to get a manageable model $𝒜_{ES} .$ It turns out that

𝒜_{ES} = {[- (Δ_{0} + ε), Δ_{0} + ε]}^{N},

(4)

can serve as such a model. ε is a small correction term that varies depending on the specific strategy.

In Fig. 3 the dynamics of the distance of the parental centroid to the global optimizer, i.e., R(g) := ‖y^(g)‖ is displayed for 200 independent runs of the (100/100_I, 200)-σSA-ES, Alg. 1, on the Rastrigin function. One observes a certain percentage of runs getting trapped in local minima. The other runs converge to the global optimizer. Similar graphs can be obtained when running the CSA-ES, Alg. 2. Determining the success probability P_s by which the ES approaches the global optimizer will be the task of the following section. Having a closer look at the dynamics, one sees that there are basically three phases in the evolution process. The first phase can be observed when the initial parental centroid is initialized far enough from the global optimizer. In that case, the ES “sees” basically a sphere model. The influence of the cosine terms in (1) can be neglected and one observes a linear convergence behavior. Getting closer to the global optimizer, the $y_{i}^{2}$ parts get comparable to the magnitude of the cosine terms, defining the phase II. The influence of the local attractors becomes dominant, slowing down the speed by which the global optimizer is approached. This slow-down can also be seen in the mean value dynamics displayed in Fig. 4. There, the averaged dynamics of the successful individual ES runs is displayed, symbolized by angular brackets. At the end of phase II, the ES is either confined in a local attractor or it has hit the global attractor $𝒜_{ES}$ . This defines the begin of phase III where one observes again increased linear convergence order.

Residual distance dynamics for 200 (100/100_I, 200)-σSA-ES runs with $τ = 1 / \sqrt{2 N}$ for N = 100. For each run, the ES was initialized randomly at an expected residual distance R(0) = 100. The success probability is P_s = 0.88.

Mean value dynamics of the (100/100_I, 200)-σSA-ES with $τ = 1 / \sqrt{2 N}$ for N = 100 derived from the *successful* runs displayed in Fig. 3. In addition to the R dynamics, the mutation strength σ and its normalization σ* are displayed. The distance Δ₀ ≈ 0.527, cf. Eq. (3), (dashed line) indicates the R = ||y|| below which any parental component *y_i* ∈ [−Δ₀, Δ₀]. The remaining curves regarding σ_ES, cf. Eq. (30), and $〈 \sqrt{Var [C]} 〉$ are discussed in Sect. 4.1. The asymptote $A \sqrt{N / 2}$ indicates the maximum of the σ_ES curve.

In addition to the mean value dynamics of R, Fig. 4 shows also the dynamics of the mutation strength and its normalization

σ^{*} : = σ N / R .

(5)

As one can see, the initial σ = 10 was (intentionally) chosen too small. Therefore, self-adaptation increased the normalized σ* to reach typical sphere model values. Entering the phase II, one observes a certain decrease of σ*. This reflects the tendency of getting trapped in local attractors. This phase ends at about generation g = 125. At g = 150 it already holds R < Δ₀ and the ES evolves safely in the global attractor. The central question to be answered in the next section concerns the conditions under which the global attractor is reached with the success probability P_s.

4. The Success Probability Model

4.1. The Frozen Noise Model

The Rastrigin function (1) can be divided into two parts. The first is the sphere function $R^{2} : = \sum_{i = 1}^{N} y_{i}^{2}$ indicating the squared distance to the global optimizer. The second

C (y) := N A - A \sum_{i = 1}^{N} \cos (α y_{i})

(6)

is called cosine part. It describes the oscillations of the Rastrigin function where C(y) ∈ [0, 2NA].

In the case where the distance to the global optimizer R is very large, i.e. R² ≫ NA (being phase I), the perturbations caused by the cosine parts are relatively small compared to the sphere model part R². In this case the behavior of the ES on Rastrigin is similar to that of the sphere model. In real runs of the σSA-ES (cf. Fig. 4) and the CSA-ES one observes σ* values, Eq. (5), that are in an order of magnitude of the asymptotically optimal sphere model value σ* = μc_μ/μ,λ [4, 13]. c_μ/μ,λ is the progress coefficient [6] which is in the range of roughly [0.8, 1.2] for truncation ratios 𝜗 ∈ [1/4, 1/2] and sufficiently large λ. Due to (5) it holds σ = σ*R/N = μc_μ/μ,λR/N. Since cos(αy_i) is periodic, the minimum distance between two of its maxima is at 2π/α. This determines the extent of the local attractor regions. As long as σ ⪆ 2π/α, there will be a high probability to jump over those regions. This yields the condition μc_μ/μ,λR/N ⪆ 2π/α that will be fulfilled for sufficiently large R, i.e., if one is far apart from the global optimizer.

If the ES is getting closer to the global optimizer (phase II) the influence of the cosine parts becomes more pronounced compared to the R²-part in (1). The ripples caused by the cosine parts (6) can be interpreted as frozen noise. Thus, the evolution process can be modeled as optimizing a noisy sphere model

F (y) = R^{2} + N A + σ_{ES} (R) 𝒩 (0, 1) with R = ‖ y ‖,

(7)

where σ_ES is the noise strength depending on the distance R to the global optimizer. This needs further justifications: Under the assumption of a sufficient large mutation strength σ the ES performs a restricted random walk that can be interpreted as exploration. As described in [5] the exploitation step towards the global optimizer is only of order $1 / \sqrt{N}$ compared to the exploration step, i.e., the step perpendicular to the optimizer. This is a random sampling process of global kind (i.e., it is not confined in a local attractor region provided that σ is sufficiently large). The assumption of 𝒩(0,1) Gaussian noise in (7) can be justified by considering the cosine part C(y), Eq. (6), as a sum of independent random variables cos (αy_i) for which the central limit theorem of statistics holds. The standard deviation σ_ES (R) of this noise produced by the offspring ỹ

σ_{ES} = \sqrt{Var [C]} = A \sqrt{Var [\sum_{i = 1}^{N} \cos (α {\tilde{y}}_{i})]}

(8)

will be derived in Appendix B, it is also displayed in Fig. 4 both experimentally as $〈 \sqrt{Var [C]} 〉$ and by a theoretical estimate σ_ES (for further discussion, see below).

Accepting the noise model (7), converging to the global optimizer of Rastrigin is equivalent to optimize a noisy sphere model. It is important to note that in the case of constant noise strength σ_ES, an ES optimizing a noisy sphere reaches a steady state R-distribution with R_st := E [R] ≠ 0 and

R_{st} ≃ \sqrt{\frac{σ_{ES} N}{4 μ c_{μ / μ, λ}}},

(9)

see [3, 12]. That is, the parental centroid y calculated in Line 10 of Alg. 1 and 2 has the expected distance R_st to the global optimizer. Furthermore, each component of y is normally distributed [7]

y_{i} = {(y)}_{i} ~ 𝒩 (0, R_{st}^{2} / N) .

(10)

This also holds approximately for the parental distribution of the ES on Rastrigin as long as the ES is not trapped in one of the local attractors. Figure 5 shows two examples of the distribution of a single parent component at two different distances R to the global optimizer. The histogram on the rhs has been obtained for an R in the critical range where the ES has a higher probability getting trapped into one of the local attractors.

Histogram of all individual components *y_i* at distance R to the optimum for R = 3 (left graph) and R = 2.1 (right graph) for σSA-ES runs N = 100, μ = 50, 𝜗 = 0.5. The blue line is the pdf of the $𝒩 (0, \frac{R^{2}}{N})$ variate.

With model (7), the evolution of the ES on Rastrigin can be analyzed as a noisy minimization problem where the ES reaches the vicinity of the global optimizer up to a distance R_st. If this distance is sufficiently small, the ES has reached the global attractor region $𝒜_{ES}$ and can converge successfully (phase III). Since σ_ES is bounded (see Fig. 4) one can infer from Eq. (9) that global convergence mainly depends on the choice of a sufficiently large μ (assuming 𝜗 = const.).

4.2. Estimating the Success Probability

In order to have convergence to the global optimizer, the parental centroid has to be in the global attractor region, i.e., $y \in 𝒜_{ES}$ , Eq. (4). Using (4), the success probability P_s is therefore

\begin{array}{l} P_{s} & = & \Pr [y \in 𝒜_{ES}] \\ = & \Pr [(- Δ_{0} - ε \leq y_{1} \leq Δ_{0} + ε) \land \dots \dots \land (- Δ_{0} - ε \leq y_{N} \leq Δ_{0} + ε)] \\ = & \Pr {[- Δ_{0} - ε \leq y \leq Δ_{0} + ε]}^{N} . \end{array}

(11)

Here, the independence of the parental centroid components in the steady state has been used and y is distributed according to (10). Using

σ_{st} : = \frac{R_{st}}{\sqrt{N}} \overset{(9)}{=} \sqrt{\frac{σ_{ES}}{4 μ c_{μ / μ, λ}}},

(12)

for the standard deviation in (10), one gets for a single component

\begin{matrix} \Pr [- Δ_{0} - ε \leq y \leq Δ_{0} + ε] = \Pr [- \frac{Δ_{0} + ε}{σ_{st}} \leq z \leq \frac{Δ_{0} + ε}{σ_{st}}] \\ = Φ (\frac{Δ_{0} + ε}{σ_{st}}) - Φ (- \frac{Δ_{0} + ε}{σ_{st}}) \end{matrix}

(13)

where Φ(z) is the cdf of the standard normal variate z ~ 𝒩(0, 1). Thus, one gets for the success probability

P_{s} = {[2 Φ (\frac{Δ_{0} + ε}{σ_{st}}) - 1]}^{N} .

(14)

Due to Eq. (12), σ_st depends on σ_ES of the offspring generated frozen noise. This standard deviation which depends on the parental R will be derived for the CSA-ES in Appendix B and is displayed in Fig. 4 together with an experimentally obtained curve for the σSA-ES labeled as $〈 \sqrt{Var [C]} 〉$ . Apart from small deviations caused by the different offspring ${\tilde{σ}}_{l}$ values in σSA-ES which do not exist for CSA-ES the general curve tendency is the same: The frozen noise strength σ_ES stays constant and only starts to drop if the distance R is of the order of Δ₀. That is, even if the ES enters the global attractor region $𝒜_{ES}$ , σ_ES is still in the vicinity of its maximum value. Therefore, one can replace σ_ES by its maximum value $σ_{ES} = A \sqrt{N / 2}$ . Plugging this into (12) yields

σ_{st} = \sqrt{\frac{A \sqrt{N}}{4 \sqrt{2} μ c_{μ / μ, λ}}} .

(15)

If inserted into (14), one finally obtains the success probability formula

P_{s} = {[2 Φ (\sqrt{\frac{4 \sqrt{2} μ c_{μ / μ, λ}}{A \sqrt{N}}} (Δ_{0} + ε)) - 1]}^{N} .

(16)

4.3. Comparison with Experiments

The predictive quality of the success probability formula (16) with (3) is evaluated for the σSA-ES and the CSA-ES in Fig. 6 using ε = 0. As for the σSA-ES the learning parameter $τ = 1 / \sqrt{2 N}$ was used and for the CSA-ES $c = 1 / \sqrt{N}$ was chosen. Each data point was obtained by at least 500 independent runs of the ES. As expected, there are differences between experimental data and the predictions. However, the general tendencies are well covered by (16). One can obtain better predictions in the case of the σSA-ES if one chooses $τ = 1 / \sqrt{4 N}$ (not shown in this paper).

Success P_s vs. population size μ predicted by Eq. (16) with ε = 0. Experimental results are displayed by the stars. σSA-ES in top row with ϑ = 1/2 (left) and ϑ = 1/4 (right). CSA-ES in bottom row with ϑ = 1/2 (left) and ϑ = 1/4 (right).

In order to improve the predictions one needs the correction term ε ≠ 0 in (16). The results are presented in Fig. 7 where ε was chosen according to Fig. 8. The ε values were determined experimentally by minimizing the sum of the squares from the differences between (16) and the experimental values. As one can see, the model of a global success domain $𝒜_{ES}$ in terms of (4) provides success curves that do well agree with the real ES runs. Therefore, Eq. (16) can be used to derive a population sizing formula and to evaluate its scaling behavior.

P_s predicted by Eq. (16) with ε values according to Fig. 8 for ϑ = 1/2 (left column) and ϑ = 1/4 (right column). Upper row displays the σSA-ES with $τ = 1 / \sqrt{2 N}$ . Middle row displays the CSA-ES with $c = 1 / \sqrt{N}$ . Bottom row displays the CSA-ES with c = 1/N.

The dependence of ε on N such that the deviations between Eq. (16) and the experimental values are minimal in Fig. 7.

5. Population Sizing

5.1. Derivation of Parent Population Size

The central question of this paper regards the choice of μ and λ that guarantees convergence of the ES towards the global optimizer. Given a fixed truncation ratio 𝜗, it suffices to derive a formula that predicts μ(P_s). Solving Eq. (16) for μ under the assumption c_μ/μ,λ ≃ f(𝜗) [6, p.249] yields after a simple calculation

μ ≃ \frac{A}{\sqrt{2} c_{μ / μ, λ}} \frac{\sqrt{N}}{4 {(Δ_{0} + ε)}^{2}} {[Φ^{- 1} (\frac{1}{2} + \frac{1}{2} P_{s}^{\frac{1}{N}})]}^{2},

(17)

where Φ⁻¹ is the quantile function of the standard normal distribution.

5.2. Comparison with Experiments

Figure. 9 compares the prediction of the population size Eq. (17) depending on N and P_s for ε = 0 with experiments. 300 runs were executed to obtain the experimental data displayed by the markers (+, ×, and ∘). While the theoretical predictions of (17) with ε = 0 differ from the experimental values, Eq. (17) predicts the general functional tendency well. The deviations are due to the different sizes of $𝒜_{ES}$ encoded in ε. As for the σSA-ES the population size is underestimated in accordance with the negative values of ε in Fig. 8. In contrast to that, ε = 0 results in an overestimation of the population size for the CSA-ES in both cases c = 1/N and $c = 1 / \sqrt{N}$ .

Population size for P_s = 50% (blue curve and markers) and P_s = 99% (purple curve and markers). The data points obtained by ES runs represent the σSA-ES (+) with $τ = 1 / \sqrt{2 N}$ , the CSA-ES with $c = 1 / \sqrt{N} (\times)$ , and with c = 1/N (◦). Gray dashed-dotted lines show functions $\propto \sqrt{N} \ln (N)$ for comparison.

As will be shown in Appendix C, (17) with ε = 0 behaves asymptotically like

μ = 𝒪 (\sqrt{N} \ln (N)) .

(18)

Respective curves proportional to $\sqrt{N} \ln (N)$ are displayed by gray dashed-dotted curves in Fig. 9. As one can infer from the data in Fig. 9 and 8, the population size scaling of $𝒪 (\sqrt{N} \ln (N))$ can serve as an upper bound for the CSA-ES versions considered. As for the σSA-ES with $τ = 1 / \sqrt{2 N}$ the growth rate is slightly above $\sqrt{N} \ln (N)$ for P_s = 0.5. However, corrections to the scaling law $\sqrt{N} \ln (N)$ cannot be obtained indicating the limits of the model used that does not take into account the influence of τ and c, respectively, on the σ adaptation.

Besides the N-scaling the influence of the Rastrigin parameters A and α on the population sizing is of interest. To this end, a closer look at (3) reveals that for A → ∞ ⇒ Δ₀ → π/α. Thus, the influence of A in (17) becomes linear for sufficiently large A. This can be verified by the experiments presented in the left column of Fig. 10 for both the σSA-ES and the CSA-ES. ε in (17) was calculated by minimizing the difference between the slope of the experimental values and those of Eq. (17) for values larger than A = 4.

Scaling behavior of μ, Eq. (17), depending on the Rastrigin parameters A (left column) and α (right column). Top row represents the σSA-ES and the bottom row the CSA-ES with c = 1/N and 𝜗 = 1/2 optimizing the N = 100 case. Markers with dashed lines represent the experiments, where each data point was obtained by 500 independent runs. The grey dashed-dotted straight lines are linear and quadratic growth curves for comparison purpose.

In order to derive the scaling behavior w.r.t. α it is important, to realize, that α → ∞ ⇒ Δ₀ → π/α. That is, the extension of $𝒜_{ES}$ shrinks with increasing α. Therefore, ε must shrink analogously ε → ε/α. As a result the term in (17) yields (Δ₀ + ε)⁻² ≃ α²/(π+ε)². Therefore, the population size must grow quadratically with α. This is experimentally confirmed on the rhs of Fig. 10. ε in (17) was determined experimentally for values of α larger than 2.5π by minimizing the sum of the squares from the differences between the experimental values and Eq. (17) using (Δ₀ + ε)⁻² = α²/(π + ε)².

6. Conclusions

Reaching the global minimum of Rastrigin, a function that has a huge number of local minima, is a hopeless endeavor when tackled by gradient based nonlinear optimization techniques. Yet, Evolution Strategies are able to find the global optimizer provided that the population size has been chosen sufficiently large. The question answered in this paper is how large is “sufficiently large?” To this end, a model has been developed that allows for the calculation of the probability P_s of reaching the global optimizer of the Rastrigin function depending on the population size parameters μ and λ. The basic idea was the separation of Rastrigin into a Sphere model part and a noise term and to apply the theory of ES performance on noisy Sphere models. Given a fixed noise strength, an ES with fixed μ, λ cannot reach the optimizer of the Sphere model arbitrarily close. Instead, its parents fluctuate about the optimizer with an expected distance R_st to the optimizer. However, if this R_st is sufficiently small such that the parents hit the global attractor region, the ES will converge to the global minimum. The model abstracts from the details of the ES used, i.e., whether σSA or CSA-ES are used. Therefore, the influence of those strategy specific parameters as τ and c are not incorporated in the model. Yet, the model yields remarkable predictions. Basically the parental population size scales like $𝒪 (\sqrt{N} \ln (N))$ in order to get reliable convergence to the global optimizer. That is, the growth is sublinear and slightly above $\sqrt{N}$ . This is in contrast to gradient-based restart strategies that need an exponential number of restarts. Besides the influence of the search space N, the influence of the Rastrigin parameters on the necessary population size came out of the analysis. While for sufficiently large A the population size scales linearly the influence of the spacial frequency α is quadratic.

It seems that the analysis method presented can be extended to other highly multimodal test functions provided that a reasonable noisy Sphere model can be constructed. This will be a future road of research.

Supplementary Material

Appendix

EMS189096-supplement-Appendix.pdf^{(595.8KB, pdf)}

Acknowledgments

This work was supported by the Austrian Science Fund (FWF) under grant P33702-N. Special thanks goes to Amir Omeradzic for providing valuable feedback and helpful discussions.

Contributor Information

Lisa Schönenberger, Email: lisa.schoenenberger@fhv.at.

Hans-Georg Beyer, Email: hans-georg.beyer@fhv.at.

References

[1].Abramowitz M, Stegun IA. Pocketbook of Mathematical Functions. Verlag Harri Deutsch; Thun: 1984. [Google Scholar]
[2].Arnold DV. Noisy Optimization with Evolution Strategies. Kluwer Academic Publishers; Dordrecht: 2002. [Google Scholar]
[3].Arnold DV, Beyer H-G. Performance Analysis of Evolution Strategies with Multi-Recombination in High-Dimensional ℝN -Search Spaces Disturbed by Noise. Theoretical Computer Science. 2002;289:629–647. 2002. [Google Scholar]
[4].Arnold DV, Beyer H-G. Performance Analysis of Evolutionary Optimization With Cumulative Step Length Adaptation. IEEE Trans Automat Control. 2004;49(4):617–622. 2004. [Google Scholar]
[5].Beyer H-G. In: Porto VW, Saravanan N, Waagen D, Eiben AE, editors. On the “Explorative Power” of ES/EP-like Algorithms; Evolutionary Programming VII: Proceedings of the Seventh Annual Conference on Evolutionary Programming; Heidelberg: Springer-Verlag; 1998. pp. 323–334. [DOI] [Google Scholar]
[6].Beyer H-G. The Theory of Evolution Strategies. Springer; Heidelberg: 2001. [DOI] [Google Scholar]
[7].Beyer H-G, Arnold DV, Meyer-Nieberg S. A New Approach for Predicting the Final Outcome of Evolution Strategy Optimization under Noise. Genetic Programming and Evolvable Machines. 2005;6(1):7–24. 2005. [Google Scholar]
[8].Beyer H-G, Schwefel H-P. Evolution Strategies: A Comprehensive Introduction. Natural Computing. 2002;1(1):3–52. 2002. [Google Scholar]
[9].Hansen N. Verallgemeinerte individuelle Schrittweitenregelung in der Evolutionsstrategie. Doctoral thesis. Technical University of Berlin; Berlin: 1998. [Google Scholar]
[10].Hansen N, Kern S. In: Parallel Problem Solving from Nature. Yao X, et al., editors. Vol. 8. Springer; Berlin: 2004. Evaluating the CMA Evolution Strategy on Multimodal Test Functions; pp. 282–291. [Google Scholar]
[11].Hansen N, Müller SD, Koumoutsakos P. Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES) Evolutionary Computation. 2003;11(1):1–18. doi: 10.1162/106365603321828970. 2003. [DOI] [PubMed] [Google Scholar]
[12].Meyer-Nieberg S. Self-Adaptation in Evolution Strategies. Ph. D. Dissertation. University of Dortmund, CS Department; Dortmund, Germany: 2007. [Google Scholar]
[13].Meyer-Nieberg S, Beyer H-G. On the Analysis of Self-Adaptive Recombination Strategies: First Results; Proceedings of the CEC’05 Conference; Piscataway, NJ. 2005. pp. 2341–2348. [Google Scholar]
[14].Müller N, Glasmachers T. Foundations of Genetic Algorithms. Vol. 16. ACM; 2021. Non-local optimization: imposing structure on optimization problems by relaxation; pp. 1–10. [DOI] [Google Scholar]
[15].Omeradzic A, Beyer H-G. In: Parallel Problem Solving from Nature XVII. Aguirre H, et al., editors. Springer; Berlin: 2022. Progress Rate Analysis of Evolution Strategies on the Rastrigin Function: First Results; pp. 499–511. [DOI] [PMC free article] [PubMed] [Google Scholar]
[16].Rechenberg I. Evolutionsstrategie ’94. Frommann-Holzboog Verlag; Stuttgart: 1994. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix

EMS189096-supplement-Appendix.pdf^{(595.8KB, pdf)}

[R1] [1].Abramowitz M, Stegun IA. Pocketbook of Mathematical Functions. Verlag Harri Deutsch; Thun: 1984. [Google Scholar]

[R2] [2].Arnold DV. Noisy Optimization with Evolution Strategies. Kluwer Academic Publishers; Dordrecht: 2002. [Google Scholar]

[R3] [3].Arnold DV, Beyer H-G. Performance Analysis of Evolution Strategies with Multi-Recombination in High-Dimensional ℝN -Search Spaces Disturbed by Noise. Theoretical Computer Science. 2002;289:629–647. 2002. [Google Scholar]

[R4] [4].Arnold DV, Beyer H-G. Performance Analysis of Evolutionary Optimization With Cumulative Step Length Adaptation. IEEE Trans Automat Control. 2004;49(4):617–622. 2004. [Google Scholar]

[R5] [5].Beyer H-G. In: Porto VW, Saravanan N, Waagen D, Eiben AE, editors. On the “Explorative Power” of ES/EP-like Algorithms; Evolutionary Programming VII: Proceedings of the Seventh Annual Conference on Evolutionary Programming; Heidelberg: Springer-Verlag; 1998. pp. 323–334. [DOI] [Google Scholar]

[R6] [6].Beyer H-G. The Theory of Evolution Strategies. Springer; Heidelberg: 2001. [DOI] [Google Scholar]

[R7] [7].Beyer H-G, Arnold DV, Meyer-Nieberg S. A New Approach for Predicting the Final Outcome of Evolution Strategy Optimization under Noise. Genetic Programming and Evolvable Machines. 2005;6(1):7–24. 2005. [Google Scholar]

[R8] [8].Beyer H-G, Schwefel H-P. Evolution Strategies: A Comprehensive Introduction. Natural Computing. 2002;1(1):3–52. 2002. [Google Scholar]

[R9] [9].Hansen N. Verallgemeinerte individuelle Schrittweitenregelung in der Evolutionsstrategie. Doctoral thesis. Technical University of Berlin; Berlin: 1998. [Google Scholar]

[R10] [10].Hansen N, Kern S. In: Parallel Problem Solving from Nature. Yao X, et al., editors. Vol. 8. Springer; Berlin: 2004. Evaluating the CMA Evolution Strategy on Multimodal Test Functions; pp. 282–291. [Google Scholar]

[R11] [11].Hansen N, Müller SD, Koumoutsakos P. Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES) Evolutionary Computation. 2003;11(1):1–18. doi: 10.1162/106365603321828970. 2003. [DOI] [PubMed] [Google Scholar]

[R12] [12].Meyer-Nieberg S. Self-Adaptation in Evolution Strategies. Ph. D. Dissertation. University of Dortmund, CS Department; Dortmund, Germany: 2007. [Google Scholar]

[R13] [13].Meyer-Nieberg S, Beyer H-G. On the Analysis of Self-Adaptive Recombination Strategies: First Results; Proceedings of the CEC’05 Conference; Piscataway, NJ. 2005. pp. 2341–2348. [Google Scholar]

[R14] [14].Müller N, Glasmachers T. Foundations of Genetic Algorithms. Vol. 16. ACM; 2021. Non-local optimization: imposing structure on optimization problems by relaxation; pp. 1–10. [DOI] [Google Scholar]

[R15] [15].Omeradzic A, Beyer H-G. In: Parallel Problem Solving from Nature XVII. Aguirre H, et al., editors. Springer; Berlin: 2022. Progress Rate Analysis of Evolution Strategies on the Rastrigin Function: First Results; pp. 499–511. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] [16].Rechenberg I. Evolutionsstrategie ’94. Frommann-Holzboog Verlag; Stuttgart: 1994. [Google Scholar]

PERMALINK

On a Population Sizing Model for Evolution Strategies Optimizing the Highly Multimodal Rastrigin Function

Lisa Schönenberger

Hans-Georg Beyer

Abstract

1. Introduction

2. ES-Algorithms

Algorithm 1. The (μ/μ_I, λ)-σSA Evolution Strategy.

Algorithm 2. The (μ/μ_I, λ)-CSA Evolution Strategy.

3. The Rastrigin Function

Figure 1.

Figure 2.

Figure 3.

Figure 4.