Skip to main content
UKPMC Funders Author Manuscripts logoLink to UKPMC Funders Author Manuscripts
. Author manuscript; available in PMC: 2024 Apr 11.
Published in final edited form as: Proc ACM SIGEVO Conf Found Genet Algorithms. 2023 Aug 30;2023:117–128. doi: 10.1145/3594805.3607126

Convergence Properties of the (μ/μI, λ)-ES on the Rastrigin Function

Amir Omeradzic 1, Hans-Georg Beyer 1
PMCID: PMC7615820  EMSID: EMS189097  PMID: 38606013

Abstract

The highly multimodal Rastrigin test function is analyzed by deriving a new aggregated progress rate measure. It is derived as a function of the residual distance to the optimizer by assuming normally distributed positional coordinates around the global optimizer. This assumption is justified for successful ES-runs operating with sufficiently slow step-size adaptation. The measure enables the investigation of further convergence properties. For moderately large mutation strengths a characteristic distance-dependent Rastrigin noise floor is derived. For small mutation strengths local attraction is analyzed and an escape condition is established. Both mutation strength regimes combined pose a major challenge optimizing the Rastrigin function, which can be counteracted by increasing the population size. Hence, a population scaling relation to achieve high global convergence rates is derived which shows good agreement with experimental data.

CCS Concepts: • Theory of computation → Theory of randomized search heuristics, Probabilistic computation, • Mathematics of computing → Bio-inspired optimization

Keywords: Evolution Strategy, global optimization, progress rate, multimodal function

1. Introduction

Evolution Strategies (ES) have proven to be well suited for the optimization of highly multimodal real-valued fitness functions due to their underlying stochastic nature. Test functions such as the Rastrigin function contain a huge number of local minima scaling exponentially with the search space dimensionality N. Sampling the search space and applying multiple restarts with standard gradient-based optimization algorithms quickly becomes unfeasible. On the other side, ES achieve high success rates for global convergence on certain multimodal functions, if sufficiently large population sizes are chosen. Experimental investigations in [9] indicate a non-exponential population scaling at most of O (N2) for the tested multimodal functions with the Rastrigin function scaling sub-linearly in N. However, there is little understanding of how the ES is exploring the fitness landscape to find the global optimizer without getting trapped in one of the local minima. This paper investigates the convergence properties on the Rastrigin function based on progress rate theory results of [11] and [12]. To this end, a new aggregated progress rate is introduced modeling the progress as a function of the residual distance to the optimizer. The obtained results are compared to real ES-runs operating with isotropic mutations and self-adaptation for step-size control.

In Sec. 2 the ES under investigation is introduced. In Sec. 3 the Rastrigin function is defined and averaging methods are discussed. Then, the method is applied to component-wise progress rate equations in Sec. 4 obtaining an aggregated measure. In Sec. 5 a population sizing relation is derived and compared to experimental results. Local attraction is discussed in Sec. 6 and a characteristic “escape” mutation strength is derived. Finally, in Sec. 7 conclusions and an outlook are provided.

2. The ES-Algorithm

The ES under investigation, see Algorithm 1, consists of μ parents and λ offspring with truncation ratio ϑ = μ/λ (1 ≤ μ < λ). Selection of the m = 1, …, μ best individuals (out of λ) is denoted by subscript "m; λ". Normally distributed isotropic mutations of strength σ are applied. Intermediate multi-recombination with equal weights is used to obtain the parental location y(g)=[y1(g),,yN(g)] in the N-dimensional search space for each generation g. For the σ-adaptation (self-adaptation) the offspring mutation strengths are chosen from a log-normal distribution with learning parameter τ. A smaller τ-value therefore yields a slower σ-adaptation, which will be important later. The default choice is τ=1/2N, see [10], which ensures optimal performance on the sphere in the limit N → ∞. Recombination also applies to selected σ-values.

3. Rastrigin Function and Averaging

The Rastrigin test function f is defined for a real-valued search vector y = [y1, …, yN] as

f(y)=i=1N[yi2+A(1cos(αyi))], (1)

Algorithm 1. (μ/μI, λ)-σSA-ES.

1: g ← 0

2: initialize (y(0), σ(0))

3: repeat

4:     for l = 1, …, λ do

5:         σ˜lσ(g)eτNl(0,1)

6:         x˜lσ˜lNl(0,1)

7:         y˜ly(g)+x˜l

8:         f˜lf(y˜l)

9:     end for

10:     (f˜1;λ,,f˜m;λ,,f˜μ;λ)sort(f˜1,,f˜λ)

11:     y(g+1)1μm=1μy˜m;λ

12:     σ(g+1)1μm=1μσ˜m;λ

13:     gg +1

14: until termination criterion

with oscillation amplitude A and frequency α. The function is minimized and the global minimizer is located at ŷ = 0. All simulations in this paper will be conducted at default α = 2π (unless stated otherwise). Depending on A, the function shows a finite number M of local attractors for each of the N dimensions, such that the function contains MN − 1 local minima scaling exponentially with N. Note that for |yi| > αA/2 the derivative f/yi0 (for any i), such that no further local minima occur.

As an introductory example, real optimization runs are shown in Fig. 1 using Algorithm 1. The global convergence is investigated by evaluating the dynamics of the parental distance to the global optimizer denoted by R(g) = ‖y(g)‖. All runs are initialized far away from the local attractor landscape. The black line shows the median of all successful runs (the mean can also be used, but the median is more robust w.r.t. outliers). The global convergence probability is denoted as PS. After the initial phase, the σSA-ES maintains a nearly constant normalized mutation strength σ* ≈ 30 with a characteristic dip at g ≈ 200. σ* is defined as

σ=σNR. (2)

Figure 1.

Figure 1

Dynamic runs (500 trials) of Algorithm 1 using (100/100I, 200)-σSA-ES on the Rastrigin function with N = 100 and A = 1 at learning parameter τ=1/2N. The upper plot shows the R-dynamics, while the lower plot shows the normalized mutation strength (2). The green-yellow color palette depicts globally converging runs, while the cyan-magenta colors show locally converging runs. The black line marks the median of all successful runs. The measured success probability PS = 0.91.

A constant σ*-level ensures scale-invariance on the sphere function and therefore linear convergence. The observations from Fig. 1 will be investigated in more detail throughout the paper.

The Rastrigin fitness (1) is defined as a function of y = [y1, …, yN]. Convergence however is usually measured as a function of the residual distance R, see Fig. 1. Quantity R is therefore an aggregated measure over all components. In general, the convergence properties can be investigated using progress rate equations (see Sec. 4). A component-wise progress rate however, derived and discussed in [11], has the disadvantage of not being an aggregated measure. As an example, positive progress (convergence) between two generations occurs for decreasing R(g+1) < R(g) even if some components deteriorate and show negative progress. Furthermore, analytic treatment of N equations is unfeasible for large dimensionalities. Hence, the idea will be to express y-dependent functions as average values over all positions satisfying ∥y∥ = R to obtain aggregated measures. First, the approach is presented on the Rastrigin function and later transferred to its corresponding progress rate in Sec. 4.

The averaging problem to be solved is

f¯(R):=averagey=R[f(y)]. (3)

An approach to obtain f¯(R) is to integrate the function over the (N − 1)-dimensional sphere surface SN (R) with radius ∥y∥ = R and normalize by the sphere surface according to

f¯(R)=1SNy=Rf(y)ds, (4)

where ds denotes the (hyper-)surface element. The sphere surface area for N ≥ 2 evaluated using the gamma function Γ is given by

SN(R)=2πN/2RN1Γ(N/2). (5)

Applying (4) to (1), the first two terms can be evaluated easily noting that R2=iyi2. Integrating over a constant yields SN. Therefore, one gets the intermediate result

f¯(R)=R2+NA+T(R), (6)

with

T(R):=ASNy=Ri=1Ncos(αyi)ds. (7)

Closed-form solutions of (7) can be obtained for N = 1 and N = 2. Starting with N = 1 only two discrete points are relevant (no integration necessary) with two possible solutions y1 = ±R. Averaging over two points therefore yields

T(R)=A2y1=±Rcos(αy1)=Acos(αR). (8)

For N = 2 one can use polar coordinates (y1, y2) = (R cos ϕ, R sin ϕ) with derivative vector d(y1,y2)dϕ=(Rsinϕ,Rcosϕ) on ϕ ∈ [0, 2π). Additionally, one has S2 = 2πR. Therefore, inserting this parametrization into (7) and using path element length d(y1,y2)dϕ=R one has

T(R)=A2πR02πi=12cos[αyi(R,ϕ)]d(y1,y2)dϕdϕ=A2π02π[cos(αRcosϕ)+cos(αRsinϕ)]dϕ. (9)

The integrals obtained in (9) can be solved in terms of Bessel functions of the first kind Jn (x) with n ≥ 0 by applying the integral identity [1, p. 360, 9.1.18]

J0(x)=1π0πcos(xsint)dt=1π0πcos(xcost)dt. (10)

Due to the periodicity integrating cos t and sin t over [0, π] yields the same contribution as the integration over [π, 2π]. Thus, one can extend the integral bounds of (10) as

2J0(x)=1π02πcos(xsint)dt=1π02πcos(xcost)dt. (11)

Comparing (9) with (11) and setting x = αR, the expression (9) is evaluated as

T(R)=A2π[2πJ0(αR)+2πJ0(αR)]=2AJ0(αR). (12)

The final result for f (R) is summarized as

f¯(R)=R2+A(1cos(αR))forN=1 (13)
f¯(R)=R2+2A(1J0(αR))forN=2. (14)

Examples of result (13) and (14) are shown in Fig. 2. The analytic equations are compared to sampled results, where for each R random isotropic positions are chosen with ∥y∥ = R and averaged over 104 trials. Excellent agreement can be observed. Furthermore, one notices a decrease of the oscillation effect when N is increased. This will be useful for the subsequent approach. Unfortunately, integral (4) yields analytically exact results only for the cases N < 3. In the context of progress rate theory of Sec. 4 an approach is needed which can be applied to any arbitrary large dimensionality N.

Figure 2.

Figure 2

Average Rastrigin function evaluated for A = 10 as a function of R. The solid black lines show sampled results using Eq. (1) for N = 1 and N = 2, respectively. The overlaid dotted green line shows (13) and the dashed cyan line (14).

The approach presented now will evaluate the average value of f(y) assuming independent, normally distributed random coordinates yi with zero mean and variance σy2 according to

yi~σyN(0,1). (15)

The approach can be considered as averaging by stochastic sampling (assuming large N) instead of analytic integration. For the determination of σy from (15) one demands

R2=!E[i=1Nyi2]=σy2E[i=1NNi2(0,1)]=σy2E[χN2]=σy2N. (16)

It was used that the sum over N independent standard normally distributed variables squared is equal to the chi-squared distributed variable χN2 with E[χN2]=N. Solving (16) for σy, expression (15) can be rewritten as

yi~RNN(0,1). (17)

Equation (17) will be useful for averaging sums over trigonometric functions of yi, where analytic integration is unfeasible. Furthermore, successful ES runs on the Rastrigin function operating under default step-size adaptation also show normally distributed yi as in (17), see also experiments in Fig. 4. This property is used again in Sec. 4. As yi is treated as a random variate for the cosine terms, Eq. (1) is now rewritten as

f(R,Y)~R2+NAAY, (18)

with Y containing the sum over the random terms

Y:=i=1Ncos(αyi). (19)

By the Central Limit Theorem in the limit N → ∞, the sum over i.i.d. variates approaches a normal distribution with Y~E[Y]+Var[Y]N(0,1). Additionally, it is shown in Appendix (A.7) that

Var[Y]E[Y]N0, (20)

with ratio Var[Y]/E[Y] vanishing as O(1/N). In the asymptotic limit the fluctuation term of Y is negligible, which means that the random variate can be replaced by its expected value E[Y]=Ne12(αR)2N evaluated in Appendix (A.1). Equation (18) therefore yields (overline denoting the average in the limit N → ∞)

f¯(R)=R2+NAAE[Y] (21)
=R2+NA(1e12(αR)2N). (22)

Exemplary evaluations of (22) are shown in Fig 3. The derived results match the sampled results well. Smaller deviations are observed for small values N = 3 or N = 10, which was expected. In the limit N → ∞ the deviations are smoothed out. The limits R → 0 and R → ∞ yield R2-dependent functions, i.e., sphere functions. For R → ∞ the exponential vanishes and NA is negligible, while for R → 0 one has e12(αR)2N=112(αR)2N+O(R4) giving f(R)=(1+α2A2)R2.

Figure 3.

Figure 3

Average Rastrigin function evaluated for A = 10 as a function of R. The solid black lines show sampled results of Eq. (1). The overlaid dotted green lines show Eq. (22) for N = {3, 10, 100, 1000}, from bottom to top.

Figure 4.

Figure 4

Distribution of realized yi -values (i = 1, …, N over 500 trials) for (100/100I, 200)-ES, N = 100, A = 1, and τ=1/2N at R ≈ 100 (left) and R ≈ 1 (right). The red solid curve shows the density of the normal distribution with yi ~ 𝒩(0, R2/N).

4. Progress Rate

4.1. Derivation

The method introduced in Sec. 3 will now be applied to results for the progress rate on the Rastrigin function. The progress rate (denoted by φ) measures the expected positional change in search space between two generations gg+1 as a function of fitness and ES parameters. A positive value corresponds to the ES approaching the optimizer and vice versa. The second order component-wise progress rate φiII for the parental location y(g) is defined as [8]

φiII:=E[(yi(g))2(yi(g+1))2y(g),σ(g)]. (23)

The second order refers to the square of yi -values which ensures φiII>0 for (yi(g+1))2<(yi(g))2 independent of the sign of yi. A second order model is needed for a correct model of convergence involving large mutation strengths. The expected values for the determination of (23) were already evaluated in [11] in the asymptotic limit N, µ, λ → ∞ (ϑ = μ/λ = const). The result yields

φiII=cϑσ2DQ(4yi2+e12(ασ)22αAyisin(αyi))σ2μ. (24)

In (24) the asymptotic progress coefficient [11] is given by

cϑ=e12[Φ1(ϑ)]22πϑ, (25)

with Φ −1 (·) denoting the quantile function of the standard normal variate. The cϑ is related to the progress coefficient cμ/μ,λcϑ (for µ, λ→ ∞ with constant ϑ = μ/λ), see also [6, p. 249]. The quality gain variance DQ2 at location y given σ was evaluated in [12] giving

DQ2(y)=i=1N{4σ2yi2+2σ4+A22[1e(ασ)2][1cos(2αyi)e(ασ)2]+2Aασ2e12(ασ)2[ασ2cos(αyi)+2yisin(αyi)]}. (26)

The first term of (24) is usually referred to as the gain term, while the second term is the loss term characteristic for intermediate recombination. A distinct property of the Rastrigin function is that the gain term (yi -dependent) is not necessarily positive as it is the case for unimodal functions. This property will be discussed later.

The first step to obtain an R-dependent aggregation of expression (23) is to sum over all N components

i=1NφiII=E[i=1N(yi(g))2i=1N(yi(g+1))2]=E[(R(g))2(R(g+1))2], (27)

such that one can define the R-dependent progress rate

φRII:=E[(R(g))2(R(g+1))2R(g),σ(g)]. (28)

Given the sphere function fsph (R) = R2, one can relate (28) to the sphere quality gain E[fsph(R(g+1))fsph(R(g))]=φRII, such that the quality gain normalization [5, p. 173] is applicable. This yields the normalized R-dependent progress rate (labeled by the asterisk “*”)

φRII,:=N2R2φRII. (29)

For N → ∞ one has φRΠ,φsph, see [2, p. 16], yielding the normalized sphere progress rate. Expression (29) has two important properties. First, it is an aggregated progress rate measure over all N components, which is new for the Rastrigin function. Second, its relation to the sphere function enables direct comparison of progress rates.

A prerequisite for the further derivation will be the assumption of normally distributed yi ~ 𝒩 (0, R2/N)N(…), see (17). This property is experimentally confirmed in Fig. 4, using the data of 500 trials shown in Fig. 1, displayed at two residual distances. Good agreement is observed between the expected density (red curve) and the histogram. Each component contributes roughly as yi2R2/N to the overall residual distance R2. This concept of “equal contribution” is not new and was investigated in [7] for the quality gain on the ellipsoid. Slightly larger deviations occur at R ≈ 1 (right), where local attraction is more significant, see also later discussion of Fig. 12. At small mutation strengths where local attraction occurs the assumption of course breaks down.

Now φRII is derived starting from (24). Performing the summation one gets

φRII(R,y)=cϑσ2DQ(4R2+e12(ασ)22αAi=1Nyisin(αyi))Nσ2μ. (30)

Similarly, the summation of the variance terms in (26) yields

DQ2(R,y)=4σ2R2+2Nσ4+A22[1e(ασ)2]i=1N[1cos(2αyi)e(ασ)2]+2Aασ2e12(ασ)2[ασ2i=1Ncos(αyi)+2i=1Nyisin(αyi)]. (31)

Analogous to (19) and (20), the sums over the yi -dependent trigonometric terms of (30) and (31) will be replaced by their respective expectation values assuming yi~RNN(0,1) and neglecting fluctuations for N → ∞. The needed expected values are derived in Appendix (A.1), (A.2), and (A.3) giving

E[i=1Ncos(αyi)]=Ne12(αR)2N (32)
E[i=1Ncos(2αyi)]=Ne2(αR)2N (33)
E[i=1Nyisin(αyi)]=αR2e12(αR)2N. (34)

Furthermore, it is shown in Appendix A that Var[i()]/E[i()]0 for N → ∞ for all three sums. Finally, a fully R-dependent expression can be given for the progress rate

φRII=cϑ2R2σ2DQ(R)(2+α2Aeα22(σ2+R2N))Nσ2μ. (35)

Analogously, the R-dependent quality gain variance yields

DQ2(R)=4R2σ2+2Nσ4++NA22[1e(ασ)2][1eα2(σ2+2R2N)]+2NAα2σ2eα22(σ2+R2N)[σ2+2R2N]. (36)

Result (35) is important as it measures the progress on the Rastrigin function in R-space aggregating the individual progress rates φiII. Note that the first term of (35), i.e., the gain term, is now strictly positive, which is in contrast to Eq. (24).

One-generation experiments are conducted in Fig. 5 by performing single optimization steps for given mutation strength and averaging the results of progress rates (23) and (28), respectively, over 104 trials. Furthermore, the simulations are compared to analytic expressions (24) and (35). To this end, two configurations (constant R = 7) are overlaid with one having y fixed, and one randomly sampled y-values with ∥y∥ = R. The values are normalized using (29) and displayed using scale-invariant mutations σ* of Eq. (2). All results are similar for moderate and large σ*-values showing good agreement. Differences emerge at small σ*. The fixed y = [0.7, …, 0.7] was chosen to lie within a local attractor. In this Case iφiII(y) correctly predicts negative progress for small σ*, while φRII falsely assumes normally distributed yi-coordinates and predicts positive progress. This error vanishes for large σ*, i.e., when the ES is searching at larger scales. Therefore, one can conclude that φRII is a suitable aggregated measure of component-wise φiII, if sufficiently large mutations are applied. Indeed, real (successful) ES-runs, such as in Fig. 1 or later in Fig. 6, tend to maintain high σ*-levels, such that the normal assumption for yi stays valid. Local attraction (assuming small σ*) is investigated further in Sec. 6.

Figure 5.

Figure 5

One generation experiments with 104 repetitions for (100 /100I, 200)-ES, N = 100, A = 1 at constant R = 7. Black circles show experimentally evaluated (23), summed over i, for constant y = [0.7, …, 0.7]. Blue crosses show (28), where y is randomly sampled for each trial, such that ∥y∥ = R. The green dash-dotted line shows (24) (summed over i) and the red dashed line (35). All values are normalized using (29). The error bars are vanishing and not shown.

Figure 6.

Figure 6

Progress rate φRII, for (100 /100I, 200)-ES with N = 100 and A = 1. High progress rate values are shown in yellow and blue values indicate small (negative) progress. The boundary φRII,=0 is shown in bold white. The black (left) curve is displaying the median dynamics of Fig. 1 (τ=1/2N, PS = 0.91), while the red (right) curve is showing the same ES with τ=1/8N and PS = 0.99.

A few important remarks regarding results (35) and (36) are made now. Given expression (36), the variance can be written more compactly as

DQ2=Dsph2+DRas2, (37)

where Dsph2=4R2σ2+2Nσ4 corresponds to the quality gain variance of the sphere function [6]. The term DRas2=DQ2(R)Dsph2 is Rastrigin-specific. In the limit of vanishing exponential functions (R → ∞), see later in Sec. 5, the term will simplify significantly giving the so-called Rastrigin (maximum) noise strength

DRas2NA22=:σRas2. (38)

Having derived (35) and (36), the sphere progress rate φsph can be recovered as a special case. It can be obtained from φRII in multiple ways. The technical details are not shown since the calculations are simple and straightforward, only the main steps are explained now. As the normalized progress is constant on the sphere for constant scale-invariant mutations σ*, Eqs. (35) and (36) need to be rewritten as φRII(σ) and DQ (σ*) by setting σ = σ*R/N via (2). Furthermore, normalization (29) needs to be applied. One way to recover φsph is by setting A = 0 or α = 0, which removes all Rastrigin-specific terms. Another way is applying the limit R → ∞, which suppresses the exponential terms. Additionally, the constant term NA2/2 is negligible in (36) for R → ∞, see also Appendix (B.5). The third way is the limit R → 0. All exponentials contain arguments being a function g(R2) after inserting σ = σ*R/N. Performing a Taylor expansion yields eg(R2)=1g(R2)+O(R4) with negligible higher order terms. After simplification, all three approaches yield

φRII,=φsph=cϑσ1+σ2/2Nσ22μ, (39)

and for N → ∞ the well-known asymptotic formula

φsph=cϑσσ22μ. (40)

Both (39) and (40) are scale-invariant (R-independent) expressions. As a conclusion, the Rastrigin progress rate yields the sphere progress rate in the limits R → ∞ and R → 0. This result is important and was expected from (1), as yi2 is dominating at large scales. For yi → 0 the global attractor is essentially a quadratic function.

An important property of φsph is that for sufficiently small σ* one has φsph>0, while for too large σ*-values the progress rate becomes negative. The second (non-trivial) zero of (39), denoted by σφ0, is derived in Appendix B by setting φsph=0 and yields in (B.8)

σφ0=[(N2+8Ncϑ2μ2)1/2N]1/2. (41)

Due to the same global (quadratic) structure, result (41) will also be applicable to the Rastrigin function as an upper bound for σ*.

4.2. Progress Landscape

A more detailed analysis of the progress rate (35) is provided now. Given fitness parameters A, α, and N, the expression φRII,(σ,R) is essentially a function of only two variables. Therefore, the results will be displayed in a two-dimensional σ*-R-space denoted as the progress landscape. Note that for the sphere function, see Eqs. (39) and (40), the progress rate is constant for all R (given σ* and N).

Figure 6 shows an example progress landscape, evaluated for σ[0,σend] and R ∈ [10−1, 102]. The value σend is chosen slightly larger than σφ0, see Eq. (41), as for σ>σφ0 the progress rate gets negative. The R-range was chosen large enough to provide good visibility of the relevant characteristics. Thin black lines display regions of equal progress rate level. For R → ∞ and R → 0 the sphere limit is recovered (vertical lines of constant progress).

The median of real runs (black and red curves) show a characteristic σ*-drop (also visible in Fig. 1), which is directly related to the progress rate zero. The ESs are moving around the progress dip in σ*-R-space. Interestingly, the σ SA-ES with τ=1/2N has a global convergence probability PS = 0.91, while τ=1/8N yields PS = 0.99. Maximizing σσφ0 therefore maximizes PS, which is associated with a smaller learning parameter τ. This effect can also be observed for the CSA-ES (cumulative step-size adaptation), where a higher PS is observed for smaller cumulation constant values (due to a slower change of σ). The downside of large mutation strengths is less efficiency optimizing the sphere limits (the sphere-optimal value for the (100/100I, 200)-ES, see Fig. 6, is at σ* ≈ 19). The respective median R-dynamics reaches the stopping value R = 10−3 at g400(τ=1/2N), while g ≈ 1100 for τ=1/8N.

One observes that φRII,>0 for sufficiently small σ*. This means that positive progress is expected at any R for arbitrary small σ* > 0, which contradicts experimental observations, see also Fig. 5, as small σ* significantly increases the local convergence probability. Hence, local attraction is not modeled correctly by φRII,. Furthermore, the progress dip of Fig. 6 is not related to single local attractors. It is a cumulative effect of oscillations in all N dimensions related to Rastrigin noise term (38). This is investigated in the next section.

5. Convergence and Population Sizing

In this section the convergence properties on the Rastrigin function are discussed. Global convergence (in expectation) requires φRII,>0 for R ∈ (0, ∞). The boundary φRII,=0, see Fig. 6, is therefore of most interest, especially the progress dip and its location in σ*-R-space. As it is shown in Appendix B, a closed-form solution can only be obtained under certain (simplified) assumptions. An analytical solution for R(σ*), such that φRII,=0, cannot be given due to the non-linearity of the underlying equations.

In the limit of R → ∞, all exponentials of (35) and (36) vanish. The resulting equation for φRII,(σ,R)=0 simplifies significantly with DQ2=Dsph2+NA2/2, see (37), such that a fourth order polynomial is obtained in Eq. (B.9) as

σ4+2Nσ2+N4A24R48Ncϑ2μ2=0. (42)

Solving (42) for R yields the zero-progress line

Rφ0(σ)=(14N4A28Ncϑ2μ22Nσ2σ4)1/4, (43)

which is visualized in Fig. 7 as a black dashed line. An important relation to the noisy sphere model can be made. In [3] the residual location error R was derived for the (μ/μI, λ)-ES assuming a constant noise strength σϵ in the limit σ* → 0 as

RσN4cμ/μ,λμ. (44)

Figure 7.

Figure 7

Progress rate φRII, for (100/100I, 200)-ES with N = 100 and A = 3. The black dashed line shows Eq. (43). Two lines show Eq. (46) with δ = 5 (yellow, top) and δ = 1 (magenta, bottom), respectively. Crosses indicate the intersection points obtained by Eq. (49). Note that the progress dip at A = 3 is significantly larger compared to A = 1 from Fig. 6.

Applying the limit σ* → 0 to Eq. (43), identifying the constant noise strength of the Rastrigin function (for sufficiently large R) as σRas2=NA2/2 via (38) yields

Rφ0|σ=0=(N3A232cϑ2μ2)1/4=σRasN4cϑμ, (45)

which corresponds to result (44) with cμ/μ,λcϑ and σϵ = σRas.

Results (43) and (45) explain the σ*-decrease observed in Fig. 1 and Fig. 6, occurring for the ES approaching the Rastrigin noise floor. The red curve (τ=1/8N) decreases σ* to have positive progress at all, while the black curve (τ=1/2N) exhibits smaller σ*-values keeping a larger distance to the φRII,=0 boundary. Thus, the latter realizes a larger local progress. This is the result of the faster adaptation of σ (due to the larger τ). As a result, one has smaller mutations which are in turn more prone to be trapped in a local attractor. This is reflected by a lower success probability PS. The smaller τ, however, yields a larger PS -value by keeping a higher σ*-level.

In the limit of R → 0, σ* increases again, as the ES reaches the global attractor optimizing a sphere function with constant σ* (same level as for R → ∞). Since there is no closed-form solution of the progress dip location, a different approach is needed to model the transition point. Recalling that Rφ0 of (43) was derived in the limit of vanishing exponential terms, a natural extension of this model is to parametrize the point at which the terms are vanishing. This can also be motivated by looking at Eq. (22) and Fig. 3, where the exponential term models the transition between the sphere limits. Hence, a transition relation Rtr(σ*) is introduced. It can be obtained by investigating the characteristic exponential term of φRII in (35), which also occurs in variance (36). Introducing an attenuation factor δ > 0 and setting σ = σ*Rtr/N, one can demand

eδ=!e(αRtr)22[(σN)2+1N],suchthatRtr(σ)=2δNα11+σ2/N. (46)

It is assumed that δ is independent of the fitness and strategy parameters. Figure 7 shows Rφ0 from (43) and Rtr from (46) with two exemplary evaluations δ = 1 and δ = 5. One observes that Rφ0 (black dashed line) follows the zero-progress line up until the dip minimum is reached. The dip location along the R-axis is well approximated by the constant noise limit (45) at σ* = 0. The Rtr-curves (magenta and yellow, respectively) follow a characteristic path depending on the chosen attenuation factor δ. The intersection point σsec of both curves, namely

Rφ0(σsec)=!Rtr(σsec), (47)

is parametrizing the dip location and will give insight on the population scaling μ(N, α, A). Setting Rφ0=Rtr, one obtains a fourth order polynomial in σ* as

σ4+2Nσ2+N2(α4A2128δ2cϑ2μ2/N)α4A2+16δ2=0. (48)

The real non-negative solution of (48) yields after simplification the intersection point

σsec=[N(1+8cϑ2μ2N)1/2(1+α4A216δ2)1/2N]1/2, (49)

which is visualized in Fig. 7. Convergence on the sphere requires

0<σsec<σφ0, (50)

with the sphere-zero σφ0 given in Eq. (41). The relation σsec<σφ0 follows immediately for any A, α, δ > 0 and setting A = 0 or α = 0 yields σsec=σφ0. Demanding σsec>0 in (49) it must hold

8cϑ2μ2N>α4A216δ2. (51)

Solving (51) for μ one arrives at the important population sizing result

μ>N2α2A8cϑδ. (52)

Expression (52) relates the fitness-dependent parameters to the population size μ. For the subsequent experiments we will investigate the scaling properties of (52) without considering the potential prefactors of Eq. (52). To this end, repeated experiments of Algorithm 1 are performed and the success probability PS is measured. Then, the necessary population size μ is evaluated to achieve a high success rate of PS ≥ 0.99. The results of Figs. 8, 9, and 10 show good agreement with the parameter scaling predicted in Eq. (52). The μ(N)-scaling from experimental results is clearly sub-linear (as already observed in [9]) and indicates a scaling slightly larger than N1/2. Some fluctuations can be observed which is practically inevitable, as very large N and μ are tested posing limits on the available CPU resources. Furthermore, certain deviations of the experiments to prediction (52) are expected to occur as the underlying model is based on an expected value, see (23), without considering possible higher order moments of the yi -distribution causing fluctuations.

Figure 8.

Figure 8

Population sizing μ(N) using σSA-ES with τ=1/2N and α = 2π. The top plot shows ϑ = 1/4 with A = 1, while the bottom plot shows ϑ = 1/2 with A = 10. The dotted cyan lines depict μN, dash-dotted green lines μN5/8, and dashed red lines μN. The number of evaluated trials, for increasing N, is 2000, 2000, 1000, 700, 500, 400 (top) and 3000, 3000, 1500, 1000, 700, 600 (bottom).

Figure 9.

Figure 9

Population sizing μ (A) using σSA-ES with ϑ = 1/4, N = 100, τ=1/2N, and α = 2π. The dotted cyan lines show μA. For each data point 2000 trials were evaluated.

Figure 10.

Figure 10

Population sizing μ (α) using σSA-ES with ϑ = 1/4, N = 100, τ=1/2N, and A = 1. The dotted cyan lines show μα2. For each data point 2000 trials were evaluated.

6. Local Attraction

In this section the limitations of the R-dependent progress rate φRII are discussed by investigating local attraction effects. In case of local convergence one has σ → 0 (equivalently σ* → 0) while R stagnates. In this case the local structure of the fitness landscape is dominating. Hence, the assumption yi~RNN(0,1) being normally distributed around the optimizer cannot hold. While the progress landscapes show positive progress for small σ*-values, this does not imply global convergence of real ES runs, see e.g. Fig. 12. It should be intuitively clear that for too small mutation strengths local convergence occurs. This issue was also observed in one-generation experiments in Fig. 5, where negative progress rates are obtained at certain y, if local attraction is present. As the aggregated (R-dependent) formula (35) is not able to model local attraction, a different approach is needed based on the yi-dependent formula (24). The goal is to derive a σ-condition avoiding local attraction (in expectation). To this end, a characteristic “escape” mutation strength σesc is derived. It can serve as an additional stability criterion for the ES.

Starting with φiII of Eq. (24), the gain function G is defined as

G(yi,σ):=4yi2+e12(ασ)22αAyisin(αyi). (53)

Requiring positive progress φiΠ>0, Eq. (24) yields

cϑDQG(yi,σ)>1μ. (54)

At this point the infinite population limit μ → ∞ is assumed in order to obtain closed-form solutions. As 1/μ → 0 it suffices to show that G > 0 for φiII>0 to hold. The function G is plotted in Fig. 11 as a function of σ and yi. One observes local attraction regions for small mutations located at y0 ≈ {1, 2, 3, 4, 5}. For small σ each of the attractors is a “stable” point, as y0 + ϵ (with ϵ > 0) yields positive gain (decreasing yi in expectation), while y0ϵ yields negative gain (increasing yi). For sufficiently large σ > σesc (black dotted line) positive progress can be ensured. The threshold σesc is derived now. Starting with (53), requiring G=!0 and assuming yi ≠ 0, G is refactored as

G=2yiG˜, (55)

with G˜ defined as

G˜:=2yi+e12(ασ)2αAsin(αyi)=!0, (56)

yielding a first condition. The second condition Gyi=0 (at σ = σesc) can be inferred from Fig. 11, which yields for (55)

Gyi=2G˜+2yiG˜yi=!0. (57)

Figure 11.

Figure 11

Gain function (53) visualized for A = 10 and α = 2π. The boundary G = 0 is shown in bold white, enclosing regions of negative progress. σesc from (64) is shown as black dotted. Only the first five local attractors are shown (out of 31).

As G˜=0 and yi ≠ 0, Gyi=0 is equivalent to G˜yi=0. Therefore, one has

G˜yi=2+e12(ασ)2α2Acos(αyi)=0, (58)

such that the following condition is obtained

e12(ασ)2αA=2αcos(αyi). (59)

Inserting condition (59) into (56), it follows

2yi2αsin(αyi)cos(αyi)=0. (60)

Introducing the substitution x = αyi and applying sin x/cos x = tan x yields

2α(xtanx)=0. (61)

The first non-trivial solution of (61) is the most interesting, as it corresponds to G = 0 of the first local attractor at yi ≈ 0.75, see Fig. 11. Furthermore, negative gain contributions are due to the sine term in (53). For small |yi| < 1 one has yi2<|yi|, such that the first local attractor corresponds to the worst case requiring the largest σ to obtain G = 0. Numerical solving yields the zero of (61) as

x04.493. (62)

Multiplying (56) by α, identifying x0 = αyi and σ = σesc (point of vanishing gain) results in

2x0+e12(ασesc)2α2Asinx0=0e12(ασesc)2=α2Asinx02x0. (63)

Resolving (63) for σesc yields the final result

σesc=1α2ln(α2Asinx02x0)1α2ln(0.1086α2A). (64)

Figure 12.

Figure 12

Median dynamics of unsuccessful runs (out of 100 trials) for (400/400I, 800)-σSA-ES, N = 100, A = 10, with σesc ≈ 0.436 (red dashed line). The learning parameter was set to τ=1/N (blue, left, PS = 0.01), τ=1/2N (black, center, PS = 0.08), and τ=1/8N (magenta, right, PS = 0.29).

Figure 12 shows experiments of the (400/400I, 800)-σSA-ES with α = 2π, and relatively large A = 10, such that one has σesc ≈ 0.436 from result (64). A constant σ translates to R (σ*) = σN/σ*, see normalization (2), showing a 1/σ* characteristics (red dashed line) in the progress landscape. The median of real unsuccessful runs is shown for different learning parameters τ. The sharp decrease σ* → 0 indicates local convergence, which agrees well with the σesc-line. However, dropping below the threshold σ < σesc does not imply that local convergence must occur (see success rate PS > 0 for all τ). Conversely, it is a stability criterion that maintains positive component-wise progress in expectation, if σ > σesc is kept large enough. Of course, this can only hold up to the global attractor, at which σ → 0 must be ensured to have convergence.

Figure 13 shows numerically evaluated progress dip locations, see e.g. the dip at σ* ≈ 25 and R ≈ 2 in Fig. 12, for increasing μ-values while keeping ϑ and the fitness parameters constant. It shows how increasing μ shifts the dip location to larger σ*-values and smaller residual distances R. Using larger populations enables the ES to operate at larger mutation strengths and approach the optimizer more closely resulting in higher global convergence probabilities. The σesc-line (red dashed) remains constant as it was derived for μ → ∞. A progress dip located below the σesc-line is critical, as both noise floor and local attraction effects overlap yielding effectively zero success rates.

Figure 13.

Figure 13

Numerically evaluated progress dip locations (black dots) for ϑ = 1/2, N = 100, and A = 10 with increasing values μ = {10, 50, 100, 200, …, 1000} from left to right. The red dashed curve shows σesc (displayed as R = σescN/σ*).

The results obtained from Fig. 12 suggest a synthetic explicit σ-control rule for understanding the meaning of σesc. This rule uses a constant mutation strength σ for a sufficiently high number of generations until the global attractor is reached, and then decreases σ → 0. This is realized by defining a σ(g)-schedule being constant for the first g < 9000 generations. For 9000 ≤ g ≤ 104 it is decreased multiplicatively as σ(g+1) = (g) (0 < c < 1), such that the stopping criterion σ < 10−6 is reached at the last generation. The corresponding experiments are conducted in Fig. 14. The single-trial dynamics show that only the run at σesc converges globally (repeated experiments shown in Fig. 15). ES-runs with σ < σesc tend to converge locally at large R due to the ES getting stuck in the local minima landscape. ES-runs operating at σ > σesc are less prone to local attraction and they reach the Rastrigin noise floor at moderately large R (see intersection of red and white line in Fig. 12).

Figure 14.

Figure 14

Single runs using constant σ for (400/400I, 800)-ES, N = 100, and A = 10. A schedule for σ(g) was defined being constant during the first 9000 generations and converging exponentially within the last 1000 generations. One has σ = {0.1, 0.2, 0.3, 0.4, σesc, 0.5, 0.6}, from top to bottom (see ordering at g = 2000).

Figure 15.

Figure 15

The success probability PS is evaluated using parameters of Fig. 14 and 500 repetitions for each σ. The peak occurs around σσesc = 0.436.

In Fig. 15 different σ-values are tested and the success rate PS is evaluated over 500 repetitions. One observes that PS is maximized at σσesc. As expected, values σ < σesc are less successful due to local attraction. For σ > σesc local attraction is avoided, but the ES fluctuates at a larger residual distance before σ → 0, such that it is more likely to miss the global attractor.

7. Conclusions and Outlook

In this paper results from progress rate theory were applied and extended to investigate the convergence properties on the Rastrigin function. An aggregated residual distance dependent progress rate was obtained assuming normally distributed yi-locations around the optimizer. The progress rate yields useful insights on the search behavior of the ES, which can be illustrated by recalling Fig. 1. Far away from local attraction the ES is optimizing the sphere keeping a constant scale-invariant mutation strength. Approaching the local attractor landscape leads to a significant reduction of the (normalized) mutation strength compared to the initial level. As the mutation strength σ decreases together with σ*, it may fall below the σesc-threshold (see Fig. 12). Having σσesc (at σ*-levels comparable to sphere-optimal values) the ES is performing a global search. It is not significantly influenced by single local attractors. For σσesc the search can be regarded as rather local and individual attractors gain importance, such that local convergence occurs with higher probability. Within the global attractor the sphere function is optimized again. Considering the ES performance a two-fold positive effect of large populations on the success rate can be identified. First, large μ-values decrease the expected residual distance (45) to the global optimizer (similar to optimizing the sphere under constant noise). Second, intermediate recombination reduces the magnitude of the loss term −σ*2/(2μ) in (39). Large μ and recombination therefore allow the ES to operate at larger σ-levels keeping σ > σesc and enabling a global search.

Furthermore, the progress rate analysis enabled the derivation of the population scaling result in (52), which could be experimentally verified. The result can serve to some extent as a guidance for the investigation of other highly multimodal test functions, provided that a global (spherical) structure exists with local perturbations.

There are multiple issues requiring further research. While it is now clear why large populations and mutation strengths are beneficial optimizing Rastrigin, a detailed analysis of the full σSA-ES or CSA-ES including the step-size adaptation is still pending. Additionally, the ES-efficiency in terms of fitness evaluations as a function of population size, truncation ratio, and learning parameter was not yet investigated. As the population size is a crucial parameter, the idea of using dynamic population control methods seems natural, see e.g. [4]. Actually, the theoretical analysis of population size control strategies is an uncharted research field. Furthermore, a probabilistic model would be useful to predict the success rate PS as a function of fitness and ES parameters. Whether the obtained results can be transferred to other multimodal functions also remains part of future research.

Supplementary Material

Appendix

Acknowledgments

This work was supported by the Austrian Science Fund (FWF) under grant P33702-N. The authors thank Lisa Schönenberger for providing valuable feedback.

Contributor Information

Amir Omeradzic, Email: amir.omeradzic@fhv.at.

Hans-Georg Beyer, Email: hans-georg.beyer@fhv.at.

References

  • [1].Abramowitz M, Stegun IA. Pocketbook of Mathematical Functions. Verlag Harri Deutsch; Thun: 1984. [Google Scholar]
  • [2].Arnold DV. Noisy Optimization with Evolution Strategies. Kluwer Academic Publishers; Dordrecht: 2002. [Google Scholar]
  • [3].Arnold DV, Beyer H-G. Performance Analysis of Evolution Strategies with Multi-Recombination in High-Dimensional ℝN-Search Spaces Disturbed by Noise. Theoretical Computer Science. 2002;289(2002):629–647. [Google Scholar]
  • [4].Auger A, Hansen N. A Restart CMA Evolution Strategy with Increasing Population Size; Congress on Evolutionary Computation, CEC’05; 2005. pp. 1769–1776. [Google Scholar]
  • [5].Beyer H-G. Toward a Theory of Evolution Strategies: Some Asymptotical Results from the (1,+ λ)-Theory. Evolutionary Computation. 1993;1(2):165–188. 1993. [Google Scholar]
  • [6].Beyer H-G. The Theory of Evolution Strategies. Springer; Heidelberg: 2001. [DOI] [Google Scholar]
  • [7].Beyer H-G, Arnold DV, Meyer-Nieberg S. A New Approach for Predicting the Final Outcome of Evolution Strategy Optimization under Noise. Genetic Programming and Evolvable Machines. 2005;6(1):7–24. (2005) [Google Scholar]
  • [8].Beyer H-G, Melkozerov A. The Dynamics of Self-Adaptive Multi-Recombinant Evolution Strategies on the General Ellipsoid Model. IEEE Transactions on Evolutionary Computation. 2014;18(5):764–778. doi: 10.1109/TEVC.2013.2283968. (2014) [DOI] [Google Scholar]
  • [9].Hansen N, Kern S. In: Parallel Problem Solving from Nature. Yao X, et al., editors. Vol. 8. Springer; Berlin: 2004. Evaluating the CMA Evolution Strategy on Multimodal Test Functions; pp. 282–291. [Google Scholar]
  • [10].Meyer-Nieberg S. Self-Adaptation in Evolution Strategies. Dissertation Universität Dortmund; Dortmund, Germany: 2007. [Google Scholar]
  • [11].Omeradzic A, Beyer H-G. Progress Analysis of a Multi-Recombinative Evolution Strategy on the Highly Multimodal Rastrigin Function. Report. Vorarlberg University of Applied Sciences; 2022. https://opus.fhv.at/frontdoor/index/index/docId/4722 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [12].Omeradzic A, Beyer H-G. In: Parallel Problem Solving from Nature – PPSN XVII. Rudolph G, Kononova AV, Aguirre H, Kerschke P, Ochoa G, Tušar T, editors. Springer International Publishing; 2022. Progress Rate Analysis of Evolution Strategies on the Rastrigin Function: First Results; pp. 499–511. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix

RESOURCES