Abstract
The highly multimodal Rastrigin test function is analyzed by deriving a new aggregated progress rate measure. It is derived as a function of the residual distance to the optimizer by assuming normally distributed positional coordinates around the global optimizer. This assumption is justified for successful ES-runs operating with sufficiently slow step-size adaptation. The measure enables the investigation of further convergence properties. For moderately large mutation strengths a characteristic distance-dependent Rastrigin noise floor is derived. For small mutation strengths local attraction is analyzed and an escape condition is established. Both mutation strength regimes combined pose a major challenge optimizing the Rastrigin function, which can be counteracted by increasing the population size. Hence, a population scaling relation to achieve high global convergence rates is derived which shows good agreement with experimental data.
CCS Concepts: • Theory of computation → Theory of randomized search heuristics, Probabilistic computation, • Mathematics of computing → Bio-inspired optimization
Keywords: Evolution Strategy, global optimization, progress rate, multimodal function
1. Introduction
Evolution Strategies (ES) have proven to be well suited for the optimization of highly multimodal real-valued fitness functions due to their underlying stochastic nature. Test functions such as the Rastrigin function contain a huge number of local minima scaling exponentially with the search space dimensionality N. Sampling the search space and applying multiple restarts with standard gradient-based optimization algorithms quickly becomes unfeasible. On the other side, ES achieve high success rates for global convergence on certain multimodal functions, if sufficiently large population sizes are chosen. Experimental investigations in [9] indicate a non-exponential population scaling at most of O (N2) for the tested multimodal functions with the Rastrigin function scaling sub-linearly in N. However, there is little understanding of how the ES is exploring the fitness landscape to find the global optimizer without getting trapped in one of the local minima. This paper investigates the convergence properties on the Rastrigin function based on progress rate theory results of [11] and [12]. To this end, a new aggregated progress rate is introduced modeling the progress as a function of the residual distance to the optimizer. The obtained results are compared to real ES-runs operating with isotropic mutations and self-adaptation for step-size control.
In Sec. 2 the ES under investigation is introduced. In Sec. 3 the Rastrigin function is defined and averaging methods are discussed. Then, the method is applied to component-wise progress rate equations in Sec. 4 obtaining an aggregated measure. In Sec. 5 a population sizing relation is derived and compared to experimental results. Local attraction is discussed in Sec. 6 and a characteristic “escape” mutation strength is derived. Finally, in Sec. 7 conclusions and an outlook are provided.
2. The ES-Algorithm
The ES under investigation, see Algorithm 1, consists of μ parents and λ offspring with truncation ratio ϑ = μ/λ (1 ≤ μ < λ). Selection of the m = 1, …, μ best individuals (out of λ) is denoted by subscript "m; λ". Normally distributed isotropic mutations of strength σ are applied. Intermediate multi-recombination with equal weights is used to obtain the parental location in the N-dimensional search space for each generation g. For the σ-adaptation (self-adaptation) the offspring mutation strengths are chosen from a log-normal distribution with learning parameter τ. A smaller τ-value therefore yields a slower σ-adaptation, which will be important later. The default choice is , see [10], which ensures optimal performance on the sphere in the limit N → ∞. Recombination also applies to selected σ-values.
3. Rastrigin Function and Averaging
The Rastrigin test function f is defined for a real-valued search vector y = [y1, …, yN] as
| (1) |
Algorithm 1. (μ/μI, λ)-σSA-ES.
1: g ← 0
2: initialize (y(0), σ(0))
3: repeat
4: for l = 1, …, λ do
5:
6:
7:
8:
9: end for
10:
11:
12:
13: g ← g +1
14: until termination criterion
with oscillation amplitude A and frequency α. The function is minimized and the global minimizer is located at ŷ = 0. All simulations in this paper will be conducted at default α = 2π (unless stated otherwise). Depending on A, the function shows a finite number M of local attractors for each of the N dimensions, such that the function contains MN − 1 local minima scaling exponentially with N. Note that for |yi| > αA/2 the derivative (for any i), such that no further local minima occur.
As an introductory example, real optimization runs are shown in Fig. 1 using Algorithm 1. The global convergence is investigated by evaluating the dynamics of the parental distance to the global optimizer denoted by R(g) = ‖y(g)‖. All runs are initialized far away from the local attractor landscape. The black line shows the median of all successful runs (the mean can also be used, but the median is more robust w.r.t. outliers). The global convergence probability is denoted as PS. After the initial phase, the σSA-ES maintains a nearly constant normalized mutation strength σ* ≈ 30 with a characteristic dip at g ≈ 200. σ* is defined as
| (2) |
Figure 1.
Dynamic runs (500 trials) of Algorithm 1 using (100/100I, 200)-σSA-ES on the Rastrigin function with N = 100 and A = 1 at learning parameter . The upper plot shows the R-dynamics, while the lower plot shows the normalized mutation strength (2). The green-yellow color palette depicts globally converging runs, while the cyan-magenta colors show locally converging runs. The black line marks the median of all successful runs. The measured success probability PS = 0.91.
A constant σ*-level ensures scale-invariance on the sphere function and therefore linear convergence. The observations from Fig. 1 will be investigated in more detail throughout the paper.
The Rastrigin fitness (1) is defined as a function of y = [y1, …, yN]. Convergence however is usually measured as a function of the residual distance R, see Fig. 1. Quantity R is therefore an aggregated measure over all components. In general, the convergence properties can be investigated using progress rate equations (see Sec. 4). A component-wise progress rate however, derived and discussed in [11], has the disadvantage of not being an aggregated measure. As an example, positive progress (convergence) between two generations occurs for decreasing R(g+1) < R(g) even if some components deteriorate and show negative progress. Furthermore, analytic treatment of N equations is unfeasible for large dimensionalities. Hence, the idea will be to express y-dependent functions as average values over all positions satisfying ∥y∥ = R to obtain aggregated measures. First, the approach is presented on the Rastrigin function and later transferred to its corresponding progress rate in Sec. 4.
The averaging problem to be solved is
| (3) |
An approach to obtain is to integrate the function over the (N − 1)-dimensional sphere surface SN (R) with radius ∥y∥ = R and normalize by the sphere surface according to
| (4) |
where ds denotes the (hyper-)surface element. The sphere surface area for N ≥ 2 evaluated using the gamma function Γ is given by
| (5) |
Applying (4) to (1), the first two terms can be evaluated easily noting that . Integrating over a constant yields SN. Therefore, one gets the intermediate result
| (6) |
with
| (7) |
Closed-form solutions of (7) can be obtained for N = 1 and N = 2. Starting with N = 1 only two discrete points are relevant (no integration necessary) with two possible solutions y1 = ±R. Averaging over two points therefore yields
| (8) |
For N = 2 one can use polar coordinates (y1, y2) = (R cos ϕ, R sin ϕ) with derivative vector on ϕ ∈ [0, 2π). Additionally, one has S2 = 2πR. Therefore, inserting this parametrization into (7) and using path element length one has
| (9) |
The integrals obtained in (9) can be solved in terms of Bessel functions of the first kind Jn (x) with n ≥ 0 by applying the integral identity [1, p. 360, 9.1.18]
| (10) |
Due to the periodicity integrating cos t and sin t over [0, π] yields the same contribution as the integration over [π, 2π]. Thus, one can extend the integral bounds of (10) as
| (11) |
Comparing (9) with (11) and setting x = αR, the expression (9) is evaluated as
| (12) |
The final result for f (R) is summarized as
| (13) |
| (14) |
Examples of result (13) and (14) are shown in Fig. 2. The analytic equations are compared to sampled results, where for each R random isotropic positions are chosen with ∥y∥ = R and averaged over 104 trials. Excellent agreement can be observed. Furthermore, one notices a decrease of the oscillation effect when N is increased. This will be useful for the subsequent approach. Unfortunately, integral (4) yields analytically exact results only for the cases N < 3. In the context of progress rate theory of Sec. 4 an approach is needed which can be applied to any arbitrary large dimensionality N.
Figure 2.
Average Rastrigin function evaluated for A = 10 as a function of R. The solid black lines show sampled results using Eq. (1) for N = 1 and N = 2, respectively. The overlaid dotted green line shows (13) and the dashed cyan line (14).
The approach presented now will evaluate the average value of f(y) assuming independent, normally distributed random coordinates yi with zero mean and variance according to
| (15) |
The approach can be considered as averaging by stochastic sampling (assuming large N) instead of analytic integration. For the determination of σy from (15) one demands
| (16) |
It was used that the sum over N independent standard normally distributed variables squared is equal to the chi-squared distributed variable with . Solving (16) for σy, expression (15) can be rewritten as
| (17) |
Equation (17) will be useful for averaging sums over trigonometric functions of yi, where analytic integration is unfeasible. Furthermore, successful ES runs on the Rastrigin function operating under default step-size adaptation also show normally distributed yi as in (17), see also experiments in Fig. 4. This property is used again in Sec. 4. As yi is treated as a random variate for the cosine terms, Eq. (1) is now rewritten as
| (18) |
with Y containing the sum over the random terms
| (19) |
By the Central Limit Theorem in the limit N → ∞, the sum over i.i.d. variates approaches a normal distribution with . Additionally, it is shown in Appendix (A.7) that
| (20) |
with ratio vanishing as . In the asymptotic limit the fluctuation term of Y is negligible, which means that the random variate can be replaced by its expected value evaluated in Appendix (A.1). Equation (18) therefore yields (overline denoting the average in the limit N → ∞)
| (21) |
| (22) |
Exemplary evaluations of (22) are shown in Fig 3. The derived results match the sampled results well. Smaller deviations are observed for small values N = 3 or N = 10, which was expected. In the limit N → ∞ the deviations are smoothed out. The limits R → 0 and R → ∞ yield R2-dependent functions, i.e., sphere functions. For R → ∞ the exponential vanishes and NA is negligible, while for R → 0 one has giving .
Figure 3.
Average Rastrigin function evaluated for A = 10 as a function of R. The solid black lines show sampled results of Eq. (1). The overlaid dotted green lines show Eq. (22) for N = {3, 10, 100, 1000}, from bottom to top.
Figure 4.
Distribution of realized yi -values (i = 1, …, N over 500 trials) for (100/100I, 200)-ES, N = 100, A = 1, and at R ≈ 100 (left) and R ≈ 1 (right). The red solid curve shows the density of the normal distribution with yi ~ 𝒩(0, R2/N).
4. Progress Rate
4.1. Derivation
The method introduced in Sec. 3 will now be applied to results for the progress rate on the Rastrigin function. The progress rate (denoted by φ) measures the expected positional change in search space between two generations g → g+1 as a function of fitness and ES parameters. A positive value corresponds to the ES approaching the optimizer and vice versa. The second order component-wise progress rate for the parental location y(g) is defined as [8]
| (23) |
The second order refers to the square of yi -values which ensures for independent of the sign of yi. A second order model is needed for a correct model of convergence involving large mutation strengths. The expected values for the determination of (23) were already evaluated in [11] in the asymptotic limit N, µ, λ → ∞ (ϑ = μ/λ = const). The result yields
| (24) |
In (24) the asymptotic progress coefficient [11] is given by
| (25) |
with Φ −1 (·) denoting the quantile function of the standard normal variate. The cϑ is related to the progress coefficient cμ/μ,λ ≃ cϑ (for µ, λ→ ∞ with constant ϑ = μ/λ), see also [6, p. 249]. The quality gain variance at location y given σ was evaluated in [12] giving
| (26) |
The first term of (24) is usually referred to as the gain term, while the second term is the loss term characteristic for intermediate recombination. A distinct property of the Rastrigin function is that the gain term (yi -dependent) is not necessarily positive as it is the case for unimodal functions. This property will be discussed later.
The first step to obtain an R-dependent aggregation of expression (23) is to sum over all N components
| (27) |
such that one can define the R-dependent progress rate
| (28) |
Given the sphere function fsph (R) = R2, one can relate (28) to the sphere quality gain , such that the quality gain normalization [5, p. 173] is applicable. This yields the normalized R-dependent progress rate (labeled by the asterisk “*”)
| (29) |
For N → ∞ one has , see [2, p. 16], yielding the normalized sphere progress rate. Expression (29) has two important properties. First, it is an aggregated progress rate measure over all N components, which is new for the Rastrigin function. Second, its relation to the sphere function enables direct comparison of progress rates.
A prerequisite for the further derivation will be the assumption of normally distributed yi ~ 𝒩 (0, R2/N)N(…), see (17). This property is experimentally confirmed in Fig. 4, using the data of 500 trials shown in Fig. 1, displayed at two residual distances. Good agreement is observed between the expected density (red curve) and the histogram. Each component contributes roughly as to the overall residual distance R2. This concept of “equal contribution” is not new and was investigated in [7] for the quality gain on the ellipsoid. Slightly larger deviations occur at R ≈ 1 (right), where local attraction is more significant, see also later discussion of Fig. 12. At small mutation strengths where local attraction occurs the assumption of course breaks down.
Now is derived starting from (24). Performing the summation one gets
| (30) |
Similarly, the summation of the variance terms in (26) yields
| (31) |
Analogous to (19) and (20), the sums over the yi -dependent trigonometric terms of (30) and (31) will be replaced by their respective expectation values assuming and neglecting fluctuations for N → ∞. The needed expected values are derived in Appendix (A.1), (A.2), and (A.3) giving
| (32) |
| (33) |
| (34) |
Furthermore, it is shown in Appendix A that for N → ∞ for all three sums. Finally, a fully R-dependent expression can be given for the progress rate
| (35) |
Analogously, the R-dependent quality gain variance yields
| (36) |
Result (35) is important as it measures the progress on the Rastrigin function in R-space aggregating the individual progress rates . Note that the first term of (35), i.e., the gain term, is now strictly positive, which is in contrast to Eq. (24).
One-generation experiments are conducted in Fig. 5 by performing single optimization steps for given mutation strength and averaging the results of progress rates (23) and (28), respectively, over 104 trials. Furthermore, the simulations are compared to analytic expressions (24) and (35). To this end, two configurations (constant R = 7) are overlaid with one having y fixed, and one randomly sampled y-values with ∥y∥ = R. The values are normalized using (29) and displayed using scale-invariant mutations σ* of Eq. (2). All results are similar for moderate and large σ*-values showing good agreement. Differences emerge at small σ*. The fixed y = [0.7, …, 0.7] was chosen to lie within a local attractor. In this Case correctly predicts negative progress for small σ*, while falsely assumes normally distributed yi-coordinates and predicts positive progress. This error vanishes for large σ*, i.e., when the ES is searching at larger scales. Therefore, one can conclude that is a suitable aggregated measure of component-wise , if sufficiently large mutations are applied. Indeed, real (successful) ES-runs, such as in Fig. 1 or later in Fig. 6, tend to maintain high σ*-levels, such that the normal assumption for yi stays valid. Local attraction (assuming small σ*) is investigated further in Sec. 6.
Figure 5.
One generation experiments with 104 repetitions for (100 /100I, 200)-ES, N = 100, A = 1 at constant R = 7. Black circles show experimentally evaluated (23), summed over i, for constant y = [0.7, …, 0.7]. Blue crosses show (28), where y is randomly sampled for each trial, such that ∥y∥ = R. The green dash-dotted line shows (24) (summed over i) and the red dashed line (35). All values are normalized using (29). The error bars are vanishing and not shown.
Figure 6.
Progress rate for (100 /100I, 200)-ES with N = 100 and A = 1. High progress rate values are shown in yellow and blue values indicate small (negative) progress. The boundary is shown in bold white. The black (left) curve is displaying the median dynamics of Fig. 1 ( PS = 0.91), while the red (right) curve is showing the same ES with and PS = 0.99.
A few important remarks regarding results (35) and (36) are made now. Given expression (36), the variance can be written more compactly as
| (37) |
where corresponds to the quality gain variance of the sphere function [6]. The term is Rastrigin-specific. In the limit of vanishing exponential functions (R → ∞), see later in Sec. 5, the term will simplify significantly giving the so-called Rastrigin (maximum) noise strength
| (38) |
Having derived (35) and (36), the sphere progress rate can be recovered as a special case. It can be obtained from in multiple ways. The technical details are not shown since the calculations are simple and straightforward, only the main steps are explained now. As the normalized progress is constant on the sphere for constant scale-invariant mutations σ*, Eqs. (35) and (36) need to be rewritten as and DQ (σ*) by setting σ = σ*R/N via (2). Furthermore, normalization (29) needs to be applied. One way to recover is by setting A = 0 or α = 0, which removes all Rastrigin-specific terms. Another way is applying the limit R → ∞, which suppresses the exponential terms. Additionally, the constant term NA2/2 is negligible in (36) for R → ∞, see also Appendix (B.5). The third way is the limit R → 0. All exponentials contain arguments being a function g(R2) after inserting σ = σ*R/N. Performing a Taylor expansion yields with negligible higher order terms. After simplification, all three approaches yield
| (39) |
and for N → ∞ the well-known asymptotic formula
| (40) |
Both (39) and (40) are scale-invariant (R-independent) expressions. As a conclusion, the Rastrigin progress rate yields the sphere progress rate in the limits R → ∞ and R → 0. This result is important and was expected from (1), as is dominating at large scales. For yi → 0 the global attractor is essentially a quadratic function.
An important property of is that for sufficiently small σ* one has , while for too large σ*-values the progress rate becomes negative. The second (non-trivial) zero of (39), denoted by , is derived in Appendix B by setting and yields in (B.8)
| (41) |
Due to the same global (quadratic) structure, result (41) will also be applicable to the Rastrigin function as an upper bound for σ*.
4.2. Progress Landscape
A more detailed analysis of the progress rate (35) is provided now. Given fitness parameters A, α, and N, the expression is essentially a function of only two variables. Therefore, the results will be displayed in a two-dimensional σ*-R-space denoted as the progress landscape. Note that for the sphere function, see Eqs. (39) and (40), the progress rate is constant for all R (given σ* and N).
Figure 6 shows an example progress landscape, evaluated for and R ∈ [10−1, 102]. The value is chosen slightly larger than , see Eq. (41), as for the progress rate gets negative. The R-range was chosen large enough to provide good visibility of the relevant characteristics. Thin black lines display regions of equal progress rate level. For R → ∞ and R → 0 the sphere limit is recovered (vertical lines of constant progress).
The median of real runs (black and red curves) show a characteristic σ*-drop (also visible in Fig. 1), which is directly related to the progress rate zero. The ESs are moving around the progress dip in σ*-R-space. Interestingly, the σ SA-ES with has a global convergence probability PS = 0.91, while yields PS = 0.99. Maximizing therefore maximizes PS, which is associated with a smaller learning parameter τ. This effect can also be observed for the CSA-ES (cumulative step-size adaptation), where a higher PS is observed for smaller cumulation constant values (due to a slower change of σ). The downside of large mutation strengths is less efficiency optimizing the sphere limits (the sphere-optimal value for the (100/100I, 200)-ES, see Fig. 6, is at σ* ≈ 19). The respective median R-dynamics reaches the stopping value R = 10−3 at , while g ≈ 1100 for .
One observes that for sufficiently small σ*. This means that positive progress is expected at any R for arbitrary small σ* > 0, which contradicts experimental observations, see also Fig. 5, as small σ* significantly increases the local convergence probability. Hence, local attraction is not modeled correctly by . Furthermore, the progress dip of Fig. 6 is not related to single local attractors. It is a cumulative effect of oscillations in all N dimensions related to Rastrigin noise term (38). This is investigated in the next section.
5. Convergence and Population Sizing
In this section the convergence properties on the Rastrigin function are discussed. Global convergence (in expectation) requires for R ∈ (0, ∞). The boundary , see Fig. 6, is therefore of most interest, especially the progress dip and its location in σ*-R-space. As it is shown in Appendix B, a closed-form solution can only be obtained under certain (simplified) assumptions. An analytical solution for R(σ*), such that , cannot be given due to the non-linearity of the underlying equations.
In the limit of R → ∞, all exponentials of (35) and (36) vanish. The resulting equation for simplifies significantly with , see (37), such that a fourth order polynomial is obtained in Eq. (B.9) as
| (42) |
Solving (42) for R yields the zero-progress line
| (43) |
which is visualized in Fig. 7 as a black dashed line. An important relation to the noisy sphere model can be made. In [3] the residual location error R∞ was derived for the (μ/μI, λ)-ES assuming a constant noise strength σϵ in the limit σ* → 0 as
| (44) |
Figure 7.
Progress rate for (100/100I, 200)-ES with N = 100 and A = 3. The black dashed line shows Eq. (43). Two lines show Eq. (46) with δ = 5 (yellow, top) and δ = 1 (magenta, bottom), respectively. Crosses indicate the intersection points obtained by Eq. (49). Note that the progress dip at A = 3 is significantly larger compared to A = 1 from Fig. 6.
Applying the limit σ* → 0 to Eq. (43), identifying the constant noise strength of the Rastrigin function (for sufficiently large R) as via (38) yields
| (45) |
which corresponds to result (44) with cμ/μ,λ ≃ cϑ and σϵ = σRas.
Results (43) and (45) explain the σ*-decrease observed in Fig. 1 and Fig. 6, occurring for the ES approaching the Rastrigin noise floor. The red curve decreases σ* to have positive progress at all, while the black curve exhibits smaller σ*-values keeping a larger distance to the boundary. Thus, the latter realizes a larger local progress. This is the result of the faster adaptation of σ (due to the larger τ). As a result, one has smaller mutations which are in turn more prone to be trapped in a local attractor. This is reflected by a lower success probability PS. The smaller τ, however, yields a larger PS -value by keeping a higher σ*-level.
In the limit of R → 0, σ* increases again, as the ES reaches the global attractor optimizing a sphere function with constant σ* (same level as for R → ∞). Since there is no closed-form solution of the progress dip location, a different approach is needed to model the transition point. Recalling that of (43) was derived in the limit of vanishing exponential terms, a natural extension of this model is to parametrize the point at which the terms are vanishing. This can also be motivated by looking at Eq. (22) and Fig. 3, where the exponential term models the transition between the sphere limits. Hence, a transition relation Rtr(σ*) is introduced. It can be obtained by investigating the characteristic exponential term of in (35), which also occurs in variance (36). Introducing an attenuation factor δ > 0 and setting σ = σ*Rtr/N, one can demand
| (46) |
It is assumed that δ is independent of the fitness and strategy parameters. Figure 7 shows from (43) and Rtr from (46) with two exemplary evaluations δ = 1 and δ = 5. One observes that (black dashed line) follows the zero-progress line up until the dip minimum is reached. The dip location along the R-axis is well approximated by the constant noise limit (45) at σ* = 0. The Rtr-curves (magenta and yellow, respectively) follow a characteristic path depending on the chosen attenuation factor δ. The intersection point of both curves, namely
| (47) |
is parametrizing the dip location and will give insight on the population scaling μ(N, α, A). Setting , one obtains a fourth order polynomial in σ* as
| (48) |
The real non-negative solution of (48) yields after simplification the intersection point
| (49) |
which is visualized in Fig. 7. Convergence on the sphere requires
| (50) |
with the sphere-zero given in Eq. (41). The relation follows immediately for any A, α, δ > 0 and setting A = 0 or α = 0 yields . Demanding in (49) it must hold
| (51) |
Solving (51) for μ one arrives at the important population sizing result
| (52) |
Expression (52) relates the fitness-dependent parameters to the population size μ. For the subsequent experiments we will investigate the scaling properties of (52) without considering the potential prefactors of Eq. (52). To this end, repeated experiments of Algorithm 1 are performed and the success probability PS is measured. Then, the necessary population size μ is evaluated to achieve a high success rate of PS ≥ 0.99. The results of Figs. 8, 9, and 10 show good agreement with the parameter scaling predicted in Eq. (52). The μ(N)-scaling from experimental results is clearly sub-linear (as already observed in [9]) and indicates a scaling slightly larger than N1/2. Some fluctuations can be observed which is practically inevitable, as very large N and μ are tested posing limits on the available CPU resources. Furthermore, certain deviations of the experiments to prediction (52) are expected to occur as the underlying model is based on an expected value, see (23), without considering possible higher order moments of the yi -distribution causing fluctuations.
Figure 8.
Population sizing μ(N) using σSA-ES with and α = 2π. The top plot shows ϑ = 1/4 with A = 1, while the bottom plot shows ϑ = 1/2 with A = 10. The dotted cyan lines depict , dash-dotted green lines μ ∝ N5/8, and dashed red lines μ ∝ N. The number of evaluated trials, for increasing N, is 2000, 2000, 1000, 700, 500, 400 (top) and 3000, 3000, 1500, 1000, 700, 600 (bottom).
Figure 9.
Population sizing μ (A) using σSA-ES with ϑ = 1/4, N = 100, , and α = 2π. The dotted cyan lines show μ ∝ A. For each data point 2000 trials were evaluated.
Figure 10.
Population sizing μ (α) using σSA-ES with ϑ = 1/4, N = 100, , and A = 1. The dotted cyan lines show μ ∝ α2. For each data point 2000 trials were evaluated.
6. Local Attraction
In this section the limitations of the R-dependent progress rate are discussed by investigating local attraction effects. In case of local convergence one has σ → 0 (equivalently σ* → 0) while R stagnates. In this case the local structure of the fitness landscape is dominating. Hence, the assumption being normally distributed around the optimizer cannot hold. While the progress landscapes show positive progress for small σ*-values, this does not imply global convergence of real ES runs, see e.g. Fig. 12. It should be intuitively clear that for too small mutation strengths local convergence occurs. This issue was also observed in one-generation experiments in Fig. 5, where negative progress rates are obtained at certain y, if local attraction is present. As the aggregated (R-dependent) formula (35) is not able to model local attraction, a different approach is needed based on the yi-dependent formula (24). The goal is to derive a σ-condition avoiding local attraction (in expectation). To this end, a characteristic “escape” mutation strength σesc is derived. It can serve as an additional stability criterion for the ES.
Starting with of Eq. (24), the gain function G is defined as
| (53) |
Requiring positive progress , Eq. (24) yields
| (54) |
At this point the infinite population limit μ → ∞ is assumed in order to obtain closed-form solutions. As 1/μ → 0 it suffices to show that G > 0 for to hold. The function G is plotted in Fig. 11 as a function of σ and yi. One observes local attraction regions for small mutations located at y0 ≈ {1, 2, 3, 4, 5}. For small σ each of the attractors is a “stable” point, as y0 + ϵ (with ϵ > 0) yields positive gain (decreasing yi in expectation), while y0 − ϵ yields negative gain (increasing yi). For sufficiently large σ > σesc (black dotted line) positive progress can be ensured. The threshold σesc is derived now. Starting with (53), requiring and assuming yi ≠ 0, G is refactored as
| (55) |
with defined as
| (56) |
yielding a first condition. The second condition (at σ = σesc) can be inferred from Fig. 11, which yields for (55)
| (57) |
Figure 11.
Gain function (53) visualized for A = 10 and α = 2π. The boundary G = 0 is shown in bold white, enclosing regions of negative progress. σesc from (64) is shown as black dotted. Only the first five local attractors are shown (out of 31).
As and yi ≠ 0, is equivalent to . Therefore, one has
| (58) |
such that the following condition is obtained
| (59) |
Inserting condition (59) into (56), it follows
| (60) |
Introducing the substitution x = αyi and applying sin x/cos x = tan x yields
| (61) |
The first non-trivial solution of (61) is the most interesting, as it corresponds to G = 0 of the first local attractor at yi ≈ 0.75, see Fig. 11. Furthermore, negative gain contributions are due to the sine term in (53). For small |yi| < 1 one has , such that the first local attractor corresponds to the worst case requiring the largest σ to obtain G = 0. Numerical solving yields the zero of (61) as
| (62) |
Multiplying (56) by α, identifying x0 = αyi and σ = σesc (point of vanishing gain) results in
| (63) |
Resolving (63) for σesc yields the final result
| (64) |
Figure 12.
Median dynamics of unsuccessful runs (out of 100 trials) for (400/400I, 800)-σSA-ES, N = 100, A = 10, with σesc ≈ 0.436 (red dashed line). The learning parameter was set to (blue, left, PS = 0.01), (black, center, PS = 0.08), and (magenta, right, PS = 0.29).
Figure 12 shows experiments of the (400/400I, 800)-σSA-ES with α = 2π, and relatively large A = 10, such that one has σesc ≈ 0.436 from result (64). A constant σ translates to R (σ*) = σN/σ*, see normalization (2), showing a 1/σ* characteristics (red dashed line) in the progress landscape. The median of real unsuccessful runs is shown for different learning parameters τ. The sharp decrease σ* → 0 indicates local convergence, which agrees well with the σesc-line. However, dropping below the threshold σ < σesc does not imply that local convergence must occur (see success rate PS > 0 for all τ). Conversely, it is a stability criterion that maintains positive component-wise progress in expectation, if σ > σesc is kept large enough. Of course, this can only hold up to the global attractor, at which σ → 0 must be ensured to have convergence.
Figure 13 shows numerically evaluated progress dip locations, see e.g. the dip at σ* ≈ 25 and R ≈ 2 in Fig. 12, for increasing μ-values while keeping ϑ and the fitness parameters constant. It shows how increasing μ shifts the dip location to larger σ*-values and smaller residual distances R. Using larger populations enables the ES to operate at larger mutation strengths and approach the optimizer more closely resulting in higher global convergence probabilities. The σesc-line (red dashed) remains constant as it was derived for μ → ∞. A progress dip located below the σesc-line is critical, as both noise floor and local attraction effects overlap yielding effectively zero success rates.
Figure 13.
Numerically evaluated progress dip locations (black dots) for ϑ = 1/2, N = 100, and A = 10 with increasing values μ = {10, 50, 100, 200, …, 1000} from left to right. The red dashed curve shows σesc (displayed as R = σescN/σ*).
The results obtained from Fig. 12 suggest a synthetic explicit σ-control rule for understanding the meaning of σesc. This rule uses a constant mutation strength σ for a sufficiently high number of generations until the global attractor is reached, and then decreases σ → 0. This is realized by defining a σ(g)-schedule being constant for the first g < 9000 generations. For 9000 ≤ g ≤ 104 it is decreased multiplicatively as σ(g+1) = cσ(g) (0 < c < 1), such that the stopping criterion σ < 10−6 is reached at the last generation. The corresponding experiments are conducted in Fig. 14. The single-trial dynamics show that only the run at σesc converges globally (repeated experiments shown in Fig. 15). ES-runs with σ < σesc tend to converge locally at large R due to the ES getting stuck in the local minima landscape. ES-runs operating at σ > σesc are less prone to local attraction and they reach the Rastrigin noise floor at moderately large R (see intersection of red and white line in Fig. 12).
Figure 14.
Single runs using constant σ for (400/400I, 800)-ES, N = 100, and A = 10. A schedule for σ(g) was defined being constant during the first 9000 generations and converging exponentially within the last 1000 generations. One has σ = {0.1, 0.2, 0.3, 0.4, σesc, 0.5, 0.6}, from top to bottom (see ordering at g = 2000).
Figure 15.
The success probability PS is evaluated using parameters of Fig. 14 and 500 repetitions for each σ. The peak occurs around σ ≈ σesc = 0.436.
In Fig. 15 different σ-values are tested and the success rate PS is evaluated over 500 repetitions. One observes that PS is maximized at σ ≈ σesc. As expected, values σ < σesc are less successful due to local attraction. For σ > σesc local attraction is avoided, but the ES fluctuates at a larger residual distance before σ → 0, such that it is more likely to miss the global attractor.
7. Conclusions and Outlook
In this paper results from progress rate theory were applied and extended to investigate the convergence properties on the Rastrigin function. An aggregated residual distance dependent progress rate was obtained assuming normally distributed yi-locations around the optimizer. The progress rate yields useful insights on the search behavior of the ES, which can be illustrated by recalling Fig. 1. Far away from local attraction the ES is optimizing the sphere keeping a constant scale-invariant mutation strength. Approaching the local attractor landscape leads to a significant reduction of the (normalized) mutation strength compared to the initial level. As the mutation strength σ decreases together with σ*, it may fall below the σesc-threshold (see Fig. 12). Having σ ≫ σesc (at σ*-levels comparable to sphere-optimal values) the ES is performing a global search. It is not significantly influenced by single local attractors. For σ ⪅ σesc the search can be regarded as rather local and individual attractors gain importance, such that local convergence occurs with higher probability. Within the global attractor the sphere function is optimized again. Considering the ES performance a two-fold positive effect of large populations on the success rate can be identified. First, large μ-values decrease the expected residual distance (45) to the global optimizer (similar to optimizing the sphere under constant noise). Second, intermediate recombination reduces the magnitude of the loss term −σ*2/(2μ) in (39). Large μ and recombination therefore allow the ES to operate at larger σ-levels keeping σ > σesc and enabling a global search.
Furthermore, the progress rate analysis enabled the derivation of the population scaling result in (52), which could be experimentally verified. The result can serve to some extent as a guidance for the investigation of other highly multimodal test functions, provided that a global (spherical) structure exists with local perturbations.
There are multiple issues requiring further research. While it is now clear why large populations and mutation strengths are beneficial optimizing Rastrigin, a detailed analysis of the full σSA-ES or CSA-ES including the step-size adaptation is still pending. Additionally, the ES-efficiency in terms of fitness evaluations as a function of population size, truncation ratio, and learning parameter was not yet investigated. As the population size is a crucial parameter, the idea of using dynamic population control methods seems natural, see e.g. [4]. Actually, the theoretical analysis of population size control strategies is an uncharted research field. Furthermore, a probabilistic model would be useful to predict the success rate PS as a function of fitness and ES parameters. Whether the obtained results can be transferred to other multimodal functions also remains part of future research.
Supplementary Material
Acknowledgments
This work was supported by the Austrian Science Fund (FWF) under grant P33702-N. The authors thank Lisa Schönenberger for providing valuable feedback.
Contributor Information
Amir Omeradzic, Email: amir.omeradzic@fhv.at.
Hans-Georg Beyer, Email: hans-georg.beyer@fhv.at.
References
- [1].Abramowitz M, Stegun IA. Pocketbook of Mathematical Functions. Verlag Harri Deutsch; Thun: 1984. [Google Scholar]
- [2].Arnold DV. Noisy Optimization with Evolution Strategies. Kluwer Academic Publishers; Dordrecht: 2002. [Google Scholar]
- [3].Arnold DV, Beyer H-G. Performance Analysis of Evolution Strategies with Multi-Recombination in High-Dimensional ℝN-Search Spaces Disturbed by Noise. Theoretical Computer Science. 2002;289(2002):629–647. [Google Scholar]
- [4].Auger A, Hansen N. A Restart CMA Evolution Strategy with Increasing Population Size; Congress on Evolutionary Computation, CEC’05; 2005. pp. 1769–1776. [Google Scholar]
- [5].Beyer H-G. Toward a Theory of Evolution Strategies: Some Asymptotical Results from the (1,+ λ)-Theory. Evolutionary Computation. 1993;1(2):165–188. 1993. [Google Scholar]
- [6].Beyer H-G. The Theory of Evolution Strategies. Springer; Heidelberg: 2001. [DOI] [Google Scholar]
- [7].Beyer H-G, Arnold DV, Meyer-Nieberg S. A New Approach for Predicting the Final Outcome of Evolution Strategy Optimization under Noise. Genetic Programming and Evolvable Machines. 2005;6(1):7–24. (2005) [Google Scholar]
- [8].Beyer H-G, Melkozerov A. The Dynamics of Self-Adaptive Multi-Recombinant Evolution Strategies on the General Ellipsoid Model. IEEE Transactions on Evolutionary Computation. 2014;18(5):764–778. doi: 10.1109/TEVC.2013.2283968. (2014) [DOI] [Google Scholar]
- [9].Hansen N, Kern S. In: Parallel Problem Solving from Nature. Yao X, et al., editors. Vol. 8. Springer; Berlin: 2004. Evaluating the CMA Evolution Strategy on Multimodal Test Functions; pp. 282–291. [Google Scholar]
- [10].Meyer-Nieberg S. Self-Adaptation in Evolution Strategies. Dissertation Universität Dortmund; Dortmund, Germany: 2007. [Google Scholar]
- [11].Omeradzic A, Beyer H-G. Progress Analysis of a Multi-Recombinative Evolution Strategy on the Highly Multimodal Rastrigin Function. Report. Vorarlberg University of Applied Sciences; 2022. https://opus.fhv.at/frontdoor/index/index/docId/4722 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Omeradzic A, Beyer H-G. In: Parallel Problem Solving from Nature – PPSN XVII. Rudolph G, Kononova AV, Aguirre H, Kerschke P, Ochoa G, Tušar T, editors. Springer International Publishing; 2022. Progress Rate Analysis of Evolution Strategies on the Rastrigin Function: First Results; pp. 499–511. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.















