Skip to main content
iScience logoLink to iScience
. 2025 Mar 11;28(4):112203. doi: 10.1016/j.isci.2025.112203

Emergence of natural and robust bipedal walking by learning from biologically plausible objectives

Pierre Schumacher 1,2,8,, Thomas Geijtenbeek 3, Vittorio Caggiano 4, Vikash Kumar 4, Syn Schmitt 5, Georg Martius 1,6, Daniel FB Haeufle 2,7
PMCID: PMC12002607  PMID: 40241757

Summary

Humans show unparalleled ability when maneuvering diverse terrains. While reinforcement learning (RL) has shown great promise for musculoskeletal simulation in the development of robust controllers, complex behaviors are only achievable under extensive use of motion data. We demonstrate that the combination of a recent RL algorithm with a biologically plausible reward is capable of learning controllers for 4 different musculoskeletal models and achieves locomotion with up to 90 muscles without demonstrations. Our controllers generalize to diverse and unseen terrains, while only a single adaptive objective function is needed for training. We validate our findings on four models in two different simulators. The RL agents perform robustly with complex 3D models, where reflex-controllers are difficult to apply, and produce close-to-natural motion. This is a first step for the motor control, biomechanics, and rehabilitation communities to generate complex human movements with RL, without using motion data or simple unrepresentative models.

Subject areas: Behavioral neuroscience, Biological sciences, Neuroscience

Graphical abstract

graphic file with name fx1.jpg

Highlights

  • Biomechanical human models learn to walk close-to-naturally with RL and energy costs

  • The learned controllers are robust under unseen conditions

  • Our results hold for walking and running for 4 different models, with up to 90 muscles


Behavioral neuroscience; Biological sciences; Neuroscience

Introduction

Humans excel at robust bipedal walking in complex natural environments by adequately adapting each movement and coordinating their muscles to compensate for uncertainties in ground conditions. In each step, they adequately tune the interaction of biomechanical muscle dynamics and neuronal signals to be robust against uncertainties in ground conditions.1 However, it is still not fully understood how the nervous system resolves the musculoskeletal redundancy to solve the multi-objective control problem considering stability, robustness, and energy efficiency. In computer simulations, energy minimization has been shown to be a successful optimization target, reproducing natural walking with trajectory optimization2 or reflex-based control methods.3,4,5,6 However, these methods focus on particular motions at a time and the resulting controllers are limited when compensating for perturbations.6,7,8,9 Trajectory optimization approaches using direct collocation also use energy as a cost term, however, they do not produce a feedback control strategy and are therefore limited in studying (neuronal) responses to an unexpected perturbation.10,11 In robotics, reinforcement learning (RL) methods recently achieved highly stable (and efficient) locomotion on quadruped systems12,13 and outperform optimal control approaches,14 but the generation of human-like walking with bipedal biomechanical models often relies on multi-camera marker-based motion capture data that is imitated by the controller.15,16,17 Some approaches are able to learn by imitation from a relatively small number of pre-recorded motions (<10)16,17,18 benefiting from faster convergence of RL algorithms with imitation components. In some cases, however, these controllers can be less robust.19 Additionally, the extension to behaviors unseen in the data is difficult with direct imitation learning and requires additional algorithmic complexity and large high-quality datasets.20,21 In a recent study22 behaviors were learned for a high-dimensional musculoskeletal model by imitating only 25 min of motion data, but it was not probed whether the generated motions and muscle activities represent human experimental data well. In one study23 human-like motions with RL are learned without explicitly tracking experimental trajectories, but require a complex curriculum schedule that might not transfer to other models.

Achieving natural locomotion with RL without sacrificing its robustness and generalization capability and while using realistic muscle activity patterns might pave the way for approaches that study human walking in complex natural environments.24

In this study, we demonstrate that biologically plausible objectives in combination with recent RL methods can directly (without learning by imitation) lead to the emergence of close-to-natural bipedal locomotion behaviors that can generalize to diverse terrains. In our work, we do not try to achieve human-like behavior with RL by attempting to follow recorded kinematic data25 but only by optimizing biologically plausible objectives in combination with realistic biomechanical constraints embedded into simulation engines. While we still use experimental data in this framework, the data are strongly aggregated into average gait cycles, and no individual trajectories are tracked. The use of biologically plausible objectives has a long history in motor control research.26 In contrast to direct trajectory tracking, metrics such as muscular effort, pain, and others are potentially more closely aligned with objectives that humans optimize during movement generation.27 The combination of a biomechanical system with these objectives under an RL paradigm has the potential to be general enough to allow for the reproduction of natural gait, similar to the achievements of reflex-based control,4,28,29 but with the potential for generating diverse and robust behaviors under many different conditions. Indeed, recent works30 have shown that RL algorithms may provide more robust controllers than model-predictive control approaches and might be able to handle the complexity of diverse movement generation with a high-dimensional physical embodiment.20,22

From a technical point-of-view, general reward terms have the benefit of being applicable to a broader range of behaviors without the need to collect specific trajectories. They might also enhance the controller’s reaction to perturbations, as objectives such as effort and pain minimization stay relevant even when deviating from the original motion. A simple trajectory tracking approach would be unable to compute rewards when the controller faces unseen situations.

The purpose of this study was to investigate if close-to-natural walking behaviors can be achieved with model-free RL using sufficiently good exploration31 combined with biologically plausible objectives (muscle effort, joint torques) and biomechanical human models. We investigated the quality of the produced gaits as well as the robustness of the controllers by comparing quantities derived from hip, knee and ankle joint angle trajectories as well as the GRFs of the feet. Experimental human data are used to modify the relative importance of the different metrics in the controller optimization, similar to.32,33,34 This goes beyond previous works applying RL to biomechanical models that either study low-dimensional systems,23 make use of human data17 during training, or learn unrealistic movements.31,35 The results are evaluated in four different models and two simulation engines of differing biomechanical complexity and accuracy using an identical training protocol and without changing the reward function.

Results

RL agents were trained on four musculoskeletal models in 2D and 3D in order to achieve controllers that produce close-to-human motion and are robust against unseen variations. It was found that the obtained controllers performed better than comparable reflex-based baselines and were easily applied to highly complex models where no previous controllers were available. The highest value of the used gait quality metric was achieved by the RL controller for the H1622 model, which has 16 DOF and 22 muscles. The method achieved comparable results during walking with a speed of 1.2 m/s and running with a speed of 5.2 m/s.

Reward function

Building on previous work on gait optimization,5 we found that a natural gait can be achieved with RL by using objectives that incentivize.

  • 1.

    learning to maintain a given speed without falling down,

  • 2.

    minimizing effort, and

  • 3.

    minimizing pain (We do not use the term “pain” in the medical sense, but interpret it as an indication of mechanical loads potentially leading to injury).

Thus, our reward function contains three main terms:

r=rvelceffortcpain, (Equation 1)

The first term specifies the external task the agent should solve. As we want the agent to move at a walking pace while keeping its balance, we chose the following objective:

rvel={exp[(vvtarget)2]ifv<vtarget1otherwise, (Equation 2)

where v is the center-of-mass velocity and the target velocity vtarget is chosen to be 1.2 m/s, which is close to the average energetically optimal human walking speed.36 The velocity reward is constant above the target velocity to improve the optimization of the auxiliary cost terms, inspired by a recent study on reward shaping in robotics.37 While there is no reward gradient for the target speed above the threshold, we observed in our experiments that the strong effort costs used in training prevent the policy from reaching higher speeds, as doing so would require higher muscle activity.

Important for achieving natural human walking is the use of minimal muscle effort, as the literature suggests that energy efficiency is a key component of human locomotion:38,39

ceffort=αta3+w1uuprev2+w2nactive (Equation 3)

where the first term penalizes muscle activity a40 the second term incentivizes smoothness of muscle excitations u and the third term nactive incentivizes a small number of active muscles (penalizing activity exceeding a certain value). The last term was included as experimental human data and prior work in reflex-based controllers has shown that the activity of muscle groups is small during the swing phase of steady-state gait and that there are low co-contraction levels of the leg muscles during the whole gait cycle.28 While an optimal solution to Equation 3 should use sparse muscle activity even without the last term, we found that it was necessary to add it in practice.

From a technical standpoint, it proved challenging to effectively minimize muscle activity. Using a strong cost scale causes a performance collapse in early training, even if that value would lead to energy-efficient walking in late training. We, therefore, chose an approach rooted in constrained optimization.41 We conjecture that a large initial action cost incentivizes near zero actions before any increase in the other reward terms is encountered, therefore preventing learning.

We propose an adaptation mechanism for the weighting parameter α(t), increasing the weight only when the agent performs well in the main task (rvel) and decreasing it when this constraint is violated. Concretely, we measure the performance by the task return. The details are provided in Algorithm 1, we marked the constrained optimization in blue.

Algorithm 1. Effort weight adaptation.

Require: threshold θ, smoothing β, change in adaptation rate Δα, decay term λ[0,1]

 rmean 0, αt 0, smean 0

 while True do

 r train episode()

  rmeanβrmean+1βr

  if rmean>θ and smean<0.5 then

ΔαλΔα

  else if r mean >θ and s mean >0.5 then

  αt+1αt+Δα

  else

  αt+1αt+Δα

  end if

  starget{1ifrmean>θ0otherwise

  smeanβsmean+1βstarget

  end while

This adaptive learning mechanism is agnostic to the model and the task and removes the need for hand-tuning of schedules. A change in reward function over time could, however, destabilize learning, as previously collected environment transitions are not reflective of the current effort cost anymore.42 We, therefore, monitor the performance of the policy in the current environment, while the effort cost is only applied the moment when data are sampled from the replay buffer. This relabeling of previously collected data ensures that our off-policy algorithm can make efficient use of the full replay buffer.

The third term cpain is necessary to prevent unnatural optima. One striking example is the over-use of mechanical forces of the joint limits (e.g., massive knee over-extension) to keep a straight leg while minimizing muscle activity. As this is clearly unnatural behavior, we include objectives that account for the notion of pain:

cpain=w3iτilim+w4jFjGRF, (4)

where τilim is the torque with which the joint angle limit of joint i is violated (joint-limit pain) and FjGRF is the vertical ground reaction force (GRF) for foot j (joint-loading pain). We only penalize GRFs if they exceed 1.2 times the model’s body weight43,44 such that all pain cost terms vanish close to the natural gait and do not further bias the solution. Joint limit violations in the simulation engines represent soft tissue in humans45 and are modeled by forces that push back against the violation.

We emphasize that the used cost weights are constant across all models and all joints and muscles. The weights ωi for i{1,,4} were found by first aligning the generated and experimental trajectories into gait cycles for each leg, starting and ending when the respective foot touches the ground. The result is then averaged over all gait cycles recorded from both legs. After this procedure, the data generated by the learned controllers is compared to its equivalent obtained from experimental human data.

The experimental match, defined as the fraction of the gait cycle for which the average simulated trajectory overlaps within the standard deviation of experimental data, serves as an optimization metric for our cost terms. The joint angles for the hips, knees and ankles of both legs are used to define this fraction. We note that the coefficients are identical across all joints and muscles, and stress that no human data were used during the learning process, but only to find weighting coefficients. This procedure is similar to Berret et al.32 with the difference that we search for one set of values that works across a range of models, instead of optimally for only one model.

Finally, we initialize the models with a randomized initial state that either starts with an elevated left or right leg and also contains a small amount of random noise, to mitigate asymmetries caused by the initial state distributions. We also clip all muscle excitations to lie between 0 and 0.5 at each time step, to further reduce muscle effort.

Models

With the reward function and the RL approach described above, we are able to learn robust control policies for several models of human walking, with varying complexity, and across two different simulation engines with different levels of biomechanical accuracy (see Figure 1):

Figure 1.

Figure 1

We achieve robust and energy-efficient natural walking with RL on a series of human models

The different models differ in number of degrees of freedom (DOF), muscles, geometry, and simulation engine. We also use an uneven terrain environment. Videos: https://sites.google.com/view/naturalwalkingrl.

H0918 A planar Hyfydy model with 9° of freedom (DOFs) and 18 muscles, based on Delp et al.46

H1622 A 3D Hyfydy model with 16 DOFs and 22 muscles, based on Delp et al.46

H2190 A 3D Hyfydy model with 21 DOFs and 90 muscles, and articulation between the otherwise rigid pelvis and torso, based on.46,47,48

MyoLegV0 A 3D MuJoCo model with 20 DOFs and 80 muscles, based on Rajagopal et al.47 As for the H0918 and H1622 models, the pelvis and torso are one rigid body part. Additionally, each foot in the MyoLegV0 contains articulated toes (all five toes are joined into one body segment). See Figure 1 for a summary of the models.

Simulation engine

The simulation engines used for each model are indicated in the description and are either: (1) Hyfydy,49 which was used via the SCONE Python API44 or (2) MuJoCo,50 which was used via the MyoSuite51 environment. We chose these two engines, to highlight the versatility of our approach but also to bridge two communities: biomechanics and RL.

Hyfydy is an engine built for biomechanical accuracy. It is closely related to the well-established OpenSim52 framework, matching its level of detail in muscle and deformation-based contact-force models while providing increased computational performance. MuJoCo is a fast simulation framework widely used in the robotics and RL community. It offers a simplified muscle model with rigid tendons and resolves contact forces using the convex Gauss Principle. The MyoSuite51 builds on this framework, allowing for the development of high-dimensional muscle-control models which have recently gained a lot of interest from the RL community.53,54,55 Both engines achieve the required computational speed to train control policies for these high-dimensional models in under a day.

Learned behaviors

We first show that with our framework, we can train agents across 4 different models to produce walking gaits with the same training approach and reward function. In Figure 2 we compare the resulting gait kinematics against experimental data.56 Kinematics are shown for 5 rollouts of the most human-like policy checkpoint that was achieved over the entire training run over 10 random seeds, averaged over all gait cycles of both legs in a 10 s rollout (For videos, see: https://sites.google.com/view/naturalwalkingrl). The gait data for locomotion contains participants walking at 1.2 m/s and participants running at 5.2 m/s.

Figure 2.

Figure 2

Gait-kinematics for RL agents for all models during walking and running

Shown are the hip, knee, ankle and GRF values averaged over 5 rollouts of 10 s walking and running on flat ground. We excluded rollouts that did not achieve the whole episode length to clearly highlight the achieved kinematics. For walking, we observe slight discrepancies between experimental data (gray) and the RL behaviors (red), which are bigger for high-dimensional models.

(A) The experimental data show human subjects walking at 1.2m/s and is included in SCONE.44

(B) The experimental data show human subjects running at 5m/s and was extracted from Hamner et al.57 For running, the behaviors deviate overall more from experimental data, a weak knee flexion can be observed. Nevertheless, the characteristic single-peaked GRF curve is present for all models, except for the MyoLegV0. The proposed reward function provides a strong and flexible starting point for researchers aiming to create robust and natural controllers for high-dimensional musculoskeletal systems. Also, see the videos on the website. In all simulations, the values (red) are averaged over both legs and 5 rollouts of 10 s are recorded. The experimental data (shaded) is shown as maximum and minimum values over all subjects.

The results for the planar H0918 and the 3D H1622 model look very similar to the experimental data. While the agents achieve the most human-like gaits here, the models are also of limited complexity and applicability, compared to the high-dimensional systems, H2190 and MyoLegV0. As seen in Table 1 and Figure 2 our approach still achieves periodic gaits resembling human kinematics with the difficult-to-control 80 and 90 muscle models, even though they contain more artifacts. The H2190-agent exhibits less knee flexion and the MyoLegV0-agent lacks the double-peaked GRF structure; it produces periodic torso oscillations, see Figure 3. Overall, the behavior of the H2190 model appears more natural than the one produced with the MyoLegV0 model, see also the discussion and the supplementary videos. Both models produce agents with ankle profiles deviating slightly from experimental data.

Table 1.

Average cubic muscle activity (effort), the percentage match with human experimental data (exp. match), and the average distance walked on the uneven terrain (Figure 1)

controller system avg. effort experimental match avg. distance [m]
reflex H0918 0.041±3×103 0.68±0.08 2.46±0.98
RL H0918 0.013±3×104 0.67±0.03 10.42±0.94
RL H1622 0.015±2×104 0.73±0.01 5.6±0.99
RL H2190 0.017±2×105 0.50±0.01 10.59±2.51
RL MyoLeg 0.013±2×104 0.43±0.05 n.a

Note that the exp. match metric measures the percentage of the gait cycle during which the trajectory perfectly lies inside the standard deviation of the experimental data. Even relatively natural gaits can still achieve a low metric if the angles are slightly shifted. Reported are mean ± SD over 5 rollouts of 10 s with the best policy.

Figure 3.

Figure 3

Muscle activity and torso oscillations for the RL agents

We compare muscle activities for two controller types for natural walking with the H0918 model.

(A) The activity for the RL agent has been clipped to 0.5. We use 5 roll-outs of the most natural RL policy and 5 reflex-based controllers that were optimized until convergence. The initial state for the RL agent is randomized, which would cause collapse with the reflex controllers, as they are sensitive to the initial state.

(B) and (C) We show the torso angle with the vertical axis for 5 rollouts of 10 s for the H2190and the MyoLegV0 models for walking and for running. The MyoLegV0 presents stronger lateral oscillations. The dashed line shows a straight torso posture.

Nevertheless, Table 1 shows that RL gaits not only approximate human walking but are also robust and energy-efficient across all models, without changes in the reward function and only minimal changes in the hyperparameters of the RL method. We provide the training curves and additional metrics for 10 random seeds in Figure S1.

In order to probe the robustness of our controllers, we perform rollouts on uneven terrain, which was not seen during training. The entire training procedure was performed with flat ground. The generated terrain contains 10 tiles of 1 m length with random slopes of ±5° and is fixed for all evaluations. The behavior of the planar H0918-model is compared against a popular reflex-based controller as an illustrative example, adapted from28,44 and included with the SCONE software. We were only able to use this simple reflex-based controller with the H0918 model, as it did not produce stable gaits with the other models. We train 5 reflex-based controllers with different initializations until convergence, while we use the most natural RL policy for each model and perform 20 roll-outs with randomized initial states to test the robustness. We chose this approach as reflex-based controllers are sensitive to the initial simulation state; different roll-outs would be almost identical if similar starting states were used.

While both approaches adequately match human kinematics, as quantified by the exp. match metric, with low energy consumption in the planar case, the reflex-based controller produces more natural gaits. However, when exposed to uneven terrain, the RL agent achieves an average distance of 10.42 m, which shows that it is much more robust than the reflex controller with an average distance of 2.46 m, see Table 1. Both controllers also induce similar average muscle activities over the gait cycle, with the RL agent inducing less smooth activity, shown in Figure 3.

With the same framework, we were also able to train agents to learn maximum speed running, by simply using the achieved velocity as the velocity reward in our reward function. Additionally, the action clipping and effort costs were omitted, as energy consumption is less critical for short maximum performance tasks. See Figure 2 for these results.

As a showcase of the extreme robustness of the RL agents, we generated a difficult suspension-bridge-terrain task with moveable environment elements that present dynamic perturbations, see Figure S3. We test the robustness of H1622 and H2190 RL controllers in this scenario, even though they were only ever trained on flat ground, and observe remarkable stability across the task. We report the data in Table 2 and in the videos.

Table 2.

Maximum running velocity for different models and total achieved distance in the dynamic terrain

system H0918 H1622 H2190 MyoLeg
max velocity 5.38 5.04 6.49 5.44
achieved distance n.a. 9.87±4.27 10.45±4.77 n.a.

Velocities are expressed in m/s, the terrain is related to Figure S3. We show the maximum speed over 20 roll-outs for each model. We do not examine this environment for the H0918 model, as the 3D nature of the terrain is not applicable to it, and not for the MyoLegV0 model, as the terrain was not implemented yet in the MuJoCo engine when the experiments were conducted.

Note that we tried several alternatives to our approach which yielded worse results. We performed experiments with different reward terms such as a constant instead of an adaptive effort term, with metabolic energy costs3 or with a cost of transport58,59 reward. Even though these terms sometimes lead to small muscle activity during execution, the kinematics were further away from human data. We conjecture that energy minimization is not enough of an incentive for human-like gait if the learning algorithm is as flexible as an RL agent. See also Figure 4 for ablations of our reward function.

Figure 4.

Figure 4

Cost function ablations

We show several ablations of our cost function and plot the average match with experimental human data, as well as the average muscle activity. A natural gait is generally characterized by a large experimental match as well as minimal muscle activity. Different ablations are shown: The adaptive effort term is zero (α(t)=0): no-adapt. The entire effort cost term is zero (ceffort=0) and we deactivate the action clipping: no-effort. We only reward with the velocity reward term (ceffort=0 & cpain=0): only-vel. Only the combined cost function achieves a close resemblance to natural gait with low muscle activity. Leaving out the pain-related costs leads to the worst gait trajectories, while a combination of the effort cost terms and the adaptive cost term is needed to achieve the lowest muscle activity. All experiments report mean ± SD over 10 random seeds.

Larger effort term exponents, penalization of contacts between limbs or angle-based joint limit violation costs did not lead to better behavior. The prescription of hip movement at a certain frequency (step clock), keeping certain joint angles in pre-specified positions or minimizing torso rotation helped to achieve stable gaits, but prevented effort minimization and did not lead to natural kinematics.

Ablations

In order to pinpoint the contributions of the different introduced mechanisms, we performed ablation experiments, see Figure 4. The considered variants include a reward function without the adaptive term (no-adapt: α(t)=0), a reward function without any effort terms (no-effort: ceffort=0), and a variant where all rewards except for the external task reward rvel were zero (only-vel). It can be observed that the exp. match metric is the highest with the inclusion of all reward and cost terms (ours). While the non-adaptive effort reward also leads to a smaller average muscle effort than the other reward functions, the final effort value is still larger. The initially high muscle effort for the ours-variant coincides with high muscular co-contraction levels, diminishing over time, similar to studies with human participants.60

Discussion

The recently published DEP-RL31 approach was leveraged to learn feedback controllers for musculoskeletal systems. DEP-RL has been shown to achieve robust locomotion in several tasks, including running with a high-dimensional (120 muscles) bipedal ostrich model, by proposing an exploration scheme for overactuated systems. The learned behaviors, however, still exhibited unnatural artifacts, such as large co-contraction levels and excessive bending of several joints.

Here, we demonstrate that the combination of an effective RL algorithm31 with a reward function accounting for biologically plausible incentives results in gaits that closely resemble human walking. We achieve this by introducing a reward function that adapts its weights to the used model and is general enough to generate gaits across several models with up to 90 muscles in two and three dimensions and in simulators of differing biomechanical modeling accuracy without the need for manual tuning of the reward function for each model. Only the network size was decreased for the low-dimensional models to benefit from the computational speed up.

In our experiments with lower-dimensional 2D and 3D models (up to 22 muscles), the agents almost reach the naturalism of existing optimal control- and reflex-based frameworks.4,5,28 While the match to experimental human data is not perfect, it is closer than comparable methods in the literature that, similar to us, only use aggregated data to tune their reward function.23 In the complex 3D models with 80 and 90 muscles, which are substantially more challenging to control due to the large and redundant space of possible muscle excitations, our approach still achieved gaits with kinematics and GRFs similar to experimental human data, albeit with more artifacts. Achieving gaits in these complex models is a step toward applications in rehabilitation, neuroscience, and computer graphics requiring simulated human motor control with high-dimensional models in complex environments.

Striking is the robustness of the learned controllers exhibiting diverse stabilization strategies when faced with dynamic perturbations to an extent unseen in previous reflex-based controllers.8,9,28,61,62 As the used reward terms are considered plausible objectives for biological organisms, the general approach may also be applicable to different movements. Therefore, we see this study as a useful starting point for the community showing that RL in combination with biologically plausible objectives is a viable candidate to investigate the highly robust nature of complex human movements, even if no experimental data are available.

Close-to-human gait with RL can be achieved with an adaptive energy cost

As the human biomechanical system is highly redundant, there are many possible solutions to walking at a defined speed. There exists strong evidence that natural human walking is in part driven by energy-efficiency63 and optimal control approaches have shown that natural walking kinematics can be achieved if energy optimality is considered in the cost function (Some also suggest that muscle fatigue could be the driving factor to explain the experimentally observed kinematic patterns64).

However, while model-predictive control and trajectory optimization methods provide valuable insight into singular trajectories and can deal with unexpected perturbations up to a certain degree, recent studies have shown that RL methods can generate more robust controllers14 and learn a large number of diverse movements.20

To extend the promising results in the field of robotics to high-dimensional biomechanical systems, while considering the minimization of muscular effort as a guiding principle, we introduce an adaptive reward function. A single reward term schedule adapts the weighting of the energy term in the reward function depending on the current performance, achieving energy-efficient gaits with more natural kinematics in a biomechanical setting. Moreover, the adaptation Algorithm (Algorithm 1) and all other reward terms and their weighting coefficients are general enough to work—without any changes—across 2D and 3D models with different numbers of muscles and even different levels of biomechanical modeling accuracy.

This is a significant step toward finding a general reward function and framework to generate natural and robust movements with RL in muscle-driven systems. Other RL frameworks that do achieve natural muscle utilization either consider low-dimensional systems65 or strongly rely on motion capture data16 to render the learning problem feasible. Our approach works without the use of motion capture data during training and with few and very general reward terms and therefore may generalize better to other movements.

In our opinion, there is only one comparable work.23 They achieved human-level kinematics on a planar human model with 18 muscles, by crafting a multi-stage learning curriculum affecting the weighting of seven reward terms. As this learning curriculum contains model-specific reward terms and adaptation procedures, we speculate that it would have to be hand-tuned for different models.

The realism of learned behaviors is influenced by the embodiment of the agent

While our approach achieved higher robustness than reflex-based controllers and kinematics closer to natural walking than previous demonstration-free RL approaches, several discrepancies to natural walking remain, see Figure 2 and supplementary videos. The low-dimensional models (H0918 and H1622) in general do not present proper ankle rolling, while the high-dimensional models (H2190 and MyoLegV0) exhibit less passive leg-swing in the swing phase of the gait. We attribute these differences in behaviors to the physiological modeling differences between different models.

The behavior of the MyoLegV0 model deviates stronger from human data than the H2190, although they are similar in terms of complexity. The MyoLegV0 model lacks the double-peak structure in the GRFs and we also observed a tendency for unnatural lateral torso oscillations during walking and running, see Figure 3 and the videos. MyoLegV0 uses a different muscle geometry from H2190 and includes ellipsoidal collision objects for foot contact dynamics, which might increase learning difficulty. Alternatively, the more elaborate biomechanical features in Hyfydy, such as elastic tendons,66,67 non-linear foot-ground contact mechanics,68 variable pennation angles,69 or error-controlled integration, could account for the increased realism of the behaviors with the Hyfydy models.

Research on the contribution of biomechanical structures to the emergence of natural movement70,71,72,73 suggest that, in addition to the learning method and reward function, the biomechanical structures and modeling choices may play a crucial role in the accurate reproduction of human gait. This seems a plausible explanation for the increased realism in the Hyfydy models, as previous observations in predictive simulations suggest that e.g., an elastic tendon is beneficial for natural gait.3,5,28,67 We regard this as one interesting area of future research, which could help us better understand the fundamentals of the interaction between biomechanics and neuronal control in human locomotion.

Conclusion

We achieved highly robust walking approaching human-like kinematics and ground reaction forces. While a better degree of accuracy was achieved in simpler models, we provide first promising results for difficult-to-control 80 and 90 muscle models that are of high interest for applications in rehabilitation, neuroscience, and computer graphics. Learning with the proposed reward function and RL framework allows for these results across several models of differing complexity and biomechanical modeling accuracy with only minimal changes in the hyperparameters of the method. We hope that this inspires researchers from both the biomechanics and the RL community to further improve on our approach and to develop tools to unravel the fundamentals of the generation of complex, robust, and energy-efficient human movement.

Limitations of the study

We only performed experiments with two different behaviors, walking and running. It is unclear whether the reward function will generalize to other tasks. The learning problem did not include the control of arms or the incorporation of vision-based sensory inputs, both of which could make the tasks more difficult. Finally, while the overall match to experimental human data is quite close, methods optimized for a particular model might be able to achieve an even better fit. The influence of the distribution of male and female subjects in the used data was not investigated.

Resource availability

Lead contact

Further information and requests for resources and information should be directed to and will be fulfilled by Pierre Schumacher, (pierre.schumacher@cin.uni-tuebingen.de).

Materials availability

This study did not generate new unique materials.

Data and code availability

Acknowledgments

Pierre Schumacher was supported by the International Max Planck Research School for Intelligent Systems (IMPRS-IS). This work was supported by the Cyber Valley Research Fund (CyVy-RF-2020-11 to D.H. and G.M.). We acknowledge support from the Open Access Publication Fund of the University of Tübingen.

Author contributions

Conceptualization, P.S., G.M., D.F.B.H., and S.S.; methodology, P.S., G.M., D.F.B.H., T.G., V.K., and V.C.; software, P.S., T.G., V.K., and V.C.; writing – original draft, P.S., G.M., and D.F.B.H.; writing – review and editing, P.S., G.M., D.F.B.H., S.S., T.G., V.K., and V.C.

Declaration of interests

V.K. and V.C. are senior research scientists at Meta AI. T.G. is the author and proprietor of the Hyfydy simulation software.

STAR★Methods

Key resources table

REAGENT or RESOURCE SOURCE IDENTIFIER
Deposited data

Human gait data. Hamner et al.57 https://simtk.org/projects/nmbl_running

Software and algorithms

Reinforcement learning algorithms Schumacher et al.31 and the present study https://github.com/martius-lab/depRL
Hyfydy models used in the present study The present study https://github.com/tgeijten/sconegym
SCONE. The SCONE open-source software https://scone.software/doku.php?id=start
MyoLeg model MyoSuite https://github.com/myohub/myosuite/

Method details

Simulation engines

We use both the Hyfydy and the MuJoCo simulation engines for our experiments. Both engines differ in these key areas.

Musculotendon dynamics

The muscle model in Hyfydy is based on Millard et al.74 and includes tendon elasticity, muscle pennation, and muscle fiber damping. The MuJoCo muscle model is based on a simplified Hill-type model, parameterized to match existing OpenSim models51 and supports only rigid tendons and does not include variable pennation angles.

Contact forces

Hyfydy uses the Hunt-Crossley75 contact model with non-linear damping to generate contact forces, with a friction cone based on dynamic, static, and viscous friction coefficients.76 MuJoCo contacts are rigid, with a friction pyramid instead of a cone, and without separate coefficients for dynamic and viscous friction.

Contact geometry

The MuJoCo model uses a convex mesh for foot geometry, while in the Hyfydy models the foot geometry is approximated using three contact spheres.

Integration step size

Hyfydy uses an error-controlled integrator with variable step size, while MuJoCo uses a fixed step size and no error control. The average simulation step size in Hyfydy is around 0.00014s (7000 Hz) for the H2190 model, compared to the fixed MyoSuite step size of 0.001s (1000 Hz) for the MyoLegV0 model.

State space

The state space, or sensory input, of the RL agents, are the Cartesian position and orientation of the pelvis, their velocities, internal joint angles and velocities, the center-of-mass velocity, the torso orientation, the pelvis height, the position of the feet relative to the pelvis and muscle fiber lengths, velocities as well as muscle force. We also zero the Cartesian pelvis position in the x-dimension to improve performance, similar to.15

Definition of reward terms

To facilitate the reproduction of our results, we give a detailed description and definition of all the used reward terms: The complete reward function is given by:

r=rvelceffortcpain. (Equation 5)

The first term describes the task, in this case, it is a target velocity:

rvel={exp[(vvtarget)2]ifv<vtarget1otherwise, (Equation 6)

where v is the velocity of the pelvis in the forward direction.

The second term, responsible for minimizing the effort, is:

ceffort=α(t)j=1Mam3+w11Mm=1M(umumprev)2+w2m=1M[am>0.15]. (Equation 7)

where the sum is over all muscles in the respective model and [·] denotes the Iverson bracket, an indicator function that is 1 if the interior condition is true, and 0 otherwise. Note that the term encouraging smoothness of the excitations, leading with w1, is averaged over all muscles, while the other terms are not. The reasoning is that the first term uses the adaptive term α(t), making additional weighting unnecessary, while the last term encourages sparse muscle activity. We want as few active muscles at all times as possible.

Lastly, the pain cost term is given by:

cpain=w3j=1Jτilim+w4max0,1Wk=12FkGRF1.2. (Equation 8)

The w4-term sums the ground reaction forces between each foot and the ground and normalizes it by the total body weight W. Only values exceeding 1.2 are punished.

The w3 term sums the forces with which joint limits are violated over all joints in the model. For the MyoLegV0, we did not consider internal DOFs of the patella for this cost, as we did not achieve good results when incorporating it. This is a natural cost term as joint limits in humans are not entirely stiff due to elastic tendons and other passive tissue.45 Both used simulation engines allow small violations of the joint limits which produce forces that drive the joint angle back to minimize the violations.

The experiments with the running task also contain a self-contact cost:

ccontact=w5clip(igroundjground|fc(i,j)|,0,100)/100, (Equation 9)

which adds all contact forces between bodies i and j except for collisions with the ground. We consider this cost to belong to the pain cost term, as strong and repetitive collisions between limbs during high-performance running would lead to injury in a human. We clip and scale the contact forces to the range ccontact[0,w5]. Unscaled contact forces can vary wildly for different simulation engines and might interfere with training stability. In this way, only strong and potentially painful self-contacts are considered, while weaker collisions can be safely ignored by the learner.

Training curves

Here we present more detailed results about the training evolution in Figure S1. We plot the experimental match percentage between the collected gait-cycle averaged data and experimental human data, the muscle-averaged effort, the training returns, and the weight that the effort-reward term has over training. This weight is adapted over time and depends on the agent’s performance. It increases slower for the complex models and saturates at smaller values. It can also be seen that the returns for the MyoLegV0 are generally smaller than for the other models. We observed that there was more variance over training and over different seeds for the MyoLegV0-agent, leading to much smaller averaged returns. It was still possible to find a training checkpoint that achieved robust, close-to-human-like walking for this model. Here we present more detailed results about the training evolution in Figure S1. We plot the experimental match percentage between the collected gait-cycle averaged data and experimental human data, the muscle-averaged effort, the training returns, and the weight that the effort-reward term has over training. This weight is adapted over time and depends on the agent’s performance. It increases slower for the complex models and saturates at smaller values. It can also be seen that the returns for the MyoLegV0 are generally smaller than for the other models. We observed that there was more variance over training and over different seeds for the MyoLegV0-agent, leading to much smaller averaged returns. It was still possible to find a training checkpoint that achieved robust, close-to-human-like walking for this model.

Effort cost adaption

The effort cost adaptation considers the current external task performance as a criterion for adaption. This means that only the rvel term in Equation 1 is used to change the effort reward weight. We provide details for this mechanism in Algorithm 1.

Given a performance threshold θ, a smoothing parameter β, a change in adaptation rate Δα and a decay term λ, we initialize a running mean for the episode return rmean, the switching indicator smean and set the initial adaptation rate to at equals zero. The effort weight at increases by Δα if the running mean of the episode return rmean is above the defined threshold θ and the switch indicator smean is above 0.5. The indicator smean slowly moves towards 1 if the average episode return is above the threshold and slowly moves towards 0 if it is not. It therefore indicates whether the current episode performance has only been achieved for a short time or if the agent has been consistently well performing. If the performance is above the threshold, but only recently so, as indicated by smean, we decrease the change in adaptation rate Δα by multiplying it with the decay term λ. In the case of consistently bad performance, the effort cost decreases.

This mechanism increases the effort weight as long as good task performance can be achieved and decreases if the agent does not achieve it reliably anymore. The rate decay was introduced to prevent performance oscillation without a decrease in the used effort. We observe that the agent only learns low-effort behaviors if the effort weight has been high for many training iterations, while the absolute value of the weight as well as the adaptation speed depends on the used model and the task. Our mechanism is therefore capable of creating a personalized learning schedule for each of the evaluated biomechanical models.

We have observed that for the models with a small number of DOFs, the effort weight goes up rapidly and the rate of increase will only get adjusted once or twice for an entire training run. For the high DOF models, and more difficult tasks which require longer training time, the learning progress will collapse many times until the rate of increase of the effort weight has slowed down significantly. In this manner, we benefit from a fast effort adjustment when network training is fast and the task is easy, while difficult training runs automatically adjust to a slow increase of the effort weight over time. See also Figure S2.

Running

We performed maximum-speed running experiments with every model. While most reward terms remained identical to the natural walking case, we replace the external task reward by the velocity of the center of mass rvel=v and removed energetic constraints such as the muscle excitation clipping and the effort cost term. The gait-cycle- and leg-averaged kinematics are shown in Figure 2. As this task is a maximum performance movement, we have equalized the forces between the Hyfydy- and MuJoCo-based models, as the Hyfydy-models in the main experiments are generally based on experimental data with weaker maximum isometric muscle forces.46 Note that we added a negative reward for self-collision forces for the running tasks, as the agents would often cross their legs and hit them against each other, thereby hopping instead of running. The self-collision forces are clipped between f[100,100] and normalized by dividing by F=100. We use a reward coefficient of ωcontact=10 for all environments.

Even though there remains a stronger discrepancy between the produced kinematics and the experimental data than for walking, the hip movement and GRFs are generally well aligned for the Hyfydy-models. The MyoLegV0-model presents very strong lateral torso oscillations during running, see also Figure 3. In future work, biological objectives such as head stabilization or the inclusion of arms in the model might minimize some of these artifacts. See Table 2 for the maximum running velocities for each model.

We also performed robustness experiments on a challenging obstacle course, see Figure S3 and supplementary videos.

Hyperparameters

Used hyperparameter settings for the RL agent, DEP and the cost function are shown in Table S1. Non-reported RL values are left to their default setting in TonicRL.77 See31 for an explanation of the DEP-specific terms. The RL parameters were held constant, except for an increase in network capacity for H2190 and MyoLegV0.

Effect of network size

We observed that the models with a large number of DOFs generally require larger networks to properly train. Although the less complex models could potentially be trained with larger network sizes and a good performance would be reached eventually, this increases training time tremendously. We therefore choose the smallest network size that can reach a good performance for each model, to provide a quick-to-train baseline for other researchers to build on. Performing more difficult tasks with the less complex models may also require the use of bigger network sizes.

Quantification and statistical analysis

Basic statistical analysis was performed using numpy (version 1.26.4).78 Mean and standard deviation are indicated whenever an average over several rollouts or policies was performed. Additional information not contained in this section can be found in the figure legends.

RL training and evaluation

All RL policies are trained with the DEP-RL (https://github.com/martius-lab/depRL) software framework, reliant on the TonicRL (https://github.com/fabiopardo/tonic) implementation. We train each policy with 10 random seeds, for which the performance is plotted as averages with the standard deviation in shaded areas for all training performance plots. The controllers train for 100 million environment interactions each. The used reflex-controllers are included in the SCONE (https://scone.software/doku.php?id=ref:reflex_controller) software package.

For the evaluation of the most human-like gait, we take all training checkpoints after training and compute the experimental match metric by creating 5 rollouts of a 10 s walk for each checkpoint. The seed and training checkpoint with the largest value are taken for the evaluation.

For the robustness experiments, we randomly generate a terrain with 10 tiles of 1 m length and random slopes between ±5deg. This terrain stays otherwise fixed for all evaluations.

To test the robustness of the reflex controller and the RL agent, we optimize 5 controllers with different initializations until convergence, while we use the most natural RL policy for each model and perform 20 roll-outs with randomized initial states of the agent. The walked distance serves as a robustness metric.

Data preparation

The used datasets contain motion capture and ground reaction force (GRF) recordings of adult males walking and running on a treadmill at 1.2 m/s and 5.2 m/s respectively. All the data are available through public repositories (https://simtk.org/projects/nmbl_running) or are freely available with the SCONE software (https://scone.software/doku.php?id=start). We split all the trajectories into gait cycles by measuring the time from ground contact to ground contact of the left or the right foot via spikes in the respective GRFs. We consider a foot to be in contact with the ground if the computed normal force is larger than a threshold fthr=0.001. The same procedure is repeated for the gaits produced by the RL agent.

We then take the maximum and the minimum joint angles or GRF values of all the gait cycles of all the participants moving at a certain speed and check whether the values produced by the RL agent fall inside the range of the human experimental data. The percentage of the gait cycle for which the produced values are inside the range is finally used as a performance metric for optimization. This procedure is similar to the method included in the SCONE repository (https://github.com/opensim-org/SCONE/blob/6c8e8e3e45615ae2065a27243cfd8fbb778195a2/src/sconelib/scone/measures/GaitMeasure.cpp\#L71). We use the joint angles of the hip, knee, and ankles as well as the GRFs for our optimization. Values from the left and the right leg are averaged. The metric thereby implicitly values symmetric gaits higher than asymmetric gaits, for which the values do not align as well.

While this approach uses experimental data to guide the hyperparameter selection, no recordings from human subjects are used during the learning process.

Published: March 11, 2025

Footnotes

Supplemental information can be found online at https://doi.org/10.1016/j.isci.2025.112203.

Supplemental information

Document S1. Figures S1–S3 and Table S1
mmc1.pdf (764KB, pdf)
Video S1. Walking and running performance for different models and environments, related to Figure 2
Download video file (134MB, mp4)

References

  • 1.Patla A.E. Strategies for dynamic stability during adaptive human locomotion. IEEE Eng. Med. Biol. Mag. 2003;22:48–52. doi: 10.1109/memb.2003.1195695. [DOI] [PubMed] [Google Scholar]
  • 2.Falisse A., Serrancolí G., Dembia C.L., Gillis J., Jonkers I., De Groote F. Rapid predictive simulations with complex musculoskeletal models suggest that diverse healthy and pathological human gaits can emerge from similar control strategies. J. R. Soc. Interface. 2019;16 doi: 10.1098/rsif.2019.0402. [Online]. Available: [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Wang J.M., Hamner S.R., Delp S.L., Koltun V. Optimizing Locomotion Controllers Using Biologically-Based Actuators and Objectives. ACM Trans. Graph. 2012;31:25. doi: 10.1145/2185520.2185521. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Song S., Geyer H. 2013 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) Osaka; Japan: 2013. Generalization of a muscle-reflex control model to 3D walking; pp. 7463–7466. [DOI] [PubMed] [Google Scholar]
  • 5.Geijtenbeek T., van de Panne M., van der Stappen A.F. Flexible muscle-based locomotion for bipedal creatures. ACM Trans. Graph. 2013;32:1–11. [Google Scholar]
  • 6.Veerkamp K., Waterval N.F.J., Geijtenbeek T., Carty C.P., Lloyd D.G., Harlaar J., van der Krogt M.M. Evaluating cost function criteria in predicting healthy gait. J. Biomech. 2021;123 doi: 10.1016/j.jbiomech.2021.110530. [DOI] [PubMed] [Google Scholar]
  • 7.Song S., Geyer H. A neural circuitry that emphasizes spinal feedback generates diverse behaviours of human locomotion. J. Physiol. 2015;593:3493–3511. doi: 10.1113/JP270228. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Song S., Geyer H. Evaluation of a Neuromechanical Walking Control Model Using Disturbance Experiments. Front. Comput. Neurosci. 2017;11 doi: 10.3389/fncom.2017.00015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Haeufle D.F.B., Schmortte B., Geyer H., Müller R., Schmitt S. The Benefit of Combining Neuronal Feedback and Feed-Forward Control for Robustness in Step Down Perturbations of Simulated Human Walking Depends on the Muscle Function. Front. Comput. Neurosci. 2018;12 doi: 10.3389/fncom.2018.00080. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.van den Bogert A.J., Blana D., Heinrich D. Implicit methods for efficient musculoskeletal simulation and optimal control. Procedia IUTAM. 2011;2:297–316. doi: 10.1016/j.piutam.2011.04.027. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.De Groote F., Falisse A. Perspective on musculoskeletal modelling and predictive simulations of human movement to assess the neuromechanics of gait. Proc. Biol. Sci. 2021;288 doi: 10.1098/rspb.2020.2432. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Agarwal A., Kumar A., Malik J., Pathak D. 6th Annual Conference on Robot Learning. 2022. Legged locomotion in challenging terrains using egocentric vision.https://openreview.net/forum?id=Re3NjSwf0WF [Online]. Available: [Google Scholar]
  • 13.Miki T., Lee J., Hwangbo J., Wellhausen L., Koltun V., Hutter M. Learning robust perceptive locomotion for quadrupedal robots in the wild. Sci. Robot. 2022;7:eabk2822. doi: 10.1126/scirobotics.abk2822. [Online]. Available: [DOI] [PubMed] [Google Scholar]
  • 14.Song Y., Romero A., Müller M., Koltun V., Scaramuzza D. Reaching the limit in autonomous racing: Optimal control versus reinforcement learning. Sci. Robot. 2023;8 doi: 10.1126/scirobotics.adg1462. [Online]. Available: [DOI] [PubMed] [Google Scholar]
  • 15.Barbera V.L., Pardo F., Tassa Y., Daley M., Richards C., Kormushev P., Hutchinson J. Deep RL Workshop NeurIPS 2021. 2021. OstrichRL: A musculoskeletal ostrich simulation to study bio-mechanical locomotion.https://openreview.net/forum?id=7KzszSyQP0D [Online]. Available: [Google Scholar]
  • 16.Lee S., Park M., Lee K., Lee J. Scalable muscle-actuated human simulation and control. ACM Trans. Graph. 2019;38:1–13. doi: 10.1145/3306346.3322972. [Online]. Available: [DOI] [Google Scholar]
  • 17.Park J., Min S., Chang P., Lee J., Park M., Lee J. ACM SIGGRAPH 2022 Conference Proceedings. Association for Computing Machinery; 2022. Generative GaitNet. [Google Scholar]
  • 18.Park J., Park M., Lee J., Won J. ACM SIGGRAPH 2023 Conference Proceedings. Association for Computing Machinery; 2023. Bidirectional GaitNet: A Bidirectional Prediction Model of Human Gait and Anatomical Conditions. [Google Scholar]
  • 19.Qi C., Abbeel P., Grover A. Imitating, fast and slow: Robust learning from demonstrations via decision-time planning. arXiv. 2022 preprint at. https://arxiv.org/abs/2204.03597. [Google Scholar]
  • 20.Luo Z., Cao J., Merel J., Winkler A., Huang J., Kitani K.M., Xu W. The Twelfth International Conference on Learning Representations. 2024. Universal humanoid motion representations for physics-based control.https://openreview.net/forum?id=OrOd8PxOO2 [Online]. Available: [Google Scholar]
  • 21.Wagener N., Kolobov A., Frujeri F.V., Loynd R., Cheng C.-A., Hausknecht M. Proceedings of the 36th International Conference on Neural Information Processing Systems, ser. NIPS ’22. Curran Associates Inc.; 2024. Mocapact: a multi-task dataset for simulated humanoid control. [Google Scholar]
  • 22.Feng Y., Xu X., Liu L. SIGGRAPH Asia 2023 Conference Papers. Association for Computing Machinery. 2023. MuscleVAE: Model-Based Controllers of Muscle-Actuated Characters. [Google Scholar]
  • 23.Weng J., Hashemi E., Arami A. Natural walking with musculoskeletal models using deep reinforcement learning. IEEE Rob. Autom. Lett. 2021;6:4156–4162. [Google Scholar]
  • 24.Song S., Kidziński Ł., Peng X.B., Ong C., Hicks J., Levine S., Atkeson C.G., Delp S.L. Deep reinforcement learning for modeling human locomotion control in neuromechanical simulation. J. NeuroEng. Rehabil. 2021;18:1–7. doi: 10.1186/s12984-021-00919-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Anand A.S., Zhao G., Roth H., Seyfarth A. 2019 IEEE-RAS 19th International Conference on Humanoid Robots (Humanoids) 2019. A deep reinforcement learning based approach towards generating human walking behavior with a neuromuscular model; pp. 537–543. [Google Scholar]
  • 26.Alexander R.M. Optimization and gaits in the locomotion of vertebrates. Physiol. Rev. 1989;69:1199–1227. doi: 10.1152/physrev.1989.69.4.1199. [DOI] [PubMed] [Google Scholar]
  • 27.Saibene F. The mechanisms for minimizing energy expenditure in human locomotion. Eur. J. Clin. Nutr. 1990;44:65–71. [PubMed] [Google Scholar]
  • 28.Geyer H., Herr H. A muscle-reflex model that encodes principles of legged mechanics produces human walking dynamics and muscle activities. IEEE Trans. Neural Syst. Rehabil. Eng. 2010;18:263–273. doi: 10.1109/TNSRE.2010.2047592. [DOI] [PubMed] [Google Scholar]
  • 29.Bunz E.K., Haeufle D.F.B., Remy C.D., Schmitt S. Bioinspired preactivation reflex increases robustness of walking on rough terrain. Sci. Rep. 2023;13 doi: 10.1038/s41598-023-39364-3. [Online]. Available: [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Kaufmann E., Bauersfeld L., Loquercio A., Müller M., Koltun V., Scaramuzza D. Champion-level drone racing using deep reinforcement learning. Nature. 2023;620:982–987. doi: 10.1038/s41586-023-06419-4. [Online]. Available: [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Schumacher P., Haeufle D., Büchler D., Schmitt S., Martius G. The Eleventh International Conference on Learning Representations. 2023. DEP-RL: Embodied exploration for reinforcement learning in overactuated and musculoskeletal systems.https://openreview.net/forum?id=C-xa_D3oTj6 [Online]. Available: [Google Scholar]
  • 32.Berret B., Chiovetto E., Nori F., Pozzo T. Evidence for composite cost functions in arm movement planning: An inverse optimal control approach. PLoS Comput. Biol. 2011;7 doi: 10.1371/journal.pcbi.1002183. [Online]. Available: [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Peng X.B., Ma Z., Abbeel P., Levine S., Kanazawa A. Amp: Adversarial motion priors for stylized physics-based character control. ACM Trans. Graph. 2021;40:1–20. doi: 10.1145/3450626.3459670. [Online]. Available: [DOI] [Google Scholar]
  • 34.Weng J., Hashemi E., Arami A. Human gait cost function varies with walking speed: An inverse optimal control study. IEEE Rob. Autom. Lett. 2023;8:4777–4784. [Google Scholar]
  • 35.Xu J., Macklin M., Makoviychuk V., Narang Y., Garg A., Ramos F., Matusik W. International Conference on Learning Representations. 2022. Accelerated policy learning with parallel differentiable simulation.https://openreview.net/forum?id=ZSKRQMvttc [Online]. Available: [Google Scholar]
  • 36.Mohler B.J., Thompson W.B., Creem-Regehr S.H., Pick H.L., Jr., Warren W.H., Jr. Visual flow influences gait transition speed and preferred walking speed. Exp. Brain Res. 2007;181:221–228. doi: 10.1007/s00221-007-0917-0. [DOI] [PubMed] [Google Scholar]
  • 37.Rudin N., Hoeller D., Bjelonic M., Hutter M. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS); 2022. Advanced Skills by Learning Locomotion and Local Navigation End-To-End; pp. 2497–2503. [Google Scholar]
  • 38.Abe D., Fukuoka Y., Horiuchi M. Economical speed and energetically optimal transition speed evaluated by gross and net oxygen cost of transport at different gradients. PLoS One. 2015;10 doi: 10.1371/journal.pone.0138154. [Online]. Available: [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Raffalt P.C., Guul M.K., Nielsen A.N., Puthusserypady S., Alkjær T. Economy, movement dynamics, and muscle activity of human walking at different speeds. Sci. Rep. 2017;7 doi: 10.1038/srep43986. [Online]. Available: [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Ackermann M., van den Bogert A.J. Optimality principles for model-based prediction of human gait. J. Biomech. 2010;43:1055–1060. doi: 10.1016/j.jbiomech.2009.12.012. https://www.sciencedirect.com/science/article/pii/S0021929009007210 [Online]. Available: [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Zahavy T., Schroecker Y., Behbahani F., Baumli K., Flennerhag S., Hou S., Singh S. The Eleventh International Conference on Learning Representations. 2023. Discovering policies with DOMiNO: Diversity optimization maintaining near optimality.https://openreview.net/forum?id=kjkdzBW3b8p [Online]. Available: [Google Scholar]
  • 42.Lee K., Smith L., Abbeel P. International Conference on Machine Learning. 2021. Pebble: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. [Google Scholar]
  • 43.NILSSON J., THORSTENSSON A. Ground reaction forces at different speeds of human walking and running. Acta Physiol. Scand. 1989;136:217–227. doi: 10.1111/j.1748-1716.1989.tb08655.x. [Online]. Available: [DOI] [PubMed] [Google Scholar]
  • 44.Geijtenbeek T. SCONE: Open Source Software for Predictive Simulation of Biological Motion. J. Open Source Softw. 2019;4:1421. https://scone.software [Online]. Available: [Google Scholar]
  • 45.Hatze H. A three-dimensional multivariate model of passive human joint torques and articular boundaries. Clin. Biomech. 1997;12:128–135. doi: 10.1016/s0268-0033(96)00058-7. [DOI] [PubMed] [Google Scholar]
  • 46.S. L. Delp, J. P. Loan, M. G. Hoy, and F. E. Zajac, (1990). An Interactive Graphics-Based Model of the Lower Extremity to Study Orthopaedic Surgical Procedures. IEEE Trans. Biomed. Eng. 37. 757–767. [DOI] [PubMed]
  • 47.Rajagopal A., Dembia C.L., DeMers M.S., Delp D.D., Hicks J.L., Delp S.L. Full body musculoskeletal model for muscle-driven simulation of human gait. IEEE Trans. Biomed. Eng. 2016;63:2068–2079. doi: 10.1109/TBME.2016.2586891. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Christophy M., Faruk Senan N.A., Lotz J.C., O’Reilly O.M. A musculoskeletal model for the lumbar spine. Biomech. Model. Mechanobiol. 2012;11:19–34. doi: 10.1007/s10237-011-0290-6. [DOI] [PubMed] [Google Scholar]
  • 49.Geijtenbeek T. Vol. 11. 2021. https://hyfydy.com (The Hyfydy Simulation Software). [Online]. Available: [Google Scholar]
  • 50.Todorov E., Erez T., Tassa Y. 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. 2012. Mujoco: A physics engine for model-based control; pp. 5026–5033. [Google Scholar]
  • 51.Caggiano V., Wang H., Durandau G., Sartori M., Kumar V. Myosuite – a contact-rich simulation suite for musculoskeletal motor control. 2022. https://github.com/facebookresearch/myosuite
  • 52.Seth A., Hicks J.L., Uchida T.K., Habib A., Dembia C.L., Dunne J.J., Ong C.F., DeMers M.S., Rajagopal A., Millard M., et al. Opensim: Simulating musculoskeletal dynamics and neuromuscular control to study human and animal movement. PLoS Comput. Biol. 2018;14 doi: 10.1371/journal.pcbi.1006223. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.V. Caggiano, S. Dasari, and V. Kumar. MyoDex: A Generalizable Prior for Dexterous Manipulation. https://openreview.net/forum?id=iYBTiYzN0A.
  • 54.Berg C., Caggiano V., Kumar V. SAR: generalization of physiological agility and dexterity via synergistic action representation. Auton. Robots. 2024;48 doi: 10.48550/arXiv.2307.03716. [DOI] [Google Scholar]
  • 55.Alberto Silvio Chiappa, Alessandro Marin Vargas . Ann Huang, & Alexander Mathis. 2023. Latent exploration for Reinforcement Learning. [DOI] [Google Scholar]
  • 56.Bovi G., Rabuffetti M., Mazzoleni P., Ferrarin M. A multiple-task gait analysis approach: Kinematic, kinetic and EMG reference data for healthy young and adult subjects. Gait Posture. 2011;33:6–13. doi: 10.1016/j.gaitpost.2010.08.009. [DOI] [PubMed] [Google Scholar]
  • 57.Hamner S.R., Delp S.L. Muscle contributions to fore-aft and vertical body mass center accelerations over a range of running speeds. J. Biomech. 2013;46:780–787. doi: 10.1016/j.jbiomech.2012.11.024. https://www.sciencedirect.com/science/article/pii/S0021929012006768 [Online]. Available: [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Veerkamp K., Waterval N.F.J., Geijtenbeek T., Carty C.P., Lloyd D.G., Harlaar J., van der Krogt M.M. Evaluating cost function criteria in predicting healthy gait. J. Biomech. 2021;123 doi: 10.1016/j.jbiomech.2021.110530. https://www.sciencedirect.com/science/article/pii/S0021929021003110 [Online]. Available: [DOI] [PubMed] [Google Scholar]
  • 59.Mastrogeorgiou A., Papatheodorou A., Koutsoukis K., Papadopoulos E. In: CLAWAR 2022. Lecture Notes in Networks and Systems. Cascalho J.M., Tokhi M.O., Silva M.F., Mendes A., Goher K., Funk M., editors. Springer; Cham: 2023. Robotics in Natural Settings. 530. [DOI] [Google Scholar]
  • 60.Franklin D.W., Osu R., Burdet E., Kawato M., Milner T.E. Adaptation to stable and unstable dynamics achieved by combined impedance control and inverse dynamics model. J. Neurophysiol. 2003;90:3270–3282. doi: 10.1152/jn.01112.2002. pMID: 14615432. [Online]. Available: [DOI] [PubMed] [Google Scholar]
  • 61.Ramadan R., Geyer H., Jeka J., Schöner G., Reimann H. A neuromuscular model of human locomotion combines spinal reflex circuits with voluntary movements. Sci. Rep. 2022;12 doi: 10.1038/s41598-022-11102-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Schreff L., Haeufle D.F.B., Vielemeyer J., Müller R. Evaluating anticipatory control strategies for their capability to cope with step-down perturbations in computer simulations of human walking. Sci. Rep. 2022;12 doi: 10.1038/s41598-022-14040-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Selinger J., O’Connor S., Wong J., Donelan J. . Humans can continuously optimize energetic cost during walking. Curr. Biol. 2015;25:2452–2456. doi: 10.1016/j.cub.2015.08.016. https://www.sciencedirect.com/science/article/pii/S0960982215009586 [Online]. Available: [DOI] [PubMed] [Google Scholar]
  • 64.Ackermann M., van den Bogert A.J. Optimality principles for model-based prediction of human gait. J. Biomech. 2010;43:1055–1060. doi: 10.1016/j.jbiomech.2009.12.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Kidziński Ł., Mohanty S.P., Ong C.F., Huang Z., Zhou S., Pechenko A., Stelmaszczyk A., Jarosik P., Pavlov M., Kolesnikov S., et al. In: The NIPS ’17 Competition: Building Intelligent Systems. Escalera S., Weimer M., editors. Springer International Publishing; Cham: 2018. Learning to run challenge solutions: Adapting reinforcement learning methods for neuromusculoskeletal environments; pp. 121–153. [Google Scholar]
  • 66.Ishikawa M., Komi P.V., Grey M.J., Lepola V., Bruggemann G.-P. Muscle-tendon interaction and elastic energy usage in human walking. J. Appl. Physiol. 2005;99:603–608. doi: 10.1152/japplphysiol.00189.2005. pMID: 15845776. [Online]. Available: [DOI] [PubMed] [Google Scholar]
  • 67.Blazevich A.J., Fletcher J.R. More than energy cost: multiple benefits of the long Achilles tendon in human walking and running. Biol. Rev. 2023;98:2210–2225. doi: 10.1111/brv.13002. [Online]. Available: [DOI] [PubMed] [Google Scholar]
  • 68.Saraiva L., Rodrigues da Silva M., Marques F., Tavares da Silva M., Flores P. A review on foot-ground contact modeling strategies for human motion analysis. Mech. Mach. Theor. 2022;177 https://www.sciencedirect.com/science/article/pii/S0094114X22002932 [Online]. Available: [Google Scholar]
  • 69.Sopher R.S., Amis A.A., Davies D.C., Jeffers J.R. The influence of muscle pennation angle and cross-sectional area on contact forces in the ankle joint. J. Strain Anal. Eng. Des. 2017;52:12–23. doi: 10.1177/0309324716669250. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Gerritsen K.G., van den Bogert A.J., Hulliger M., Zernicke R.F., Ronald F. Intrinsic Muscle Properties Facilitate Locomotor Control - A Computer Simulation Study. Mot. Control. 1998;2:206–220. doi: 10.1123/mcj.2.3.206. [DOI] [PubMed] [Google Scholar]
  • 71.Haeufle D.F.B., Grimmer S., Seyfarth A. The role of intrinsic muscle properties for stable hopping - stability is achieved by the force-velocity relation. Bioinspiration Biomimetics. 2010;5 doi: 10.1088/1748-3182/5/1/016004. [DOI] [PubMed] [Google Scholar]
  • 72.John C.T., Anderson F.C., Higginson J.S., Delp S.L. Stabilisation of walking by intrinsic muscle properties revealed in a three-dimensional muscle-driven simulation. Comput. Methods Biomech. Biomed. Eng. 2013;16:451–462. doi: 10.1080/10255842.2011.627560. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Wochner I., Schumacher P., Martius G., Büchler D., Schmitt S., Haeufle D. In: Liu K., Kulic D., Ichnowski J., editors. Vol. 205. PMLR; 2023. Learning with muscles: Benefits for data-efficiency and robustness in anthropomorphic tasks; pp. 1178–1188. (Proceedings of the 6th Conference on Robot Learning, Ser. Proceedings of Machine Learning Research). [Google Scholar]
  • 74.Millard M., Uchida T., Seth A., Delp S.L. Flexing computational muscle: modeling and simulation of musculotendon dynamics. J. Biomech. Eng. 2013;135 doi: 10.1115/1.4023390. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.Hunt K.H., Crossley F.R.E. Coefficient of Restitution Interpreted as Damping in Vibroimpact. J. Appl. Mech. 1975;42:440–445. [Google Scholar]
  • 76.Sherman M.a., Seth A., Delp S.L. Simbody: multibody dynamics for biomedical research. Procedia IUTAM. 2011;2:241–261. doi: 10.1016/j.piutam.2011.04.023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77.Pardo, F. Tonic: A deep reinforcement learning library for fast prototyping and benchmarking. preprint at arXiv. https://arxiv.org/pdf/2011.07537.
  • 78.Harris C.R., Millman K.J., van der Walt S.J., Gommers R., Virtanen P., Cournapeau D., Wieser E., Taylor J., Berg S., Smith N.J., et al. Array programming with NumPy. Nature. 2020;585:357–362. doi: 10.1038/s41586-020-2649-2. [Online]. Available: [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Figures S1–S3 and Table S1
mmc1.pdf (764KB, pdf)
Video S1. Walking and running performance for different models and environments, related to Figure 2
Download video file (134MB, mp4)

Data Availability Statement


Articles from iScience are provided here courtesy of Elsevier

RESOURCES