Abstract
Swimming microrobots are increasingly developed with complex materials and dynamic shapes and are expected to operate in complex environments in which the system dynamics are difficult to model and positional control of the microrobot is not straightforward to achieve. Deep reinforcement learning is a promising method of autonomously developing robust controllers for creating smart microrobots, which can adapt their behavior to operate in uncharacterized environments without the need to model the system dynamics. This article reports the development of a smart helical magnetic hydrogel microrobot that uses the soft actor critic reinforcement learning algorithm to autonomously derive a control policy which allows the microrobot to swim through an uncharacterized biomimetic fluidic environment under control of a time varying magnetic field generated from a three-axis array of electromagnets. The reinforcement learning agent learned successful control policies from both state vector input and raw images, and the control policies learned by the agent recapitulated the behavior of rationally designed controllers based on physical models of helical swimming microrobots. Deep reinforcement learning applied to microrobot control is likely to significantly expand the capabilities of the next generation of microrobots.
Keywords: Robotics, microrobot, reinforcement learning, artificial intelligence, machine learning, control systems, magnetics
Graphical Abstract

Deep reinforcement learning discovers and implements optimal swimming behavior for magnetic microrobots, abrogating the need for both kinematic models of swimming and classical feedback controls approaches. This approach reveals a new strategy for microrobot manipulation in fluid environments similar to those in the human body, and thus has potential for medical impact in the future.
1. Introduction
Untethered swimming microrobotic systems have received significant research attention for performing micromanipulation tasks and particularly for their potential therapeutic biomedical applications.[1, 2] Microrobots operating remotely inside the human body have potential to enable minimally invasive medical procedures including targeted drug, cell, or other cargo delivery,[3–6] tissue biopsy,[7] thermotherapy,[8] and blood clot removal.[9] Controlling these miniature devices within complex and dynamic environments such as the human body can present a significant engineering challenge.[10] This challenge is in part because the design of microrobotic systems is trending towards the use of complex composite materials,[6, 11–14] dynamic morphologies,[15–18] and integrated biological components.[19–23] These features add layers of functionality to microrobotic systems, but can create difficulties when constructing accurate dynamic and kinematic models of microrobotic behavior, making it especially complex and challenging to use classical feedback control systems to control microrobot behavior.[5, 15, 24] Additionally, the environmental dynamics encountered by a biomedical microrobot inside the human body may be variable, complex, and poorly characterized.[17, 25, 26]
As one potential pathway to overcome these challenges, we can observe and adopt the strategies of natural biological agents that have evolved to operate in complex, unpredictable environments. Many biological systems can adapt to learn new behaviors based on experience, allowing them to thrive in a wide range of complex and variable environmental conditions by tailoring their behavior to suit the environment. Systems capable of learning adaptive behavioral patterns based on past events are ubiquitous in nature and are found across all levels of biological hierarchy, including in biochemical networks,[27] bacteria,[27, 28] nematode worms,[29] insects,[30] plants,[31] adaptive immune systems,[32] and animal behavior.[33] Inspired by the wide-ranging applicability of adaption and learning to the success of living organisms, engineered microrobotic systems that learn new behaviors from past experience could enable new capabilities for complex microrobotic systems.[34]
Reinforcement learning (RL) is a biomimetic machine learning optimization technique inspired by the adaptive behavior of real-world organisms[35] that can enable learning behaviors in artificial engineered systems.[36] In RL, an agent observes the state of an environment, and chooses actions to perform in the environment to achieve a task specified by a reward signal, which is typically predefined. The reward signal is used to teach the agent to perform actions to maximize the expected future rewards, which enables the agent to learn to perform the task better based on past experience. RL algorithms have achieved success in a range of complex robotic control applications.[36–42] For example, RL for robotic control has been demonstrated to create robotic control policies that achieve better performance than many humans at complex tasks such as grasping and accurate throwing of irregularly shaped objects into bins.[40] RL algorithms have also been shown to exceed human level performance in complex virtual tasks with large possible state spaces that cannot be tractably and exhaustively modeled, such as the game of Go.[43] Machine learning techniques including RL have already demonstrated promise for developing policies to control microrobot behavior. In simulation, RL agents have been trained to control microrobot behavior for solving navigation and swimming challenges in heterogenous fluids.[44, 45] An early RL algorithm, Q-learning, has also been shown to be effective for controlling the behavior of laser-driven microparticles in a discretized grid environment.[46] Other similar machine learning techniques such as Bayesian optimization have been demonstrated to learn walking gates for difficult-to-model magneto-elastomeric millirobots.[47] However, control of swimming microrobots that use deep reinforcement learning to operate in dynamic, biomimetic, and microfluidic environments with clinically relevant magnetic actuation has yet to be reported.
In this work, we demonstrate that deep reinforcement learning based on the Soft Actor Critic algorithm[48] can be used to create smart soft helical magnetic microrobots that autonomously learn optimized swimming behaviors when actuated with non-uniform, nonlinear, and time-varying magnetic fields in a physical fluid environment. Our RL microrobots learned successful actuation policies both from state variable input and directly from raw images without any a priori knowledge about the dynamics of the microrobot, the electromagnetic actuator, or the environment (Figure 1a). Our RL agent discovered multiple successful actuation strategies in separate learning trials, and the control policies learned by the agent all recapitulated the behavior of theoretically optimal physics-based approaches for actuating helical magnetic microrobots. These results demonstrate the potential of reinforcement learning for developing high performance multi-input, multi-output (MIMO) controllers for microrobots without the need for explicit system modeling. This capability to autonomously learn model-free microrobot control algorithms could significantly reduce the time and resources required to develop high performance microrobotic systems.[34]
Figure 1. Microrobots with unknown dynamics in uncharacterized environments can be controlled with deep reinforcement learning.
(a) Microrobotic systems are designed with a great variety of shapes, sizes, materials, and actuation methods, and are often operated in challenging environments. Controllers based on artificial deep neural networks trained with reinforcement learning (RL) can factor in all of these complex dynamic systems and inputs to create model-free microrobot controllers to create adaptive microrobots. (b) Our microrobotic system consisted of a helical agar magnetic robot (HAMR) in a circular polydimethylsiloxane (PDMS) fluidic track that was given the task of moving to a target position along the track under control of a multi-axis electromagnet (Magneturret). Images of the HAMR in the arena were captured with an overhead camera. (c) The smart microrobot control system used a neural network trained with the Soft Actor Critic (SAC) reinforcement learning algorithm to generate commands for the Magneturret. In the control loop, the stream of images from the overhead camera was processed to generate state information that was then fed into the actor neural network, which returned a set of continuous actions that were used to control the currents in the Magneturret. State (s), action (a), reward (r), and next state (s’) information was stored in a replay buffer which was used update actor and critic neural networks off policy in a learning loop.
2. Results
In order to create an environment where we could test the hypothesized efficacy of RL control systems for microrobots, we first designed and built a physical, biomimetic, fluidic arena with multidimensional magnetic actuation, and deployed a magnetic microrobot in the arena whose design was inspired by the work of Kumar and colleagues.[4] In our experimental setup (Figure 1b), a helical agar magnetic robot (HAMR) was tasked with swimming clockwise through a fluid filled lumen in a polydimethylsiloxane (PDMS) arena under control of a non-uniform rotating magnetic field generated by a three-axis array of electromagnetic coils (Magneturret). An overhead camera was used to track the position of the HAMR in the channel. The camera was used to pass images to a control algorithm consisting of an image processing module and a neural network which generated commands for the Magneturret (Figure 1c). The goal of the control system for our remotely actuated microrobot was to manipulate the shape and magnitude of the actuating energy field in order to move the microrobot to achieve an intended dynamic behavior. The controller neural network was trained via a reinforcement learning agent using the soft actor critic algorithm.
The fundamental control problem was encapsulated by this question: how should the currents in the electromagnetic coils be modulated in order to create a magnetic field that places forces and torques on the HAMR sufficient to drive its locomotion toward a specific target? Achieving this task usually requires an accurate dynamic model of the complete system, including the dynamics of the robot, the environment, and the actuator.[24] Significant work has been done developing dynamic and kinematic models for different microrobots and actuators.[15, 49–52] These models are often developed by making simplifying assumptions about the system such as uniform magnetization,[49] ideal shape,[49] and system linearity,[53, 54] which could lead to behavioral deviations between the physical system and the modeled system. The difficulty in accurately modeling the dynamics of microrobot behavior increases significantly for microrobots with complex magnetization profiles, soft material composition, or active shape-changing capabilities.[13, 15, 18, 55, 56]
Instead of explicitly modeling the dynamics of the magnetic actuator and the HAMR within the environment and specifying a controller, we performed the much simpler task of specifying the desired behavior of the HAMR in the form of a reward signal. The agent observed the state of the environment along with a reward signal containing information about actions that lead towards the successful completion of the task. The RL agent started without any a priori information about the task and had to learn to perform the task by sampling actions from the space of all possible actions and learning which actions resulted in behavior that was rewarded.
The RL controller required us to develop and formulate the task as well as the associated reward signal. At the beginning of each training episode, a target position was defined, 20° clockwise from the starting position of the HAMR in the circular channel. The objective of the RL agent was to develop an action policy, , which maximized the total value of the rewards it would receive if it followed that policy in the future. When the environment was in state, s, the agent chose an action, a, from the policy according to , probabilistically selecting from a distribution of possible actions available in that state. The agent received a reward when it selected actions that moved the HAMR clockwise through the circular lumen towards the target, and it received a negative reward when it moved the HAMR counterclockwise. If the HAMR reached the target within the allotted time, the agent was given a large bonus reward, and the target position was immediately advanced 20° clockwise, starting a new episode. The reward function we selected was
| (1) |
where is the angular position of the HAMR in the channel in degrees, is the angular position of the goal, and is the change in angular position of the HAMR as a result taking of action a in state s. Within reinforcement learning, proper reward selection can have a large effect on the performance of the system.[57] While we did not attempt to extensively optimize our reward function to maximize the learning speed of the agent, we designed this reward to encourage the agent to reach the goal as quickly as possible. This two-part reward function encourages actions that move the HAMR a large distance in the correct direction with each action (optimizing velocity) and direct the agent to end on the target position to receive the bonus reward (providing a terminal condition to end an episode). It is important to note that the agent received a positive reward when the HAMR moved clockwise during an action (which corresponds to a negative change in using standard mathematical angle notation). The agent received the additional bonus reward for steps in which when the position of the robot was within 3° of the goal position .
The reinforcement learning problem was formalized as a Markov decision process,[34] in which at time t, the state of the system is observed by the agent. The agent performs an action which changes the state of the environment to , yielding a reward . This process continues for the duration of the task, yielding a trajectory of the form . The goal of the RL agent is to identify an optimal policy for selecting actions, based on state observations, that maximize the rewards received for following the policy. Over the course of training, the agent autonomously learned a control policy by trying actions in the environment, observing the reward obtained by performing those actions, and modifying its future behavior in order to maximize the expected future return.
The control problem for our microrobotic system was formulated as an episodic, discrete time problem with a continuous action space and continuous state space. The state space consisted of all the possible states for the system: the position of the HAMR within the channel, the speed of the HAMR, the shape of the magnetic fields, the time remaining in the episode, and relative position of the robot to the target position in the channel. The action space consisted of four continuous actions, which controlled the magnitudes and phase angles for sinusoidal currents in the Magneturret. While the current waveforms could theoretically take on an infinite number of shapes, we chose to define the applied waveforms as sinusoids to bound the space of possible actions that the agent could take. Sinusoidal currents were chosen because these can be used to generate rotating magnetic fields in other three-axis electromagnetic actuators for microrobots, such as Helmholtz coils.[58] We formulated the problem as episodic, with a time limit and a goal for the robot: to reach the goal position as quickly as possible. While reinforcement learning problems can be framed as episodic (with terminal states) or continuous,[59] we chose to represent the problem episodically to recreate the conditions of we expect would be present in plausible use cases for microrobots. We chose to include a goal position to represent targets that a microrobot moving in the context of a biomedical application might attempt to reach, such as a blood clot, lesion, or tumor. In our current implementation, when the HAMR reached the goal, this represented the terminal state of the task and a new episode was started with a new goal position. By creating a virtual moving target for the HAMR during training, we were freed from the having to manually reset the state of the system each time the target was reached, which facilitated an automated training process.
2.1. Entropy Regularized Deep Reinforcement Learning Enabled Continuous Microrobot Control
We selected the Soft Actor Critic RL algorithm (SAC) for this research. SAC is a maximum entropy RL algorithm that seeks to balance the expected future rewards with the information entropy of the policy.[48] In other words, SAC learns a policy that successfully completes the task while acting as randomly as possible, which in practice often leads to robust policies that are tolerant of perturbations in environmental conditions.[37] SAC had previously proven useful for real-world robotic tasks with high-dimensional, continuous state and action spaces,[37, 39] which suggested that it would be applicable to our microrobotic control problem. In previously reported applications of real-world reinforcement learning with physical systems,[60] SAC was demonstrated to be highly sample efficient, requiring relatively few environmental interactions in order to develop a successful policy. Sample efficiency is critical when performing reinforcement learning with real-world robotics (i.e. not simulated) in order to reduce wear and tear on the system, and in order to minimize the time needed to learn a policy.[38]
The SAC algorithm seeks to develop an optimal stochastic policy :
| (2) |
where is the information entropy of the policy and is a temperature hyperparameter, which balances the relative impact of the policy entropy against the expected future rewards. Here, we used a version of the SAC algorithm in which the temperature is automatically tuned via gradient descent so that the entropy of the policy continually matches a target entropy, , which we selected to be −4 (−Dim of the actions space) using a heuristic suggested by Haarnoja et al.[61] A full derivation of the soft actor critic algorithm is beyond the scope of this paper, but interested readers are directed to Haarnoja et al.[48] Briefly, the SAC algorithm uses an agent called the actor, denoted as , which is a deep neural network that takes the state of the system as input, and returns action as output. A value function is created to rate the value of taking actions when in particular states, and instantiated using two critic neural networks which take states and actions as input, and return values corresponding to the relative value of taking action in state st. Two Q networks are trained in order to reduce overestimation in the value function. Environmental transitions in the form of sets are recorded in an experience replay buffer, D, where d is a done flag denoting a terminal state, set either when the microrobot has reached the goal, or when the episode has timed out. The SAC algorithm learns off-policy by randomly sampling minibatches of past experiences from D, and performing stochastic gradient descent over the minibatch in order to minimize loss functions for the actor network, , critic networks, and , and temperature parameter, . Over the course of learning, the parameters of the actor and critic neural networks are updated so that the behavior of the policy approaches the optimum policy, . A detailed version of the soft actor critic algorithm for microrobot control that we used in this study is available in Algorithm 1. Neural network architectures and hyperparameters used are available in Supplementary Table 1.
Algorithm 1.
Soft Actor Critic for Microrobot Control.
| 1: | Initialize policy parameters , Q-function parameters , , and empty FIFO replay buffer D |
| 2: | Set target Q-function parameters equal to main parameters |
| 3: | Initialize = −number of actions (4), |
| 4: | Observe initial state , and calculate |
| 5: | Set , |
| 6: | Data Collection Process: Repeat |
| 7: | If new is available: update |
| 8: | While t steps <33 or done = false |
| 9: | Select action |
| 10: | Execute in the environment |
| 11: | For j in range (3) |
| 12: | Wait 0.3 seconds |
| 13: | Observe next state , reward , and done |
| 14: | End For |
| 15: | Set |
| 16: | Store transition in replay buffer D |
| 17: | Set |
| 18: | If done: set |
| 19: | |
| 20: | End While |
| 21: | Set |
| 22: | Training Process: Repeat |
| 23: | If number of updates < number of transitions in D |
| 24: | Randomly sample batch of n transitions, from D |
| 25: | Compute targets for the Q functions: |
| 26: | Update Q-functions using |
| for | |
| 27: | Update Policy using |
| , | |
| 28: | Update temperature using |
| , | |
| 29: | Update the target Q-functions using |
| for | |
| 30: | End If |
| 31: | Send latest to data collection process every minute |
| 32: | Until convergence |
2.2. Hardware to Control Magnetic Microrobots with Reinforcement Learning
Magnetic fields created by electromagnetic coils are commonly used actuators for magnetic microrobots and have significant potential for clinical medical applications.[3, 62, 63] Magnetic fields act by imparting forces and torques on the robot. For a microrobot with a magnetic moment, , in a magnetic field, , the robot experiences a force according to
| (3) |
In a non-uniform magnetic field (i.e., a magnetic field with a spatial gradient), a ferromagnetic or paramagnetic microrobot feels force in the direction of increasing magnetic field gradient. The magnetic microrobot also experiences a torque according to
| (4) |
which acts to align the magnetic moment of the microrobot with the direction of the magnetic field. When the magnetic field is rotated so that the direction of is constantly changing, it is possible to use this torque to impart spin to the microrobot at the frequency of the rotating magnetic field, up to the step out frequency of the robot.[49] If the spinning microrobot is helically shaped, rotation can be transduced into forward motion so that the microrobot swims similar to how bacteria are propelled by flagella.[64] This non-reciprocal helical swimming is efficient in low Reynolds number fluidic environments commonly encountered by microrobots.[64] Because of the efficiency of this swimming mode, and because the magnetic torque available to a microrobot decreases more gradually with distance compared to the force,[24] magnetic microrobots designed for long range magnetic operation are often helically shaped.[6, 11, 65] For this reason, we selected a helical magnetic microrobot, the HAMR, as our model system.
The HAMR that we created for this study was composed of a 2% w/v agar hydrogel, which was uniformly diffused with 10% w/v iron oxide nanopowder to form a magnetically susceptible soft polymer (Figure 2a).[4] This magnetic agar solution was heated to melting temperature and a syringe was used to inject the liquid into a helical mold created using a stereolithography 3D printer (Figure 2b). The agar in the mold solidified and the robots were removed with a metal needle and stored long-term in deionized (DI) water. The HAMRs molded for this study were 4.4 mm in length, 1 mm in diameter, and asymmetrical from head to tail, with flat head and a pointed tail (Figure 2c,d). Microrobots formed with this technique have been previously shown to be controllable within rotating magnetic fields, and to perform biomedical functions such as cell delivery[4] and active biofilm removal in the root canal of human teeth.[58] For our application, this HAMR design had several advantages. The HAMRs were simple to manufacture at low cost with batch fabrication methods. The HAMRs were small enough to act as helical swimming robots in a flow regime with Reynolds number ~1, but large enough, about the size of a small grain of rice, to be easily manipulated and visualized without the use of microscopes or other micromanipulation tools. Because the HAMRs swim with non-reciprocal, helical motion in the presence of a rotating magnetic field, a very common motif in microrobotic research,[11, 65, 66] insights gained from this study could readily be extended to other microrobotic systems with similar characteristics.
Figure 2. Hardware for real-world control of magnetic microrobots using reinforcement learning.

(a) Helical Agar Magnetic Robot (HAMR) schematic. (b) HAMRs were fabricated by molding molten magnetic hydrogel in 3D printed molds. Scale bar = 10 mm. (c) HAMR with a United States 5-cent coin. Scale Bar = 5 mm. (d) HAMRs were composed of agar hydrogel infused with iron oxide nanopowder. Scale bar = 1 mm. (e) The Magneturret was composed of six coils of copper magnet wire wrapped around permalloy cores, positioned on the faces of a central 3D printed cube. Scale bar = 20 mm. (f) The Magneturret was enclosed within a 3D printed housing and sealed with epoxy, and glycerol coolant was continuously pumped through the Magneturret housing. Scale bar = 20 mm. (g) A circular PDMS track with a rectangular cross section used as an arena for the HAMR. A black, circular fiducial marker indicates the center of the arena. Scale bar = 10 mm. (h) The PDMS arena was submerged in a water-filled petri dish placed on top of the Magneturret, with an acrylic LED light sheet as a backlight for uniform bottom-up illumination. Scale bar = 30 mm. (i) The complete hardware system.
Because the HAMRs were made of soft hydrogel, they were flexible and deformable. Soft bodied robots have many favorable characteristics for in vivo use such as deformability to fit through irregular shaped channels and enhanced biocompatibility (e.g., by matching the elastic modulus of the robot to the biological environment).[67] Previous studies investigating the material properties of agar hydrogels suggest that our HAMRs should have an elastic modulus in the range of ~400 kPa[68] which makes our HAMR softer than many of the biological tissues that a microrobot might interact with in vivo, such as artery tissue.[69] However, the forces and torques experienced by the HAMR in this study did not lead to significant observable deformation, and the HAMR retained its helical shape throughout the training. Based on the relative rigidity of the HAMR, we expect that the swimming dynamics of the HAMR will behave analogously to a rigid helical swimmer with a similar geometry.[70] Soft-bodied microrobots are appealing for biomedical applications, but it can be more difficult to create accurate dynamic models for soft-bodied microrobots.[17] Our method of using reinforcement learning to develop control systems without explicit modeling could be particularly useful for soft microrobots due to this modeling constraint, although the increased complexity of a highly deformable or kinematically complex microrobot would be likely to significantly increase the time required to train an RL control system. Finally, despite being soft-bodied, the hydrogel structure of the HAMR did not experience noticeable wear over the course of several months of continuous use, thus meeting a practical reinforcement learning constraint that the system not be susceptible to significant wear and tear during extended use which would cause a distribution shift in the collected data as the dynamic properties of the system degraded.[60]
As an actuator for our microrobot system, we developed a three-axis magnetic coil actuator –the Magneturret– which contained six permalloy-core magnetic coils arranged on the faces of a 3D-printed acrylonitrile butadiene styrene (ABS) plastic cube (Figure 2e). The two coils on opposite sides of the central cube along each axis were wired together in series so that they both contribute to the generation of a magnetic field along their respective axis. Each of the three coils, hereafter referred to as the X, Y, and Z coils, were driven with a sinusoidal current generated by a pulse width modulated (PWM) signal created by a microcontroller and amplified in an H-bridge motor driver. The resulting magnetic field, produced by the superposition of the magnetic fields from the three coils, could be modulated by varying the frequency, amplitude, and phase angle of the sine current waves in each coil. To cool the coils and prevent thermal damage, the Magneturret was sealed with epoxy resin into a 3D-printed housing and coolant was continuously pumped through while the coil was operating (Figure 2f). The RL agent was given direct control over the magnitude (M) and phase angles of the sinusoidal driving currents in the X-axis coils and Y-axis coils of the Magneturret, for a total of four continuously variable actions: , , , and (Table 1). The Z-axis magnitude was calculated as the larger of the two magnitudes in X and Y, and the Z-axis phase angle was fixed. The sinusoidal currents in each axis used a fixed angular frequency of 100 rad/s (15.9 Hz).
Table 1.
Control inputs for electromagnet waveforms.
| Magnetic Coil Control Parameters |
| Control Variable | Symbol | Source | Range |
|---|---|---|---|
| Frequency | Fixed | 15.9 Hz | |
| Magnitude X | RL agent | [−1,1] unitless | |
| Magnitude Y | RL agent | [−1,1] unitless | |
| Magnitude Z | [0,1] unitless | ||
| Phase Angle X | RL agent | [0,] radians | |
| Phase Angle Y | RL agent | [0,] radians |
| Control Equations | |||
|---|---|---|---|
| Current in X-axis coil | (5) | ||
| Current in Y-axis coil | (6) | ||
| Current in Z-axis coil | (7) | ||
We chose to operate our HAMR in a circular, fluid-filled track for this study. This arena served as a simple environment which mimics the tortuous in vivo luminal environments that microrobots operating in the body might encounter, while providing a simple environment for us to establish a robust proof-of-concept reinforcement learning control system (Figure 2g). The HAMR could swim in a complete circle within this arena, and no human intervention was required to reset the position of the robot in the environment during training, which facilitated automated learning.[38] The arena was constructed by pouring polydimethylsiloxane (PDMS) over a polyvinyl chloride ring with an outer diameter of 34 mm and a 1.7 mm x 3 mm rectangular cross section. The PDMS was then cured, and plasma bonded to a second flat sheet of cured PDMS to form a rectangular lumen for the HAMR to swim. PDMS is transparent, allowing us to see the robot in the arena and to visually track it with an overhead camera. During long-term learning experiments, the PDMS arena was submerged in a petri dish filled with DI water in order to prevent the formation of air bubbles in the channel due to evaporation over the course of an experiment. This petri dish was then placed on top of the Magneturret, with the center of the Z-axis coil aligned with the center of the circular track (Figure 2h). A black rubber wafer was placed into the center of the arena on top of the PDMS to act as a fiducial marker so that the center of the arena could easily be identified with image processing. A diffuse white LED backlight was positioned between the Magneturret and the PDMS arena for uniform bottom-up illumination which facilitated simple image processing by binary thresholding to identify the position of the microrobot in the channel.
We did not perform an extensive analysis to identify the shape or magnitude of the magnetic field created by the Magneturret, or to model the swimming dynamics of the HAMR in the PDMS arena. We hypothesized that we would be able to develop a high-performance control system using RL without going through the effort of developing a system model first.
2.3. Reinforcement Learning Can Be Used to Learn Microrobot Control Policies
Reinforcement learning systems have been demonstrated that can learn to achieve tasks from a wide range of state information sources[36]. For this study, we evaluated the ability of our RL agent to learn control policies from either state vector-based inputs (Figure 3a) or raw images augmented with the goal position (Figure 3b). In state-based input mode, state information of the microrobotic system was derived by using image processing to create a state vector-based input, which was passed to the RL agent. The angular position, , of the microrobot in the channel was calculated with image processing by binary thresholding and simple morphological operations. The camera was deliberately run with a slow shutter speed so that the images were intentionally washed out to remove noise. This simplified the task of using binary thresholding operations to identify the position of the HAMR and the center of the channel. The angular position of the HAMR in the channel was measured relative to the fiducial marker in the center of the circular arena. This information, as well as the position of the goal, , the last action taken by the agent , and the time, t, remaining in the episode were used to create a state vector. In the second input mode, we gave the agent observations in the form of raw pixel data from the camera. The images from the overhead camera were scaled down to 64×64 pixels, and the images were augmented with a marker indicating the position of the goal , as a line radiating outward from the center of the circular track to the goal position in front of the HAMR. These images were then passed into a convolutional neural network, which is a deep neural network architecture that has been demonstrated to effectively learn to identify features in images for classification tasks[71], and has been used to control robots with raw image input using reinforcement learning[61].
Figure 3. Reinforcement learning yielded successful control policies for the HAMR within 100k time steps.
(a) We used two input modes to train the RL agent. The first mode was to calculate a state vector consisting of the robot position, , the goal position, , the last action taken by the agent, , , , , and the time remaining in the episode. Three sequential observations were taken 0.3 seconds apart, and combined as a single state, s. (b) The second input mode used images of the microrobot in the arena as the state. Three sequential images were then passed to a convolutional neural network as the state, s. (c) We trained the agent for a total of 100k time steps per training session. The trace represents the average and standard deviation of the return from three successful training runs. (d) Successful training resulted in policies which, after an initial learning period, achieved consistent clockwise motion, with the HAMR moving continuously around the circular arena. (e) The RL agent learned policies that moved the microrobot around the full circular arena with helical swimming motion.
Reinforcement learning is based on the mathematics of Markov decision processes, which theoretically require the full state of the system to be available to the agent in order for convergence to be guaranteed.[35] In our particular implementation, the velocity of the HAMR at any given time could not be determined from a single still-frame observation, so the total state of the system given to the agent at each time step was composed of three concatenated sub-observations taken 0.3 seconds apart, for a total step time of 0.9 seconds (see Algorithm 1, steps 11–15). This allowed the agent to infer the velocity of the HAMR based on differences between the three sub-observations. This technique of batching sequential observations for improving the observability of the system for RL has been used successfully in domains such as Atari video games, in which the agent learned from raw pixel data gathered from sequential screenshots of the game.[72]
At the beginning of each learning trial, the actor and critic neural network parameters were randomly initialized. We allowed the RL agent to train for a maximum of 100,000 times steps, using a fixed ratio of one gradient update per environmental step, which has been shown to reduce training speed in exchange for higher training stability.[73] At the beginning of training, the agent was typically observed to reach the target and receive the bonus reward within the allotted time approximately 25% of the time, based on random sequences of actions. 100,000 environment steps were adequate time for effective actuation policies to be learned that continuously moved the HAMR clockwise around the arena, both with state vector input and raw images (Figure 3c). It is commonly reported when using reinforcement learning that several million environmental steps are necessary to derive a successful policy,[38, 72] so this result shows the sample efficiency of SAC, which is critically important for RL tasks that are trained using physical systems instead of in simulation. A time-lapse movie of the HAMR recorded during the learning process is shown in supplementary movie 1. We tracked the net movement of the HAMR during the training process, and each training session ended with the microrobot going continuously around the track in a clockwise direction (Figure 3d). After 100,000 steps the state vector-based policies achieved significantly higher overall level of performance, succeeding in approximately 90% of the episodes compared to the approximately 50% success rate for the image-based input. It is likely that the raw-image-based policies could have benefited from longer training periods [61].
Once the training sessions were complete, we evaluated the learned policies to test their performance. For evaluating policies, we used the highest performing policy parameters learned during a training session by monitoring a rolling average of the return over the last 100 episodes and saving the policy parameters each time the rolling average performance exceeded the last best performing model (Supplementary Figure 1). This was done because we sometimes observed a drop in performance after the peak performance was achieved in training, possibly due to overfitting. Early stopping, or selecting a policy before performance degradation has occurred, is a common technique used to prevent overfitting in neural networks.[74] Successful policies were able to move the robot indefinitely around the complete circular track (Figure 3e, Supplementary Movie 2).
We recorded each action taken by the agent during training sessions and the resultant change in state of the microrobot. For a single training run with state-vector input, we plotted the distribution of actions as a function of the resultant change in HAMR position, (Figure 4a). Note that HAMR velocities in Figure 4 are indicated as positive for negative values of , indicating travel in the clockwise direction. We separated the total 100k steps into 5 bins of 20k steps each. The distribution of actions over the first 20k time steps (Figure 4a, i) is centered around a sharp peak of actions which result in no net movement, as we would expect from an agent with little experience randomly exploring the space of possible actions. By the second batch of 20k steps (Figure 4a, ii), a pattern emerged in which the action distribution shifted to a bimodal distribution in which most actions still result in no net movement, but a second peak on the positive side indicates a trend towards selecting actions which result in clockwise movement. However, during this phase of training, the net motion of the robot remained close to zero (Figure 4b), because of fattening of the negative tail in the action distribution. As the learning process continued, the distribution continued to shift until the average movement was clockwise, with a second peak around 5 degrees per time step, and a narrow tail representing few actions, which caused the robot to move in the counterclockwise direction (Figure 4a, v).
Figure 4. Evaluating the learning performance of the RL agent during multiple training sessions.

(a) We recorded the actions and state of the agent over 100,000 training steps. The actions taken by the agent at the beginning of the training session resulted in no net forward motion (i). As the agent learned, the action distribution became bimodal (ii-v), with a second peak at ~5 degrees per action, indicating the agent was increasingly taking actions which resulted in forward movement. (b) Averaging the velocity distribution shows that by between 40k and 60k steps the agent had learned to achieve net positive movement. (c) Following three training sessions for the RL agent, we evaluated the performance of the learned stochastic policy, , and the deterministic policy for 3k steps without additional learning. In each policy, the performance was higher while selecting actions deterministically from . (d) Evaluation of deterministic and stochastic action selection for image-based policies. (e) The average velocity of the HAMR during evaluation for each input type and action selection method, ± standard deviation. (f) Actions taken by the policies during deterministic evaluation plotted according to at the beginning of the action, and color coded according to the resultant velocity during that action, with red indicating clockwise movement, and blue indicating counterclockwise. (g) Schematic showing the control parameters for the sine waves used to drive the current in the Magneturret. The RL agent had control over the Magnitude, M, and the phase angle, , in the X and Y coils. (h) Schematic showing the angular position of the robot in the circular channel.
The soft actor critic algorithm learns a continuous stochastic policy, , sampling actions from the policy according to , in which the actions selected during training are randomly sampled from a Gaussian distribution, and the agent learns the mean μ and the variance of this distribution over the course of training.[48] This is done in order to explore the space of possible actions during training. During training the agent seeks to balance the sum of future rewards with the information entropy of the policy by maximizing an entropy regularized objective function, and the policy entropy corresponds to the explore/exploit tradeoff the agent makes during training. However, once the policies were trained, performance during policy evaluation could be increased by selecting actions from the mean of the distribution without further stochastic exploration according to . This deterministic evaluation led to an increase in the proportion of actions taken by the agent which resulted in positive motion for both state-based (Figure 4c) and image-based agents (Figure 4d). We compared the total average velocity achieved by all the trained policies in both deterministic and stochastic action selection modes, which showed that deterministic action selection led to higher performance (Figure 4e). This is consistent with our expectation that the reward function we used would optimize the agent for velocity, and that removing the additional exploratory behavior of the stochastic agent would manifest as faster swimming.
We next examined the distribution of the action values chosen by the RL agent when evaluated deterministically according to . For each of the four actions taken by the policy over 3000 time steps, we plotted the value of the action against the position of the HAMR, (Figure 4f). The plotted actions are color-coded according to , with red actions indicating clockwise forward motion and blue actions indicating retrograde motion. The majority of actions taken by each of the six policies during evaluation resulted in positive motion. Each of the three policies trained using state-vector based input followed similar patterns, in which the phase angle of the X coil was held constant, and the magnitude of the X coil varied according the position of the microrobot . The Y coil was controlled by actuating the phase angle as a function of position and holding the magnitude relatively constant. In contrast, the policies learned by the image-based agents were more heterogeneous, finding different possible ways to manipulate the four actions in order to produce forward motion. One consistent pattern across all learned policies is that the magnitudes tended to hold steady close to the maximum or minimum values of −1 and positive 1, regardless of . This would result in the largest amplitude current sine waves, which we would expect because stronger magnetic fields would be able to create more powerful torques on the HAMR. The action distribution plotted in figure 4f is plotted against the position of the HAMR, which was the component of the state vector that correlated most strongly with variations in the policy action distribution. For reference, the action distribution is plotted verses the remaining time in the episode in supplementary figure S2, plotted versus the previous action in supplementary figure S3, and plotted verses the velocity in supplementary figure S4.
2.4. Optimizing the RL-Trained Policies Via Regression
We observed that the microrobot control policies learned by the RL agent sometimes performed actions that were obviously non-optimal (resulting in counterclockwise motion). We could likely further increase the performance of the learned polices by using techniques like hyperparameter tuning and longer training times.[73] However, by observing the behavior of the RL agent, we hypothesized that if we could distill the policies learned by the network into mathematical functions of the state variables, we might achieve a higher level of performance (Figure 5a). To test this, we chose one of the state-based policies and one of the image-based policies, and fit regression models to the data in order to create continuous control signals as a function of the robot position . First, we examined policy 1, learned by the state-vector based agent (Figure 4f). This policy was acting by modulating the magnitude in the X coil in what approximated a square wave pattern and modulating the phase angle in the Y coil in what appeared closer to a sine wave. The other two actions were held approximately constant regardless of the position of the robot. From all 3000 actions taken during policy evaluation, we selected the subset of actions taken by policy which had resulted in a positive movement of at least 3°, discarding the lower performing actions for this analysis. We then fit sinusoidal regression models to the and action distributions, and also fit a square wave to (Figure 5b). The resulting policies are shown in Figure 5b as solid black lines superimposed over the action distribution. The sine wave policy (Figure 5c) and the mixed sine/square wave policy (Figure 5d) that we developed with the regression models were then used to control the HAMR. The image-based policy that we evaluated was significantly more complex than the state-based policy (Figure 5e). We modeled this policy mathematically by fitting a 20th order polynomial to the data, and used this polynomial policy to drive the HAMR (Figure 5f).
Figure 5. Control policies learned by the RL agent could be translated into continuous functions in order to increase performance.
(a) The policies learned by the RL agent were used as a basis to derive higher-performance policies via regression of the positive action distribution. (b) For one of the policies that the RL agent learned, the actions were plotted as a function of and mathematical functions were fit to the subset of actions, which yielded HAMR velocities greater than 3 degrees per step. Sine waves and square waves were fit to the data via regression (shown in black), and those mathematical function were used to control the HAMR for 1000 time steps. (c) The results of running the sinusoidal policy and (d) the sine/square policy. (e) The policy learned by the imaged-based agent was fit with a 20th degree polynomial function. (f) The polynomial policy was used to drive the HAMR for 1000 time steps. (g) Comparing the average velocity of the HAMR when controlled by each policy.
The sinusoidal policy achieved the highest level of performance (Figure 5g), achieving the highest average HAMR velocity of all policies tested in this study, while the square/sine policy performed slightly worse than the neural network policy on which it was based. Despite the complexity of the polynomial policy compared to the sine and square wave-based policies, the performance of this policy was approximately equal to that of the sine/square wave policy, and was superior to the neural network-based policy upon which it was based. Since these mathematical policies only use as the input, it is possible that we could further increase the performance of mathematically inferred policies by taking other parts of the state vector into account.
2.5. Control policies learned by the RL agent recapitulate the behavior of optimal policies based on physical models.
Finally, we wanted to know whether the policies learned by the RL agent were modulating the magnetic field in a way that matched the control systems that human researchers have developed based on physical models. When using a uniform rotating magnetic field to steer a helical microrobot, the rotating magnetic field is usually made to rotate about the helical axis of the microrobot, which is also the direction in which the microrobot will swim (Figure 6a).[70] Therefore, to drive a helical microrobot around a circular track like the one we used, we would expect that the direction of the rotating magnetic field would be tangent to the circular track at all points along the circle for the optimal policy.
Figure 6. Control policies learned by the RL agent recapitulate the behavior of optimal policies based on theoretical physical models.

(a) Magnetic helical microrobots swim with propeller-like motion, transducing magnetic field rotation into torque, torque into angular velocity, and angular velocity into linear velocity in the direction of travel. The theoretical optimal policy for controlling a magnetic helical microrobot in a circular channel would be to create a rotating magnetic field perpendicular to the direction of travel of the robot at each point in the circle, so that the microrobot moves tangent to the circle at all points along the track [70]. Plotting a vector perpendicular to the calculated plane of magnetic field rotation for each action at each point along the circular track demonstrates that each policy learned by the RL agent recapitulated this theoretical optimal policy behavior. Vectors are color coded according to their azimuthal angle. (b) The inferred mathematical policies. (c) The image trained policies. (d) The state trained policies.
Although we did not record the magnetic field at all points along the track during policy evaluation, we can approximate the behavior of the magnetic fields based on the actions taken by the agent. The recorded actions selected by the policy while driving the HAMR around the track were used by us to estimate the magnetic fields produced during the action. To do this, we constructed a three-dimensional vector , where , where the magnitude and phase angle were selected by the policy. The actions taken by the agent are run for a total of 0.9 seconds during each time step, during which time the actions are held constant and rotates as a function of time with an angular frequency of 100 rad/s. Taking the cross product results in a vector which points in the direction perpendicular to the plane of the rotating magnetic field described by . By calculating the azimuthal angle we can approximate the direction of the rotating magnetic field during an action taken by the policy. The results of this analysis are shown by plotting an arrow with direction at the point along the circular track for each action taken by the policy. Results are shown for the inferred mathematical policies (Figure 6b), the image-trained policies (Figure 6c), and the state trained policies (Figure 6d). Based on this mathematical approximation of the magnetic field direction, each policy learned by the RL agent, regardless of input type, tended to create a rotating magnetic field nearly perpendicular to the direction of travel of the microrobot. This recapitulates the behavior of the theoretical optimal policy that we would expect based on a physical analysis of helical swimming magnetic microrobots[70]. This data also sheds light on the reason that the sine function was the highest performing policy that we tested, because the sine function most closely approximates the theoretical optimal policy in this analysis.
3. Discussion
Here, we have reported the development of a closed-loop control system for magnetic helical microrobots, which was implemented using reinforcement learning to discover control policies without the need for any dynamic system modeling. Continuous control policies for high-dimensional action spaces were represented by deep neural networks for effective control of magnetic fields to actuate a helical microrobot within a fluid-filled lumen. Multiple high-dimensional state representations including state-vector and raw images were sufficient to represent the state of the microrobot and yield successfully trained policies. Compared with previously reported control systems for magnetic microrobots,[24] we believe that the system we have presented possesses a number of key advantages. Electromagnetic actuation systems for microrobots are either air core, such as Helmholtz coils and Maxwell coils, or contain soft magnetic materials in the core which enhance the strength of the generated magnetic field, but can lead to nonlinearities when summing the combined effect of fields from multiple coils with saturated magnetic cores.[75] Such nonlinearities make modeling the behavior of the system more difficult,[76] particularly when the coils are run with high enough power to magnetically saturate the core material. Additionally, when controlling microrobots with permanent magnets, those magnets are often modeled as dipole sources for simplicity,[77] and the actual behavior of the physical system may not match the idealized model behavior. Neural network-based controllers trained with RL learn control policies from observing the actual behavior of the physical system, and deep neural networks can accurately model non-linear functions.[78] Control policies learned with RL will automatically take into account the real system dynamics, and this model-free control approach can greatly simplify the job of the microrobotic engineer.
Significant work has been done to create accurate dynamic models of rigid helical microrobots, and many microrobots are relatively straightforward to control using these models and classical control systems.[49] However, many recently developed microrobotic systems are composed of soft, shape changing materials, which are inherently harder to model than rigid bodies.[5, 13, 15, 17, 18] Soft microrobots such as helical grippers, which can kinematically change their configuration during operation, might also be difficult to accurately model.[7] Here, we have shown that our algorithm was able to control a soft helical microrobot without any dynamic modeling on the part of the control system designers. RL based microrobot control could enhance both the capabilities of novel microrobot designs and increase the efficiency of researchers by allowing the RL agent to do the work of developing a high-performance controller. RL based controllers may be able to exceed the performance of classical control systems based on simplified models (e.g., linearized models) because the RL agent is able to learn based on the observed physical behavior of the system, and deep neural networks are capable of accurately modeling any observed nonlinearities that the microrobotic system might exhibit.
Our choice to use a model-free reinforcement learning algorithm trained from scratch on a physical robotic system does entail potential downsides that are worth noting. In many robotic applications, training physical robots with reinforcement learning is impractical due to the constraints imposed by the physical system or the task, particularly when safety is critical and exploration is costly.[60] This could make it difficult to amass a sufficient quantity of training data to train a high-performance system. One potential approach to scale up real-world learning time is to multiplex robot training with many concurrently learning robots performing the same task.[79]
Highly complex microrobots, which exhibit significant kinematic complexity and deformability, could be infeasible to train purely with model-free approaches because of the additional time it would require to train the system to fully explore the state space. For many complex robotic systems, transfer learning has been shown to be effective, in which a simulation of the physical system is used to amass a large quantity of training data in silico, and then the final control system is fine-tuned with training on the physical system.[80–82] There will likely continue to be a tradeoff between the difficulty and utility of constructing a useful system model for in silico training, and the practicality of collecting large amounts of real-world training data.
Other control strategies that have been successfully applied for microrobot control could be enhanced with RL. Algorithms which have been used to control soft microrobots such as force-current mapping with PID control and path planning algorithms[5, 24, 76] could potentially be combined with reinforcement learning in order to optimize the gains in the PID controllers, and adapt to changes in environmental conditions by a process of continuous learning, or to optimize for multiple variables. Force-current mapping algorithms used to control microrobots are also often created with assumptions of linearity in magnetic field superposition, which could be violated with soft magnetic cores in the driving coils,[53] a limitation that could be potentially overcome by the nonlinear function approximation capabilities of deep neural networks. Another potential combination of RL with classical control has been demonstrated by Zeng et al.,[40] who used a residual physics model in order to fine tune a physics-based model of a grasping and throwing problem for a robotic arm. Using such methods, imprecise models of microrobot dynamics could be improved by fine tuning their parameters based on a data driven RL approach, leading to increased performance.
We have demonstrated that it is possible to learn microrobot control policies with reinforcement learning based on no prior knowledge, and then fine tune the performance of the policy by fitting continuous mathematical functions to the learned policy behaviors. While the sine and sine/square policies that we created based on analyzing the learned policies might have been deduced by a first-principles-analysis of the problem, this is certainly not the case for the polynomial policy we derived from the image-based input policy, which does not map to our intuitions of how to actuate a magnetic helical microrobot. We believe that this result provides strong support for the idea that a deep neural network trained to control microrobots with reinforcement learning is likely to arrive at policies that are unintuitive, and could potentially uncover useful behaviors that would not be suspected or created by human engineers. Furthermore, our analysis of the direction of the rotating magnetic fields created by the learned policies strongly supports the idea that the RL agent reliably developed near-optimal behavior, which matches the behavior of a rationally designed controller. This suggests that if RL were applied to a more complex microrobotic system for which no good models of optimal behavior were available, the RL agent might be able to autonomously identify the best way to control the system. This ability to detect subtle patterns from high dimensional data in a model-free RL approach could ultimately lead to state-of-the-art control policies that exceed the performance of human-designed policies, as has been seen with reinforcement learning algorithms in tasks like playing Go[43] and classic Atari games.[72]
In our experimentation, we found that the RL agent could learn successful polices from both state vector input, and from raw camera images. This input flexibility demonstrates that RL could be applicable for a broad class of biomedical imaging modes in which the state of the system might be represented by MRI, X-ray, ultrasound, or other biomedical imaging methods.[25] Our results are consistent with the findings of Haarnoja et al., [61] in that the use of raw images as input requires more training time in order to develop high quality policies compared to state vector input. Using higher dimension inputs like images has the potential to encode richer policies which respond to objects in the field of view such as obstacles which could impede the forward progress of the microrobot but would not be observable from lower dimensional feedback available in a state vector representation. In complex environments in which environmental factors such as lumen shape, fluid flow profiles, surface interactions, and biological interactions are likely to be a significant factor,[2] the ability to use machine vision for state representation could significantly improve microrobot performance. While the HAMR in this study was not observed to exhibit significant deformability, an image-based input to an RL control system could also help to observe and control more kinematically complex microrobots by encoding the configuration of the robot in the state representation. All these points strongly favor the use of RL for developing the next generation of microrobot control systems.
4. Methods
4.1. Helical agar magnetic robot.
The HAMR was constructed based on a method published by Hunter et al.[4] The structure of the robot was formed from a 2% w/v agar-based hydrogel (Fisher Cat. No. BP1423–500). The agar was melted to above 80 degrees Celsius and mixed with iron oxide nanopowder (Sigma Aldrich Cat. No. 637106) to a total concentration of 10% w/v. This mix was injected into a helical 3D printed mold printed on the Elegoo Mars stereolithography 3D printer to form a helical microrobot 4.4 mm in length. The microrobot was manually removed from the mold after cooling and solidifying, and stored in deionized water until use. Because the yield of this batch fabrication technique was not 100%, robots used for subsequent experiments were chosen based on their morphology and responsiveness to magnetic fields.
4.2. Circular swimming arena.
The PDMS swimming arena was created by molding Sylgard 184 elastomer (Sigma Aldrich Cat. No. 761036) over a thin 3mm tall section of polyvinyl chloride pipe (31 mm Inner Diameter, 34mm Outer diameter) Access holes for the microrobot were cut, and then the molded PDMS was plasma bonded to a thin uniform sheet of PDMS to close the channel, and cured overnight at 65° C.
4.3. Magneturret.
The Magneturret was constructed by winding 6 identical coils with 400 turns each of 30-gauge magnet wire (Remington Industries Cat. No. 30H200P) around a 0.26 inch diameter permalloy core (National Electronic Alloys Cat. No. HY-MU 80 Rod .260 AS DRAWN) cut to a length of 20 mm. These coils were fixed to the sides of an Acrylonitrile butadiene styrene (ABS) 3D printed cube with quick set epoxy. The coil was enclosed in a 3D printed housing printed in Zortrax Z-glass filament with a Zortrax M200 printer and sealed with epoxy. Glycerol coolant was pumped through the housing with a liquid CPU cooling system (Thermaltake Cat. No. CL-W253-CU12SW-A). The coils were energized by creating sinusoidal currents with an Arduino STEMtera breadboard, which took serial commands from the RL agent over USB and turned them into PWM signals which were sent to two Pololu Dual G2 High-Power Motor Driver 24v14 Shields. The power supply used to power the coils and was a Madewell 24V DC power supply.
4.4. Overhead camera.
The overhead camera was an Alvium 1800 U-500c with a 6mm fixed focal length lens from Edmund Optics. The camera used to take images for the state was set at a long exposure so that the HAMR and the center mark were the only visible objects in the image. A second identical camera placed above the arena at a slight angle was used to simultaneously record normal exposure video of the HAMR in the arena during operation, so that the features in the image were not washed out.
4.5. RL algorithm.
The soft actor critic RL agent was developed in Python, using Tensorflow 2.0 for creating the neural network models. This was run on a desktop workstation from Lambda Labs. Separate processes were used for data collection and updating the neural networks so that the two operations could run in parallel. The full algorithm details are available in Supplementary Algorithm 1.
Supplementary Material
Supplementary Figure S1: Highest performing policy parameters during training
Supplementary Figure S2: Policy action distribution as a function of remaining time
Supplementary Figure S3: Policy action distribution as a function of the previous action
Supplementary Figure S4: Policy action distribution as a function of the velocity
Supplementary Figure S5: Comparison of mean policy action distribution vs raw policy action distribution
Table S1: Hyperparameters
Movie S1: Training process time course.
Movie S2: HAMR swimming around the track
Acknowledgements Funding
National Institutes of Health through the Director’s New Innovator Award, DP2-GM132934 (WCR)
Air Force Office of Scientific Research, FA9550-18-1-0262 (WCR)
National Science Foundation, DMR 1709238 (WCR)
Office of Naval Research, N00014-17-1-2306 (WCR)
Army Research Laboratory W911NF2120208 (WCR)
National Institutes of Health Cellular Approaches to Tissue Engineering and Regeneration Training Program traineeship T32-EB001026 (MRB)
William Kepler Whiteford Faculty Fellowship from the Swanson School of Engineering at the University of Pittsburgh (WCR)
We gratefully thank Haley C. Fuller and Ting-Yen Wei for scientific discussions and manuscript suggestions.
Footnotes
Competing Interests
The authors declare that they have no competing interests.
Data and Materials Availability
All data are available in the main text or the supplementary materials. The code used in this work is available at https://github.com/Synthetic-Automated-Systems/RUDER_MBOT_RL
References
- [1].Wang B, et al. , Trends in micro‐/nanorobotics: materials development, actuation, localization, and system integration for biomedical applications. Advanced Materials, 2021. 33(4): p. 2002047. [DOI] [PubMed] [Google Scholar]
- [2].Ceylan H, et al. , Translational prospects of untethered medical microrobots. Progress in Biomedical Engineering, 2019. 1(1): p. 012002. [Google Scholar]
- [3].Jeon S, et al. , Magnetically actuated microrobots as a platform for stem cell transplantation. Science Robotics, 2019. 4(30): p. eaav4317. [DOI] [PubMed] [Google Scholar]
- [4].Hunter EE, et al. , Toward soft micro bio robots for cellular and chemical delivery. IEEE Robotics and Automation Letters, 2018. 3(3): p. 1592–1599. [Google Scholar]
- [5].Scheggi S, et al. Magnetic motion control and planning of untethered soft grippers using ultrasound image feedback. IEEE. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Dong M, et al. , 3D‐printed soft magnetoelectric microswimmers for delivery and differentiation of neuron‐like cells. Advanced Functional Materials, 2020. 30(17): p. 1910323. [Google Scholar]
- [7].Jin Q, et al. , Untethered single cell grippers for active biopsy. Nano letters, 2020. 20(7): p. 5383–5390. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].Park J, et al. , Magnetically actuated degradable microrobots for actively controlled drug release and hyperthermia therapy. Advanced Healthcare Materials, 2019. 8(16): p. 1900213. [DOI] [PubMed] [Google Scholar]
- [9].Hosney A, et al. In vitro validation of clearing clogged vessels using microrobots. IEEE. [Google Scholar]
- [10].Medina-Sánchez M. and Schmidt OG, Medical microbots need better imaging and control. Nature, 2017. 545(7655): p. 406–408. [DOI] [PubMed] [Google Scholar]
- [11].Wang X, et al. , 3D printed enzymatically biodegradable soft helical microswimmers. Advanced Functional Materials, 2018. 28(45): p. 1804107. [Google Scholar]
- [12].Alapan Y, et al. , Multifunctional surface microrollers for targeted cargo delivery in physiological blood flow. Science robotics, 2020. 5(42): p. eaba5726. [DOI] [PubMed] [Google Scholar]
- [13].Zhang J, et al. , Voxelated three-dimensional miniature magnetic soft machines via multimaterial heterogeneous assembly. Science robotics, 2021. 6(53): p. eabf0112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Palagi S. and Fischer P, Bioinspired microrobots. Nature Reviews Materials, 2018. 3(6): p. 113–124. [Google Scholar]
- [15].Hu W, et al. , Small-scale soft-bodied robot with multimodal locomotion. Nature, 2018. 554(7690): p. 81–85. [DOI] [PubMed] [Google Scholar]
- [16].Breger JC, et al. , Self-folding thermo-magnetically responsive soft microgrippers. ACS applied materials & interfaces, 2015. 7(5): p. 3398–3405. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Medina‐Sánchez M, et al. , Swimming microrobots: Soft, reconfigurable, and smart. Advanced Functional Materials, 2018. 28(25): p. 1707228. [Google Scholar]
- [18].Xu T, et al. , Millimeter-scale flexible robots with programmable three-dimensional magnetization and motions. Science Robotics, 2019. 4(29): p. eaav4494. [DOI] [PubMed] [Google Scholar]
- [19].Ricotti L, et al. , Biohybrid actuators for robotics: A review of devices actuated by living cells. Science Robotics, 2017. 2(12): p. eaaq0495. [DOI] [PubMed] [Google Scholar]
- [20].Zhang H, et al. , Dual-responsive biohybrid neutrobots for active target delivery. Science Robotics, 2021. 6(52): p. eaaz9519. [DOI] [PubMed] [Google Scholar]
- [21].Xu H, et al. , Sperm micromotors for cargo delivery through flowing blood. ACS nano, 2020. 14(3): p. 2982–2993. [DOI] [PubMed] [Google Scholar]
- [22].Alapan Y, et al. , Soft erythrocyte-based bacterial microswimmers for cargo delivery. Science robotics, 2018. 3(17): p. eaar4423. [DOI] [PubMed] [Google Scholar]
- [23].Wei T-Y and Ruder WC, Engineering control circuits for molecular robots using synthetic biology. Apl Materials, 2020. 8(10): p. 101104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [24].Xu T, et al. , Magnetic actuation based motion control for microrobots: An overview. Micromachines, 2015. 6(9): p. 1346–1364. [Google Scholar]
- [25].Aziz A, et al. , Medical imaging of microrobots: Toward in vivo applications. ACS nano, 2020. 14(9): p. 10865–10893. [DOI] [PubMed] [Google Scholar]
- [26].Yasa IC, et al. , Elucidating the interaction dynamics between microswimmer body and immune system for medical microrobots. Science Robotics, 2020. 5(43): p. eaaz3867. [DOI] [PubMed] [Google Scholar]
- [27].Stock JB, Ninfa AJ, and Stock AM, Protein phosphorylation and regulation of adaptive responses in bacteria. Microbiological reviews, 1989. 53(4): p. 450–490. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [28].Tagkopoulos I, Liu Y-C, and Tavazoie S, Predictive Behavior Within Microbial Genetic Networks. Science, 2008. 320(5881): p. 1313–1317. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [29].Ardiel EL and Rankin CH, An elegant mind: learning and memory in Caenorhabditis elegans. Learning & memory, 2010. 17(4): p. 191–201. [DOI] [PubMed] [Google Scholar]
- [30].Giurfa M, Learning and cognition in insects. WIREs Cognitive Science, 2015. 6(4): p. 383–395. [DOI] [PubMed] [Google Scholar]
- [31].Calvo Garzón P. and Keijzer F, Plants: Adaptive behavior, root-brains, and minimal cognition. Adaptive Behavior, 2011. 19(3): p. 155–171. [Google Scholar]
- [32].Bonilla FA and Oettgen HC, Adaptive immunity. Journal of Allergy and Clinical Immunology, 2010. 125(2, Supplement 2): p. S33–S40. [DOI] [PubMed] [Google Scholar]
- [33].Dickinson A, Contemporary animal learning theory. Vol. 1. 1980: CUP Archive. [Google Scholar]
- [34].Tsang ACH, et al. , Roads to smart artificial microswimmers. Advanced Intelligent Systems, 2020. 2(8): p. 1900137. [Google Scholar]
- [35].Sutton RS and Barto AG, Introduction to reinforcement learning. 1998.
- [36].Kober J, Bagnell JA, and Peters J, Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 2013. 32(11): p. 1238–1274. [Google Scholar]
- [37].Haarnoja T, et al. , Learning to walk via deep reinforcement learning. arXiv preprint arXiv:1812.11103, 2018.
- [38].Zhu H, et al. , The ingredients of real-world robotic reinforcement learning. arXiv preprint arXiv:2004.12570, 2020.
- [39].Turan M, et al. , Learning to navigate endoscopic capsule robots. IEEE Robotics and Automation Letters, 2019. 4(3): p. 3075–3082. [Google Scholar]
- [40].Zeng A, et al. , Tossingbot: Learning to throw arbitrary objects with residual physics. IEEE Transactions on Robotics, 2020. 36(4): p. 1307–1319. [Google Scholar]
- [41].Tan J, et al. , Sim-to-real: Learning agile locomotion for quadruped robots. arXiv preprint arXiv:1804.10332, 2018.
- [42].Mahmood AR, et al. Benchmarking reinforcement learning algorithms on real-world robots. PMLR. [Google Scholar]
- [43].Silver D, et al. , Mastering the game of go without human knowledge. nature, 2017. 550(7676): p. 354–359. [DOI] [PubMed] [Google Scholar]
- [44].Colabrese S, et al. , Flow navigation by smart microswimmers via reinforcement learning. Physical review letters, 2017. 118(15): p. 158004. [DOI] [PubMed] [Google Scholar]
- [45].Yang Y, Bevan MA, and Li B, Efficient navigation of colloidal robots in an unknown environment via deep reinforcement learning. Advanced Intelligent Systems, 2020. 2(1): p. 1900106. [Google Scholar]
- [46].Muiños-Landin S, et al. , Reinforcement learning with artificial microswimmers. Science Robotics, 2021. 6(52): p. eabd9285. [DOI] [PubMed] [Google Scholar]
- [47].Demir SO, et al. , Task space adaptation via the learning of gait controllers of magnetic soft millirobots. The International Journal of Robotics Research, 2021: p. 02783649211021869. [DOI] [PMC free article] [PubMed]
- [48].Haarnoja T, et al. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. PMLR. [Google Scholar]
- [49].Wang X, et al. , Dynamic modeling of magnetic helical microrobots. IEEE Robotics and Automation Letters, 2021. 7(2): p. 1682–1688. [Google Scholar]
- [50].Charreyron SL, et al. , Modeling electromagnetic navigation systems. IEEE Transactions on Robotics, 2021. 37(4): p. 1009–1021. [Google Scholar]
- [51].Zhang J, et al. , Reliable grasping of three-dimensional untethered mobile magnetic microgripper for autonomous pick-and-place. IEEE Robotics and Automation Letters, 2017. 2(2): p. 835–840. [Google Scholar]
- [52].Elgeti J, Winkler RG, and Gompper G, Physics of microswimmers—single particle motion and collective behavior: a review. Reports on Progress in Physics, 2015. 78(5): p. 056601. [DOI] [PubMed] [Google Scholar]
- [53].Denasi A. and Misra S A robust controller for micro-sized agents: The prescribed performance approach. IEEE. [Google Scholar]
- [54].Pawashe C, et al. , Two-dimensional autonomous microparticle manipulation strategies for magnetic microrobots in fluidic environments. IEEE Transactions on Robotics, 2011. 28(2): p. 467–477. [Google Scholar]
- [55].Huang H-W, et al. , Soft micromachines with programmable motility and morphology. Nature communications, 2016. 7(1): p. 1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [56].Li H, et al. , Magnetic actuated pH-responsive hydrogel-based soft micro-robot for targeted drug delivery. Smart Materials and Structures, 2016. 25(2): p. 027001. [Google Scholar]
- [57].Grzes M, Reward shaping in episodic reinforcement learning. 2017.
- [58].Hwang G, et al. , Catalytic antimicrobial robots for biofilm eradication. Science robotics, 2019. 4(29): p. eaaw2388. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [59].Szepesvári C, Algorithms for reinforcement learning. Synthesis lectures on artificial intelligence and machine learning, 2010. 4(1): p. 1–103. [Google Scholar]
- [60].Ibarz J, et al. , How to train your robot with deep reinforcement learning: lessons we have learned. The International Journal of Robotics Research, 2021. 40(4–5): p. 698–721. [Google Scholar]
- [61].Haarnoja T, et al. , Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018.
- [62].Yan X, et al. , Multifunctional biohybrid magnetite microrobots for imaging-guided therapy. Science robotics, 2017. 2(12): p. eaaq1155. [DOI] [PubMed] [Google Scholar]
- [63].Sitti M, et al. , Biomedical applications of untethered mobile milli/microrobots. Proceedings of the IEEE, 2015. 103(2): p. 205–224. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [64].Abbott JJ, et al. , How should microrobots swim? The international journal of Robotics Research, 2009. 28(11–12): p. 1434–1447. [Google Scholar]
- [65].Medina-Sánchez M, et al. , Cellular cargo delivery: Toward assisted fertilization by sperm-carrying micromotors. Nano letters, 2016. 16(1): p. 555–561. [DOI] [PubMed] [Google Scholar]
- [66].Lee S, et al. , A needle‐type microrobot for targeted drug delivery by affixing to a microtissue. Advanced Healthcare Materials, 2020. 9(7): p. 1901697. [DOI] [PubMed] [Google Scholar]
- [67].Hu C, Pané S, and Nelson BJ, Soft micro-and nanorobotics. Annual Review of Control, Robotics, and Autonomous Systems, 2018. 1: p. 53–75. [Google Scholar]
- [68].Nayar VT, et al. , Elastic and viscoelastic characterization of agar. Journal of the Mechanical Behavior of Biomedical Materials, 2012. 7: p. 60–68. [DOI] [PubMed] [Google Scholar]
- [69].Rus D. and Tolley MT, Design, fabrication and control of soft robots. Nature, 2015. 521(7553): p. 467–475. [DOI] [PubMed] [Google Scholar]
- [70].Wang X, et al. , Dynamic Modeling of Magnetic Helical Microrobots. IEEE Robotics and Automation Letters, 2021: p. 1–1. [Google Scholar]
- [71].He J, et al. , Hookworm Detection in Wireless Capsule Endoscopy Images With Deep Learning. IEEE Transactions on Image Processing, 2018. 27(5): p. 2379–2392. [DOI] [PubMed] [Google Scholar]
- [72].Mnih V, et al. , Human-level control through deep reinforcement learning. nature, 2015. 518(7540): p. 529–533. [DOI] [PubMed] [Google Scholar]
- [73].Fu J, et al. Diagnosing bottlenecks in deep q-learning algorithms. PMLR. [Google Scholar]
- [74].Caruana R, Lawrence S, and Giles C, Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping. Advances in neural information processing systems, 2000. 13. [Google Scholar]
- [75].Kummer MP, et al. , OctoMag: An electromagnetic system for 5-DOF wireless micromanipulation. IEEE Transactions on Robotics, 2010. 26(6): p. 1006–1017. [Google Scholar]
- [76].Ongaro F, et al. , Design of an electromagnetic setup for independent three-dimensional control of pairs of identical and nonidentical microrobots. IEEE transactions on robotics, 2018. 35(1): p. 174–183. [Google Scholar]
- [77].Mahoney AW and Abbott JJ, Five-degree-of-freedom manipulation of an untethered magnetic device in fluid using a single permanent magnet with application in stomach capsule endoscopy. The International Journal of Robotics Research, 2016. 35(1–3): p. 129–147. [Google Scholar]
- [78].Arulkumaran K, et al. , Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine, 2017. 34(6): p. 26–38. [Google Scholar]
- [79].Kalashnikov D, et al. , Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation (2018). arXiv preprint arXiv:1806.10293, 2018.
- [80].Lee J, et al. , Learning quadrupedal locomotion over challenging terrain. Science Robotics, 2020. 5(47): p. eabc5986. [DOI] [PubMed] [Google Scholar]
- [81].Wulfmeier M, Posner I, and Abbeel P Mutual alignment transfer learning. PMLR. [Google Scholar]
- [82].Andrychowicz M, et al. , Hindsight experience replay. Advances in neural information processing systems, 2017. 30. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplementary Figure S1: Highest performing policy parameters during training
Supplementary Figure S2: Policy action distribution as a function of remaining time
Supplementary Figure S3: Policy action distribution as a function of the previous action
Supplementary Figure S4: Policy action distribution as a function of the velocity
Supplementary Figure S5: Comparison of mean policy action distribution vs raw policy action distribution
Table S1: Hyperparameters
Movie S1: Training process time course.
Movie S2: HAMR swimming around the track
Data Availability Statement
All data are available in the main text or the supplementary materials. The code used in this work is available at https://github.com/Synthetic-Automated-Systems/RUDER_MBOT_RL



