As part of my development of a legged helicopter landing gear (more on this later), I started studying ways to minimize landing impact using a combination of leg retraction (active impedance) and elastic elements (passive impedance). This lead me to the paper “Can Active Impedance Protect Robots from Landing Impact?” by Houman Dallali et. al.
Although I won’t be going over all the technical details (especially regarding the passive impedance), I would like to share some of my results in using Reinforcement Learning based on Particle Filtering (Kormushev, Caldwell) for the task of cushioning a landing robot.
The robot model used in this simulation first consisted of a hip body constrained to move in the vertical axis connected to a 312mm link (the thigh) with an unactuated revolute joint. This was in turn connected to a 434mm link (the shank) with an actuated knee joint. Finally, an unactuated revolute joint connected the shank with a foot constrained to move in the same axis as the hip.
The general idea of the active impedance controller is this: first, model what a retraction reflex looks like. In this paper, a knee angle trajectory was generated by smoothing a square retraction reflex. This trajectory was tracked using a PD controller on the difference between the desired and actual knee angle. The parameters for the generated trajectory in the original paper were the time to begin the reflex (relative to the impact time), the time to return to the original pose, and the magnitude of the reflex. In my simulation, I removed the return time parameter, ending each trial with the robot in the fully retracted position. I also added a parameter that controlled the smoothing of the square reflex.
The reinforcement learning algorithm attempts to find the optimal set of parameters (begin time, smooth velocity, reflex magnitude) to optimize the reward function. In my case, the reward function gives high rewards for trials that have the lowest maximum hip acceleration, providing a smoother landing. The reinforcement learning algorithm alternates between exploration (global random sampling of parameter space) and exploitation. The exploitation step simply picks a random particle, with higher rewards corresponding to a higher chance of selection; a new particle is spawned near to the selected one with some exponentially decaying noise added on. As more trials are performed, the chance of exploitation rises.
Although the learning algorithm presented doesn’t include a culling step (removing all but the N best particles every some fixed number of trials), I included them. If you don’t, the algorithm tends to add more and more particles around prior clusters, even if a better policy has been found through exploration. This is because the added weights of the sub-optimal particles outweigh a superior, but isolated particle.
Simulation was performed using CAGE engine (made by my brother and I, not available publicly yet). The engine uses Box2D for physics and OpenGL for rendering. In this simulation, I just wanted to get some experience with the reinforcement learning algorithm and didn’t include passive impedance for simplicity.
The following video shows the first few trials, simulated in real time. The top graph shows the parameter space, with each dot representing an evaluated set of parameters. The colors of the dots are scaled from black (lowest reward) to white (highest reward). The blue circle marks the last added particle. The bottom graph shows the highest reward found at or before a certain trial. As you can see, the algorithm alternates between exploitation and exploration, with a higher emphasis on exploration in the earlier trials.
In the next video, the simulation has already been run for a few thousand trials. The algorithm tries to exploit more heavily now.
Finally, here is the best policy in slow motion (1/5th real time speed). As you can see, the downwards force is relatively constant throughout the landing phase.