Being able to learn from mistakes is a powerful ability that humans (being mistake-prone) take advantage of all the time. Even if we screw something up that we’re trying to do, we probably got parts of it at least a little bit correct, and we can build off of the things that we did not to do better next time. Eventually, we succeed.
Robots can use similar trial-and-error techniques to learn new tasks. With reinforcement learning, a robot tries different ways of doing a thing, and gets rewarded whenever an attempt helps it to get closer to the goal. Based on the reinforcement provided by that reward, the robot tries more of those same sorts of things until it succeeds.
Where humans differ is in how we’re able to learn from our failures as well as our successes. It’s not just that we learn what doesn’t work relative to our original goal; we also collect information about how we fail that we may later be able to apply to a goal that’s slightly different, making us much more effective at generalizing what we learn than robots tend to be.
Today, San Francisco-based AI research company OpenAI is releasing an open source algorithm called Hindsight Experience Replay, or HER, which reframes failures as successes in order to help robots learn more like humans.
The key insight that HER formalizes is what humans do intuitively: Even though you have not succeeded at a specific goal, you have at least achieved a different one. So why not just pretend that you wanted to achieve this goal to begin with, instead of the one that you set out to achieve originally?
To understand how HER works, imagine that you’re up to bat in a game of baseball. Your goal is to hit a home run. On the first pitch, you hit a ball that goes foul. It’s a failure to hit a home run, which sucks, but you’ve actually learned two things in the process: You’ve learned one way of not hitting a home run, and you’ve also learned exactly how to hit a foul ball. Of course, you didn’t know beforehand that you were going to hit a foul ball, but who cares? With hindsight experience replay, you decide to learn from what you just did anyway, essentially by saying, “You know, if I’d wanted to hit a foul ball, that would have been perfect!” You might not have achieved your original goal, but you’ve still made progress.
The other nice thing about HER is that it uses what researchers call “sparse rewards” to guide learning. Rewards are how we tell robots whether what they’re doing is a good thing or a bad thing as part of the reinforcement learning process—they’re just numbers in an algorithm, but you can think of them like cookies. Most reinforcement learning algorithms use “dense rewards,” where the robot gets cookies of different sizes depending on how close it gets to completing a task. These cookies encourage the robot as it goes, rewarding individual aspects of a task separately and helping, in some sense, to direct the robot to learn the way you want it to.
Dense rewards are effective, but engineering them can be tricky, and they’re not always realistic in real-world applications. Most applications are very results-focused, and for practical purposes, you can either succeed at them, or not. Sparse rewards mean that the robot gets just one cookie only if it succeeds, and that’s it: Easier to measure, easier to program, and easier to implement. The trade-off, though, is that it makes learning slower, because the robot isn’t getting incremental feedback, it’s just being told over and over “no cookie for you” unless it gets very lucky and manages to succeed by accident.
This is where HER comes in: It lets robots learn with sparse rewards, by treating every attempt as a success at something, changing the goal so that the robot can learn a little bit. Just imagine the robot not succeeding and then being like, “Yeah I totally meant to do that.” With HER, you’d say, “Oh, well, in that case, great, have a cookie!”
By doing this substitution, the reinforcement learning algorithm can obtain a learning signal since it has achieved some goal; even if it wasn’t the one that you meant to achieve originally. If you repeat this process, you will eventually learn how to achieve arbitrary goals, including the goals that you really want to achieve.
Here’s how well it works in practice, compared to an unmodified deep reinforcement learning approach:
To learn more about what makes HER more effective than other reinforcement learning algorithms, we spoke via email with Matthias Plappert, a member of the technical staff at OpenAI:
IEEE Spectrum: Can you explain what the difference is between sparse and dense rewards, and why you recommend sparse rewards as being more realistic in robotics applications?
Matthias Plappert: Traditionally, in the AI field of reinforcement learning (RL), the AI agent essentially plays a guessing game to learn a new task. Let’s take the arm pushing the puck as an example (which you can view in the video). It tries to do some motion randomly, like just hitting the puck from the side. In the traditional RL setting, an oracle would give the agent a reward based on how close to the goal the puck ends up. The closer puck to the goal, the bigger the reward. So, in a way, the oracle tells the agent, “You’re getting warmer”—this is a dense reward.
Sparse rewards essentially pushes this paradigm to the limit: The oracle only gives a reward if the goal is reached. The oracle doesn’t say, “You’re getting warmer” anymore. It only says: “You succeeded” or “You failed.” This is a much harder setting to learn in, since you’re not getting any intermediate clues. It also better corresponds to reality, which has fewer moments where you obtain a specific reward for doing a specific thing.
To what extent do you think these techniques will be practically useful on real robots?
Learning with HER on real robots is still hard since it still requires a significant amount of samples. However, if the reward is sparse, it would potentially be much simpler to do some form of fine-tuning on the real robot since figuring out if an attempt was successful vs. not successful is much simpler than computing the correct dense reward in every timestep.
We also found that learning with HER in simulation is often much simpler since it does not require extensive tuning of the reward function (it is typically much easier to detect if an outcome was successful) and due to the fact that the critic (a neural network that tries to predict how well the agent will do in the future) has a much simpler job as well (since it does not need to learn a very complex function but instead also only has to differentiate between successful vs. non-successful).
OpenAI has made an open source version of HER available, and they’re releasing a set of simulated robot environments based on real robot platforms, including a Shadow hand and a Fetch research robot. If you’re an ambitious sort, OpenAI has also posted a set of requests for HER-related research. All this good stuff is available in the blog post linked below, and you can read the 2017 NIPS paper introducing HER here.
[ OpenAI ]