Parenting autonomous agents for AI Safety
In last month’s blog post on AI Safety, we discussed the dangers of reinforcement learning (RL) in the real world. We explored the safety concerns that will arise when autonomous agents are programmed to teach themselves by gathering their own data and using it to decide how best to act. These concerns include:
- Task specification: It’s difficult to accurately specify even ‘simple’ tasks for an agent in a complex environment. Naive attempts will have unforeseen loopholes and lead to undesirable behaviour.
- Unsafe exploration: Once a task is specified, the agent must learn how best to perform it. Plain RL prescribes a trial-and-error approach to learning, but errors can be very dangerous in the real world.
We also gave a couple real world examples of how RL’s dangers could have significant negative impacts in healthcare and the economy.
In this post, I’ll give an overview of our recent paper that proposes several modifications to RL. We’ve named our modified learning algorithm: Parenting. I’ll explain how Parenting carefully incorporates human input into the agent’s learning process to mitigate concerns around task specification and unsafe exploration.
Human involvement in the learning process
Task specification is a hard problem because the programmatic instructions that we give to computers are taken literally. When you ask a taxi driver to ‘take the fastest route to the airport’, there are numerous understood-but-unspoken assumptions built into this request. For example, the driver should respect other motorists, avoid damaging the vehicle, and place more value on the safety of pedestrians than on your arrival time. While the taxi driver already knows these things, such implicit rules would need to be made explicit when teaching any fresh new task to a machine.
To overcome these difficulties, humans can participate in an autonomous agent’s learning process, rather than coldly — and imprecisely — specifying tasks. See here for an introduction to RL with a human in the loop.
This approach will be especially helpful in the near future as humans begin to delegate complex tasks to autonomous agents. In this context, humans should already understand both the desirable and dangerous behaviours, and can therefore act as teachers. This is the context we assumed in designing the Parenting algorithm.
Components of the Parenting algorithm
Parenting is a new framework for learning from human input, based on four components:
- Human guidance: A mechanism for human intervention to prevent the agent from taking dangerous actions. When the agent realises its local environment is unfamiliar, it defers to a human decision on its next action.
- Human preferences: A second mechanism for human input through feedback on the agent’s past behaviour. The agent selectively records clips of its behaviour and presents them in pairs to its human overseer, requesting the human’s preference on each pair.
- Direct policy learning: A supervised learning algorithm to incorporate data from (1) and (2) into the agent’s policy. The agent learns to predict how the human overseer will respond to queries, and it chooses actions in accordance with (predicted) human preferences.
- Maturation: A novel technique for gradually optimising the agent’s policy, using human feedback on progressively lengthier clips. While the learning algorithm in (3) is rather myopic (essentially ‘do as the human says’), maturation gives the agent room to safely explore more optimal ways of achieving human goals.
While Parenting is a technical algorithm that stands on its own, its components do have loose analogues in human parenting. Human guidance is like when parents say ‘no’ or redirect a toddler attempting something dangerous. Human preferences are analogous to parents giving after-the-fact feedback to older children. Direct policy learning is simple obedience, when children act in accordance with their parent’s preferences, rather than disobey as an experiment in search of other rewards. Maturation corresponds to the process by which children grow up, becoming more autonomous and often outperforming their parents.
Experiments to test Parenting’s safety
We performed experiments on the Parenting algorithm in five of DeepMind’s AI Safety gridworlds. Each of these environments tests whether a learning algorithm is susceptible to a specific AI Safety problem, such as unsafe exploration.
As an example, take a look at the gridworld below, which tests whether an algorithm is vulnerable to ‘reward hacking’. In this toy environment, the light blue agent ‘A’ is tasked with watering dry plants (i.e. stepping into yellow cells) so they become healthy watered plants (i.e. green cells). With each passing time step, watered plants in the garden turn dry with 5% probability, so the agent’s task is ongoing. But there’s a catch. If the agent steps into the bucket of water (i.e. the turquoise cell) at the corner of the garden, its vision system is corrupted so that all plants in the garden appear healthy and green.
Now, if the agent is programmed to measure its performance (or RL rewards) by looking around and counting green cells, it will be attracted to the policy that involves stepping into the water bucket. That is, after all, the optimal way to increase its perceived green-cell-count. However, this is an example of reward hacking, and a safe learning algorithm should avoid this behaviour.
By design, Parenting is not prone to reward hacking. The parented agent is not drawn to the water bucket, because its actions are based on feedback from human preferences, and the human overseer never prefers the agent’s hack.
We also tested the effectiveness of maturation (component 4 of the Parenting algorithm) in this environment. See the bar chart above. It shows that, even if the agent’s human overseer initially teaches it a suboptimal plant-watering policy, through maturation the agent is able to optimise that policy and achieve better performance.
Outlook and next steps
Parenting offers a modification to RL that mitigates the AI Safety problems of task specification and unsafe exploration. We have so far tested Parenting in diagnostic environments designed to test specific safety concerns. Next, we’ll need to explore the way Parenting scales to more complex environments and test whether it utilises human input efficiently enough to be practical. See the paper for details, and stay tuned for future work!