In our most recent blog post on AI Safety, we discussed the landscape of safety concerns revolving around humanity’s usage of AI. The character of these safety concerns varies across 3 dimensions:

The level of autonomy possessed by the technology

The intentions of the technology’s human designers, whether benign or malicious

The nature of the concern’s required resolution, whether technical or regulatory

In this post, I will go into more detail about the AI Safety problems that will become relevant in the not-so-distant future: when humans with benign intentions design AI agents to act autonomously in the real-world. Specifically, I’ll discuss the dangers of reinforcement learning (RL), an algorithm that allows an AI agent to essentially teach itself, by gathering its own data from its environment.

In an upcoming post, I’ll give a conceptual overview of our recent technical paper proposing a safer modification to RL.

What is reinforcement learning?

RL involves an agent, an environment, and a reward function. The agent takes actions that cause changes in the environment. Rewards – positive or negative – are granted to the agent depending on which actions it takes. The agent’s goal is to learn which behaviours maximise its accrual of rewards.

Variants of RL have achieved impressive super-human performance in a variety of domains, e.g. Atari and Go. In Atari, for example, the reward function is easy to define: the agent should act in order to maximise the game’s score.

Several classes of AI Safety problems will arise in the future as increasingly autonomous agents are given increasingly complex tasks. Let’s discuss two of them. (A broader overview can be found here).

Problem #1: task specification

First, it can be quite difficult to precisely specify the task we’d like an AI agent to perform.

In the context of RL, task specification is done using the reward function: the RL agent will choose its actions to maximise those rewards. But in a complex environment, it’s simply infeasible to expect a human to write down a reward function that has no loopholes and will lead to the desired outcome.

For example, imagine training a household robot to make you a cup of coffee. To use RL, you might design a reward function that awards the agent for turning on the coffee machine, inserting the grounds, pouring a cup, and bringing you the finished product. You might also include a small penalty (negative reward) for each second that passes during the process, in order to encourage the swift delivery of your coffee. This simple reward function is already problematic, due to the fact that your household is a complicated place. Your home might include children, pets, fragile or sentimental belongings, a system of organisation, and an unspoken code of conduct and culture. However, none of these were mentioned in the reward function above.

In this situation, to maximise its rewards, the household robot will prepare your coffee as fast as possible, even if this means running over pets, breaking glasses, and generally destroying your organised household along the way. On top of this, no matter how hard you try to write down all the rules a robot should obey, the real world is sufficiently complex that you will inevitably leave some out.

Task specification problems come in several flavours. The example of the household robot highlights the negative side effects that can accrue as an agent works to maximise its rewards. To name another, reward hacking can occur if the agent discovers a way to earn rewards that side-steps the task its human designers were aiming to achieve. For example, the household robot might discover that repeatedly switching the coffee maker off and back on earns much more reward than actually delivering any coffee.

Problem #2: unsafe exploration

Let’s suppose for a moment that we overcome the difficulties with task specification. Safety concerns don’t end there: even with an accurately specified task, an agent’s process for learning how to perform that task can be dangerous. In particular, all variants of RL involve trial-and-error learning: the agent takes random actions until it learns which behaviours earn the most total reward. This is generally problematic if learning is to take place in the real world, where errors can be expensive or deadly.

Because an RL agent must trial, or explore, a wide variety of actions before learning which behaviours to pursue and which to avoid, this problem is known as unsafe exploration. As an example, it would be rather undesirable if your household robot had to experiment with breaking glasses or running over your pet to learn that such actions receive large penalties.

Potential impact on larger scales

The risks are even more concerning if we imagine unleashing RL agents on larger scales. For example, a financial-trading agent with the goal of increasing profits might cause a flash crash while exploring a radical trading strategy. As a second example, imagine a hospital administration agent responsible for scheduling visits between doctors and patients. The agent might cause a rise in undetected instances of a disease as a result of switching to a more efficient scheduling policy.

Stay tuned!

The AI Safety problems described in this post are the subject of much ongoing research. In my next post, I’ll give a conceptual overview of our recent paper that proposes several modifications to RL. I’ll show how one can carefully incorporate human input into an agent’s learning process to mitigate concerns around task specification and unsafe exploration.

Recent Blogs


Subscribe to our newsletter and never miss out on updates from our experts.