Arrow DownArrow ForwardChevron DownDownload facebookGroup 2 Copy 4Created with Sketch. linkedinCombined-Shape mailGroup 4Created with Sketch. ShapeCreated with Sketch. twitteryoutube

AI Safety

Ensuring the safe deployment of AI across its many applications presents many challenges. Some are subtle and difficult to formulate with current technology (such as malicious autonomous AI agents), whereas others can be articulated as problems facing the current adoption of AI in industry (such as biased decision-making algorithms).

To provide an overview of the areas where AI could be unsafe, we use two independent dimensions: the autonomy level of the decision-making agent, and the intention behind the (use of) AI.

We are working on elements of all three of the current-horizon quadrants.

Fig 1. AI Safety landscape.


1. Parenting as a general approach to safe reinforcement learning

Autonomous learning agents trained with reinforcement learning present many safety concerns: hacking their reward functions, behaviour with unintended side effects, and unsafe exploration of new behaviour (to name a few). However, humans are themselves autonomous learning agents who learn to behave safely in complex environments through attentive training from other humans (primarily our parents).

We present a framework for safe reinforcement learning, inspired by the ways that humans parent their children, and demonstrate that this solves a broad class of AI Safety problems: essentially all those that would not be problematic to a human. We find that our Parenting algorithm solves these problems in all relevant AI Safety ‘grid-world’ environments, that an agent can learn to outperform its parents as it is allowed to ‘mature’, and that the behaviour learned from Parenting is generalisable to new environments.

Our results set the foundation of an approach to reinforcement learning that mimics the way people train their children.

A summary of the elements of parenting that solve the various AI Safety problems:

Fig 2. Elements of Parenting and the AI Safety problems that they solve.


2. Mitigating the potential for AI-based political disinformation

Strategic disinformation enabled by technology has already been used to undermine individuals and institutions. AI has the potential to exacerbate this trend by enabling the creation of fake, but indistinguishably realistic videos of public figures making statements that were never said.

We are working with the Alliance of Democracies to mitigate this risk through the creation of a classifier that determines if a video is fake. We are also working on the creation of educational material that demonstrates the current state AI-based disinformation so that the public will not be caught off-guard if it is deployed maliciously.


3. Avoiding bias in neural-networks

Avoiding bias in the predictions made by machine learning algorithms can be subtle, especially for more complex algorithms such as those based on neural networks. It is not enough simply to remove a sensitive feature in the data that you do not want your algorithm to use, since this feature probably correlates with a number of other seemingly innocuous features.

An approach to training general (e.g. neural network) algorithms to robustly avoid using sensitive features is known as adversarial fairness. An example of this type of algorithm (taken from this paper) works as follows (see Figure 3):

[1] A classifier is trained to predict whether or not a customer receives a loan.

[2] However, that classifier receives as input a representation of the data that is also used by an adversarial classifier that attempts to predict the sensitive features.

[3] Both are simultaneously trained with an encoder-decoder network that ensures the adversary is unable to discern the sensitive features, no matter how hard it tries.

Fig 3. Framework for creating fair, black-box classifiers

In this way, an ‘insensitive data representation’ is created that cannot contain sensitive information or else the adversary would be able to use it (the adversary must not be deliberately weak). Since the loan-determining classifier only receives the insensitive data representation as input, this classifier will not depend on the sensitive features used.

Interestingly, this approach to unbiased machine learning requires access to the sensitive data. Ironically, removing access to sensitive features can prohibit the creation of robustly unbiased algorithms. This subtlety is sometimes not fully appreciated by regulators or policy makers.