When putting a machine learning algorithm into production we might discover that it treats unfairly members of certain protected groups. In practice, the fact that an algorithm is biased will be a sufficient reason to cancel its deployment. The obvious solution, training an algorithm that is free from biases, is actually a very difficult task because these biases are deeply rooted in the training data.

In this two-part series I will go over a few methods that you can use to remove biases from an already trained model. In part one, I will set up the problem, explain the idea behind one of these repairing methods, as well as highlighting some of the limitations. In part two, I will go on to present two more algorithms. Each of these algorithms aims to achieve a different definition of fairness.

A few words of warning: this is not a detailed discussion on the topic of fairness. Fairness is a complicated topic, under active research, with a lot interesting points to highlight – way too many to be included in a short blog like this one. A complete discussion of fairness should involve a lot more than a few lines of code and superficial maths. If you are new to observational fairness, I strongly recommend you read this excellent (and yet to be completed) textbook by Solon Barocas, Moritz Hardt and Arvind Narayanan. Do not blindly apply the methods below.

## Notation and definitions

For the sake of simplicity, I will discuss these techniques in the context of supervised, binary classification. The relevant variables at play are:

• The training data: $X$.
• The target variable / training labels: $Yin {0, 1}$.
• The protected attribute with respect to which we want to be fair: $A$. This might or might not be explicitly contained in $X$. I will call groups the different values that the variable $A$ may take.
• The output of the machine learning algorithm is a score function, $R$, taking values in the interval $[0,1]$.
• The predicted label, $D$ is a function that takes discrete values in ${0, 1}$. We obtain $D$ by applying a threshold to the score $R$.

For example, if we have trained a model to decide which applicants get accepted into grad school, the training data could be the applicant’s CV; the target variable would be whether or not the applicant graduated in time (or any other binary variable that measures if the student is qualified for grad school or not); the protected attribute could be either of gender, ethnic background, nationality, etc.. In practice the score $R$ is typically the model’s predicted probabilities $mathrm{Pr}(Y=1|X)$.

What do we mean when we say our model is biased? Here I mean that, given the outcome of the model, the joint probability distribution $p_{A, R, Y}(a, r, y)$ or $p_{A, D, Y}(a, d, y)$, does not satisfy either of the following criteria:

1.  Independence: The probability distribution is such that $p_{R|A}(r|a) = p_{R}(r)$. Back to our grad school example, the score $R|A$ would represent the acceptance rate across the different groups. Independence is a widely accepted notion of fairness because it reflects our current beliefs as a society: in a world that is fair, we would expect that an equal number of men and women get accepted into university and that minorities are proportionally represented.
2.  Separation: The probability distribution is such that $p_{R|A, Y}(r|a, y) = p_{R|Y}(r|y)$. If you are the candidate applying to grad school, this is the definition of fairness that you would care about. Separation is also called equality of opportunity and it means that, if you’re a qualified applicant, your ethnicity, gender, etc. won’t make a difference to your application. The key distinction from independence is the “if you are a qualified applicant” part.
3.  Sufficiency: The probability distribution is such that $p_{Y|A, R}(y|a, r) = p_{Y|R}(y|r)$. If you were in charge of admissions, this is the notion of fairness that you would have in mind. You’d want to graduate students at equal rates across the different groups. In other words, you’d want that whoever gets accepted into grad school has the same chances of graduating regardless of their ethnicity, gender, etc.

If you find that your machine learning algorithm fails at any of the above criteria, the goals is then to find a new score function, $tilde{R}$, that is fair. A key problem to keep in mind is that, in any realistic scenario, one can prove that the above definitions of fairness are pair-wise exclusive. By realistic scenario I mean two things. One, the protected groups are fundamentally different. For example, people from different ethnic backgrounds will inevitably perform differently at school because they have been exposed to different socio-economical circumstances and therefore to a different quality of education. And two, your classification algorithm will make mistakes from time to time. Choosing the appropriate definition of fairness is already a challenge in itself and I will not debate on this. Instead, I’m assuming that, through some very careful process, you were able to determine that the correct definition of fairness for your case at hand is one of the above. I will now explain what you can do to achieve it.

## Achieving independence

Here’s the wrong way of achieving independence: Randomly accept members of group $A=b$ at a rate equal to $Pr(D=1|A=a)$. This classifier will be fair according to the definition of independence, but it will certainly perform poorly in terms of accuracy (I use the term accuracy vaguely, to refer to whatever metric we are using to evaluate the model’s performance regarding the task of classification). This is a limitation of independence and it happens because the target variable $Y$ does not appear in the definition. The real challenge is to achieve independence without shamelessly giving up on accuracy.

Here’s the right way of achieving independence. This algorithm was introduced in this paper by Feldman et. al. It goes as follows:

1.  Compute the cumulative distributions $F_{R|A}(r|a)$ for each group $a$.
2.  Find the median cumulative distribution, defined as the distribution $tilde{F}$ that satisfies $$tilde{F}^{-1}(u) = underset{ain A}{mathrm{median} }: F^{-1}_{R|A}(u)$$
3. Take the derived score for group $a$ to be: $$tilde{r}_a = tilde{F}^{-1}circ F_{R|A}(r|a)$$

This figure illustrates the idea behind the algorithm. The CDF trick. The score r=0.8, marked with a black dot, is redefined for each group. The new score can be obtained by following the dotted arrows.

Let’s go through this with a toy example. Start by making up some scores for two different groups. I use a beta distribution because it already takes values in unit interval.

import numpy as np
np.random.seed(42)

scores_a = np.random.beta(8,2, size=1000)
scores_b = np.random.beta(6,4, size=1000)


Step 1: Compute the cumulative distributions by integrating the densities.


xbins = np.linspace(0, 1, 2**10)
x = (xbins[1:] + xbins[:-1])/2  # midpoint of each bin
dx = xbins[1:] - xbins[:-1]
ya, _ = np.histogram(scores_a, density=True, bins=xbins)
yb, _ = np.histogram(scores_b, density=True, bins=xbins)

# Cumulative distributions:
fa = ya.cumsum()*dx
fb = yb.cumsum()*dx

Step 2: Find the median distribution $tilde{F}$.


f_space = np.linspace(0,1, 2**9)
x_med = np.zeros_like(f_space)
for i, f in enumerate(f_space):
a_arg = np.abs(fa - f).argmin()
b_arg = np.abs(fb - f).argmin()
x_med[i] = np.median([x[a_arg], x[b_arg]])

Step 3: The scores can be made fair by using a function like:

def repair_score(r, group):
x_arg = np.abs(x - r).argmin()

if group == 'a':
f_val = fa[x_arg]
elif group == 'b':
f_val = fb[x_arg]

f_arg = np.abs(f_space - f_val).argmin()
r_new = x_med[f_arg]
return r_new

print(repair_score(0.8, 'a'))
# 0.6984359726295211
print(repair_score(0.8, 'b'))
# 0.8670576735092863

This algorithm can be applied to any black-box model as long as you have access to its score function. The resulting score, $tilde{R}$, is statistically independent of the protected attribute because now all groups have the same distribution of scores. We will often refer back to this algorithm so, for brevity, let’s call it the CDF trick.

## Partial repair

In practice, any modification to the original score function $R$ will most likely decrease the accuracy of the classifier. We would like to have more control over this tradeoff between independence and accuracy. To achieve this, first define a repair factor $gamma in [0,1]$ that quantifies how much we want to repair the scores. A partial repair function, $g$, would satisfy
$$g(r, a, gamma = 0) = r quad text{ and }quad g(r, a, gamma=1) = tilde{r}_a$$
where $tilde{r}_a$ is the fully-repaired score. For example, we can choose
$$g(r, a, gamma) = r + gamma (tilde{r}_a - r)$$.
And we can easily add it to our code:

def partial_repair_score(r, group, gamma=0.5):
rtilde = repair_score(r, group)
return r + gamma*(rtilde-r)

The plots below show the comparison between the original scores and the partially/fully repaired scores. The paper by Feldman et. al. contains a full analysis of the motivations and limitations behind the above algorithm, and I strongly encourage you to read it. The success of the algorithm presented here varies from case to case, and the hit in performance is also very case-dependent. Sometimes the original score function $R$ might have discontinuities that result in cumulative distributions that are not strictly monotonic. In these cases the CDF trick will not be well defined.