Last week, my colleague Cathy wrote a blog post about the commercial benefits of A/B tests. In this blog post, I’ll provide some technical background on how A/B testing works in practice, specifically how we use Bayesian inference to evaluate A/B tests for our clients here at Faculty.

Let’s assume we want to evaluate two things: the impact of a promotional offer on the amount customers spend with a brand, and the effect that the same offer has on customers’ rebuy rate. Our analytical solution follows a three step process:

1. First, we make assumptions about how spend and rebuy are generated given certain parameters, i.e. the sampling distribution. We also specify our current beliefs about likely values of parameters in the form of a prior distribution.
2. Second, given observed data, we can infer which parameters could have given rise to the observed data under our assumed generating process, i.e. the posterior.
3. Finally, integrate the posterior against a sampling distribution to get the likely spread of outcomes we would see if the promotional offer were rolled out more widely.

## Bayes theorem

Bayesian inference is an approach to statistical inference that accounts for prior beliefs about model parameters. In the case of A/B testing, the parameters of interest are the success rates in group A (who see the promotional offer) and B (who do not see the promotional offer),  which we will compare to determine which group was more successful.  Bayes’ theorem lets us determine the plausibility of different model parameters that could have generated the data we observe, i.e. the posterior distribution.

The posterior distribution will take different forms depending on both our prior and sampling distributions. For example, if our sampling distribution is a binomial distribution, there is only one parameter (the probability of ‘success’). Therefore the posterior will be a probability distribution of a single variable. If our sampling distribution were a normal distribution, then we have two parameters (the mean and the variance) to estimate, resulting in a two-dimensional posterior distribution.

To model the posterior, we need to make certain assumptions about the prior and sampling distributions and their parameters. Constructing the posterior for a binary outcome like the rebuy rate has a well known solution, but spend is less straightforward. We will take a detailed look at the difference below.

I’ll use the rebuy rate as an example to demonstrate how to derive a posterior distribution analytically, before focusing on the intricacies of modelling spend. Note that you could also use sampling methods to derive the shape of the posterior, but we prefer the analytical solution for its interpretability.

## Inference

Once we have calculated the posterior, we can use it to calculate the probability of any outcome if we were to run the experiment again – or, in the case of A/B testing, if the promotional offer were to be rolled out.

By integrating the posterior corresponding to an offer against a sampling distribution, we can explicitly measure the frequency of particular outcomes for this offer. Formally, we would want to predict $P(S|N,y)$: the probability of $S$ (i.e. number of rebuys or average spend per person) if we rolled the offer out to $N$ customers and observed $y$ data:

$$P(S| N,y) = int dtheta P(S|N, theta) P(theta|y)$$

Where $P(theta|y)$ is the posterior and $P (S|N,theta)$ is the sampling distribution. This allows us to compare the likely outcomes between offers and quantify the certainty with which one offer will outperform another at any sample size.

In the case of the rebuy rate, there are two possible outcomes: either a customer buys again or not. This makes determining the sampling distribution easy – the probability of whether a customer rebuys follows a Bernoulli distribution, which is fully characterised by the probability of success (the rebuy rate).

So, I choose the conjugate prior for the Bernoulli distribution to model the prior distribution, as this gives a closed-form expression for the posterior. This removes the need to use sampling methods to estimate the posterior. The conjugate prior to a Bernoulli distribution is the beta distribution.

The resulting posterior follows a beta distribution:

$$P (theta|y) propto theta^{alpha + s - 1} (1 - theta)^{beta + n - s - 1}$$

Where $alpha$ and $beta$ are the parameters of the beta prior, $s$ is the number of successes/rebuys and $n$ is the number of events/offers sent out.

The beta distribution is characterised by two parameters ($alpha$, $beta$) that describe the shape of the distribution. You’ll need to choose these parameters by hand. Prior selection is a topic in its own right; there isn’t time here to discuss how we approached it but, as a general rule, you will normally want to choose the priors such that your starting assumption is that all campaigns performed equally well.

## Modelling spend

Spend has a more complicated distribution than rebuys. The observed data, $y$ is a continuous variable, whose distribution spikes around zero (because most people don’t buy anything) and has a long tail (because a few people will spend much more than the overall average).

This means that modelling the sampling distribution is not as straightforward as was the case for rebuys because we cannot describe it by a simple equation: it cannot be described by a normal or log-normal distribution, for example.

I use a trick to deal with this problem: the central limit theorem states that, in a sufficiently large sample of customers, the distribution of average spend should follow a normal distribution, which can be fully characterised by its mean and variance. The size of the sample required for the central limit theorem to hold will depend on how skewed your spend distribution is, i.e. how far it is from being normally distributed.

Now that we know that we can represent the sampling distribution as a normal distribution, we can use its conjugate prior to solve analytically for the posterior distribution.

The conjugate prior of a normal distribution is the normal-inverse-gamma distribution, which is described by four parameters: $alpha, beta, gamma, lambda$. These parameters roughly relate to the mean and variance of the sample and the uncertainty around each. Combining the sampling distribution with its conjugate prior (using a lot of algebra) gives a posterior distribution that also follows a normal-inverse gamma distribution.