Tech Blog

A short introduction to Stan

28 Jan 2019

This post will introduce you to Stan. To quote the Stan documentation:

Stan® is a state-of-the-art platform for statistical modeling and high-performance statistical computation… Users specify log density functions in Stan’s probabilistic programming language and get:
– full Bayesian statistical inference with MCMC sampling (NUTS, HMC)
– approximate Bayesian inference with variational inference (ADVI)
– penalized maximum likelihood estimation with optimization (L-BFGS)

Stan models are specified in a custom programming language that closely mimics mathematical notation. For example, the requirement $x \sim \mathcal{N}(0, 1)$ becomes (ignoring for the moment variable declarations)

x ~ normal(0, 1);

Once you’ve specified your model in a .stan file, the Stan compiler stanc transpiles the code to optimized C++ then compiles it to produce an executable. This results in extremely efficient custom samplers.

The Stan modelling language

Stan programs are specified in blocks. The most important ones are:

data – define the data that the Stan model should expect to receive as input
parameters – define the parameters in the model
model – specify the joint distribution of parameters and data by specifying a prior and a sampling distribution

Additionally we can specify

functions – define reusable functions within the Stan program. Can be used to define custom probability distributions in advanced models
transformed data – apply transformations to the input data. This block allows us to transform data once and save it, avoiding expensive transformations applied to every iteration of sampling. Usually you’ll do a lot of data preprocessing in Python or R, but this can be useful for aggregating input data, or computing auxiliary data
transformed parameters – this block allows us to specify transformations that should be applied to sampled parameters in each sampling iteration
generated quantities – in this block we can generate quantities from sampled parameters in each iteration of the sampling algorithm. For example, one might use this block to generate predictions on test data with each sample of the parameters in the model. This allows us to compute Monte Carlo estimates for quantities of interest while sampling

In this tutorial we’ll focus mainly on the three core blocks and generated quantities, but check out future posts for more advanced model specification.

Types

Stan has two primitive types, declared as follows:

int n;  // n can take integer values
real x;  // x can take continuous values

It also has three linear algebra types:

vector[4] v;  // a 4-dimensional column vector
row_vector[5] w;  // a 5-dimensional row vector
matrix[2, 3] m;  // a 2x3 matrix

Any type can be made into an array type by declaring array arguments. For example:

real x[10];  // an array of reals length 10
matrix[3, 3] m[6, 7];  // a 2D array shape (6, 7) of 3x3 matrices

You can also specify constraints on variables using angle brackets. For example:

int<lower = 1> n;  // a positive integer
// a 4-dimensional column vector with each entry in [-1, 1]
vector<lower = -1, upper = 1>[4] corr;

Upper and lower bounds specified like this are inclusive. Constraints specified on variables defined in data, transformed data, transformed parameters and generated quantities provide error checking on input or sampled values, whereas constraints on variables specified in the parameters block determine the range of values that can be sampled from.

In addition to the above, there are special data types for structured vectors and matrices that we will not cover in detail here. They are: simplex, unit_vector, ordered, positive_ordered, corr_matrix, cov_matrix, cholesky_factor_cov, cholesky_factor_corr.

Example: The Paris birth problem

The best way to demonstrate writing a Stan model is with a specific example. We will fit a simple binomial model to the number of female and male births in Paris over the period 1745–1770. This problem was first studied by LaPlace, who independently developed many of the same insights originally had by Thomas Bayes.

Our data consists of two integers:

female_births = 241945
male_births = 251527

We would like to understand the probability of a given birth being male or female, and answer questions such as ‘is a given birth equally likely to be male or female?’

Specifying the model

We let $n$ denote the total number of births, $y$ the number of female births, and $p$ the probability that any given baby is born female (which are assumed independent, identically distributed events). The model we are going to fit can be specified as:

$p \sim \mathrm{Unif}(0, 1)$
$y \sim \mathrm{Bin}(n, p)$

That is to say, we assume every possible value for the probability $p$ is equally likely, and assume that $y$ is binomially distributed with $n$ trials and probability $p$ .

In Stan, we would specify this model as follows:

// paris_births.stan

data {
  int<lower=0> n;  // total number of births
  int<lower=0, upper=n> y;  // number of female births
}
parameters {
  real<lower=0, upper=1> p;  // probability of female birth
}
model {
  // specifying the prior on p is not strictly necessary
  // as stan will place a uniform prior on p by default
  p ~ uniform(0, 1);
  y ~ binomial(n, p);
}

Compiling and running the model

Stan has numerous interfaces for compiling and running models, including CmdStan, RStan, PyStan, MatlabStan, Stan.jl, StataStan and MathematicaStan. We’ll demonstrate PyStan and RStan, which are the most mature and feature-rich options (other than CmdStan) but are also very user-friendly if you already know how to code in R or Python.

PyStan

First install PyStan

pip install pystan

then from Python we can simply do the following

import pystan

female_births = 241945
male_births = 251527

# data is passed into Stan as a dict
data = {
    "n": female_births + male_births,
    "y": female_births,
}

# compile the model
# for a simple model, this takes much longer than sampling!
sm = pystan.StanModel(file="paris_births.stan")

fit = sm.sampling(data=data, chains=4, iter=2000, n_jobs=1)
print(fit)

Inference for Stan model: anon_model_c0885aeecd94a1dc173da101ba25b432.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

       mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
p      0.49  1.8e-5 7.0e-4   0.49   0.49   0.49   0.49   0.49   1520    1.0
lp__ -3.4e5    0.02   0.68 -3.4e5 -3.4e5 -3.4e5 -3.4e5 -3.4e5   1549    1.0

Samples were drawn using NUTS at Fri Jan 11 13:26:41 2019.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

As you can see above, printing the fit object gives us basic summary stats of the sampling, such as the sample mean, mean standard error (estimated sampling error), the sample standard deviation and various sample quantiles that let us easily see what the central 50% and 95% intervals are for each parameter (in our case, we have only one parameter p; lp__ is the log posterior which will get generated by every Stan model). In addition to summary statistic, fit displays two diagnostic metrics:

• n_eff – MCMC draws correlated samples. n_eff is a crude measure of the ‘effective sample size’, i.e. performing inference with the sample is approximately like performing inference with n_eff independent samples. If n_eff is a small fraction of the number of samples drawn, this indicates highly correlated samples and could lead to problems with inference.

• Rhat – This is a convergence metric for the sampling algorithm. If all your chains have converged it will be 1 or very close to it. There’s no definite rule on what a good value of Rhat is, but if you have Rhat greater than 1.2 you should probably investigate, and if it’s greater than 2 your chains are very unlikely to have converged properly. Note that Rhat equal to 1 doesn’t actually guarantee convergence; it is a necessary condition but not sufficient.

We can extract the sampled parameters into a Python dictionary using the extract method

fit.extract()["p"]

array([0.49075869, 0.48934217, 0.49044326, ..., 0.48911428, 0.48898282,        0.48935031])

Diagnosing sampling issues and understanding the posterior distribution is often best done by visualising the samples. The visualisation tools inside PyStan are being deprecated in favour of using the library arviz. Install with

pip install arviz

Then, to visualise your chains and sampled parameters, do something like the following:

import arviz as az

az_data = az.from_pystan(posterior=fit)
az.plot_density(az_data, var_names=["p"])

RStan

The above analysis can be repeated similarly in R. First install RStan:

install.packages("rstan")

This takes surprisingly long… When it’s done, you can do the following from R:

library(rstan)

# automatically cache compiled model as .RDS file
rstan_options(auto_write = TRUE)
# specify number of cores
options(mc.cores = 1)

female_births <- 241945
male_births <- 251527

data = list(
  n = female_births + male_births,
  y = female_births
)

fit <- stan(
  file = "paris_births.stan",
  data = data,
  chains = 4,
  iter = 2000
)

Much like in pystan, you can print the fit object to get a high-level summary of the samples, or you can investigate more deeply with traceplots etc. RStan has inbuilt plotting functions built with ggplot2. Check out the documentation for details.

Generated quantities

To finish this introduction, let’s take a quick look at the generated quantities block. We’ll specify a new model as follows:

// paris_births_gq.stan

data {
  int<lower=0> n;  // total number of births
  int<lower=0, upper=n> y;  // number of female births
}
parameters {
  real<lower=0, upper=1> p;  // probability of female birth
}
model {
  // specifying the prior on p is not strictly necessary
  // as stan will place a uniform prior on p by default
  p ~ uniform(0, 1);
  y ~ binomial(n, p);
}
generated quantities {
  int<lower=0, upper=n> y_sim;

  // generate births data with sampled p
  y_sim = binomial_rng(n, p);
}

We run this with similar python code as before:

sm = pystan.StanModel(file="paris_births_gq.stan")

fit = sm.sampling(data=data, chains=4, iter=2000, n_jobs=1)
print(fit)

Inference for Stan model: anon_model_bb79043a746024083d185b7fd5482af0. 4 chains, each with iter=2000; warmup=1000; thin=1;  post-warmup draws per chain=1000, total post-warmup draws=4000.         mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat p       0.49  2.0e-5 7.2e-4   0.49   0.49   0.49   0.49   0.49   1300    1.0 y_sim  2.4e5   11.67 497.65  2.4e5  2.4e5  2.4e5  2.4e5  2.4e5   1818    1.0 lp__  -3.4e5    0.02   0.73 -3.4e5 -3.4e5 -3.4e5 -3.4e5 -3.4e5   1403    1.0 Samples were drawn using NUTS at Fri Jan 11 13:28:24 2019. For each parameter, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at  convergence, Rhat=1).

and we find that fit has now sampled the new variable y_sim, which gives us a range of likely female birth counts given the observed data. Once again, we can extract these with extract:

fit.extract()["y_sim"]

array([241772., 241553., 242757., ..., 241920., 241043., 242368.])

Note that the variance in this range will be larger than the variance in the binomial distribution with $p$ fixed at the sample mean. This demonstrates one of the powers of full Bayesian inference, as point estimates will typically underestimate uncertainty, leading to overconfident inferences.