A question of polling: using data science to predict elections
The results from the second round of the French election are in, and Macron was comfortably elected President. In the second round, the public polls were not particularly accurate – perhaps surprisingly, given their first-round performance. They underestimated Macron’s eventual vote share, although this made little difference to the result given his large winning margin.
In the run-up to the second round of the French election, it wasn’t just on the continent that people were eagerly awaiting the result. Obviously, the French people care deeply about their next leader, but the result had a wider set of implications for the rest of the EU, the UK and the world.
In our professional capacity as data scientists, we were also very interested to see what the election would reveal about the accuracy of the public polls. In contrast to other recent elections, the public polls on the first round of the French election quite closely predicted the results. But would they maintain their performance?
ASI specialises in applying complicated data science tools to solve real-world problems. Since polling is interesting, difficult and probably requires some new technology to work consistently, we’ve been exploring what value we could add in this area over the last two years.
The reason that polls fail is reasonably well understood. Most pollsters depend on finding a random sample from a large population. 50 years ago, randomly dialling telephone numbers was a good way to gather this sample. Then, as communications technology developed, fewer and fewer people bothered to use landlines. Today, even when the pollsters do get through, many people just hang up on them. In 1980 the response rate to a phone poll was about 72% – but now, it is around 0.9%.
But this methodology was invented in the 1930s, and many things have changed since then. In particular, we have more powerful computers, more powerful algorithms and easier ways of reaching people. Is it possible to reimagine polling given all the advantages of the 21st century?
If you spend any time in Silicon Valley, it won’t be long before you hear someone say ‘Software is eating the world’. This is a quote from Marc Andreessen, an investor and the creator of Netscape. It is a pithy way of summarising a wider argument that all businesses will become software businesses. According to this theory, because software is cheaper than people and easy to iterate fast, software companies will out-compete those that are not.
So how does this apply to polling? In current polling methods, the complexity lies in collecting a representative sample. In the methods we’ve been developing, we move this complexity into software, reducing the requirement for a good sample. We try to correct for the errors in the maths, not in the sample.
The trade-off, though, is that this requires a lot of data. The technique is still in development, but we have been able to experiment with predictions for two elections – one for the UK’s Brexit referendum and one for the French election. (Unfortunately there isn’t enough time, because of the short run-in, for us to do a third test on the UK general election this year.)
The basis of this technique was developed in the US by Professor Andrew Gelman, who used questions on the Xbox to gather the data. In Gelman’s demonstration of the technique, the non-representative polling out-performed the average of the public polls for the 2012 election of Obama.