Flagging extremist content online
It is well known that online propaganda from terrorist organisations plays a key part in radicalisation in Europe and the UK. In collaboration with the UK Home Office, we recently developed and tested an AI algorithm designed to detect such propaganda on the web.
Following media coverage of the classifier (see, for example, BBC News and The Guardian), there has been much talk of how it works and its performance. In this post, we’ll do our best to outline the general approach we took to designing the classifier without exposing the inner workings of the algorithm. We will also specify precisely the metrics used and the performance achieved by the classifier.
Let’s briefly review the basic problem – binary classification. Given a media file (audio or video), we wish to determine whether or not it belongs to one of two classes. In this case, the classes are ‘extremist’ and ‘non-extremist’.
We attempt to do this following a two-step process. First, we train a model on a large number of labelled media files (i.e., the class of the media file is known); then, we test the model on media files it has never seen, to validate its performance.
The success of the classifier depends on the choice of model. In order to be effective, the model has to be designed in such a way that it has the capacity to learn the features of the input that have significant discriminatory power when it comes to deciding whether or not it represents extremist content.
Our model was designed to be able to learn multiple underlying signals in extremist propaganda. It then learns to combine all these signals in such a way as to achieve the high degree of performance reported. But how do we quantify the performance?
ROC curves, sensitivity and specificity
Let’s take a step back. The output of the algorithm is not a simple ‘yes’ or ‘no’. The raw output is a continuous probability of belonging to the extremist class. This gives us flexibility to choose where to draw the line when it comes to attaching a label to a given input.
For example, we may choose to say a video is extremist if the output probability is > 0.5, and non-extremist otherwise. The probability threshold (0.5 in this example) will give simultaneous values for two important metrics: true positive rate and true negative rate.
The true positive rate (or TPR, and also known as sensitivity or recall) is the fraction of extremist videos that are correctly identified as extremist. Similarly, the TNR or true negative rate (also known as specificity) is the fraction of non-extremist videos that are correctly identified as non-extremist. As we vary the probability threshold, we move along what is called an ROC curve (see figure above).
If we set the probability threshold high, then the model must be more confident that a given media file is extremist content for it to be classified as such. This means that the model is less likely to classify both extremist and non-extremist media as extremist. Hence we expect the true negative rate to go up (fewer innocent videos will be incorrectly labelled as extremist), but the trade-off is that we also expect the true positive rate to go down (because fewer extremist videos will be correctly labelled as extremist).
On the other hand, reducing the threshold means that more media files from both classes will be marked as extremist, and hence we expect the true negative rate to go down and the true positive rate to go up. We choose the threshold with this trade-off in mind, based on the needs of the user (for example. very large volumes of videos might require an extremely high true negative rate).
Consider a random classifier, that is, one that assigns a probability at random. If we give a random classifier a threshold of 0.7, that means about 70% of all videos get classified as non-extremist, resulting in a true positive rate of 30% and a true negative rate of 70%. Consequently the ROC curve is simply a straight line interpolating the points (100, 0) and (0, 100), as shown in the figure above.
A good classifier is one that can simultaneously achieve good true positive and true negative rates, and so the further to the top right the ROC curve lies, the better. We illustrate a model that is better than random (i.e. has more discriminatory power) with the grey solid line labelled ‘useful classifier’. Finally, the perfect classifier has an ROC curve that looks like a right angle with its edge pointing north-east.
The blue curve in the figure is the ROC curve for our classifier after generating predictions for about 100,000 media files (none of which were used for training). We were delighted with the overall performance of our model. We optimised for a high true negative rate for a number of reasons, one of which is that given the enormous volume of non-extremist content uploaded to internet platforms, even a small percentage of false alarms would quickly overwhelm and discredit the system. From our ROC curve, we chose the point at which TNR = 99.995% and TPR = 94%, as shown by the red dot in the figure.
With a TNR of 99.995% and a TPR of 94%, the average number of innocuous videos flagged as extremist is only 250 per day for a site with 5 million daily uploads (e.g. for the largest media hosting platforms in the world).
Such a low number of false positives could be manually screened by a single analyst. We consider this reduction a significant step in the fight against online extremism, and we will help technology platforms to use our tool in order to remove the vast majority of Daesh content online with minimal impact to their business.