As well as providing reputational and ethical advantages, there are clear commercial reasons for building robustness into your AI – though robustness rarely gets the same level of attention as concepts like explainability or privacy.
Pillars of safe and performant AI
What is robustness and why does it matter?
At Faculty, creating ‘robust’ models means establishing clear guarantees for how AI systems will behave upon deployment in the real world. Robustness allows us to trust that a model will draw good conclusions and that we’ll know when it’s uncertain about the accuracy of its predictions.
Today, we face a stark issue: most business leaders fail to understand the need for robustness and most data scientists simply do not have good technical solutions for it. This means that models are being put into production without safeguards to ensure they are robust, which increases the risk of models producing incorrect predictions – and therefore the risk of businesses making decisions in error based on those predictions. The significance of an incorrect prediction can surf a spectrum from the trivial (someone got an advert for dog food when they actually have a cat) to the severe (someone couldn’t get a medical bill covered). Making decisions based on incorrect predictions is not only unsafe, it can hurt reputations and bottom lines.
The importance of this should be clear: AI’s promise is to help society make better decisions. If we can’t trust what the model predicts, then there isn’t much point in using a model at all.
As an applied AI company developing solutions for public and private sector organisations, we see first hand that good robustness solutions create more purposeful and effective AI systems. If you want a model to generate new revenue streams, inform decisions or make processes more efficient, then you can’t dismiss robustness as just a technical issue.
Here, we’ll explore five key arguments for including robustness measures in your model development.
Robustness helps you understand the limitations of your model
When applied in businesses, AI can be invaluable when it comes to processing data faster, making better decisions, and reducing costs. But it’s not infallible; no model is 100% accurate and all models will generate predictions that are more accurate on some parts of the data than others.
Figuring out how to maximise the profitability of your AI means figuring out which parts of your data generate less accurate predictions. Implemented correctly, robustness tools will provide an objective and simple measurement that describes when you should and should not trust a model prediction.
At Faculty, we refer to this measurement as a credibility score. To provide the score, we apply a technique that allows us to simplify the representation of our data and understand which subsets of the data will generate better predictions than others.
Let’s consider this approach in practice for the example below, where we’re trying to predict whether or not someone is a high-earner based on population demographics. We can plot the distinct values of a particular feature over this two-dimensional representation of the data set; in the example below, we’ve chosen marital status.
Our example data – a two-dimensional representation of US census data on marital status
Secondly, we can visualise which areas of this distribution, when fed into our model, produce predictions that have high credibility and which parts of the data produce predictions that have low credibility. In the graph above we can see that the model has, on average, high credibility on individuals who have never been married, while it has low credibility on individuals who are married.
Tests show that, if a region of data has a high credibility score, the predictions the model makes based on that data is usually correct. You can see that this is true in the comparison below: on the left we can see the credibility scores distributed over the data set, on the right we can see where the model makes correct or incorrect predictions. We can clearly see that high credibility regions lead to correct predictions.
Note the high correlation between high credibility score and correct predictions
Most organisations lack the technology needed to conduct the analysis above. This means prediction accuracy is being crippled by low credibility data regions dragging down the overall accuracy of the model.
Let’s say the model visualised in the diagram below has 100 test data points. In this example, using 100% of the data points gives this model a ‘baseline’ accuracy of 85% – meaning that 15% of the time the model is likely to produce an inaccurate prediction. Not ideal if you’re a business looking to make serious business decisions based on the outputs of this model.
Test accuracy of a model achieved when using 100% of the available data
If, however, we decide to remove the 25 data points with the lowest credibility scores (but we won’t discard them completely – more on this later), our model can analyse the remaining 75 data points with 90% accuracy.
Test accuracy of a model achieved when using 75% of the available data
Analysing the top 50 most credible data points means our model can make predictions with 95% accuracy.
Test accuracy of a model achieved when using 50% of the available data
Even a marginal gain can have a significant impact on the performance of any organisation, but by applying robustness we find a short-cut to making double digit improvements in model accuracy. What could similar gains across deployed models do for your business?
Robustness helps you allocate human resources better
We’ve established that credibility scores protect users from making decisions based on low-credibility data. But we can also use it to save time and resources by highlighting which predictions can be taken at face value and which need a double check from a human.
During model deployment, a credibility score is a simple representation of how likely a prediction is to be correct. The score can be any number from 0 to 1, where 1 means completely trustworthy and 0 the opposite. Thus for the end-user, the credibility score is an intuitive standard upon which they evaluate how trustworthy a prediction is. If the score drops, they’ll know that the model is below its required performance standard and that a human needs to intervene to check some of the predictions. By precisely flagging only the predictions you need to check, you can assign human resources more efficiently or across a greater number of models.
This is particularly important in cases where large numbers of false positives drive lengthy and expensive human interventions – like, for example, when financial institutions use AI to prevent financial crimes by monitoring for anomalous signals. Being able to disqualify more false positives upfront can save huge amounts of time that would have been spent on costly investigations of legitimate transactions.
Robustness helps your models self-improve
Instead of passing the low-credibility data to a human for analysis, we can also pass that data to a new model which is specifically designed to handle the specific quirks of this section of the data.
This means that we can almost automatically improve the overall prediction accuracy of our models without needing to get a human involved. Robustness measures aren’t just about ensuring that your models are performing well now; with processes like these in place, robustness tools are also a major investment in the constant improvement of your models.
Robustness makes your models resilient to change
Models are trained on data. Typically that data represents a certain reality, but what if that reality starts to change?
COVID-19 is a good example of this. If a credit card provider is using an AI system to forecast revenue streams and default risks based on factors like purchasing behaviour, how is that AI system going to cope with a huge shift in spending habits caused by a pandemic? Once in a lifetime events aside, your model may also make incorrect predictions for more subtle reasons, like noisy and erroneous input data or poor parameter choices.
Robustness measures allow you to check that your model is still working when the dynamics of the underlying data change. By giving you an estimate of the credibility of each prediction, your AI system will be able to alert you when sudden or gradual shifts in the data might be skewing your results.
This is vital if you’re using AI to make money or deliver services; if your model outputs are at risk of being skewed by transformative market forces, you’ll have a decision system in place for thinking carefully about whether or not to take your model’s advice. You’ll also know when your model outputs aren’t being skewed, and avoid wasting money by abandoning them.
Combining robustness with explainability helps you understand your models better
The most powerful AI systems are also the most complex. As a result, they often operate as “black boxes”, meaning that it’s hard to understand how they use data to make predictions and if those predictions are ethical. Explainability is the science of interpreting exactly how any model makes its prediction. Explainability is a hot topic right now and, like robustness, is a requirement for building trust in AI systems.
What is interesting is how robustness can be combined with explainability to drive tremendous business outcomes. In the plot below, we can see Faculty’s explainability technology being used to describe which model inputs were most important when a model is predicting an individual’s salary.
Graph showing factors a model considers most predictive when estimating a person’s salary
In this example, explainability shows that the model has determined that marital status and education are the most predictive factors for determining whether or not someone is a high earner. This alone is incredibly valuable; tools like these are being adopted to help model development teams save time building impactful models, while making it easier to explain how the model works to colleagues who don’t know anything about statistics.
When you layer good robustness technology on top, the time savings multiply, allowing us to see differences in the way low credibility predictions and high credibility predictions parse the data.
In the graph below, we see that for data points in the lowest credibility quartile, the most predictive features are occupation and education.
Graph showing most predictive factors in the low-credibility quartile of data
Knowing this, we can take steps to improve the credibility of these predictions by improving the quality and granularity of the occupation and education data we feed into the model. For example, we might conclude that we need to include information about years of experience in the current job, so that the model can anticipate that someone who has only spent a year in a company will usually earn much less than someone who’s been in the sector for a decade.
More generally, making use of that low credibility region would be hard without explainability, but because we know which features have the greatest predictive power, data scientists can get to work tuning the model on low credibility data regions with a precise understanding of what features will be helpful in making the biggest improvements. When these things are done in sequence, there is an exponential increase in the chances of putting a rock solid, safe, highly predictive model into production.
In a world bound to ever-changing market dynamics, robustness is about creating trust between humans and AI. It is a toolset for practitioners to develop models that behave consistently, and that generate more purposeful predictions. It gives executives confidence that models will behave as expected, and that they will know when they don’t. It’s the concrete technical foundation for leaders that want to allocate resources more effectively, improve predictive performance, and minimise risk.
At Faculty, robustness is a fundamental part of our AI Safety framework. For each facet of AI safety, our teams use a combination of deep research and real-world experience to understand what is required and then develop the technology to get there.
If you are interested in AI Safety, how you can use our technology to build more robust AI, or use credibility scores to improve MLOps and decision-making systems, drop us a line.