AI advances keep coming… but what do they actually mean for your business?

AI models can seem powerful on paper – but can they actually perform in the high-stakes context of your business?

2025-05-08
AI Strategy
Divish Rengasamy
Senior Machine Learning Engineer

Another week, another new AI model, promising unprecedented capabilities. The headlines trumpet new records, benchmark scores soar… 

But for leaders in specialised sectors like insurance, finance, or law, the immediate reaction might reasonably be: 'So what?'

How do all these AI capabilities actually deliver on the big promises of higher ROI, streamlined operations, and all the usual headlines? Especially in knowledge-based industries, where value lies in profound expertise and complex nuances. 

The reality is, businesses struggle to see clearly how these generic improvements affect the highly specialised tasks their underwriters, lawyers, and finance professionals perform.

In this blog, I’ll discuss the importance of evaluations. Why you should test AI models against specific analytical reasoning, risk assessment, regulatory understanding, or document interpretation required in your sector. 

The gap between generic AI capabilities and specific industry needs

Whether you’re in insurance, financial analysis or a legal practice, you’re probably using AI in some form during your working day. You may also be reading about the progress of research, new models and latest trends in the space. But how do you link all of this to the specific needs of your business?

An AI scoring high on a broad test like MMLU (Massive Multitask Language Understanding) is impressive. But does it understand the nuances of a complex insurance policy? Can it navigate intricate financial regulations, or perform accurate legal discovery? 

Current AI tests don’t match what experts are actually working on.

Without evaluations tailored to these professions, clear insights are missing. 

It's difficult to answer the critical question business leaders are asking: "How much work within these professions could realistically, and reliably, be performed by AI?" 

This uncertainty makes it hard to plan, invest, and adapt effectively.

Moving beyond the generic tests

A good starting point is to evaluate AI, using metrics grounded in the skills and knowledge you genuinely need.

For example, evaluate your models against criteria derived from the syllabus exams set by the Chartered Financial Analyst (CFA) Institute. Or the detailed knowledge needed for Financial Conduct Authority (FCA) compliance in the UK. 

OpenAI made a big splash about GPT’s ability on the Uniform Bar Exam (UBE), so why isn’t your industry measuring similar benchmarks?

These established standards test core competencies, foundational technical knowledge, and critical reasoning skills in a structured, measurable way. This is more relevant to predicting performance than relying on generic tests. 

This approach will help you figure out whether an AI model can actually handle the kind of work you'd trust a professional to do.

But feeding AI these exams is just the beginning. 

It’s a whole different thing from handling the complicated, day-to-day reality of getting stuff done.

The ultimate goal for truly effective AI evaluation is to benchmark performance against your actual tasks, workflows, and decision-making processes. From analysing intricate client case files to stress-testing financial models using live market data.

This is more than just an academic exercise. Running AI through your internal, validated, benchmarks has a convincing business case. 

The benefits of establishing better benchmarks

Measuring how well your AI performs within the specific context and operational reality of the actual job function can:

1.Improve your clarity on which AI projects to prioritise

Knowing how well different models perform on specific, relevant tasks means you can focus your investments on where AI shows genuine, measurable capability.

2. Encourage stakeholders to support your projects

When performance can be demonstrated against relevant benchmarks, it gives leaders confidence to champion AI investments. 

3. Guide strategic technology choices: 

You can make informed choices about which AI providers to partner with, select the most suitable models for specific tasks, and decide on the best deployment approach based on proven performance.

4. Inform your long-term workforce strategy: 

By understanding which specific professional activities AI can improve, you can proactively plan future workforce needs. You can identify where AI can free up your team for higher-impact work, and where human oversight is still essential.

Don’t trust AI blindly, put it to the test

When you’re working in complex fields like finance, insurance and law, there’s little value in relying on AI that’s only been tested on generic benchmarks.

Real impact depends on how well the AI is adapted to your specific context. Where it can compliment the distinct inner workings of your business. 

Demand relevant benchmarks that are grounded in professional standards, and proven against your daily tasks. This ensures your development of AI can help deliver your existing priorities, as we discuss in lesson nine of our ‘Ten Lessons From Ten Years of Applied AI’ book. Rather than trying to come up with a separate AI strategy, you should test how AI can help deliver your existing priorities. If it can’t, then ignore it and focus on the things that can.

Following this approach, you turn abstract AI potential into tangible business advantage.