Beyond manual red teaming: Scaling model safety with Automated Red-teaming Emulation System (ARES)
Our AI Safety and Security Lead Thomas Clarke, walks through the different types of red teaming, where current approaches fall short, what we built to address the gap, and how it works in practice.
Over the last few years, large language models (LLMs) have gone from impressive demos to critical infrastructure. They write production code, draft regulatory documents, generate research summaries and increasingly sit inside systems that have real-world consequences. As their capabilities expand, so does the surface area of risk and the need for rigorous safety evaluations.
Pre-deployment red teaming remains one of the most important tools for understanding misuse potential and characterising model behaviour under adversarial or high-risk conditions. At Faculty, we continue to push the boundaries of safety testing for frontier AI models, particularly in the domain of red teaming for high-risk areas such as biosecurity.
In this blog, I'll cover the different types of red teaming, what the current process gets wrong, what we built to address it, and how it works.
Not all red teaming is the same
Broadly, red teaming techniques fall into two categories.
Single-turn evaluations test how a model responds to a specific high-risk prompt in isolation, often scored against structured criteria. These approaches provide standardisation and comparability, but can under-represent how capability unfolds in realistic settings. Particularly when models recognise evaluation-style prompts and respond more cautiously.
In contrast, multi-turn red teaming simulates sustained interaction, where a potentially malicious actor steers the conversation. It’s through this progression that operational detail, boundary testing, and how model behaviour evolves across an interaction becomes clearer, particularly as it more closely reflects real-world interaction. In high-risk domains like biosecurity, such in-depth assessments should be a core part of responsible model deployment. The tool that we describe in this post focuses specifically on multi-turn red teaming, which is better suited to simulating real-world interaction, where model behaviour unfolds over a sequence of exchanges rather than a single prompt. However as models get better at reasoning, adapting and handling more complex interactions, they require more sophisticated probing, taking more time and effort to conduct multi-turn red teaming.
The three structural limits of manual red teaming
Manual multi-turn red teaming is an effective AI safety testing technique. When done by experienced subject matter experts in their specific harm domains, it can produce nuanced, context-aware assessments that benchmarks simply cannot replicate. Experts can adapt dynamically to a model’s responses, pursue unexpected but revealing lines of inquiry, validate technical claims, and apply domain judgement in ways that structured test sets cannot easily capture.
However, manual red teaming has structural limits:
First, it is time-bound. In many cases, experts are given a fixed window- sometimes as little as 10 hours- to explore an entire misuse domain. Coming up with high-quality probes and thoughtful follow-ups is demanding - it takes time and deep consideration. Within that window, only a small number of conversational threads can be explored in depth and red teamers may spend valuable time pursuing lines of inquiry that ultimately reveal little or prove to be dead ends.
Second, coverage is uneven. Two experts starting from the same seed question can diverge rapidly based on model responses or personal instinct. That flexibility is a strength, but it also means coverage across a domain is hard to systematise. Given the potential conversation space is so large some risk chains will be explored deeply while others may not be touched at all.
Third, reproducibility is limited. If conversational trajectories vary significantly across sessions, comparing capability across models (or even across versions of the same model) becomes difficult. This is further complicated by the qualitative nature of manual red teaming and its reliance on single conversational instances, which do not often capture variability across repeated runs.
There are other AI safety assessment approaches that attempt to address these constraints:
1) Evaluations with broad questions and rubric based scoring
2) Narrow questions with programmatic scoring offering structure and scale.
But they often rely on constrained elicitation patterns that do not resemble realistic threat actor interaction. Or are restricted to performance in a single-turn format. They can under-estimate what a model might provide in a multi-turn, adaptive setting.
So we are left with a trade-off:
Manual red teaming offers depth, realism and adaptability.
Automated evaluations offer scale and structure.
Neither, on its own, fully solves the problem.
Our starting point was simple: can automation be used to augment experts, rather than replace them?
Building a high-throughput collaborator
Funded by the Frontier Model Forum AI Safety Fund, in collaboration with SecureBio, we developed ARES, the Automated Red Teaming Emulation System. We built the tool for the Biological threat domain, but it can be easily re-tooled for other harm areas.
ARES is not an attempt to automate away the expert, but to remove two of the biggest bottlenecks in manual red teaming: generating large numbers of diverse probes, and prioritising conversations for expert reviews.
ARES implements a feedback-driven pipeline with four components:
A probe generator creates diverse initial questions from a user-defined guidance topic, informed by previous scoring (if available)
A conversation simulator runs multi-turn dialogues between an emulated red teamer and the target model
A rater scores each conversation against predefined risk criteria, such as Actionability, Accuracy, Detail, Willingness, Conversation Relevance and Scale of Harm, and provides structured reasoning with citations
An orchestrator coordinates these components over multiple iterations, using scoring feedback to inform subsequent probe generation
In practice, a red teamer specifies a guidance topic (e.g. a sub-area within biosecurity). ARES generates batches of probes, simulates conversations up to a configurable depth, and scores each thread.
Crucially, the system runs iteratively. Probes associated with higher-risk conversations are fed back into the generation process, encouraging the system to explore adjacent or under-examined areas. The result is a systematic widening of the search space.
ARES is deployed as a web application. Users can filter conversations by score, star or hide threads, and importantly branch conversations at any point. This branching feature allows a human expert to take over from an automatically generated exchange and pursue a line of questioning manually. In other words, automated breadth, with human-guided depth where it matters.
What happened when we tested it
To assess whether ARES actually changed outcomes, we conducted a controlled comparative experiment with seven biological subject matter experts from SecureBio. While the sample size was small and the quantitative results should therefore be interpreted with caution, the experiment provided an initial signal on how the tool affects red-teaming coverage and surfaced promising qualitative feedback from participants.
Participants red teamed two models – Kimi K2 and Qwen 3 32B – in parallel experimental lanes. In one session they used ARES; in another, a standard chatbot interface (OpenWebUI). Sessions were counterbalanced to reduce order effects.
The most obvious difference was volume.
With ARES, participants generated hundreds of conversations across their sessions (configured at 40 conversations per run). With the manual interface, the median was five conversations per session- though some of those threads went very deep, reaching up to 50 messages.
This illustrates the core trade-off. Manual red teaming naturally optimises for depth in a small number of threads. ARES optimises for breadth across a much wider portion of the risk landscape.
In terms of risk assessment, we observed general increases in scores when ARES was used, though not all were statistically significant across the full cohort. These increases are consistent with ARES covering more of the conversational space within the same time window and therefore surfacing higher-risk interactions that might otherwise have remained undiscovered. At the same time, they underline the need to ensure that higher scores reflect genuinely increased exposure to risk rather than differences introduced by the evaluation setup itself.
Most consistently, the “Willingness” criterion showed significant increases across both models. Because ARES surfaces a high volume of conversations, participants encountered more instances where the model advanced hazardous objectives or answered questions that might otherwise have remained undiscovered within a limited manual session.
Qualitative feedback added nuance. Several participants reported that ARES surfaced more detailed and actionable responses, reflecting its ability to explore a wider range of conversational paths within the same time window. At the same time, manual interaction remained superior for creative, highly contextual or fine-grained explorations, particularly where subtle domain judgement was required.
Trust in automated scoring was mixed. While structured reasoning and citations were helpful, calibration with human judgement needs further refinement.
What this changes and what it doesn’t
ARES does not eliminate the need for manual red teaming. Nor does it solve the problem of alignment.
What it does change is the shape of the workflow.
If generating conversations is no longer the primary bottleneck, expert time can shift from producing interactions to interpreting them. The limiting factor becomes analysis, not exploration.
It also introduces more systematic structure. Probes, conversations and scores are versioned and stored hierarchically. Runs are parameterised. Criteria are explicitly defined. This makes cross-model comparisons and iterative testing more grounded than relying purely on anecdotal transcripts.
However, several open research questions remain:
Scoring calibration: Aligning automated ratings more tightly with SME judgement, potentially through curated “golden” labels.
Generation quality: Improving probe diversity and follow-up precision, potentially moving beyond in-context prompting towards supervised fine-tuning or reinforcement learning grounded in expert data.
User control: Introducing more granular steering, such as probe staging or custom risk aggregation.
Coverage metrics: Developing explicit measures of semantic coverage across a misuse domain, rather than relying solely on score increases across iterations.
Applicability to strongly safeguarded models: Evaluating whether the approach generalises to models with more comprehensive guardrails. Probe generation effectiveness: Validating whether an iterative probe generation process (where high-scoring conversations guide subsequent probes) consistently improves probe quality and exploration of the domain.
Where this leaves AI safety
As LLMs become more capable, the question for frontier models is not only whether they are aligned, but how confidently we can characterise their capabilities.
Manual red teaming remains indispensable because it captures nuance and adaptive behaviour. However, as models become more capable in multi-turn interactions, the amount of effort required to meaningfully explore their behaviour increases quickly. ARES is an attempt to make the process more systematic and more scalable without sacrificing what makes manual evaluation valuable.
If you want to learn more about the ARES tool, or implement it within your red teaming set-up, reach out to the Faculty safety team through this form.